PythonでBLEUスコアを計算する方法はどうすればいいですか？

8か月 ago

綾乃, 一希

2 minutes

Pythonのブルースコアは、機械翻訳モデルの優れた性能を測るための指標です。元々は翻訳モデルにのみ使用されることを目的としていましたが、現在では他の自然言語処理アプリケーションでも利用されています。

BLEUスコアは、1つ以上の参照文と比較して、候補文が参照文のリストにどれだけマッチしたかを評価します。出力は0から1のスコアで示されます。

BLEUスコアが1の場合、候補の文が参照文の一つと完全に一致していることを意味します。

このスコアは、画像のキャプションモデルの一般的な測定基準です。

このチュートリアルでは、nltkライブラリのsentence_bleu()関数を使用します。さあ、始めましょう。

PythonでBleuスコアを計算する

Bleuスコアを計算するためには、リファレンスと候補の文をトークン形式で提供する必要があります。

このセクションでは、それをする方法とスコアを計算する方法を学びます。まず必要なモジュールをインポートしましょう。

from nltk.translate.bleu_score import sentence_bleu

これで参照文をリストの形式で入力することができます。また、文をセンテンスブルー関数に渡す前にトークンに分割する必要もあります。

1. 文章を入力し、文を分割する。

私たちの参考文献リストに含まれる文章は以下の通りです。

    'this is a dog'
    'it is dog
    'dog it is'
    'a dog, it is'

split関数を使って、それらをトークンに分割することができます。

reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
print(reference)

出力:

[['this', 'is', 'a', 'dog'], ['it', 'is', 'dog'], ['dog', 'it', 'is'], ['a', 'dog,', 'it', 'is']]

これは、トークン形式で表示された文の見た目です。さて、sentence_bleu()関数を呼び出してスコアを計算することができます。

PythonでBLEUスコアを計算します。

スコアを計算するために、次のコードを使用してください。

candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

出力:

BLEU score -> 1.0

私たちは、候補となる文章が参考文集に属しているため、1点満点であると評価します。別の文を試してみましょう。

candidate = 'it is a dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

出力：

BLEU score -> 0.8408964152537145

参考セットには文が含まれていますが、完全に一致しているわけではありません。そのため、スコアは0.84となります。

3. PythonでBLEUスコアを実装するための完全なコードを提供してください。

このセクションからの完全なコードはこちらです。

from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate )))

candidate = 'it is a dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

4. n-gramスコアの計算

文章をマッチングする際に、モデルが一度にマッチングする単語の数を選択することができます。例えば、一度に1つの単語をマッチングするように選択することもできます（1-gram）。また、単語をペアでマッチングする（2-gram）や、三つ組でマッチングする（3-grams）ように選択することもできます。

このセクションでは、これらのN-グラムスコアを計算する方法を学びます。

sentence_bleu()関数では、個々のグラムに対応した重みを引数として渡すことができます。

例えば、グラムスコアを個別に計算するためには、以下の重みを使用することができます。

Individual 1-gram: (1, 0, 0, 0)
Individual 2-gram: (0, 1, 0, 0). 
Individual 3-gram: (1, 0, 1, 0). 
Individual 4-gram: (0, 0, 0, 1).

同じことのためのPythonのコードは以下の通りです。

from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
candidate = 'it is a dog'.split()

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

出力:

Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 0.500000
Individual 4-gram: 1.000000

デフォルトでは、sentence_bleu（）関数は累積4-gram BLEUスコア、またはBLEU-4を計算します。BLEU-4の重みは以下の通りです。

(0.25, 0.25, 0.25, 0.25)

BLEU-4のコードを見てみましょう。

score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score)

以下を日本語で自然に言い換えてください、オプションは1つだけ必要です：
アウトプット：

0.8408964152537145

「n-グラムの重みを加えずに得たスコアとまったく同じです。」

結論

このチュートリアルはPythonでBLEUスコアを計算する方法についてでした。私たちは、BLEUスコアの個別と累積のn-gramスコアを計算する方法について学びました。私たちと一緒に学ぶことが楽しかったですね！