Bleu Score

Calculating the Bilingual Evaluation Understudy (BLEU) score

We will be implementing a popular metric for evaluating the quality of machine-translated text: the BLEU score proposed by Kishore Papineni, et al. In their 2002 paper "BLEU: a Method for Automatic Evaluation of Machine Translation", the BLEU score works by comparing "candidate" text to one or more "reference" translations. The result is better the closer the score is to 1. Let's see how to get this value in the following sections.

Imports

# python
from collections import Counter, namedtuple
from functools import partial
from pathlib import Path

import math
import os

# from pypi
from dotenv import load_dotenv
from nltk.util import ngrams

import hvplot.pandas
import numpy
import nltk
import sacrebleu
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

nltk.download('punkt')
slug = "bleu-score"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{slug}")

Plot = namedtuple("Plot", ["width", "height", "fontscale", "tan", "blue", "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

load_dotenv("posts/nlp/.env", override=True)

Middle

Part 1: BLEU Score

We will implement our own version of the BLEU Score using Numpy. To verify that our implementation is correct, we will compare our results with those generated by the SacreBLEU library. This package provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. It also knows all the standard test sets and handles downloading, processing, and tokenization.

Defining the BLEU Score

We can express the BLEU score as:

\[ BLEU = BP\left(\prod_{i=1}^{4}precision_i\right)^{(1/4)} \]

with the Brevity Penalty and precision defined as:

\[ BP = min\left(1, e^{(1-(\textit{reference}/\textit{candidate}))}\right) \]

\[ precision_i = \frac {\sum_{snt \in{cand}}\sum_{i\in{snt}}min\Bigl(m^{i}_{cand}, m^{i}_{ref}\Bigr)}{w^{i}_{t}} \]

where:

  • \(m^{i}_{cand}\), is the count of i-gram in candidate matching the reference translation.
  • \(m^{i}_{ref}\), is the count of i-gram in the reference translation.
  • \(w^{i}_{t}\), is the total number of i-grams in candidate translation.

Explaining the BLEU score

  • Brevity Penalty (example)
    ref_length = numpy.ones(100)
    can_length = numpy.linspace(1.5, 0.5, 100)
    x = ref_length / can_length
    y = 1 - x
    y = numpy.exp(y)
    brevity_penalty = numpy.minimum(numpy.ones(y.shape), y)
    
    frame = pandas.DataFrame.from_dict({"Reference Length/Candidate Length": x
                                        , "Brevity Penalty": brevity_penalty})
    plot = frame.hvplot(x="Reference Length/Candidate Length",
                        y="Brevity Penalty", title="Brevity Penalty").opts(
        width=PLOT.width,
        height=PLOT.height,
        fontscale=PLOT.fontscale    
    )
    output = Embed(plot=plot, file_name="brevity_penalty")()
    
    print(output)
    

    Figure Missing

    The brevity penalty penalizes generated translations that are too short compared to the closest reference length with an exponential decay. The brevity penalty compensates for the fact that the BLEU score has no recall term.

N-Gram Precision (example)

And now for a meaningless plot.

data = pandas.DataFrame.from_dict({"1-gram": [0.8],
                                   "2-gram": [0.7],
                                   "3-gram": [0.6],
                                   "4-gram": [0.5]})
plot = data.hvplot.bar(title="N-Gram Precision").opts(
    width=PLOT.width,
    height=PLOT.height,
    fontscale=PLOT.fontscale,
)

output = Embed(plot=plot, file_name="n_gram_precision")()
print(output)

Figure Missing

The n-gram precision counts how many unigrams, bigrams, trigrams, and four-grams (i=1,…,4) match their n-gram counterpart in the reference translations. This term acts as a precision metric. Unigrams account for adequacy while longer n-grams account for fluency of the translation. To avoid overcounting, the n-gram counts are clipped to the maximal n-gram count occurring in the reference (\(m_{n}^{ref}\)). Typically precision shows exponential decay with the with the degree of the n-gram.

N-gram BLEU score (example):

Another meaningless plot.

data = pandas.DataFrame.from_dict({"1-gram": [0.8],
                                   "2-gram": [0.77],
                                   "3-gram": [0.74],
                                   "4-gram": [0.71]})
plot = data.hvplot.bar(title="Modified N-Gram Precision").opts(
    width=PLOT.width,
    height=PLOT.height,
    fontscale=PLOT.fontscale
)

output = Embed(plot=plot, file_name="modified_n_gram_precision")()
print(output)

Figure Missing

When the n-gram precision is multiplied by the BP, then the exponential decay of n-grams is almost fully compensated. The BLEU score corresponds to a geometric average of this modified n-gram precision.

Example Calculations of the BLEU score

In this example we will have a reference translation and 2 candidates translations. We will tokenize all sentences using the NLTK.

  • Step 1: Computing the Brevity Penalty
    def brevity_penalty(candidate: list, reference: list) -> numpy.ndarray:
        """Calculates the brevity penalty"""
        reference_length = len(reference)
        candidate_length = len(candidate)
    
        # Brevity Penalty
        return 1 if reference_length < candidate_length else numpy.exp( 1 - (reference_length / candidate_length))
    
  • Step 2: Computing the Precision
    def clipped_precision(candidate: list, reference: list) -> numpy.ndarray:
        """
        Clipped precision function given a original and a machine translated sentences
        """
        clipped_precision_score = []
    
        for i in range(1, 5):
            ref_n_gram = Counter(ngrams(reference,i))
            cand_n_gram = Counter(ngrams(candidate,i))
    
            c = sum(cand_n_gram.values())
    
            for j in cand_n_gram: # for every n-gram up to 4 in candidate text
                if j in ref_n_gram: # check if it is in the reference n-gram
                    if cand_n_gram[j] > ref_n_gram[j]: # if the count of the candidate n-gram is bigger
                                                       # than the corresponding count in the reference n-gram,
                        cand_n_gram[j] = ref_n_gram[j] # then set the count of the candidate n-gram to be equal
                                                       # to the reference n-gram
                else:
                    cand_n_gram[j] = 0 # else set the candidate n-gram equal to zero
    
            clipped_precision_score.append(sum(cand_n_gram.values())/c)
    
        weights =[0.25] * 4
    
        s = (w_i * math.log(p_i) for w_i, p_i in zip(weights, clipped_precision_score))
        s = math.exp(math.fsum(s))
        return s
    
  • Step 3: Computing the BLEU score
    def bleu_score(candidate: list, reference: list) -> numpy.ndarray:
        BP = brevity_penalty(candidate, reference)
        precision = clipped_precision(candidate, reference)
        return BP * precision
    
  • Step 4: Testing with our Example Reference and Candidates Sentences
    reference = "The NASA Opportunity rover is battling a massive dust storm on planet Mars."
    candidate_1 = "The Opportunity rover is combating a big sandstorm on planet Mars."
    candidate_2 = "A NASA rover is fighting a massive storm on planet Mars."
    
    tokenized_ref = nltk.word_tokenize(reference.lower())
    tokenized_cand_1 = nltk.word_tokenize(candidate_1.lower())
    tokenized_cand_2 = nltk.word_tokenize(candidate_2.lower())
    
    print(
        "Results reference versus candidate 1 our own code BLEU: ",
        round(bleu_score(tokenized_cand_1, tokenized_ref) * 100, 1),
    )
    
    Results reference versus candidate 1 our own code BLEU:  27.6
    
    print(
        "Results reference versus candidate 2 our own code BLEU: ",
        round(bleu_score(tokenized_cand_2, tokenized_ref) * 100, 1),
    )
    
    Results reference versus candidate 2 our own code BLEU:  35.3
    
  • Step 5: Comparing the Results from our Code with the SacreBLEU Library
    print(
        "Results reference versus candidate 1 sacrebleu library BLEU: ",
        round(sacrebleu.corpus_bleu(candidate_1, reference).score, 1),
    )
    
    Results reference versus candidate 1 sacrebleu library BLEU:  27.6
    
    print(
        "Results reference versus candidate 2 sacrebleu library BLEU: ",
        round(sacrebleu.corpus_bleu(candidate_2, reference).score, 1),
    )
    
    Results reference versus candidate 2 sacrebleu library BLEU:  35.3
    

Part 2: BLEU computation on a corpus

Loading Data Sets for Evaluation Using the BLEU Score

In this section, we will show a simple pipeline for evaluating machine translated text. Due to storage and speed constraints, we will not be using our own model in this lab. Instead, we will be using Google Translate to generate English to German translations and we will evaluate it against a known evaluation set. There are three files we will need:

  1. A source text in English. In this lab, we will use the first 1671 words of the wmt19 evaluation dataset downloaded via SacreBLEU. We just grabbed a subset because of limitations in the number of words that can be translated using Google Translate.
  2. A reference translation to German of the corresponding first 1671 words from the original English text. This is also provided by SacreBLEU.
  3. A candidate machine translation to German from the same 1671 words. This is generated by feeding the source text to a machine translation model. As mentioned above, we will use Google Translate to generate the translations in this file.

With that, we can now compare the reference an candidate translation to get the BLEU Score.

Load the raw data.

with Path(os.environ["WMT19_SOURCE"]).expanduser().open(encoding="utf-8") as reader:
    wmt19_src_1 = reader.read()

with Path(os.environ["WMT19_REFERENCE"]).expanduser().open(encoding="utf-8") as reader:
    wmt19_ref_1 = reader.read()

with Path(os.environ["WMT19_CANDIDATE"]).expanduser().open(encoding="utf-8") as reader:
    wmt19_can_1 = reader.read()

tokenized_corpus_src = nltk.word_tokenize(wmt19_src_1.lower())
tokenized_corpus_ref = nltk.word_tokenize(wmt19_ref_1.lower())
tokenized_corpus_cand = nltk.word_tokenize(wmt19_can_1.lower())    

Inspecting the first sentence of the data.

print("English source text:\n")
print(f"{wmt19_src_1[0:170]} -> {tokenized_corpus_src[0:30]}\n")
print("German reference translation:\n")
print(f"{wmt19_ref_1[0:219]} -> {tokenized_corpus_ref[0:35]}\n")
print("German machine translation:\n")
print(f"{wmt19_can_1[0:199]} -> {tokenized_corpus_cand[0:29]}")
English source text:

Welsh AMs worried about 'looking like muppets'
There is consternation among some AMs at a suggestion their title should change to MWPs (Member of the Welsh Parliament).
 -> ['\ufeffwelsh', 'ams', 'worried', 'about', "'looking", 'like', "muppets'", 'there', 'is', 'consternation', 'among', 'some', 'ams', 'at', 'a', 'suggestion', 'their', 'title', 'should', 'change', 'to', 'mwps', '(', 'member', 'of', 'the', 'welsh', 'parliament', ')', '.']

German reference translation:

Walisische Ageordnete sorgen sich "wie Dödel auszusehen"
Es herrscht Bestürzung unter einigen Mitgliedern der Versammlung über einen Vorschlag, der ihren Titel zu MWPs (Mitglied der walisischen Parlament) ändern soll.
 -> ['\ufeffwalisische', 'ageordnete', 'sorgen', 'sich', '``', 'wie', 'dödel', 'auszusehen', "''", 'es', 'herrscht', 'bestürzung', 'unter', 'einigen', 'mitgliedern', 'der', 'versammlung', 'über', 'einen', 'vorschlag', ',', 'der', 'ihren', 'titel', 'zu', 'mwps', '(', 'mitglied', 'der', 'walisischen', 'parlament', ')', 'ändern', 'soll', '.']

German machine translation:

Walisische AMs machten sich Sorgen, dass sie wie Muppets aussehen könnten
Einige AMs sind bestürzt über den Vorschlag, ihren Titel in MWPs (Mitglied des walisischen Parlaments) zu ändern.
Es ist aufg -> ['walisische', 'ams', 'machten', 'sich', 'sorgen', ',', 'dass', 'sie', 'wie', 'muppets', 'aussehen', 'könnten', 'einige', 'ams', 'sind', 'bestürzt', 'über', 'den', 'vorschlag', ',', 'ihren', 'titel', 'in', 'mwps', '(', 'mitglied', 'des', 'walisischen', 'parlaments']
print(
    "Results reference versus candidate 1 our own BLEU implementation: ",
    round(bleu_score(tokenized_corpus_cand, tokenized_corpus_ref) * 100, 1),
)
Results reference versus candidate 1 our own BLEU implementation:  43.6
print(
    "Results reference versus candidate 1 sacrebleu library BLEU: ",
    round(sacrebleu.corpus_bleu(wmt19_can_1, wmt19_ref_1).score, 1),
)
Results reference versus candidate 1 sacrebleu library BLEU:  43.2

BLEU Score Interpretation on a Corpus

Score Interpretation
< 10 Almost useless
10 - 19 Hard to get the gist
20 - 29 The gist is clear, but has significant grammatical errors
30 - 40 Understandable to good translations
40 - 50 High quality translations
50 - 60 Very high quality, adequate, and fluent translations
> 60 Quality often better than human

From the table above (taken from here), we can see the translation is high quality (if you see "Hard to get the gist", please open your workspace, delete `wmt19_can.txt` and get the latest version via the Lab Help button). Moreover, the results of our coded BLEU score are almost identical to those of the SacreBLEU package.