Introduction
- The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting).
- Link to the paper
Evaluation Metrics Considered
Word Based Similarity Metric
BLEU
- Analyses the co-occurrences of n-grams in the ground truth and the proposed responses.
- BLEU-N: N-gram precision for the entire dataset.
- Brevity penalty added to avoid bias towards short sentences.
METEOR
- Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc).
- Compute the harmonic mean of precision and recall between proposed and ground truth.
ROGUE
- F-measure based on Longest Common Subsequence (LCS) between candidate and target response.
Embedding Based Metric
Greedy Matching
- Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa).
- Total score is averaged over all words.
Embedding Average
- Calculate sentence level embedding by averaging word level embeddings
- Compare sentence level embeddings between candidate and target sentences.
Vector Extrema
- For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use that value in the sentence-level embedding.
- Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space).
Dialogue Models Considered
Retrieval Models
TF-IDF
- Compute the TF-IDF vectors for each context and response in the corpus.
- C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score.
- R-TFIDF computes the cosine similarity between the input context and each response directly.
Dual Encoder
- Two RNNs which respectively compute the vector representation of the input context and response.
- Then calculate the probability that given response is the ground truth response given the context.
Generative Models
LSTM language model
- LSTM model trained to predict the next word in the (context, response) pair.
- Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure.
Hierarchical Recurrent Encoder-Decoder (HRED)
- Uses a hierarchy of encoders.
- Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder.
- Better handling of long-term dependencies as compared to the conventional Encoder-Decoder.
Observations
- Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric.
- Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation.
- Best performing metric:
- Using word-overlaps - BLEU-2 score
- Using word embeddings - vector average
- Embedding-based metrics would benefit from a weighting of word saliency.
- BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences.