Hands-on Assignment: Implementing Language Identification from Scratch

April 17, 2018

This hands-on assignment guides your through implementing language identification from scratch in Python. In particular, we have 10,000 snippets of text from various languages (about 50 total), where each snippet is exactly 100 characters long. We’ll use 8000 of these snippets for learning patterns, and use the other 2000 to evaluate how accurately the system we make is able to make predictions (above 90%!).

Overview

All the code in this assignment can be run. Parts have been left for you to fill, marked with ... (triple dots).

Dataset

The data has been extracted from Wikipedia. The extraction was done for a competition named International Olympiad in Informatics. You can download the data here: language_identification. Once downloaded, you can load it by using the following code.

import io
###############################################################################
## load data
file = io.open("grader.in.1", mode="r", encoding="utf-8")
data = file.read().split('\n')[:-1]
languages = [line[:2] for line in data]
text = [line[-100:] for line in data]
###############################################################################

Here’s the first text snippet

Tvangsauktion er en tvungen offentlig auktion over fast ejendom eller løsøre som afholdes af en fo

The language for this text is da, i.e. Danish.

Spitting the data: Next, we’ll split the data into train data (used for learning patterns) and test data (which will be used to evaluate the model).

###############################################################################
## split in train and test
train_xx, train_yy = text[:8000], languages[:8000]
test_xx, test_yy = text[8000:10000], languages[8000:10000]
###############################################################################

Implementing Language Identification (Coding time!)

Below is the code template for implementing language identification. It’s divided into 3 steps.

Step 1: Text to ngrams

###############################################################################
## helper functions
ngram_size = 1 # <= we'll change this from 1 to 5
def text_to_ngrams(text):
    # return a list of ngrams for the text 
    return ...
print(train_xx[0])
print(text_to_ngrams(train_xx[0]))
###############################################################################

The expected output for the above code for ngram_size = 3 is as follows:

Tvangsauktion er en tvungen offentlig auktion over fast ejendom eller løsøre   som afholdes af en fo
[u'Tva', u'van', u'ang', u'ngs', u'gsa', u'sau', u'auk', u'ukt', u'kti', u'tio', u'ion', u'on ', u'n e', u' er', u'er ', u'r e', u' en', u'en ', u'n t', u' tv', u'tvu', u'vun', u'ung', u'nge', u'gen', u'en ', u'n o', u' of', u'off', u'ffe', u'fen', u'ent', u'ntl', u'tli', u'lig', u'ig ', u'g a', u' au', u'auk', u'ukt', u'kti', u'tio', u'ion', u'on ', u'n o', u' ov', u'ove', u'ver', u'er ', u'r f', u' fa', u'fas', u'ast', u'st ', u't e', u' ej', u'eje', u'jen', u'end', u'ndo', u'dom', u'om ', u'm e', u' el', u'ell', u'lle', u'ler', u'er ', u'r l', u' l\xf8', u'l\xf8s', u'\xf8s\xf8', u's\xf8r', u'\xf8re', u're ', u'e  ', u'   ', u'  s', u' so', u'som', u'om ', u'm a', u' af', u'afh', u'fho', u'hol', u'old', u'lde', u'des', u'es ', u's a', u' af', u'af ', u'f e', u' en', u'en ', u'n f', u' fo']

Step 2: Language models

The next step is to create language models. We’ll store the models in ngram_models. For example, ngram_models['en'] should be the model for English. We want ngram_models['en'][ngram] = (number of times ngram appears in English) / (number of English snippets).

from collections import Counter, defaultdict
###############################################################################
## train
languages = list(set(train_yy))
language_counts = Counter(train_yy)
# sets ngram_models[language][ngram] = 0 for everything
ngram_models = dict([(language, defaultdict(float)) for language in languages])
# your code below. use "text_to_ngrams"
for text, language in zip(train_xx, train_yy):
    ...
print(ngram_models['en'])
###############################################################################

The following are the expected values for ngram_models['en']. (not all the values are shown, only the values a-z are). Other characters are also present. The values represent, out of every 100 characters in English text, how many are a, b, c, etc.

# 'a': 7.47
# 'b': 1.02
# 'c': 2.15
# 'd': 2.29
# 'e': 7.25
# 'f': 1.53
# 'g': 0.98
# 'h': 2.52
# 'i': 6.11
# 'j': 0.13
# 'k': 0.48
# 'l': 3.08
# 'm': 1.83
# 'n': 5.70
# 'o': 5.13
# 'p': 1.28
# 'q': 0.03
# 'r': 4.85
# 's': 4.33
# 't': 4.84
# 'u': 1.83
# 'v': 0.58
# 'w': 1.01
# 'x': 0.09
# 'y': 1.27
# 'z': 0.15

Step 3: Making predictions

Now we’ll write some code to make predictions.

import math
###############################################################################
## scoring and prediction
def predict(text):
    scores = dict([(language, score(text, language)) for language in ngram_models.keys()])
    return max(scores.keys(), key=scores.get)
def score(text, language):
    model = ngram_models[language]
    return ... 
print(train_xx[64], predict(train_xx[64])) # outputs 'en'
print(train_xx[63], predict(train_xx[63])) # outputs 'ka'
print(train_xx[62], predict(train_xx[62])) # outputs 'bg'
###############################################################################

The formula for score is as follows,

$$ score(text, L) = \sum_{ngram \in text} \log(model_L(ngram) + \epsilon) $$

where model_L is the model we learnt for language L, and $\epsilon$ is a very small number (so that we never do log(0) which throws an error). For example, $\epsilon = 10^{-9}$.

Evaluation

You can use the following code for evaluating the model.

from sklearn.metrics import classification_report, accuracy_score
###############################################################################
## evaluate
predictions = [predict(text) for text in test_xx]
print(classification_report(test_yy, predictions))
print(accuracy_score(test_yy, predictions))
###############################################################################

I get the following accuracies for ngram_size 1 through 5.

ngram_size = 1 # 83
ngram_size = 2 # 87
ngram_size = 3 # 92
ngram_size = 4 # 93
ngram_size = 5 # 91

Solution on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.

Hope you enjoyed the assignment!