Language Identification is the task of computationally determining the language of some given piece of text data. A text document could be written entirely in a single language such as English, French, German, Spanish (monolingual language identification), or each text document could have multiple languages in different parts.
Language identification is important for most NLP applications to work accurately, since models are usually trained using data from a single language. If a model is trained on English text and is used for prediction on French text, we usually see a significant decrease in accuracy. In applications such as spam filtering, machine translation, document summarization, etc, language identification is used as the first step, and based on the detected language, different trained models are used for the prediction step.
Another important area where correct identification of the language is critical is search engines. First, search engines crawl web pages and need to identify the language for each web page to decide whether or not to show in search results. Second, they need to identify the language for the search query itself since some users might use different languages for different queries. Search engines also apply spell checks and stemming on search queries which certainly are language specific.
Before understanding how language identification algorithms work, let’s see a simple example. We’ll use python’s langdetect package, which has support for 55 languages. You can install the package using
pip install langdetect
The use of the package is straightforward
from langdetect import detect detect('Can you guess my language, and how accurate are you in detecting it') >>>'en' (English) detect('Puedes adivinar mi lenguaje y qué precisión tienes para detectarlo') >>>'es' (Spanish) detect('Kannst du meine Sprache erraten und wie genau erkennst du sie?') >>>'de' (German) detect('क्या आप मेरी भाषा का अनुमान लगा सकते हैं, और आप इसका पता लगाने में कितना सही हैं') >>>'hi' (Hindi)
Interesting! Isn’t it?
Just a word of caution: the library may not work accurately on small chunks of text (less than 5-6 words).
Let us now understand the approaches that can be taken for language detection when there is a single language per document (monolingual language identification).
One of the simplest but effective approach for language classification is to maintain a corpus of words for each language and let the algorithm identify text based on the occurrences of such words. The tokens in input text are compared to the tokens in each stored corpus to identify the strongest correlation to a language. For the corpus, the most common strategy is to choose very common words. This is usually the list of stop words (for example in English, words like “the”, “and”, “or” etc).
This approach works well when the input data is relatively lengthy. The shorter the phrase in the input text, the less likely these common words are to appear, and hence less likely the algorithm will classify them correctly. Also, in some languages there are no spaces between written words making this approach less feasible.
NLTK has a corpus of stop words for a number of languages which could be used to compute language probability depending on which stop words are present in the input text.
from nltk.corpus import stopwords print(stopwords.fileids()) >>> [ 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish' ]
Overview: An advanced language detection algorithm works by calculating and comparing language profiles of character N-gram frequencies.
An N-gram here is an N-character slice of a string. Special characters are typically added to the beginning and end of the string. For example the string “TEXT” would be composed of the following N-grams:
uni-grams: T, E, X, T
bi-grams: _T, TE, EX, XT, T_
tri-grams: __T, _TE, TEX, EXT, XT_, T__
4-grams: ___T, __TE, _TEX, TEXT, EXT_, XT__, T___
Typically, we use N in the range 2 to 4. N is chosen based on trade-offs depending on size of training dataset, desired speed and memory efficiency, and desired accuracy of the model.
Advantages: N-gram based matching is successful in dealing with noisy input. Since every string is decomposed into small parts, any errors (such as spelling errors) are only limited to smaller number of parts, without affecting the remaining N-grams. This is not true in the case of complete words. It also works better on smaller pieces of text than the stop words method.
Method: First, we generate a language profile of each particular language based on frequency of occurrence of various N-grams. In any language, some words and patterns will be very frequently used and they dominate most of the language in terms of frequency.
Given this data, the inference of an unknown text document is straightforward. We compute the N-gram frequency profile of the test document and compute its distance with each of the known language profile (say, cosine distance). The language profile with the minimal distance is considered to represent the detected language.
N-gram based matching can also be extended to stream of bytes instead of stream of characters. This is helpful incase the encoding of a text document is not known, since many languages use variable length encoding, i.e. encodings in which different number of bytes encode a single character. It also helps in situations where different languages have widely varying number of characters. For example, in English, the number of distinct characters is about 100 (including punctuations, etc), whereas in Chinese (Mandarin), number of distinct characters is 7,000+.
Task: A typical assumption in basic language identification techniques is that each document is monolingual. An advance problem is to handle documents that contain text written in more than one language appropriately. Hence, we need to identify both the languages present in the document, as well as the boundaries, i.e. the start and end positions of each language in the document.
Method: To perform multilingual language identification, we often break the problem into multiple steps:
- Step 1: Segment the document into monolingual chunks.
- Step 2: Use monolingual language identification algorithms to categorize the language present in each chunk.
- Step 3: Merge adjacent chunks with the same language into a single segment.
Below, we discuss possible approaches for step 1, i.e. segmenting the document into monolingual chunks.
Approach 1: This approach is used for a multilingual document that contains text segments which are longer than a clause (sentences, or paragraphs). In this technique, we identify a set of characters that segregate languages. For example, we could assume that different languages are separated by full-stops (’.’), question marks (’?’), etc. These characters are used to break the document into chunks.
Approach 2: For documents where multiple language is used in smaller segments (words, phrases etc.) or when the above approach doesn’t work, we can use the following more advanced method:
- Perform language identification on the entire document and note the top shortlist of the top K languages.
- Compute a relevance score between words in the document to each of the K languages.
- Construct the list of scores for the entire document as a series (score(w1, lang), score(w2, lang), score(w3, lang), …). To reduce noise, one can smooth out this signal by performing something like moving average.
- Identify a small number of boundaries (from each language) based on local minima and thresholds. These are the highest likelihood points of transition from one language to other.
Other applications: Since most natural language processing techniques presuppose monolingual input data, inclusion of data in foreign languages can act as noise and degrade the performance of NLP systems. Therefore, automatic detection of multilingual documents is also used as a pre-filtering step to improve the quality of input data.
In this tutorial, we discussed the problem of language identification and why it is important. Then, we saw several methods to solve the problem of monolingual language identification (single language per document) and multilingual language identification (a document may have multiple languages).