Natural Language Identification: What it is, why its important, and how it works.
What is natural language identification?
Language Identification is the task of computationally determining the language of some given piece of text data. A text document could be written entirely in a single language such as English, French, German, Spanish (monolingual language identification), or each text document could have multiple languages in different parts.
Why is language identification important?
Language identification is important for most NLP applications to work accurately, since models are usually trained using data from a single language. If a model is trained on English text and is used for prediction on French text, we usually see a significant decrease in accuracy. In applications such as spam filtering, machine t...
Introduction to Named Entity Recognition with Examples and Python Code for training Machine Learning model
Named Entity Recognition is one of the very useful information extraction technique to identify and classify named entities in text. These entities are pre-defined categories such a person’s names, organizations, locations, time representations, financial elements, etc.
Apart from these generic entities, there could be other specific terms that could be defined given a particular problem. These terms represent elements which have a unique context compared to the rest of the text. For example, it could be anything like operating systems, programming languages, football league team names etc. The machine learning models could be trained to categorize such custom entities which are usually denoted by proper names and therefore are mostly noun phrases in text documents.
Broadly NER has three top-level categorizations - en...
Chunking (Shallow Parsing): Understanding Text Syntax and Structures, Part 2
We got introduced to text syntax and structures and took a detailed look at part of speech tagging in part 1 of this tutorial series. In this tutorial, we will learn about phrasal structure and shallow parsing.
A phrase can be a single word or a combination of words based on the syntax and position of the phrase in a clause or sentence. For example, in the following sentence
My dog likes his food.
there are three phrases. "My dog" is a noun phrase, "likes" is a verb phrase, and "his food" is also a noun phrase.
Parsing: Understanding Text Syntax and Structures, Part 3
We took a detailed look at part of speech tagging in part 1 and chunking in part 2 of this tutorial series. In this tutorial, we will learn about what parsing is, its different types, and techniques to automatically infer the parse tree of sentences.
Natural language parsing (also known as deep parsing) is a process of analyzing the complete syntactic structure of a sentence. This includes how different words in a sentence are related to each other, for example, which words are the subject or object of a verb. Probabilistic parsing uses language understanding such as grammatical rules or may be a supervised training set of hand-parsed sentences to try to infer the most likely syntax and structure of new sentences.
Parsing is used to solve various complex NLP problems such as conversational dialogues and text summarization. It is different from 'shallow parsing' in that it yields more expressive structur...
Part of Speech tagging: Understanding Text Syntax and Structures, Part 1
Language Syntax and Structure
Syntax and structure of a natural language such as English are tied with a set of specific rules, conventions, and principles which dictate how words are combined into phrases, phrases get combined into clauses, and clauses get combined into sentences. All these constituents exist together in any sentence and are related to each other in a hierarchical structure.
Let’s consider a very basic example of language structure which explains a specific example in the light of subject and predicate relationship. Consider a simple sentence:
Harry is playing football
This sentence is talking about two subjects - Harry and football. To find the subject of the sentence, it is easier to fir...
Topic Modeling is an efficient way to organize, understand and summarize large volumes of text. With huge amount of text data getting generated everyday, it becomes challenging to access the most relevant information. Topic modeling helps us in efficient text browsing by:
Discovering hidden topical patterns present across the corpus
Annotating each document according to these topics
And finally, these annotations can be used to organize, search and summarize texts
Topic Modeling is a method for finding a group of words from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining - a way to obtain recurring patterns of words in text. While there are many different algorithms for topic modeling, the most common is Latent Dirichlet Allocation, or LDA.
Text Classification (Topic Categorization, Spam filtering, etc)
Text Classification (or Categorization) has been in high demand and all the way has become more important with the increasing scale of text getting generated online. Moreover, different contextual information in different domains has raised the challenge in improving the accuracy and performance of traditional ways of doing text classification.
Some example applications of text classification include:
Assigning multiple topics to documents
Grouping of documents into a fixed number of predefined classes
Segregating the contextual details from a multi-domain corpus
TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency and is a very common algorithm to transform text into a meaningful representation of numbers. The technique is widely used to extract features across various NLP applications. This article would help you understand the importance of TF-IDF, and how to compute and apply the algorithm in your applications.
Vector representation of Text
To use a machine learning algorithm or a statistical technique on any form of text, it is prescribed to transform the text into some numeric or vector representation. This numeric representation should depict significant characteristics of the text. There are many such techniques, for example, occurrence, term-frequency, TF-IDF, word co-occurrence matrix, word2vec and GloVe.
Occurrence based vector representation
Since TF-IDF is an occurrence based numeric represen...