This article contains a list of project ideas, which can be used for projects and getting hands-on experience in Natural Language Processing. While “Hello World” problems helps in quick onboarding, the following 10 “Real World” problems should make you feel more comfortable solving NLP problems in the future. Each idea includes a link to a freely available public dataset, as well as suggested algorithm to solve the problem.
Problem: Train a machine learning model to predict tags for StackOverflow questions. This is a classic multi-label text classification problem, i.e. each question can have multiple tags associated with it.
Suggested Algorithm: Labeled LDA
Dataset: You can use any one or both of the following datasets
Problem: Use an unsupervised algorithm for topic modeling / clustering to divide the dataset into K clusters. Manually inspect the clusters to see if they look sensible.
Dataset: A Million News Headlines
Problem: The challenge is to identify (and therefore avoid) duplicate questions being asked on Quora. We need to determine semantic equivalence between questions posted on Quora and suggest the closest one. There is a labelled dataset from Quora that provides 400,000 pairs of questions, and a binary value that indicates whether the questions are a duplicate pair or not.
Feature engineering is going to be really important for this problem (see Chunking tutorial and NER tutorial for examples). This is also a good dataset to apply sentence level techniques such as parsing.
Dataset: Quora Dataset
Problem: Build a automatic spell checking and correction algorithm.
Suggested Algorithm: Spell Checking and Correction
- Corpora of misspellings for download contains a list of sentences, with corrections done manually. The main file has tags of the following form:
<ERR targ=sister> siter </ERR>, meaning that siter was written for sister. There are other supporting files with statistics such as which word has been misspelled how many times in spelling contests, etc.
- RWSE Datasets contains a list of misspellings that were fixed on Wikipedia. This dataset focuses more on incorrect word being used, and hence the context in which the words are used is more important. For example, He had a prepossessing body, broad soldiers, he had a goaty and long hair. (soldiers should be shoulders).
Problem: Implement a machine learning algorithm to automatically grade essay responses. Feature engineering is going to be really important for this problem (see Chunking tutorial and NER tutorial for examples).
Suggested Algorithm: Linear Regression on features such as number of entities used, lexical diversity, emotions, sentiments, etc.
Dataset: Essays with human graded scores
Problem: Train a model that can categorize opinions expressed by people in their tweets. The sentiment in a tweet could be positive, neutral or negative.
Dataset: Tweets sentiment tagged by humans
Problem: Classic named entity extraction problem to extract medical entities in clinical text obtained from electronic health records. These entities constitute of clinical concepts such as such as diseases, disorders, symptoms, medications, procedures, etc.)
Problem: Spam SMS is a major problem that annoys many people. The spam filter can be rule-based but spammers easily identifies the rules and find ways to deceive them. A machine learning based model can predict predicting whether an SMS is a spam or not and can be retrained on new data whenever spammers find new terms for spam.
Suggested Algorithm: Naive Bayes
Dataset: SMS Spam Collection Dataset
Problem: Implement a language identification algorithm to predict the language in which a tweet is written.
Suggested Algorithm: Natural Language Identification
Dataset: Short text language identification
Following is a list of some more datasets which could be used for different NLP problems
- Text based datasets available at Kaggle.com (Scroll down to check the extensive list)
- Home Page for 20 Newsgroups Data Set - A collection of approximately 20,000 news documents, partitioned into 20 different categories (groups). The dataset is extensively used for Text Classification and Clustering.
- DBPedia - Contains 6.6M entities. In total, 5.5M resources are classified in a consistent ontology, consisting of 1.5M persons, 840K places, 286K organizations, 306K species, 58K plants and 6K diseases. (derived from Wikipedia).
- Reuters-21578 Text Categorization Collection Data Set - A collection of categorized documents that appeared on Reuters newswire in 1987.
- Text REtrieval Conference (TREC) Dataset - Dataset used in information retrieval.
- CSTR Dataset - The dataset used in speech related research work such as speech recognition, speech synthesis, dialogue systems, etc.
- World Factobook Dataset provides information on the US government profiles of countries and territories around the world. It has information on geography, people, government, transportation, economy, etc.
- ConceptNet is a semantic network dataset which is designed to help computers understand the meanings of words. It is used for many textual reasoning based problems.
For some ideas around data science in general (not just natural language processing), take a look at the Data Science Project Ideas article.