List of NLP Project Ideas (including Datasets)

March 30, 2018

This article contains a list of project ideas, which can be used for projects and getting hands-on experience in Natural Language Processing. While “Hello World” problems helps in quick onboarding, the following 10 “Real World” problems should make you feel more comfortable solving NLP problems in the future. Each idea includes a link to a freely available public dataset, as well as suggested algorithm to solve the problem.

1. Tagging of Stack Overflow Questions

Problem: Train a machine learning model to predict tags for StackOverflow questions. This is a classic multi-label text classification problem, i.e. each question can have multiple tags associated with it.

Suggested Algorithm: Labeled LDA

Dataset: You can use any one or both of the following datasets

2. Clusters for News Headlines

Problem: Use an unsupervised algorithm for topic modeling / clustering to divide the dataset into K clusters. Manually inspect the clusters to see if they look sensible.

Suggested Algorithm: LDA or Latent Semantic Analysis

Dataset: A Million News Headlines

3. Finding duplicate questions on Quora

Problem: The challenge is to identify (and therefore avoid) duplicate questions being asked on Quora. We need to determine semantic equivalence between questions posted on Quora and suggest the closest one. There is a labelled dataset from Quora that provides 400,000 pairs of questions, and a binary value that indicates whether the questions are a duplicate pair or not.

Feature engineering is going to be really important for this problem (see Chunking tutorial and NER tutorial for examples). This is also a good dataset to apply sentence level techniques such as parsing.

Suggested Algorithm: Classification algorithms such as Support Vector Machines or Naive Bayes

Dataset: Quora Dataset

4. Spell check

Problem: Build a automatic spell checking and correction algorithm.

Suggested Algorithm: Spell Checking and Correction

Dataset:

Corpora of misspellings for download contains a list of sentences, with corrections done manually. The main file has tags of the following form: <ERR targ=sister> siter </ERR>, meaning that siter was written for sister. There are other supporting files with statistics such as which word has been misspelled how many times in spelling contests, etc.
RWSE Datasets contains a list of misspellings that were fixed on Wikipedia. This dataset focuses more on incorrect word being used, and hence the context in which the words are used is more important. For example, He had a prepossessing body, broad soldiers, he had a goaty and long hair. (soldiers should be shoulders).

5. Automated Essay Grading

Problem: Implement a machine learning algorithm to automatically grade essay responses. Feature engineering is going to be really important for this problem (see Chunking tutorial and NER tutorial for examples).

Suggested Algorithm: Linear Regression on features such as number of entities used, lexical diversity, emotions, sentiments, etc.

Dataset: Essays with human graded scores

6. Sentiment Analysis on tweets

Problem: Train a model that can categorize opinions expressed by people in their tweets. The sentiment in a tweet could be positive, neutral or negative.

Suggested Algorithm: Naive Bayes or Random Forest

Dataset: Tweets sentiment tagged by humans

7. Entity Extraction in clinical text

Problem: Classic named entity extraction problem to extract medical entities in clinical text obtained from electronic health records. These entities constitute of clinical concepts such as such as diseases, disorders, symptoms, medications, procedures, etc.)

Suggested Algorithm: Conditional Random Fields and Named Entity Recognition

Dataset: Informatics for integrating Biology and the Bedside

8. SMS Spam Filtering

Problem: Spam SMS is a major problem that annoys many people. The spam filter can be rule-based but spammers easily identifies the rules and find ways to deceive them. A machine learning based model can predict predicting whether an SMS is a spam or not and can be retrained on new data whenever spammers find new terms for spam.

Suggested Algorithm: Naive Bayes

Dataset: SMS Spam Collection Dataset

9. Language Detection of a Tweet

Problem: Implement a language identification algorithm to predict the language in which a tweet is written.

Suggested Algorithm: Natural Language Identification

Dataset: Short text language identification

10. More Datasets

Following is a list of some more datasets which could be used for different NLP problems

Text based datasets available at Kaggle.com (Scroll down to check the extensive list)
Home Page for 20 Newsgroups Data Set - A collection of approximately 20,000 news documents, partitioned into 20 different categories (groups). The dataset is extensively used for Text Classification and Clustering.
DBPedia - Contains 6.6M entities. In total, 5.5M resources are classified in a consistent ontology, consisting of 1.5M persons, 840K places, 286K organizations, 306K species, 58K plants and 6K diseases. (derived from Wikipedia).
Reuters-21578 Text Categorization Collection Data Set - A collection of categorized documents that appeared on Reuters newswire in 1987.
Text REtrieval Conference (TREC) Dataset - Dataset used in information retrieval.
CSTR Dataset - The dataset used in speech related research work such as speech recognition, speech synthesis, dialogue systems, etc.
World Factobook Dataset provides information on the US government profiles of countries and territories around the world. It has information on geography, people, government, transportation, economy, etc.
ConceptNet is a semantic network dataset which is designed to help computers understand the meanings of words. It is used for many textual reasoning based problems.

11. Data Science projects

For some ideas around data science in general (not just natural language processing), take a look at the Data Science Project Ideas article.