CommonLounge Archive

Data Science and Machine Learning Project Ideas

March 30, 2018

In this article, we will discuss some Data Science project ideas which should be fun to implement and also help you learn a lot. We will provide the datasets you will need, and also some reference implementations so you can get started. We will divide this project ideas list into three categories: Regression, Classification and Clustering.

Regression Problems

House Prices: Advanced Regression Techniques

House price depends on different variables. When we go to buy a house we observe different features such as space, number of bedrooms, location, construction quality, etc and pay the price accordingly. In this problem we will be predicting price of residential homes in Ames, Iowa using 79 different features. The problem is hosted as a Kaggle competition and link to the data is given below.

Field: Finance

Data: House Prices: Advanced Regression Techniques

Suggested Algorithms: Boosting, Random Forests

References:

  1. http://ww2.amstat.org/publications/jse/v19n3/decock.pdf
  2. A gentle introduction to data science – Part I – akquinet AG – Blog

Power Consumption Forecasting

In this problem, you will predict daily electricity consumption.

We will be using the Reference Energy Disaggregation Data Set (REDD) provided by Massachusetts Institute of Technology (MIT) for our task. The data set is made of several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. It is a time series data which was recorded for 18 days between April 2011 and June 2011 and sampled at intervals of 3 seconds.

An example of energy consumption over the course of a day for one of the houses in REDD

Field: Energy

Data: REDD

Suggested Algorithms: Linear Regression, KNN regression, Neural Networks

Reference:

  1. https://pdfs.semanticscholar.org/bfa7/9b3975c4cc1a37fb62d685e813f8f53040f0.pdf

Comment Volume Prediction

In this problem, the task is to predict how many comments a Facebook post received. The data set consists of about 40 thousand data points with 54 features each. Some of the attributes are page popularity, post length, post share count, etc.

Facebook Comment

Field: Social Networks

Data: Index of /ml/machine-learning-databases/00363

Suggested Algorithms: Linear Regression, Decision Trees, Neural Networks

References:

  1. https://pdfs.semanticscholar.org/f6a7/5ba8cc59b1dc7286454c97d7f3830e9d2c82.pdf
  2. ResearchGate

Classification Problems

Breast Cancer Detection

In this problem, we will try to detect breast cancer using features which were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass (sample image is given below). The data set given below was created by Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian form University of Wisconsin. Features that were measured from breast cell nuclei include area, perimeter, radius, texture, etc. There are 32 different features and 569 examples total. This is a small dataset where feature extraction and feature selection technique will be useful, as well as model averaging methods.

Sample of an Image from which Cell Nuclei Features were extracted

Field: Health

Data: Index of /ml/machine-learning-databases/breast-cancer-wisconsin

Suggested Algorithms: Logistic Regression, K-nearest neighbors, Support Vector Machine (SVM), Decision Trees combined with Dimensionality Reduction, Model Averaging

References:

  1. https://ac.els-cdn.com/S1877050916302575/1-s2.0-S1877050916302575-main.pdf?tid=022dbe73-6cc7-4680-8c5b-731606f550cf&acdnat=15218721839ecac853ad6f5d0c6ac6dc97e27d8d78
  2. Python Programming Tutorials

Sentiment Classification

Sentiment classification is a type of text classification problem. People express different kind of sentiments like praise, sarcasm, doubt, fear, dislike, etc by using natural languages like English, Hindi, French or Nepali. The data set given below consists of a collection of movie reviews which are divided into 25000 training data points and 25000 test data points. They have bipolar sentiments — Positive and Negative. This is the first step into sentiment classification. You can also look-up datasets consisting of more than two classes.

Field: Natural Language Processing, Text Classification

Data: Sentiment Analysis

Suggested Algorithms: TF-IDF, Text Classification, Deep Natural Language Processing, Naive Bayes

References:

  1. http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf
  2. https://thesai.org/Downloads/Volume8No6/Paper57-SentimentAnalysisusingDeep_Learning.pdf

Hand-written Digit Recognition

Hand written digit recognition is the first step into computer vision. MNIST (Modified National Institute of Standards and Technology) data set was first released on in 1999 and since then it has been a benchmark for different classification algorithms. This data set consists of 60,000 training examples and 10,000 test examples. It has ten classes which includes digits from 0 to 9. This can be the first step towards a larger system for optical character recognition, and many interesting applications can be made on top of this system. Besides that, it is a good place to start with Deep learning. Deep Learning algorithms work very well for computer vision tasks and have achieved human level (or higher) accuracy in different benchmarks. This should be a fun project where you will be learning different things at once.

Sample of MNIST Data set

Field: Computer vision

Data: MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges

Suggested Algorithms: Support Vector Machine (SVM), K-nearest neighbors, Convolutional Neural Networks

References:

  1. https://arxiv.org/pdf/1702.00723.pdf
  2. MNIST For ML Beginners
  3. Handwritten Digit Recognition using Convolutional Neural Networks in Python with Keras

Clustering Problems

Document Clustering for Topic Detection

Unsupervised machine learning methods like clustering can be used to automatically group similar documents in the same group. In this problem, we have to cluster news articles into different categories. For this we will use 20 newsgroup data set which consists of 20,000 news documents partitioned into 20 groups. Examples of group topics include science, technology, medicine, atheism, baseball, etc. This data set can be used for text classification as well as text clustering. We will be using this for clustering by taking the label of documents as a ground truth, i.e. if two documents share the same label, they should belong to the same cluster.

An Example of Text Clustering

Field: Natural Language Processing

Data: Home Page for 20 Newsgroups Data Set

Suggested Algorithms: Latent Dirichlet Allocation, K-Means Clustering

Reference:

  1. https://pdfs.semanticscholar.org/9860/487cd9e840d946b93457d11605be643e6d4c.pdf
  2. Clustering text documents using k-means — scikit-learn 0.19.1 documentation

Grouping Visually Similar Images

Grouping visually similar images makes searching for images easier. In this problem we will be automatically grouping similar images using clustering algorithms. For this purpose, we will be using the INRIA holiday data set. This is a collection of personal holidays pictures which are grouped into 500 image categories.

Visually Similar Images

Field: Computer Vision

Data: Download datasets

Suggested Algorithms: K-Means Clustering, Convolutional Neural Network

Reference:

  1. http://www.keysers.net/daniel/files/it2003.pdf
  2. https://link.springer.com/content/pdf/10.1007/s10596-014-9459-2.pdf

More Projects

For project ideas specifically related to natural language processing, take a look at: List of NLP Project Ideas (including Datasets)


© 2016-2022. All rights reserved.