# Data Science and Machine Learning Project Ideas

March 30, 2018

In this article, we will discuss some Data Science project ideas which should be fun to implement and also help you learn a lot. We will provide the datasets you will need, and also some reference implementations so you can get started. We will divide this project ideas list into three categories: Regression, Classification and Clustering.

# Regression Problems

## House Prices: Advanced Regression Techniques

House price depends on different variables. When we go to buy a house we observe different features such as space, number of bedrooms, location, construction quality, etc and pay the price accordingly. In this problem we will be predicting price of residential homes in Ames, Iowa using 79 different features. The problem is hosted as a Kaggle competition and link to the data is given below.

**Field:** Finance

**Data:** House Prices: Advanced Regression Techniques

**Suggested Algorithms:** Boosting, Random Forests

**References:**

- http://ww2.amstat.org/publications/jse/v19n3/decock.pdf
- A gentle introduction to data science – Part I – akquinet AG – Blog

## Power Consumption Forecasting

In this problem, you will predict daily electricity consumption.

We will be using the Reference Energy Disaggregation Data Set (REDD) provided by Massachusetts Institute of Technology (MIT) for our task. The data set is made of several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. It is a time series data which was recorded for 18 days between April 2011 and June 2011 and sampled at intervals of 3 seconds.

An example of energy consumption over the course of a day for one of the houses in REDD

**Field:** Energy

**Data:** REDD

**Suggested Algorithms:** Linear Regression, KNN regression, Neural Networks

**Reference:**

## Comment Volume Prediction

In this problem, the task is to predict how many comments a Facebook post received. The data set consists of about 40 thousand data points with 54 features each. Some of the attributes are page popularity, post length, post share count, etc.

Facebook Comment

**Field:** Social Networks

**Data:** Index of /ml/machine-learning-databases/00363

**Suggested Algorithms:** Linear Regression, Decision Trees, Neural Networks

**References:**

# Classification Problems

## Breast Cancer Detection

In this problem, we will try to detect breast cancer using features which were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass (sample image is given below). The data set given below was created by Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian form University of Wisconsin. Features that were measured from breast cell nuclei include area, perimeter, radius, texture, etc. There are 32 different features and 569 examples total. This is a **small dataset** where feature extraction and feature selection technique will be useful, as well as model averaging methods.

Sample of an Image from which Cell Nuclei Features were extracted

**Field:** Health

**Data:** Index of /ml/machine-learning-databases/breast-cancer-wisconsin

**Suggested Algorithms:** Logistic Regression, K-nearest neighbors, Support Vector Machine (SVM), Decision Trees combined with Dimensionality Reduction, Model Averaging

**References:**

- https://ac.els-cdn.com/S1877050916302575/1-s2.0-S1877050916302575-main.pdf?
*tid=022dbe73-6cc7-4680-8c5b-731606f550cf&acdnat=1521872183*9ecac853ad6f5d0c6ac6dc97e27d8d78 - Python Programming Tutorials

## Sentiment Classification

Sentiment classification is a type of text classification problem. People express different kind of sentiments like praise, sarcasm, doubt, fear, dislike, etc by using natural languages like English, Hindi, French or Nepali. The data set given below consists of a collection of movie reviews which are divided into 25000 training data points and 25000 test data points. They have bipolar sentiments — Positive and Negative. This is the first step into sentiment classification. You can also look-up datasets consisting of more than two classes.

**Field:** Natural Language Processing, Text Classification

**Data:** Sentiment Analysis

**Suggested Algorithms:** TF-IDF, Text Classification, Deep Natural Language Processing, Naive Bayes

**References:**

- http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf
- https://thesai.org/Downloads/Volume8No6/Paper
*57-Sentiment*Analysis*using*Deep_Learning.pdf

## Hand-written Digit Recognition

Hand written digit recognition is the first step into computer vision. **MNIST** (Modified National Institute of Standards and Technology) data set was first released on in 1999 and since then it has been a benchmark for different classification algorithms. This data set consists of 60,000 training examples and 10,000 test examples. It has ten classes which includes digits from 0 to 9. This can be the first step towards a larger system for optical character recognition, and many interesting applications can be made on top of this system. Besides that, it is a good place to start with Deep learning. Deep Learning algorithms work very well for computer vision tasks and have achieved human level (or higher) accuracy in different benchmarks. This should be a fun project where you will be learning different things at once.

Sample of MNIST Data set

**Field:** Computer vision

**Data:** MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges

**Suggested Algorithms:** Support Vector Machine (SVM), K-nearest neighbors, Convolutional Neural Networks

**References:**

- https://arxiv.org/pdf/1702.00723.pdf
- MNIST For ML Beginners
- Handwritten Digit Recognition using Convolutional Neural Networks in Python with Keras

# Clustering Problems

## Document Clustering for Topic Detection

Unsupervised machine learning methods like clustering can be used to automatically group similar documents in the same group. In this problem, we have to cluster news articles into different categories. For this we will use 20 newsgroup data set which consists of 20,000 news documents partitioned into 20 groups. Examples of group topics include science, technology, medicine, atheism, baseball, etc. This data set can be used for text classification as well as text clustering. We will be using this for clustering by taking the label of documents as a ground truth, i.e. if two documents share the same label, they should belong to the same cluster.

An Example of Text Clustering

**Field:** Natural Language Processing

**Data:** Home Page for 20 Newsgroups Data Set

**Suggested Algorithms:** Latent Dirichlet Allocation, K-Means Clustering

**Reference:**

- https://pdfs.semanticscholar.org/9860/487cd9e840d946b93457d11605be643e6d4c.pdf
- Clustering text documents using k-means — scikit-learn 0.19.1 documentation

## Grouping Visually Similar Images

Grouping visually similar images makes searching for images easier. In this problem we will be automatically grouping similar images using clustering algorithms. For this purpose, we will be using the INRIA holiday data set. This is a collection of personal holidays pictures which are grouped into 500 image categories.

Visually Similar Images

**Field:** Computer Vision

**Data:** Download datasets

**Suggested Algorithms:** K-Means Clustering, Convolutional Neural Network

**Reference:**

- http://www.keysers.net/daniel/files/it2003.pdf
- https://link.springer.com/content/pdf/10.1007/s10596-014-9459-2.pdf

# More Projects

For project ideas specifically related to natural language processing, take a look at: List of NLP Project Ideas (including Datasets)