In this article, we will discuss some Data Science project ideas which should be fun to implement and also help you learn a lot. We will provide the datasets you will need, and also some reference implementations so you can get started. We will divide this project ideas list into three categories: Regression, Classification and Clustering.
House price depends on different variables. When we go to buy a house we observe different features such as space, number of bedrooms, location, construction quality, etc and pay the price accordingly. In this problem we will be predicting price of residential homes in Ames, Iowa using 79 different features. The problem is hosted as a Kaggle competition and link to the data is given below.
- A gentle introduction to data science – Part I – akquinet AG – Blog
In this problem, you will predict daily electricity consumption.
We will be using the Reference Energy Disaggregation Data Set (REDD) provided by Massachusetts Institute of Technology (MIT) for our task. The data set is made of several weeks of power data for 6 different homes, and high-frequency current/voltage data for the main power supply of two of these homes. It is a time series data which was recorded for 18 days between April 2011 and June 2011 and sampled at intervals of 3 seconds.
An example of energy consumption over the course of a day for one of the houses in REDD
In this problem, the task is to predict how many comments a Facebook post received. The data set consists of about 40 thousand data points with 54 features each. Some of the attributes are page popularity, post length, post share count, etc.
Field: Social Networks
In this problem, we will try to detect breast cancer using features which were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass (sample image is given below). The data set given below was created by Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian form University of Wisconsin. Features that were measured from breast cell nuclei include area, perimeter, radius, texture, etc. There are 32 different features and 569 examples total. This is a small dataset where feature extraction and feature selection technique will be useful, as well as model averaging methods.
Sample of an Image from which Cell Nuclei Features were extracted
- Python Programming Tutorials
Sentiment classification is a type of text classification problem. People express different kind of sentiments like praise, sarcasm, doubt, fear, dislike, etc by using natural languages like English, Hindi, French or Nepali. The data set given below consists of a collection of movie reviews which are divided into 25000 training data points and 25000 test data points. They have bipolar sentiments — Positive and Negative. This is the first step into sentiment classification. You can also look-up datasets consisting of more than two classes.
Field: Natural Language Processing, Text Classification
Data: Sentiment Analysis
Hand written digit recognition is the first step into computer vision. MNIST (Modified National Institute of Standards and Technology) data set was first released on in 1999 and since then it has been a benchmark for different classification algorithms. This data set consists of 60,000 training examples and 10,000 test examples. It has ten classes which includes digits from 0 to 9. This can be the first step towards a larger system for optical character recognition, and many interesting applications can be made on top of this system. Besides that, it is a good place to start with Deep learning. Deep Learning algorithms work very well for computer vision tasks and have achieved human level (or higher) accuracy in different benchmarks. This should be a fun project where you will be learning different things at once.
Sample of MNIST Data set
Field: Computer vision
- MNIST For ML Beginners
- Handwritten Digit Recognition using Convolutional Neural Networks in Python with Keras
Unsupervised machine learning methods like clustering can be used to automatically group similar documents in the same group. In this problem, we have to cluster news articles into different categories. For this we will use 20 newsgroup data set which consists of 20,000 news documents partitioned into 20 groups. Examples of group topics include science, technology, medicine, atheism, baseball, etc. This data set can be used for text classification as well as text clustering. We will be using this for clustering by taking the label of documents as a ground truth, i.e. if two documents share the same label, they should belong to the same cluster.
An Example of Text Clustering
Field: Natural Language Processing
- Clustering text documents using k-means — scikit-learn 0.19.1 documentation
Grouping visually similar images makes searching for images easier. In this problem we will be automatically grouping similar images using clustering algorithms. For this purpose, we will be using the INRIA holiday data set. This is a collection of personal holidays pictures which are grouped into 500 image categories.
Visually Similar Images
Field: Computer Vision
Data: Download datasets
For project ideas specifically related to natural language processing, take a look at: List of NLP Project Ideas (including Datasets)