Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
Correlation analysis is statistical evaluation method used to study the strength of relationship between two numerical variables. This type of analysis is useful when we want to check if there exist any positive or negative connections between the variables.
We will start by loading the wine_v2 , tips and questions_data datasets.
End-to-End Project: Data Cleaning and Analysis [Under Construction]
Note: This project is currently under construction. The final version is expected to be ready on or before June 15th 2019.
In this project, we will put together all the pieces we have learnt so far for data exploration, data cleaning and data analysis. This will include loading datasets, exploring numerical and categorical variables, data visualization, handling missing values, checking for outliers, performing data transformations, etc.
Let's get started!
The student dataset
For this tutorial, we will use a dataset which has information about students admitted to an engineering college in India....
In the last couple of tutorials, we learned how to select various subsets of a DataFrame. In this tutorial, we will use these techniques to select a subset of the DataFrame and modify the selected data.
As in the previous tutorials, let's start by importing the libraries and loading the dataset.
A histogram graphically summarizes the distribution of numerical data. Typically, it consists of vertical rectangular bars which show the frequency of data (along y-axis) in successive class intervals (x-axis) of equal size. The height of each bar of a histogram is proportional to the frequency of the data.
Let’s take a sample data of heights (in cm) of 50 students in a class.
Text Classification (Topic Categorization, Spam filtering, etc)
Text Classification (or Categorization) has been in high demand and all the way has become more important with the increasing scale of text getting generated online. Moreover, different contextual information in different domains has raised the challenge in improving the accuracy and performance of traditional ways of doing text classification.
Some example applications of text classification include:
Assigning multiple topics to documents
Grouping of documents into a fixed number of predefined classes
Segregating the contextual details from a multi-domain corpus
TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency and is a very common algorithm to transform text into a meaningful representation of numbers. The technique is widely used to extract features across various NLP applications. This article would help you understand the importance of TF-IDF, and how to compute and apply the algorithm in your applications.
Vector representation of Text
To use a machine learning algorithm or a statistical technique on any form of text, it is prescribed to transform the text into some numeric or vector representation. This numeric representation should depict significant characteristics of the text. There are many such techniques, for example, occurrence, term-frequency, TF-IDF, word co-occurrence matrix, word2vec and GloVe.
Types of Learning Algorithms: Supervised, Unsupervised and Reinforcement Learning
There are three main types of learning algorithms in machine learning: supervised learning, unsupervised learning, and reinforcement learning.
Currently, most of the machine learning products use supervised learning. In this, we have a set of features or inputs X (for example, an image) and our model will predict a target or output variable y (for example, caption for the image).
y = f(X)
In other words, our model learns a function that maps inputs to desired outputs. Features are independent variables and targets are the dependent variable.
K-nearest neighbors (KNN) is one of the simplest Machine Learning algorithms. It is a supervised learning algorithm which can be used for both classification and regression.
Understanding the classification algorithm (illustration)
Let us understand this algorithm with a classification problem. For simplicity of visualization, we'll assume that our input data has 2 dimensions. We will also assume that it is a binary classification task, i.e. the target can take two possible labels - green and red.
Here's what a plot of our training data looks like.