Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
Correlation analysis is statistical evaluation method used to study the strength of relationship between two numerical variables. This type of analysis is useful when we want to check if there exist any positive or negative connections between the variables.
We will start by loading the wine_v2 , tips and questions_data datasets.
Correlation analysis can help us understand whether, and how strongly, a pair of variables are related.
In data science and machine learning, this can help us understand relationships between features/predictor variables and outcomes. It can also help us understand dependencies between different feature variables.
How strong is the correlation between mental stress and cardiac issues?
Is there a correlation between literacy rate and frequency of criminal activities?
This tutorial will help you learn the different tech...
Feature scaling is an important part of data pre-processing.
Often, numeric variables in a dataset have very different scales. For example, let's say we have a dataset which includes the area of a house (in square feet) and its corresponding price (in US dollars). Typically, the area of the house will be in the range 500 - 5000 square feet, but the price will range from $100,000 - $5,000,000. As you can see, the scale of the features are very different. In this case, the price is almost 1000x square feet area.
In this tutorial, we will first talk about how having all the variables be in a similar scale helps us. Then, we will talk about various methods to perform scaling.
In statistical modeling and machine learning, a categorical variable is a variable which can only have a fixed set of values. Some examples of categorical variables are nationality, size of clothes (small, medium, large), day of the week, genre of music, educational qualification (doctorate, graduate, diploma), etc.
Categorical Feature labels
As opposed to a continuous numerical variable such as height, age, and distance, the above variables are not intrinsically represented by continuous numbers. Instead they represented by labels. Each unique value in a categorical variable is known as a label.
For example, if the categorical variable was Size of clothing, the labels would be small, medium and large. For the catego...