Thanks for noticing the error. It has been corrected.
So far in the course, we have learnt quite a bit about DataFrames. In particular, we learnt about using various boolean and arithmetic operations on DataFrame columns, and also about indexing to select and modify various subsets of a DataFrame.
In this tutorial we will learn another method for doing operations on and also modifying a DataFrame using DataFrame methods like apply() and applymap(). These methods allow us to apply a function over an entire DataFrame.
Let's get started!
As in the previous tutorials, let us load the Pandas and Numpy libraries at the beginning.
type=codeblock|id=pd_func_import|autocreate=datascience|show_output=0
Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
Correlation analysis is statistical evaluation method used to study the strength of relationship between two numerical variables. This type of analysis is useful when we want to check if there exist any positive or negative connections between the variables.
We will start by loading the wine_v2 , tips and questions_data datasets.
type=codeblock|id=load_data1|autocreate=datascience|show_output=0
Correlation analysis can help us understand whether, and how strongly, a pair of variables are related.
In data science and machine learning, this can help us understand relationships between features/predictor variables and outcomes. It can also help us understand dependencies between different feature variables.
For example:
This tutorial will help you learn the different tech...
You have already learnt about basics of the Pandas DataFrame — a 2-dimensional data-structure supported by Pandas which looks and behaves like a table.
In this tutorial and the next one, we will learn how to select various subsets of the DataFrame. Pandas library provides us with a number a flexible options to do this.
As in the first Pandas tutorial, let's start by importing the libraries and loading the dataset.
type=codeblock|id=pd_index_load_adult|autocreate=datascience|show_output=1import pandas as pdimport numpy as np
Polynomial features are higher power polynomial terms of the original features, which are added to the feature space of a model.
Let us understand this with a few examples.
Suppose we have a dataset with features x_1, x_2 and target variable y. A multivariable linear regression model for this set of data would be:
Polynomial features are higher ordered values of x_1 and x_2 which we can add to this model, for eg. x_1^2, x_1^3, x_2^2 , etc.
Our new model would look like this:
Feature scaling is an important part of data pre-processing.
Often, numeric variables in a dataset have very different scales. For example, let's say we have a dataset which includes the area of a house (in square feet) and its corresponding price (in US dollars). Typically, the area of the house will be in the range 500 - 5000 square feet, but the price will range from $100,000 - $5,000,000. As you can see, the scale of the features are very different. In this case, the price is almost 1000x square feet area.
In this tutorial, we will first talk about how having all the variables be in a similar scale helps us. Then, we will talk about various methods to perform scaling.
In statistical modeling and machine learning, a categorical variable is a variable which can only have a fixed set of values. Some examples of categorical variables are nationality, size of clothes (small, medium, large), day of the week, genre of music, educational qualification (doctorate, graduate, diploma), etc.
Categorical Feature labels
As opposed to a continuous numerical variable such as height, age, and distance, the above variables are not intrinsically represented by continuous numbers. Instead they represented by labels. Each unique value in a categorical variable is known as a label.
For example, if the categorical variable was Size of clothing, the labels would be small, medium and large. For the categorical feature Educational qualification, some of the possible labels would be doctorate, graduate, diploma, etc.
In this tutorial, we will talk about
Both of these are extremely important concepts in machine learning.
Let's get started!
Before we make predictions using a machine learning model, we first estimate the parameters (such as weights and bias) of the model.
The dataset based on which we estimate the parameter...
So far in the course, we have learnt quite a bit about DataFrames. In particular, we learnt about using various boolean and arithmetic operations on DataFrame columns, and also about indexing to select and modify various subsets of a DataFrame.
In this tutorial we will learn another method for doing operations on and also modifying a DataFrame using DataFrame methods like apply() and applymap(). These methods allow us to apply a function over an entire DataFrame.
Let's get started!
As in the previous tutorials, let us load the Pandas and Numpy libraries at the beginning.
type=codeblock|id=pd_func_import|autocreate=datascience|show_output=0
In the last couple of tutorials, we learned how to select various subsets of a DataFrame. In this tutorial, we will use these techniques to select a subset of the DataFrame and modify the selected data.
As in the previous tutorials, let's start by importing the libraries and loading the dataset.
We will use the same dataset that we used in Indexing and Slicing and Criteria Based Selection tutorials.
type=codeblock|id=pd_modeindex_load_adult|autocreate=datascience|show_output=1import pandas as pdimport numpy as np# load the dataset