Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
As part of data cleaning, we often need to transform data into a more usable form for data science and machine learning.
We will start by discussing data transformation methods for numerical variables first, and then move on to categorical variables.
Let's get started!
Set-up
We will need the Pandas and NumPy libraries for this tutorial, so let us load them at the beginning.
type=codeblock|id=trans_import|show_output=0|autocreate=datascienceimport pandas as pdimport numpy as np
In this tutorial, we'll be looking at some tables. So let's also ask pandas to increase the display width to 120 characters, and the maximum number of columns it should display to 10:
type=codeblock|id=trans_displaywidth|autocreate=datascience|show_output=0|depends_on=trans_importpd.set_option('display.width', 120)pd.set_option('display.max_columns', 10)
Finally, we will be using the student_transformations dataset to illustrate the various concepts for this tutorial. Let's load the dataset:
type=codeblock|id=trans_load|autocreate=datascience|show_output=1|depends_on=trans_displaywidth# load the dataset and see first few rowsdf = load_commonlounge_dataset('student_transformations')print(df.head())
As in the previous tutorial, let's start by importing the libraries, changing the pandas display options, and loading the dataset.