It has been quite a few days I have been working with Pandas and apparently I feel I have gotten quite good at it. (Quite a Braggard I know) So thought about adding a post about Pandas usage here. I intend to make this post quite practical and since I find the pandas syntax quite self explanatory, I won't be explaining much of the codes. Just the use cases and the code to achieve them.
We Start by importing the libraries that we will need to use.
import pandas as pd
# Read from csv data files# With Headerdf = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv")# Without Header. sep param to provide the delimiterdf = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv", header=None, sep= ",")# Reading from SQL Datasourceimport MySQLdbfrom pandas import DataFramefrom pandas.io.sql import read_sqldb = MySQLdb.connect(host="localhost", # your host, usually localhostuser="root", # your usernamepasswd="password", # your passworddb="dbname") # name of the data basequery = "SELECT * FROM tablename"data = read_sql(query, db)# Reading from ExcelFiledata = pd.read_excel(filename)
For now, we will be working with the file at http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv. The Sales Jan 2009 file contains some “sanitized” sales transactions during the month of January. If you want to work along you can download this file from that location.
df = pd.read_csv("/Users/ragarw5/Downloads/SalesJan2009.csv")
# top 5 rowsdf.head()# top 50 rowsdf.head(50)# last 5 rowsdf.tail()# last 50 rowsdf.tail(50)
columnnames = df.columns
Sometimes you want to change the column names:
df.columns = ['Transdate', 'Product', 'Price', 'PaymentType', 'Name','City', 'State', 'Country', 'AccountCreated', 'LastLogin','Latitude', 'Longitude']
Sometimes you only need to work with specific columns in a dataframe only. You can subset the columns in the dataframe using
newDf = df[['Product', 'Price', 'PaymentType', 'Name', 'City', 'State', 'Country']]
First thing i try is this.
newDf['Price'] = newDf['Price'].astype('int')
It gives error: ValueError: invalid literal for long() with base 10: '13,000'. That is you cannot cast a string with , to an int. To do that we first have to get rid of the comma. For that we use a particular lambda-apply functionality which lets us apply functions to each row in the data.
newDf['Price'] = newDf.apply(lambda x: int(x['Price'].replace(',', '')),axis=1)
# To get statistics of numerical columnsnewDf.describe()# To get maximum value of a column. When you take a single column you can think of it as a list and apply functions you would apply to a listmax(newDf['Price'])# no of rows in dataframelen(newDf)# Shape of DataframenewDf.shape
# Create a column Address containing City,State and Country. Simply concat the columns.newDf['Address'] = newDf['City'] +","+ newDf['State'] +","+ newDf['Country']# I like to use a function defined approach with lambda-apply as it gives me more flexibility and more options. Like if i want to create a column which is 1 if the price is greater than 1200 and 0 otherwise.def gt(x):if x>1200:return 1else:return 0newDf['Pricegt1200'] = newDf.apply(lambda x: gt(x['Price']),axis=1)
# Single condition: dataframe with all entries priced greater than 1500df_gt_1500 = newDf[newDf['Price']>1500]# Multiple conditions: AND - dataframe with all entries priced greater than 1500 and from LondonAnd_df = newDf[(newDf['Price']>1500) & (newDf['City']=='London')]# Multiple conditions: OR - dataframe with all entries priced greater than 1500 or from LondonOr_df = newDf[(newDf['Price']>1500) | (newDf['City']=='London')]# Multiple conditions: NOT - dataframe with all entries priced greater than 1500 or from London have to be excludedNot_df = newDf[~((newDf['Price']>1500) | (newDf['City']=='London'))]
# In the state column the state is abbreviated as 'TX'. We want the whole name 'Texas' in therenewDf.loc[newDf['State']=='TX','State'] = 'Texas'# When City is Monaco State is not given. You want to impute 'Monaco State' as state also.newDf.loc[newDf['City']=='Monaco','State'] = 'Monaco State'
One of the most used functionality. One simple example
# Find out the sum of transactions by a state. reset_index() is a function that resets the index of a dataframe. I apply this function ALWAYS whenever I do a groupby and you might think of it as a default syntax for groupby operationsimport numpy as npnewDf.groupby(['State']).aggregate(np.sum).reset_index()# You might get a few extra columns that you dont need. Just subset the columns in the dataframe. You could just chain the commands to subset for the columns you need.newDf.groupby(['State']).aggregate(np.sum).reset_index()[['State','Price']]# Find minimum transaction in each statenewDf.groupby(['State']).aggregate(np.min).reset_index()[['State','Price']]# You might want to groupby more than one columnnewDf.groupby(['State','City']).aggregate(np.sum).reset_index()[['State','City','Price']]
You have two dataframes df1 and df2 you need to concat. Means append one below the other you can do it using:
#Suppose in the start, you had two dataframes. One which contains city and price information:City_Price = newwDf[['City','Price']]#And another which contains 'City' and 'State' insformationCity_State = newDf[['City','State']].drop_duplicates(keep=False).reset_index()#You need to merge these datatframes on basis of city. You need to do:City_Price_State_df = pd.merge(City_Price,City_State,on=['City'],how='left')
# To Csv filenewDf.to_csv("NewDfData.csv",index=False)# To Excel Filefrom pandas import ExcelWriterwriter = ExcelWriter('NewDfData.xlsx')newDf.to_excel(writer,'Sheet1')writer.save()
from pandas.io import sqlimport MySQLdbdb = MySQLdb.connect(host="localhost", # your host, usually localhostuser="root", # your usernamepasswd="password", # your passworddb="dbname") # name of the data basenewDf.to_sql(con = db, name='tablename',if_exists='append',flavor='mysql', chunksize=10000,index=False)
Hope you found this post useful and worth your time. I tried to make this as simple as possible but You may always ask me or see the documentation for doubts.
If you have any more ideas on how to use Pandas or other use-cases, please suggest in the comments section.
Till then ciao!
- Intro to Pandas By Greg Rada What I have written is in a condensed form, If you want to get a detailed description visit Greg Rada's 3 posts series.
- Pandas Documentation
Link to original article: Some Basic Pandas Functions