In this tutorial, we will learn how we can handle datetime data with pandas.
We will also use our previous knowledge in indexing DataFrames to handle time-series data specifically.
We will first look at the basic data structures available for dealing with date and time in Pandas.
Timestamp- These refer to particular instance of time. Eg:
'2018-05-31 12:30:45'. This is similar to Python’s
datetime. It is based on NumPy’s data type
Period- These refer to standard time intervals such as a month, a week or a day. These are fixed-frequency intervals. This is based on the intervals of NumPy’s data type
Timedelta- These refer to exact time intervals such as an interval of 35.27 seconds. It is similar to Python’s
datetime.timedelta, but is based on NumPy’s data type
For this tutorial, we will be making use of some time series data. Specifically, we will be using the
The data is taken from ”Open Power System Data” and modified for this tutorial.
The file contains data about electric power consumption and generation in United Kingdom. The data is recorded on an hourly basis from 11th May 2018 04:00 to 31st May 2018 23:00
Attribution: Open Power System Data. 2018. Data Package Time series. Version 2018-06-30. https://doi.org/10.25832/time_series/2018-06-30
Variable descriptions and data types:
utc_timestamp: Start of time period in Coordinated Universal Time
Britain_hourly_load: Total load in United Kingdom (in MW)
Britain_solar: Solar power generation in United Kingdom (in MW)
Britain_wind: Wind power generation in United Kingdom (in MW)
Let’s load the
electricity dataset and get a glimpse of it using the
load_commonlounge_dataset("electricity") edata = pd.read_csv("/tmp/electricity.csv") print(edata.head())
We can see that the
utc_timestamp column contains date-time data.
Let us see the dtypes of the columns with the
We can see that the
utc_timestamp column has the dtype
object. As CSV files can only have numbers and strings, we have to explicitly convert the data to
We can convert the
utc_timestamp column to
Timestamp using the
to_datetime() function. The syntax is as follows:
var = pandas.to_datetime(arg)
arg here is the argument we pass, which we want to convert to
var returned, depends on the argument of
- For a scalar it returns a
- For a Series it returns a Series of dtype
datetime64(This allows Pandas to take advantage of vectorized operations over the NumPy arrays).
Let us convert the column
utc_timestamp from dtype
edata.utc_timestamp = pd.to_datetime(edata.utc_timestamp)
Let us check if the dtype has changed.
As we can see, it is now of dtype
To convert datetime data from string to datetime, Pandas has to figure out the format in which the datetime data is written in the input file.
By default, Pandas infers the format for each element separately. If all the elements in our column have the same format, then we can considerably speed up the conversion by explicitly specifying the format.
For our data, the format is
'%Y-%m-%d %H:%M:%S'. The specification for
format is shared with Python
strftime() functions. These functions are used for date time conversion in Python. The most commonly used codes are mentioned below:
We can specify the
format parameter with the following syntax:
series = pandas.to_datetime(arg, format='%Y-%m-%d %H:%M:%S')
We can also parse the datetime data using the
parse_dates parameter in
read_csv function. To do this, we pass the list of column names or column positions to parse as datetime to the
dataframe = pandas.read_csv("path_to_file", parse_date=[col1, col2, col3])
Pandas supports an index structure called
DatetimeIndex which is based on the
datetime64 datatype. This index allows us to use
Timestamp data to identify rows, or to use with
loc for slicing.
To set a column as Index, we will use the
set_index method from Pandas DataFrame.
set_index sets an existing column as the new index for the DataFrame. The argument is the column label of the column we want as the new index.
Let us do it for our DataFrame:
inplace parameter is
True, then the original DataFrame will itself be modified, instead of creating and returning a new DataFrame.
Let us have a look at our data with the
Finding trends and patterns over time is a very essential aspect in exploring and modelling time-series data. Some of the attributes from
DatetimeIndex help us do that.
Suppose we are interested in knowing the electricity consumption level at 12pm everyday. To do this, we can use the
hour attribute of
DatetimeIndex. This returns only the hours of the datetime objects as an integer index.
DatetimeIndex can be accessed with
DataFrame.index, and the hours can be accessed as
Let us look at what the attribute returns:
Let’s use this to select all the rows with hour 12pm
# select data for 12pm everyday subset = edata[edata.index.hour==12] # display first few results print(subset.head())
We could use other attributes like:
day: day of the month as an integer
weekday: day of the week as an integer (Monday is 0, Tuesday is 1, …, Sunday is 6)
month: month of the year as an integer
year: year as integer
We can also use
loc to slice data according to date time.
For example if we want all the data from 1pm to 11pm of 11th May 2018, we could get it as follows:
print(edata.loc['2018-05-11 13:00:00':'2018-05-11 23:00:00'])
Our current data is stored on an hourly basis. From time to time, we might need to get data according to different frequencies (daily, weekly, etc).
Changing the Frequency is called resampling. It could take two different courses:
- downsample: resample to lower frequency. For example calculating daily total or average consumption from the hourly consumption.
- upsample: resample to higher frequency. This is much rarer as the increased frequency will create extra data points in time for which we don’t have values. The values are then filled by different methods like forward fill or backward fill, similar to imputation techniques.
Resampling is done with the
resample() function. The syntax is:
The following are some of the arguments that can be passed to the
rule parameter, to specify the new frequency.
'M': Calendar month ends
'A': Year ends
Note: These strings map to
DataOffsetobjects, which denote duration of time, similar to
timedelta, but with standard calendar rules. For example a one month increment by
DataOffsetfrom 5th January will be 5th February and then 5th March, irrespective of whether the month contains 28, 29, 30 or 31 days.
For downsampling we can assign an aggregate function like
sum() to fill up the columns with appropriate aggregate values that we want.
For upsampling, we can assign a filling method for
Let us take an example of downsampling to daily frequency from our original hourly frequency, with the aggregator function
mean(). This will give us the daily mean electricity consumption levels, and wind and solar generation levels.
# downsampling to daily frequency resampled_df = edata.resample('D').mean() # display first few results print(resampled_df.head())
Similar to resampling, rolling windows aggregates data in each window with a function such as
However, unlike resampling where the time intervals don’t overlap, rolling windows overlap and “roll” along at the same frequency as the data. Thus the new DataFrame has the same frequency as the original one.
The syntax is:
window parameter decides how many rows to include in each bin.
- It could be an integer to specify exact number of rows, or
- It could be a frequency -
'M'denoting hourly, daily, monthly, etc
Let us see an example where we take a rolling window of 48 hours (since our data us hourly, that means window of 48 rows). We will use the aggregate function
mean() for each window and display rows 50 to 60 (as the first 47 rows will have
rolling_mean = edata.rolling(48).mean() print(rolling_mean.iloc[50:60,:])
- Data type of variables can be converted to
to_datetime()function. It converts scalars to
Timestampand Series to
read_csv(), allows us to convert a column to
DatetimeIndexis a special index structure made of
datetime64data type. It has helpful attributes like
- We can resample Time-series data using
resample(), with specific frequencies such as hourly, daily, monthly, etc.
rolling()function allows aggregate calculations on a rolling window of rows.