Learn how to work with the Pandas resample method, a cool way to work with time based data
Time based data is one of the most common data formats that you, as data scientist, have probably stumbled. Either in the format of historical features (for instance, customer data) or time series data, it’s pretty common that one has to deal with timestamp columns in data pipelines.
If you work as a data scientist in 2022, it’s mandatory that you know how to work with
pandas, one of the most powerful Python libraries to wrangle data.
pandasis famous for its regular updates and new methods that fit a lot of tasks regarding data wrangling and manipulation.
One of its most famous methods is the
resampleone, that lets you handle time based data smoothly. In a nutshell,
resample contains several features that help you tackle time based grouping and aggregation in a really smooth way, improving the speed and simplicity when working with datetime columns. In the past, you would probably have to construct complicated (or even buggy) functions, that would have a tendency to be hard to maintain. With
resample that’s completely gone and you can play around with dates in
pandas dataframes much easily.
In this post, we’ll explore some cool things about this method and how easy it is to apply it — particularly when we want to create time based features for our data science algorithms.
Reading the Data
Let me start by loading this dataset into Python:
gallipoli_data = pd.read_csv(‘../Hourly Weather Data in Gallipoli (2008–2021).csv’)
Let’s look at the first 10 rows of the dataset:
To work with
resample, our data must meet two conditions:
- The column that we will be using to represent time must be a
- That column must go into the index of the dataframe.
Starting by converting the
DateTime column into a DateTime type:
gallipoli_data[‘DateTime’] = pd.to_datetime(gallipoli_data[‘DateTime’])
And passing our column into the index of the dataframe:
gallipoli_data.index = gallipoli_data[‘DateTime’]
Our dataframe is ready! Let’s also load the libraries needed to follow along this blog post:
import pandas as pdimport numpy as np
Having everything set up, let’s tackle our
Let’s start with a simple use case: imagine that we would like to get the average temperature in Gallipoli for every 3 hour interval. Although simple, if we would try to do this in base Python or with custom
pandas code, it would be a bit cumbersome as we would have to:
- create some type of column that would increment every 3 hours;
- use a group by on that column;
Although this is manageable with a couple of lines of code, a one liner would be better, right? And what if that one liner would also be able to generalize for other time frames (weeks, months, quarter or even minutes)? That’s exactly what
resample is for! We can perform a quick time based aggregation on the data really fast:
gallipoli_data.Temperature.resample(rule = ‘3h’).mean()
Here’s the output of the code above:
How cool? The
resample method contains a bunch of knobs we can turn such as the rule or the aggregator function we are using. Can you guess what we need to tweak if we want to check the maximum value of sunshine duration in an interval of 3 hours? We just switch
gallipoli_data[‘Sunshine Duration’].resample(rule = ‘3h’).max()
To make this even more clear and peak under the hood of
resample, let’s see the evolution of the
Sunshine Duration for the 2nd of January in Gallipoli:
(gallipoli_data.DateTime >= '2008-01-02')
(gallipoli_data.DateTime <= '2008-01-03')
resample is doing is grouping data for every three hours by applying the following:
- From 9 a.m. until 12, the maximum
Sunshine Durationis 60.
- From noon until 15 p.m., the maximum
Sunshine Durationis 60.
- From 15 until 18 p.m., the maximum
Sunshine Durationis 38.
This is reflected on the table we’ve seen above — checking it again:
Notice that the interval is closed on the left side. Meaning that the row that relates to
2008-01-02 09:00:00 aggregates the entire data from 09:00:00 a.m. until 11:59:59 a.m. If we want to change this behavior, we need to pass a new argument
gallipoli_data['Sunshine Duration'].resample(rule = '3h', closed= 'right').max()
In this case, the same row
2008-01-02 09:00:00 aggregates all data from 09:00:01 a.m. until 12:00:00 a.m.
resample knobs is very easy. Another common argument that we may want to tweak is the time frame aggregation — for instance, aggregating our data by week or month. Is this possible? Let’s see, next.
Modifying the Rule Argument
ruleargument we’re not only restricted to hourly aggregations! This is one of the coolest features of the
resamplefunction. For instance, if we would like a bi-weekly aggregation, we can:
gallipoli_data.Temperature.resample(rule = ‘2w’).mean()
The number before the
w defines the number of instances we want to group by and the
w defines that we want a week time-frame.
Can you guess what’s the argument we need to pass to aggregate our data for every 4-months?
gallipoli_data.Temperature.resample(rule = ‘4m’).mean()
rule controls the time interval we want to group by. Some of the most famous
rule parameters you can use are:
Filling the blanks
Another important feature of
resampleis that it can help you create a more complete data frame by making sense of the gaps in the data. For instance, imagine we would have a hourly sales dataset from a shop:
date_range = pd.date_range('1/2/2022', periods=24, freq='H')
sales = np.random.randint(100, 400, size=24)sales_data = pd.DataFrame(
index = date_range,
columns = ['Sales']
Let’s delete the row of the sales at 10 a.m. (imagining this was something that really happened and there were no sales in this store during this time):
sales_data = sales_data.drop(sales_data.iloc.name)
If we apply a
resampleto this dataframe, a new row with
NaN will show up:
This is extremely handy in a lot of time based scenarios. Normally, if we have a gap in our time series data, we want to have a row that represents that gap and
resample gives us a quick way to achieve it. In the example above, we can even add a
fillna(0) to make our scenario of “lack of sales” at 10 a.m. even more obvious:
Filling the gaps is a really common thing we may want to do when working with time based data. Not only when creating features, but also as a way to catch major bugs on our data pipelines.
Thank you for taking the time to read this post!
If you work with time based data and never used
resample before, I really recommend that you study this method thoroughly as it will save you a lot of headaches during your data wranling process.
Either when we’re building time based features or manipulating time-series data, using
resample makes our code more efficient and clean. As we’ve discussed, this method is also handy to catch some unexpected gaps in time-based data so I hope you can use the knowledge in this post sometime in your future data pipelines!
The dataset used in this post is under CC BY 4.0 and belongs to Meteoblue.