Feature Engineering is a vital part of any Machine Learning task. It is even important when the input includes sequential data such as time-series data. Generally, the features are generated manually. This is not the ideal way of generating time-series features. In this blog, we will look at an open-source Python package called tsfresh that we can use to generate hundreds of time-series features in an automated fashion. First, we will briefly explain Feature Engineering.
Once we are familiar with Feature Engineering, we will look at how we can use tsfresh to automate the process of generating time-series features. All the code used in this blog is available on the following GitHub repository:
What is Feature Engineering
A Machine Learning feature is any measurable value that can be used as an input for a Machine Learning task. In simplest terms, it can be considered a column of the input data to a Machine Learning model where different observations represent the rows. For example, in the famous Iris dataset, where the goal is to predict the type of species, the input values of Sepal length, Sepal width, Petal length, and petal length are called features. The task of the Machine Learning model is to predict the Species, given some feature values.
Figure 1: Features are input values that help a Machine Learning model better predict the Target Value
Feature Engineering, therefore, is the process of transforming the raw data into useful features that better characterize the data; thus, enabling the machine learning model to learn better from those features. An example of Feature Engineering on time-series sales data is given below. Here we have sales data over time, and we aim to predict future sales. We can use Feature Engineering to include additional data such as ‘Mean Sales Last year’ or ‘Sales on the same day last year.’ The main advantage of adding these features is to enable the Machine Learning model to better forecast future sales.
|Raw Data||Engineered Features|
• Mean Sales in last 7 days.
• Max Sales in last 7 days.
• Sales same date last year.
• Sales same date last month.
• Holiday Data
For in-depth details, visit the two-part blog here on the end-to-end Feature Engineering process.
Now that we know what features are and why we need Feature Engineering, let’s look at how we can perform automated feature extraction for time-series data using tsfresh.
Feature Extraction using tsfresh in Python
‘tsfresh’ is an open-source Python package that automatically calculates hundreds of time series features from sequential data such as time-series data. Tsfresh also includes methods to calculate the feature importance and assists in feature selection.
Figure 2: Images from Official tsfresh documentation illustrating feature extraction using tsfresh
In the above figure, we have sequential raw data (based on time). Using tsfresh we can extract features such as maximum, minimum, mean, median, number of peaks, etc. Once we have extracted these helpful features, we can use tsfresh or any other suitable feature selection method to reduce the feature set and only keep the most important features for machine learning.
Let’s look at the implementation of tsfresh in a Jupyter notebook using Python. First, we need to install the tsfresh module using pip. This can be done from terminal or the Jupyter notebook:
Next, we need to download sample data from the UCI Machine Learning Repository that we can use for our experimentations. The documentation for the dataset is provided on:
The dataset represents Force and Torque measurements from sensors on robots. The dataset contains 88 samples represented by the ‘id’ column. The time column represents the sequence of readings.
As you can see, in the dataframe we have 1320 rows and 8 columns.
The ‘y’ column represents whether or not the sensonrs data represents robot failure. This is a target value, and our goal can be to classify torque and force measurements as either a failure or not.
Next, we need to import the ‘extract_features’ method and use it to extract features for our dataset. We need to pass the data to the ‘extract_features’ method along with information on which column represents the sequence of readings and the ‘id’ column that will be used to differentiate between various datasets. In our case, readings from every robot represent one dataset, differentiated by the ‘id’ column and sorted on the ‘time’ column.
In the above snippet, we can see that the tsfresh package has returned 4722 columns that include time-series features for all the datasets and all the numeric columns in our datasets.
You can use the following code to identify all the features calculated by the tsfresh.
Suppose we need to extract limited features for only one column, that is, F_x. We can do that using a features dictionary as below:
We can use the above dictionary to extract features from tsfresh.
As seen above, we now have 10 columns because we opted for 10 features in our dictionary and only requested it for the F_x column. Again, we can confirm the list of features extracted as below:
In this way, we can see how easy it is to extract time-series features using tsfresh package in an automated and customizable way.
To conclude, we have seen what Feature Engineering is and why it is important. We have also seen how easy it is to extract time-series features using tsfresh. It can easily be integrated into existing Machine Learning workflow such as in Scikit-learn and thus saves us precious coding and processing time. For further details on tsfresh, you can visit the official documentation on https://tsfresh.readthedocs.io/en/latest/index.html