Practices for Feature Engineering in 2022: From ‘raw data’ to Insights

The age of IOT has ushered us into an era of data-driven approach towards finding better solutions or providing direction in the decision-making process. According to “Domo,” we have been creating 2.5 quintillion bytes of data daily.  

Is all this data useful?
No. 

Can this data be used directly?
No. 

So, this is where feature engineering comes into play. 

image001

What is Feature Engineering? 

Consider feature engineering as a puzzle, except not every piece fits. This means there is no objective solution to this. This means that from a plethora of features, the one that produces the most optimal features can be employed. So, a feature in essence refers to the data that has relevance to the problem at hand and can contribute towards finding a better solution.  

image003

A data-engineer would hardly come across real-life data that could be organized and structured to this extent. Before this raw data can be consumed by a machine learning algorithm, utilized in business intelligence reporting, or employed for any purpose, it needs to be converted into a structured format.  

This process of transforming raw data into useful features is known as feature engineering.  

This brings us to the question, if there is no correct solution then why should we go about this? 

Necessity of feature engineering 

Since there is no correct solution, we need to find empirical evidence to show that the proposed solution is going to be the most optimal and relevant. The selected feature set and the mathematical transformations involved in it contribute immensely towards the reliability of the machine learning model. In fact, the situation where the incorrect feature is selected is known as “Garbage in-Garbage out” which, as the name suggests, results in incorrect, unreliable and outright incorrect results.  

There are a few techniques that can be utilized to avoid such situations. We will briefly skim over a few and then go into detail in a later article. The following are some of these techniques:  

  • Handling missing values: No data is perfect. There is almost always the predicament of missing values. To use a dataset, we need to remove or impute these missing values from the dataset. Averaging is one way to go. There are others with their own applicability and shortcomings  

image005

  • Outlier Detection: Like missing values, datasets may also have outliers. This can be due to a malfunctioning device, environmental swings, or an extremely rare occurrence. Either way, such data may sway the results 

image006

  • Data Transformation: Data transformations are another essential part of dataengineering. As mentioned earlier, data is not necessarily going to be in the most optimal state by default and may need to be transformed. Log based transformation often comes in handy to normalize the data, but there are other ways to transform data depending on the data and the intent of transforming it 

Conclusion 

In a nutshell, these efforts result in a ‘make-or-break’ difference for the solution that is to be built. Every valid transformation, every correct feature selected, and every rightly missing value imputed takes us one step closer to the optimal solution. Time-series prediction is quite sensitive to data imputation techniques. Incorrect feature engineering practice can lead to the introduction of bias as a result. An incorrect feature-engineering process sways us from a reliable solution and will lay waste to upcoming efforts. 

Leave a Reply

Your email address will not be published.