A Practical Overview of Feature Engineering in Machine Learning and How to Automate it

In this blog post, I am going to talk about feature engineering in machine learning, which is fundamental but difficult. I will first examine what techniques we have for applying feature engineering and then discuss some works on automating the process.

1. Introduction

Feature engineering is probably the most time consuming process when we do fine-tuning of a machine pipeline, which consist of pre-processing single feature (e.g., scaling numerical features or encoding categorical features), exploring the data to find correlations (e.g, feature crossing), feature selection and dimensionality reduction.

Feature engineering is very important to the final performance of the predictive pipeline, which in some way decides the predictive power. In general, if we can compose better features, the better result we will get. On Kaggle (a website for all kinds of machine learning competitions), most winners spend most of them efforts in feature engineering.

In this blog post, I will first talk about the basic methods used in feature engineering, and then review the recent works in automating this process.

2. Methods for Feature Engineering

Basically, there are three different groups of methods for feature engineering

Pre-processing single feature, e.g., scaling, encoding and embedding
Combination of multiple features, e.g., feature crossing
Feature selection, e.g., decision tree-based feature selection
Dimensionality reduction, e.g., PCA

In this section, I will go over all these three groups of methods in details, and for each group I will discuss some widely-used techniques and their benefits, along with their implementations in scikit-learn.

2.1 Single Feature Processing

For single features, we first check if there are missing values (if so we need to impute them), then scale numerical features to make them have the same magnitude and encode categorical features for later computation. For some specific features, we can apply quantization-based methods to transform them into a quantile range to remove data redundancy.

Imputation of Missing Values

For some features, there may exist missing values, therefore we have to apply some imputation. In sklearn, we can use sklearn.impute.SimpleImputer(missing_values=nan,strategy=’mean’,fill_value=None), where strategy specifies what value we will be using for replacing missing values, e.g. mean, median, most_frequent, or constant.

Scaling of Numerical Features

Since different numerical features probably have different orders of magnitude of scale (e.g., the number of students and the average subjects registered per student at MIT), the performance of our predictive model can be solely decided by features with large scale and variance. Therefore, it is important to make each numerical feature in comparable scales.

There are two mostly-used scalers in sklearn:

sklearn.preprocessing.StandardScaler: standardizing features by removing the mean and scaling to unit variance, i.e., \(\frac{x - \mu}{\delta}\)
sklearn.preprocessing.MinMaxScaler: transforming features by scaling each feature to a given range, e.g., to transform features into the range (0, 1), we can scale the features as \(\frac{X - X_{min}}{X_{max} - X_{min}}\).

There are also some other scalers, for example, sklearn.preprocessing.RobustScaler is most robust to outliers by using statistics like median and quantile ranges. Besides, there is also a sklearn.preprocessing.Normalizer which normalizes the features to convert them into unit vectors.

Encoding of Categorical Features

There are two types of encodings in sklearn, i.e., ordinal encoding and one-hot encoding. Ordinal encoding transforms features into the range between 0 and the number of categories minus by 1, implemented by sklearn.preprocessing.OrdinalEncoder (for converting features) and sklearn.preprocessing.LabelEncoder (for converting labels).

One-hot encoding converts features into a vector whose length is the number of categories (which is a one-to-one match to the categories), and sets 1 for the corresponding category and 0 for other categories. There are sklearn.preprocessing.OneHotEncoder (for converting features) and sklearn.preprocessing.LabelBinarizer (for converting labels) in sklearn.

Binarization and Quantization

Sometimes the absolute value of a feature makes little difference after being above some threshold (e.g., the GRE score for graduate school application), it will be useful to simply binarize the data. Sklearn provides a easy-to-use sklearn.preprocessing.Binarizer which binarizes data (set feature values to 0 or 1) according to the given threshold.

Binarization is actually a special case of quantization. We can employ quantization to transform features to follow some specific distribution (e.g., a uniform or a normal distribution). There is a sklearn.preprocessing.QuantileTransformer in sklearn implementing quantization. There are other transformers as well, e.g., sklearn.preprocessing.PowerTransformer.

2.2 Combination of Multiple Features

Sometimes it is important to combine several related features together, for example, if we have the height and weight information, we can compute BMI as a new feature. However, this requires

Polynomial Features

A easy trick to extend linear regression to polynomial regression is to augment the data by adding polynomial features. For example, assume the feature vector is \((X_1, X_2)\), we can augment them into \((X_1, {X_1}^2, X_2, {X_2}^2, X_1 X_2)\), therefore we can better fit the linear model in the quadratic space. We can easily implement this using sklearn.preprocessing.PolynomialFeatures in sklearn by specifying the degree (e.g., 2 in the example).

Functional Transformations

It is also possible to support more complex transformations, e.g., \(X_3 = \frac{X_1}{X_2}\). Sklearn provides sklearn.preprocessing.FunctionTransformer for a user-defined function to transform features.

2.3 Feature Selection

Filtering

Filter-based methods select features based on some metric to evaluate its importance to the to-be-solved problem and just keep the most important features. For example,

Variance, where sklearn.feature_selection.VarianceThreshold removes all low-variance features based on the given threshold;
Pearson correlation, where we pass the Pearson correlation (e.g., scipy.stats.pearsonr) as the score_func for sklearn.feature_selection.SelectKBest;
Chi-squared stats, where we pass the chi-squared stats (e.g., sklearn.feature_selection.chi2) as the score_func for sklearn.feature_selection.SelectKBest;
ANOVA F-value between label/feature, where we pass sklearn.feature_selection.f_classif (for classification) or sklearn.feature_selection.f_regression (for regression) as the score_func for sklearn.feature_selection.SelectKBest

There are also some other score functions, e.g., mutual information (sklearn.feature_selection.mutual_info_classif or sklearn.feature_selection.mutual_info_regression).

Wrapper Methods

One of the most widely-used method is recursive feature elimination. We have a base model that is able to assign importance to each feature (e.g., a decision tree), we can train it in a recursive way. In each round, we remove some features with smallest weights and eventually we will get the desired number of features. This method is implemented by sklearn.feature_selection.RFE.

Embedded Methods

Embedded methods try to embed the feature selection along with the train of predictive model. For example, we can use L1 or L2 regularization for feature selection (especially L1). This is implemented by sklearn.feature_selection. SelectFromModel.

2.4 Dimensionality Reduction

After all these above preprocessing steps, we are essentially ready to train our first model for prediction. However, if the number of features is too big (so called the Curse of Dimensionality), the training will be super slow and the performance in general will not be good. Therefore, it is sometimes necessary to reduce the dimensionality. There are several methods we can use:

PCA (Principal Component Analysis), this is implemented in sklearn.decomposition.PCA, and its kernel variant sklearn.decomposition.KernelPCA, sparse variant sklearn.decomposition.SparsePCA and online variant sklearn.decomposition.IncrementalPCA.
Truncated SVD (while PCA is TruncatedSVD on centered data), this is implemented in sklearn.decomposition.TruncatedSVD;
Factor Analysis, this is implemented in sklearn.decomposition.FactorAnalysis;
Independent Component Analysis, this is implemented in sklearn.decomposition.FastICA;
Non-Negative Matrix Factorization, this is implemented in sklearn.decomposition.NMF;
DictionaryLearning, this is implemented in sklearn.decomposition.DictionaryLearning;
Agglomerative Clustering, this is implemented in sklearn.cluster.FeatureAgglomeration.

3. Automatic Feature Engineering

Although deep learning has been employed for feature engineering on image, video and text data, it requires huge volume of training instances therefore it is not suitable for small or medium size of datasets. Also, it is difficult to find a good representation of features in normal tabular datasets as the inputs for deep learning (while data like image and video has a good representation by nature). Furthermore, the features learned by deep learning is difficult to explain and interpret, which prevents users from understanding the machine learning pipeline.

To this end, there have been several other methods proposed to automate the tedious process of feature engineering. In general, there are TBU different categories: (1) expansion-reduction; (2) greedy evolution; (3) Learning-based transformation.

3.1 Rule-based Feature Expansion-Reduction

(Kanter & Veeramachaneni, 2015) proposes a rule-based methods Deep Feature Synthesis to expand the search space for relational data. Basically they have some pre-defined functions for processing single row in a table (e.g., doing some aggregations for some columns, MIN/MAX/SUM), and they follow the links (i.e., primary-foreign key) to join tables together and run these pre-defined functions on these joined tables.

After getting the processed features, they further use Truncated SVD for feature selection and dimensionality reduction. They build a random forest on top and since the processed features are fixed, they can fine-tune the hyper-parameters of the random forest by Bayesian Optimization.

3.2 Greedy Feature Evolution

(Katz et al., 2016) constructs feature greedily by evaluating the performance of the model trained with the addition of candidate feature. To sort of avoid the expensive computation of training models, they employ learning to rank those newly constructed features and only evaluate these most promising ones. However, since they still need to train models to expand the feature space, this category of methods are still considered time-consuming.

3.3 Learning-based Transformation

Supervised Learning

(Nargesian et al., 2017) proposes a machine learning-based methods for transformation of features. Specifically, they have a pre-defined set of unary (e.g., log, square-root) and binary (e.g., sum, subtraction) operations for features, and they train a classifier predicts the most promising transformation for each feature, which takes the Quantile Sketch Array of feature(s) for all classes as the input.

To train this classifier, they evaluate the model with the original feature and the transformed feature, and if the improvement is beyond a threshold, they use this transformation as the positive sample and other transformations as the negative samples. One limitation is that their methods only work for binary classification problem and consider single feature (they don’t support any operations for multiple features).

Reinforcement Learning

(Khurana et al., 2018) is sort of the follow-up of the above-mentioned paper. They still define a set of transformations for features, and they abstract the process of applying feature engineering as traversing on the transformation graph where the dataset is the node in the graph and the edge between two nodes is the transformation which changes the dataset (the source node) into another dataset (the destination node) by employing corresponding transformations on all applicable features. It starts with the initial dataset, and the ideal solution is to apply the sequence of transformation until the optimal node is constructed.

To guide the expansion of the transformation graph, we need to find a strategy that picks the “correct” transformation and the “correct” source node at each step. There are lots of important factors influencing the decision, e.g., the node’s accuracy, the transformation’s average performance, number of times this transformation has been applied, accuracy gain to the source node from its parent, the depth of the node, the current remaining budget, number of features in the node and so on. They use reinforcement learning (more specifically, Q-learning with functional approximation) to find out the optimal strategy. One limitation is that they have to train every dataset to learn the Q-value function for this dataset.

4. References

Kanter, J. M., & Veeramachaneni, K. (2015). Deep feature synthesis: Towards automating data science endeavors. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 1–10.
Katz, G., Shin, E. C. R., & Song, D. (2016). Explorekit: Automatic feature generation and selection. 2016 IEEE 16th International Conference on Data Mining (ICDM), 979–984.
Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E. B., & Turaga, D. S. (2017). Learning Feature Engineering for Classification. IJCAI, 2529–2535.
Khurana, U., Samulowitz, H., & Turaga, D. (2018). Feature engineering for predictive modeling using reinforcement learning. Thirty-Second AAAI Conference on Artificial Intelligence.

Written on March 5, 2019