Let us say you have a machine learning problem that you'd like to solve. You try a few machine learning algorithms, and they give okay results, say in the range 55%-65% accuracy. You'd like to do better.
In practice, the most important thing that needs to be done is what is known as feature engineering. Feature engineering is the process of finding the optimal set of features (input) that should be given as input to the machine learning model.
In fact, how the data is presented to the model highly influences the results. With great features, simpler algorithms can perform pretty well on a number of tasks. Feature engineering seeks the best representation of the data. That is, it deals with transforming the data into meaningful features which capture well the inherent structures.
A lot of machine learning in practice is feature engineering. It takes up most of the time (up to 80%!), since the technical details of machine learning algorithms is already taken care of by machine learning libraries.
Moreover, feature engineering does not appear in most research papers and machine learning books! Hence, for a person new to machine learning, it is often the most underrated concept.
In addition, feature engineering is domain-specific and therefore hard. It requires intuition, smart decisions, creativity and lots of trials & errors. Mastering it needs practice.
Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. - Andrew Ng.
Toy example: Separating concentric circles
Let's say we have a 2D dataset for a classification problem (as shown on the right below). As you might remember, a number of machine learning algorithms only work when the dataset is linearly separable. Unfortunately, our data is not.
Data in the original two-dimensional input space (right) can be mapped to a three-dimensional space (left) via φ(x,y)=(x, y, x²+y²). An extra, engineered feature has been added to the data representation. Then, a linear decision boundary (green hyperplane) can be found which separates the data. The hyperplane when mapped back to the input space looks like a circle, even though the machine learning algorithm can only find linear decision boundaries.
In general, we are working with vary many dimensions and much more messy data. So this is just a toy example for illustration.
General Techniques in Feature Engineering
There are many ways to obtain engineered features - by combining raw attributes in tabular data, by choosing context-specific indicators in text data, by finding relevant structures in image data, etc. Some examples are discussed below.
- Binarization: Decomposing any type of attribute into two groups. For example categorical attributes into two classes (blue/red/unknown into color/no color), numerical quantities into two intervals (above/below X), with the special case of frequencies (positive/0).
- Combining features by means of any kind of operation such as addition or product, like in polynomial linear regression. We can also combine categorical features, by treating a pair of features taken together as a single feature, i.e. feature C = (feature A, feature B).
- Quantization or binning: Transforming continuous numerical values into categories. The signal (weights, view counts) is decomposed into a finite number of intervals. Note that the special case of two intervals corresponds to binarization. Bins can have fixed width-size or be adaptive (e.g. using quantiles). This can be extended to any type of attribute. For example, temporal attributes in date-time format can be quantized into days, hours, parts of the day, etc.
The process of engineering features is usually iterative: applying any technique and subsequently evaluating the model to see if the new features helped.
Examples of Feature Engineering
#1. Predicting language
Suppose we are given some sentences and the language in which the sentence is. For example, below are some sentences in English and French.
English:What’s your name?How’s the weather?What’s your favorite movie?French:Comment vous appelez-vous?Quel temps fait-il?Quel est ton/votre film préféré?
Now, you are given a new sentence, such as Est-ce que vous avez des frères et sœurs? or Do you have siblings? and you'd like to predict the language.
In this case, there are number of possible features to input into a machine learning algorithm. We could break up the sentences into individual words and use those as features, we could break it up into characters, we could use character bi-grams, and so on. We could also use all of them.
Depending on which features we chose, we'll get algorithms with different accuracies and different run-time. Which features work best also depends on the size of the dataset and whether or not each document has sentences in different languages.
#2. Fourier transforms for Speech transcription
Note: For a person familiar with signal processing, this section might seem quite trivial.
Speech transcription is the problem of converting speech into written text. First of all, input sound data must be transformed into numbers. Sound waves’ amplitude is measured at discrete points equally-spaced in time. This procedure, called sampling, outputs a long array of numerical values.
When feeding this raw signal to a ML model, finding regularities in the data is a difficult task. People speak differently and at varying speeds. And this results in single words being encoded by very different signals.
If the signal is processed using fourier transforms, an important gain in accuracy can be achieved. Fourier analysis break apart one complex sound wave into multiple simple components. A sample sound wave looks as follows:
This sound wave can be decomposed into multiple sine waves of different frequencies. That is, it can be represented as the addition of different sine waves.
First, samples in the raw signal are broken into short time intervals. Then, Fourier transform is applied per temporal segment. Below is the Fourier transform of one segment.
And the following the Fourier transforms of the multiple segments attached together. Each segment is one column, and time moves forward left to right. This is called a spectrogram.
In a spectogram, one can much more easily find patterns and make out different sounds from each other. For example, different vowels such as 'a' and 'o' are characterized by different frequencies. Sounds which require closing our lips have characteristic patterns (p and b) and so on.
By feeding in this data representation instead of the raw signal machine learning models are better able to recognize patterns. Best machine learning models for this task are recurrent neural networks, and hierarchical hidden markov models.
#3. Classification problem for the 2010 KDD Cup
The task consisted on predicting student test performance based on past behavior. Winners of the competition created millions of binary features by binarization and discretization techniques, and then applied simple linear models. Check the paper here.
#4. Binary classification with kernels.
The kernel trick is a very important method of modifying the data representation. The idea is to introduce non-linearity in a machine learning model. In this case, the complex features are not calculated explicitly. By exploiting the inner product properties, one can implicitly work in the space induced by a non-linear function φ, saving computational cost.
Pairwise inner products of the input are computed by means of a kernel function k: k(c1, c2) = φ(c1)Tφ(c2). Predictions can then be computed in terms of them. Radial basis function is one of the most common kernel functions. By applying the kernel trick, simple, linear methods such as SVM or logistic regression can perform way better in classification tasks. Recall that SVM technique seeks an hyperplane for separating two classes as described here. When linearly nonseparable data is mapped to a feature space of higher dimensions, SVM has an easier time finding the hyperplane.
Feature selection vs feature extraction vs feature engineering vs feature learning
Feature selection is essential when creating the dataset. The subset of most relevant features can be automatically chosen by ranking feature scores or by trial & error over feature subsets. However, features must be created beforehand.
Feature extraction applies automatic methods (PCA, DSP methods, etc) for constructing new features (typically from large raw datasets, such as pixels or words).
Feature engineering (sometimes called feature construction) deals with the manual construction of features from raw data. This requires most of the time and thinking to extract features that suit the underlying problem, starting by understanding it.
The newer approach of feature learning tries to construct those features automatically, for example deep learning can learn abstract representations of features.