This tutorial describes the important components of a learning algorithm: representation (what the model looks like), evaluation (how do we differentiate good models from bad ones), and optimization (what is our process for finding the good models among all the possible models).
Choosing a representation of a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called hypothesis space.
Example: In the case of linear regression, the representation of our model is a linear function. Regardless of what weights w we choose, our model will be a linear function.
Key considerations: Is the scenario you are trying to capture well represented by the model function? Is it overly restrictive? Is it overly flexible? For example, if the data has a quadratic trend, but we are trying to fit a linear function to it, we are being overly restrictive.
An evaluation or objective function (or cost function) is needed to distinguish good classifiers from bad ones.
Example: In the case of least-squares linear regression, the cost function was mean-squared error cost function.
Key considerations: Does your cost function capture the relative importance of different kinds of mistakes? For example, is being off by 0.3 for one data point and 0.1 for another data points better or worse than being off by 0.2 for both data points? Is a false positive as bad as a false negative?
The process or algorithm for finding the best model in hypothesis space.
Example: Gradient descent is an optimization algorithm that can be used for finding the optimal model for least-squares linear regression.
- How efficient is the optimization technique in practice?
- Does it always find the optimal solution? Is it possible for it to output sub-optimal solutions? If yes, how often does it happen in practice?
Note that the choosing any one component of the machine learning algorithm (say the optimization method) depends on our choices for the other components.
For example, gradient descent can only be applied if the cost function is differentiable, i.e. it is a smooth function (since we need to calculate the gradients).
Another more involved example is, assuming that we have a non-convex cost function, does gradient descent get stuck on local minima?
The table below lists a few examples of each of the components of a learning algorithm. It's okay if you haven't heard many of the names before, but see if you can spot ones you have heard of.
You will find it useful to come back to this article as you go through the journey of learning more machine learning algorithms. In particular, when you learn a new algorithm, ask yourself what representation does it use, what cost function does it use, and what is the optimization process.