Explore cases that show that curriculum learning benefits machine learning.

Offer hypothesis around when and why does it happen.

Explore relation of curriculum learning with other machine learning approaches.

Experiments with convex criteria

Training perceptron where some input data is irrelevant(not predictive of the target class).

Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.

Curriculum learning model outperforms no-curriculum based approach.

Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.

Experiments on shape recognition with datasets having different variability in shapes

Standard(target) dataset - Images of rectangles, ellipses, and triangles.

Easy dataset - Images of squares, circles, and equilateral triangles.

Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called switch epoch).

For no-curriculum learning, the first epoch is the switch epoch.

As switch epoch increases, the classification error comes down with the best performance when switch epoch is half the total number of epochs.

Paper does not report results for higher values of switch epoch.

Experiments on language modeling

Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.

Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.

Each word from the vocabulary is embedded into a d dimensional feature space using a matrix W (to be learnt).

The model predicts the score of next word, given a window of words.

Expected value of ranking loss function is minimized to learn W.

Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.

Curriculum as a continuation method

Continuation methods start with a smoothed objective function and gradually move to less smoothed function.

Useful in the case where the objective function in non-convex.

Consider a family of cost functions C_λ(θ) such that C_0(θ) can be easily optimized and C_1(θ) is the actual objective function.

Start with C_0(θ) and increase λ, keeping θ at a local minimum of C_λ(θ).

Idea is to move θ towards a dominant (if not global) minima of C_1(θ).

Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.

The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).

Advantages of Curriculum Learning

Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.

Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.

Relation to other machine learning approaches

Unsupervised preprocessing - Both have a regularizing effect and lower the generalization error for the same training error.

Active learning - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.

Boosting Algorithms - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.

Transfer learning and Life-long learning - Initial tasks are used to guide the optimisation problem.

Criticism

Curriculum Learning is not well understood, making it difficult to define the curriculum.

In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.