# Introduction

*Curriculum Learning*- When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.- Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
- Link to the paper.

# Contributions of the paper

- Explore cases that show that curriculum learning benefits machine learning.
- Offer hypothesis around when and why does it happen.
- Explore relation of curriculum learning with other machine learning approaches.

# Experiments with convex criteria

- Training perceptron where some input data is irrelevant(not predictive of the target class).
- Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
- Curriculum learning model outperforms no-curriculum based approach.
- Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.

# Experiments on shape recognition with datasets having different variability in shapes

- Standard(target) dataset - Images of rectangles, ellipses, and triangles.
- Easy dataset - Images of squares, circles, and equilateral triangles.
- Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called
*switch epoch*). - For no-curriculum learning, the first epoch is the
*switch epoch*. - As
*switch epoch*increases, the classification error comes down with the best performance when*switch epoch*is half the total number of epochs. - Paper does not report results for higher values of
*switch epoch*.

# Experiments on language modeling

- Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
- Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
- Each word from the vocabulary is embedded into a
*d*dimensional feature space using a matrix**W**(to be learnt). - The model predicts the score of next word, given a window of words.
- Expected value of ranking loss function is minimized to learn
**W**. - Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.

# Curriculum as a continuation method

- Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
- Useful in the case where the objective function in non-convex.
- Consider a family of cost functions
*C_λ(θ)*such that*C_0(θ)*can be easily optimized and*C_1(θ)*is the actual objective function. - Start with
*C_0(θ)*and increase λ, keeping θ at a local minimum of*C_λ(θ)*. - Idea is to move θ towards a dominant (if not global) minima of
*C_1(θ)*. - Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
- The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).

# Advantages of Curriculum Learning

- Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
- Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.

# Relation to other machine learning approaches

**Unsupervised preprocessing**- Both have a regularizing effect and lower the generalization error for the same training error.**Active learning**- The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.**Boosting Algorithms**- Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.**Transfer learning**and**Life-long learning**- Initial tasks are used to guide the optimisation problem.

# Criticism

- Curriculum Learning is not well understood, making it difficult to define the curriculum.
- In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.