Live note-taking from AI meet-up Series talk. Just about to start!
A note about Galvanize (the hosts) and their learning programs.
Why John started working on this topic? Teaching robots to do new things. Example, surgery. Tying knots in ropes. Solution involved human demonstration (hand-holding the robot). And the robot generalizes from one demonstration. [plays video]. Generalizes to new positions of rope. (this solution used traditional techniques).
It was a brittle system, worked only under the right settings. Green carpet. Robot keeps doing its motions even if it fails to grab the rope.
This was 2013, then good results in deep learning started coming out for computer vision. And then speech recognition. This got him motivated to try and do the same thing in robotics.
[plays video] Video is of a documentary of a weaver bird making its nest by tying knots in grass. Balancing on one leg on a tree branch. Still playing video of the nest being completed. Teardrop shaped nest with one hole in the front where you can go in and out. The male birds build the nest and the female ones pick the best one! The birds have a brain thats less than 5 grams.
Whats the right kind of machine learning to not have brittle systems? Reinforcement learning.
Introduction to reinforcement learning: You take actions in a world and effect the state of the world. Environment returns an observation and reward.
Deep reinforcement learning: Deep learning is involved in estimating the various functions involved. Such as the policy function, value function, or the model of the dynamics of the world.
Example: In inventory management, the observations are current inventory levels, actions are what to purchase and rewards are how much profit.
Even for traditional machine learning tasks, often it makes sense to model them as a reinforcement learning problem. For example, when classifying images (large image), your observation would be the current image window and the action would be where to look next.
Skipping: Relation between reinforcement learning and other machine learning problems.
What makes reinforcement learning hard?
- Your input data depends on your previous actions.
- Your rewards might be delayed.
Skipping: Should I use RL on my practical problem?
Recent success stories in deep RL
- Atari using Deep Q-learning, policy gradients, DAGGER
- Superhuman Go
- Robotic manipulation using guided policy search
- Robotic locomotion using policy gradients
- 3D games using policy gradients
Skipping: Technical formulation of RL problem
Skipping: Episodic settings description
Skipping: Parameterized policies
Turning RL problems to optimization problems
Derivative free: Think of each episode as a data-point mapping parameters to rewards for the episode. Works embarrassingly well.
Cross-entropy method: Sort of a evolutionary algorithm. Sample theta randomly. Run the agent in the environment with this theta. Choose 20% of the best thetas. Fit a gaussian distribution the this set of best thetas. Repeat.
Policy Gradient methods: Overview
Collect a bunch of trajectories.
- Make the good trajectories more probable.
- Make the good actions more probable.
- Push the actions towards good actions
We'll talk about 1 and 2. Leading methods are in category 2.
Score Function Gradient Estimator: Making an estimator for the environment. I'll skip the algebra, and cut to the intuition. f(x) measures how good a sample is. You move g in the direction which gives more probability to the good samples and less to the bad ones.
This is back-propagation for reinforcement learning. Major breakthrough in last 4-5 years.
Estimator is really noisy, which is where the engineering challenges are.
Score Function Gradient Estimator for policies: Use good trajectories (high reward) as supervised examples in classification / regression.
Methods to decrease variance of estimator
- Using temporal structure: Figuring out which action in the sequence of actions caused the rewards to increase. [Will elaborate this soon]
- Introducing baseline: Things become more stable if the mean of the rewards is zero. So subtract the mean of rewards from the individual rewards.
- Discounts: If an action is far away from a reward, then they probably didn't have much to do with each other.
This is the state-of-the-art algorithm. (apart from a couple of more tricks for step-sizes and some more variance reduction in the estimators).
Demo: Walking man in physics simulation. Input = joint angles, velocities, and positions of body parts. Output = joint torques. Reward = Move forward as fast as possible. Don't apply large torques (save energy). After 2000 iterations figured out how to walk!
Same algorithm works on many different robot shapes (four-legged robot), and tasks (stand-up from sitting position). Two weeks training time!