Show admin stats
Acknowledging that γ-discounted entropy regularization is used in reward expectation, they formulate a new notion of softmax temporal consistency for optimal Q-values as:
They then introduce a new RL algorithm called Path Consistency Learning (PCL) that extends this softmax temporal consistency to arbitrary (multi-step) trajectories. Pseudocode of PCL Algorithm can be seen in Section 3.5
Unlike algorithms using Qπ-values, PCL seamlessly combines on-policy and off-policy traces. Unlike algorithms based on hard-max consistency and Q◦-values, PCL easily generalizes to multi-step backups on arbitrary paths, while maintaining proper regularization and consistent mathematical justification (that they outline in Section 3 and the appendix).