Deep Reinforcement Learning, Life doesn’t give you its MDP; life is uncertain.
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback. In AI, particularly reinforcement learning (RL), a control task refers to a problem where an agent must make sequential decisions to achieve a goal. These are often modeled as Markov Decision Processes (MDPs) or Partially Observable MDPs (POMDPs). The Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.
- state, action, reward and next state
- S0, A0, R1, S1
- The agent’s goal is to maximize its cumulative reward, called the expected return or return.
- Because RL is based on the reward hypothesis: all goals can be described as the maximization of the expected cumulative reward.
- The main goal of Reinforcement learning is to find the optimal policy π∗ that will maximize the expected cumulative reward.
- State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.
- Observation o: is a partial description of the state. In a partially observed environment.
The Action space is the set of all possible actions in an environment. The actions can come from a discrete or continuous space.
- Episodic task : In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States.
- Continuing tasks : These are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.
- Exploration is exploring the environment by trying random actions in order to find more information about the environment.
- Exploitation is exploiting known information to maximize the reward.
The Policy π: the agent’s brain This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.
- Policy-based methods (Directly) : we learn a policy function directly. We aim to optimize the policy directly without using a value function.
1- Deterministic
2- Stochastic
- Subclasses:
- Policy-Gradient methods : optimizes the policy directly by estimating the weights of the optimal policy using Gradient Ascent
- Reinforce :
- use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return). we have significant variance in policy gradient estimation.
- The Monte Carlo variance, leads to slower training since we need a lot of samples to mitigate it.
- Reinforce :
- Policy-Gradient methods : optimizes the policy directly by estimating the weights of the optimal policy using Gradient Ascent
- Subclasses:
- Value-based methods (Indirectly) : we learn a value function that maps a state to the expected value of being at that state. The idea is that an optimal value function leads to an optimal policy.
- state-value function
- action-value function
- Actor-critic method (hybrid architecture), which is a combination of value-based and policy-based methods.
- We learn two function approximations (two neural networks) : 1- A policy that controls how our agent acts 2- A value function to assist the policy update by measuring how good the action taken is
- stabilize the training by reducing the variance using:
- An Actor that controls how our agent behaves (Policy-Based method)
- A Critic that measures how good the taken action is (Value-Based method)
- Algorithms:
- Advantage Actor Critic (A2C)
- Monte Carlo vs Temporal Difference Learning:
- Monte Carlo uses an entire episode of experience before learning
- Temporal Difference (TD) uses only a step (St, At, Rt+1, St+1) to learn.
- TD (Temporal difference) methods, Temporal Difference learning combines ideas from Monte Carlo methods (learning from experience) and Dynamic Programming (bootstrapping).:
- TD learning, TD(0)/One-Step TD, TD(λ)/Forward View
- SARSA (On-Policy)
- SARSE
- Q-Learning (Off-Policy)
- Value-Based Methods: NFQ, DQN (Deep Q-Network), Double DQN (DDQN), PER
- Policy-Based Methods: REINFORCE (Monte-Carlo policy-gradient), VPG, GAE, Actor-Critic
- Advanced Actor-Critic Methods: A3C (Asynchronous Advantage Actor-Critic), A2C (Advantage Actor-Critic), DDPG (Deep Deterministic Policy Gradient), TD3 (Twin Delayed DDPG), SAC (Soft Actor-Critic), PPO
- MC (Monte Carlo)
Policy-gradient methods, what we’re going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time on-policy since for each update, we only use data (trajectories) collected by our most recent version of πθ.
The difference between these two methods lies on how we optimize the parameter θ:
-
In policy-based methods, we search directly for the optimal policy. We can optimize the parameter θ indirectly by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
-
In policy-gradient methods, because it is a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter θ directly by performing the gradient ascent on the performance of the objective function J(θ).
- The objective function gives us the performance of the agent given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the expected cumulative reward.
- Cross-Entropy Method (CEM)
- Evolutionary Strategies
- Covariance Matrix Adaptation (CMA-ES)
- REINFORCE (Monte Carlo Policy Gradient)
- Natural Policy Gradient
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Monte Carlo (MC) Methods
- First-Visit MC
- Every-Visit MC
- Temporal-Difference (TD) Methods
- TD(0)
- TD(λ)
- SARSA (On-Policy TD Control)
- Q-learning (Off-Policy TD Control)
- Expected SARSA
- Linear Function Approximation
- Deep Q-Network (DQN) and variants
- Advantage Actor-Critic (A2C)
- Asynchronous Advantage Actor-Critic (A3C)
- Soft Actor-Critic (SAC)
- Twin Delayed DDPG (TD3)
- Deep Deterministic Policy Gradient (DDPG)
- Dyna-Q
- Model-Based Value Expansion
- Monte Carlo Tree Search (MCTS)
- iLQR (Iterative LQR)
- Monte Carlo (uses complete episode returns)
- Temporal-Difference (uses estimated returns)
- Dynamic Programming (uses full model)
- Cross-Entropy Method (optimizes policy directly)
- Policy Gradient methods
- CEM ↔ Policy-Based (gradient-free)
- MC ↔ Value-Based (full returns)
- TD ↔ Value-Based (bootstrapping)
- Actor-Critic = Policy-Based (Actor) + Value-Based (Critic)
- Deep Reinforcement Learning agents learn with batches of experience. The question is, how do they collect it?
- Online RL, which is what we’ve learned, the agent gathers data directly. it collects a batch of experience by interacting with the environment. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy)
- Online Rl, this implies that either you train your agent directly in the real world or have a simulator.
- Offline Rl, the agent only uses data collected from other agents or human demonstrations. It does not interact with the environment.
- Create a dataset using one or more policies and/or human interactions.
- Run offline RL on this dataset to learn a policy
- Problem: the counterfactual queries problem, What do we do if our agent decides to do something for which we don’t have the data? For instance, turning right on an intersection but we don’t have this trajectory.
- Reality is varied, non-stationarity and open-ended.
