RL-Course-2

Note two of RL course from Alberta University.

Monte Carlo

It enables the agent to learn from experience in the form of episodes of experience. It requires no model of the environment, only experience, and it can be used with simulation or real experience.

To use pure DP, the agent needs to know the env’s transition probabilities.

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.

If we use simulated experience, a model is requried, however, we can only generate sample transitions not the complete probalility distribution.

Monte carlo methods estimate values by averaging over a large number of random samples.

Monte carlo used for estimate value functions from sample interaction

MC prediction, for estimating V

input: a policy $\pi$ to be evaluated

Initialize:

$V(s) \in R$ for all $s \in S$
Returns(s) $\leftarrow$ an empty list for all $s \in S$

Loop forever (for each episode):

Generate an episode following $\pi$: $S_0, A_0, R_1, …, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$ loop for each step of episode, $t=T-1, T-2, …, 0$:
- $G \leftarrow \gamma G + R_{t+1}$
- unless $S_t$ appears in $S_0, S_1, …, S_{t-1}$:
- Append $G$ to Returns($S_t$)
- $V(S_t) \leftarrow average(Returns(S_t))$

Both first-visit MC and every-visit MC converge to the true value function as the number of visits to each state approaches infinity. By the law of large numbers the sequence of averages of these estimates converges to thier expected values.

each average is iteself an unbiased estimate, and the var of its error as $1/sqrt(N)$, where N is the number of samples.

Using Monte Carlo for action values

Estimate action-value function using MC

Action values are useful for learning a policy.

maintaining exploration in MC

exploring stats

Epicodes start in every state-action pair.

$\epsilon$-greedy strategy

Monte carlo building generalized policy iteration

Generalized policy iteration (GPI) is a way to combine policy evaluation and improvement.

How to use Monte Carlo to implement GPI?

\[{pi}_{k+1} = argmax_{a} Q_{\pi_k}(s, a)\]

For evaluation, Monte Carlo Prediction can be used to estimate the value function of the policy.

initialize:

${pi}(s)$ in $A$ for all $s$ in $S$
$Q(s, a)$ in $R$ for all $s$ in $S$, $a$ in $A(s)$
$Returns(s, a)$ an empty list for all $s$ in $S$, $a$ in $A(s)$

loop forever (for each episode):

choose S_0 in S, A_0 in A(S_0) randomly such that all pairs have probability > 0
generate an episode starting from S_0, A_0, following {pi}
$G /leftarrow 0$
loop for each step of the episode, t= T-1, T-2, …, 0:
- $G \leftarrow \gamma * G + R_{t+1}$
- Append G to Returns($S_t$, $A_t$)
- $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$
- ${pi}(S_t) \leftarrow argmax_{a} Q(S_t, a)$

Epsilon-soft policies

An $\epsilon$-soft policy is a policy where the probability of selecting each action in each state is at least $\epsilon A(s)$.

With this, we can drop the exploring start in monte carlo control algorithm.

But it can’t find the deterministic optimal policy. it can only find the optimal $\epsilon$-soft policy.

Algorithm parameter: small $\epsilon > 0$

Initialize:

${pi} /leftarrow$ an arbitrary $\epsilon$-soft policy
$Q(s, a)$ in $R$ for all $s$ in $S$, $a$ in $A(s)$
$Returns(s, a)$ an empty list for all $s$ in $S$, $a$ in $A(s)$

Loop forever (for each episode):

Generate an episode following ${pi}$: $S_0, A_0, R_1, …, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
loop for each step of episode, t=T-1, T-2, …, 0:
- $G \leftarrow \gamma G + R_{t+1}$
- Append $G$ to Returns($S_t, A_t$)
- $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$
- $A^* \leftarrow argmax_{a} Q(S_t, a)$ (with ties broken arbitrarily)
- for all $a$ in $A(S_t)$:
  - ${pi}(S_t, a) \leftarrow \epsilon A(S_t)$

Off-policy learning

off-policy learning can deal with the exploration problem.

On-Policy learning: improve and evaluate the policy being used to select actions

Off-Policy learning: improve and evaluate a different policy from the one used to select actions. The policy we are learning is not the policy we are using for action selection.

Target policy: the policy we are learning about. Learn values for this policy. For example the optimal policy.

Behavior policy: the policy we use to select actions. This is the policy that generates behavior. It’s used to select action for the agent.

Why are we decoupling the target policy and the behavior policy?

if the agent only experience the target policy, it may only have a small number of states. but if the behavior policy is more exploratory, the agent can experience more states. facilitating exploration is one of the main motivators for off-policy learning.

The behavior policy must cover the target policy.

Off-policy learning allows learning an optimal policy from suboptimal behavior.

Importance sampling

Importance sampling allows us to do off-policy learning.

estimate the expected value of a distribution using samples from a different distribution.

Derivation of importance sampling:

Sample: $x ~ b$ Estimate: $E_{pi}[X]$

\[E_{pi}[X] = \sum_{x /in X} x * pi(x)\] \[E_{pi}[X] = \sum_{x /in X} x * b(x) * pi(x) / b(x)\]

It’s called importance sampling ratio. $w(x) = pi(x) / b(x)$

\[E_{pi}[X] = \sum_{x /in X} x * w(x) * b(x)\] \[E_{b}[X * w(x)]\]

Off-policy MC prediction

How to use importance sampling to correct returns

\[V_{pi}(s) = E_{pi}[G_t | S_t = s]\] \[~average(\rho_0*G_0, \rho_1*G_1, ..., \rho_{T-1}*G_{T-1})\] \[\rho_t = \pi(A_t | S_t) / b(A_t | S_t)\] \[V_{pi}(s) = E_{pi}[\rho_t * G_t | S_t = s]\] \[P(A_t, S_{t+1}, A_{t+1}, ..., S_T | S_t = s, A_t = a)\] \[b(A_t | S_t) * p(S_{t+1}, A_{t+1}) * b(A_{t+1} | S_{t+1}) * ... *\] \[\prod_{k=t}^{T-1} b(A_k | S_k) * p(S_{k+1} | S_k, A_{k+1})\] \[\rho_{t:T-1} = \prod_{k=t}^{T-1} \div { pi(A_k | S_k) * P(S_{k+1} | S_k, A_k)} {b(A_k | S_k) * P(S_{k+1} | S_k, A_k)}\]

modify the monte carlo prediction algorithm for off-policy learning

Input: a ploicy $\pi$ to be evaluated, a behavior policy $b$

Initialize:

$V(s) in R$ for all $s$ in $S$
$Returns(s)$ an empty list for all $s$ in $S$

Loop forever (for each episode):

Generate an episode following $b$: $S_0, A_0, R_1, …, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
$W \leftarrow 1$ Loop for each step of episode, $t = T-1, T-2, …, 0
- $G \leftarrow \gamma WG + R_{t+1}$
- Append $G$ to Returns($S_t$)
- $V(S_t) \leftarrow average(Returns(S_t))$
- $W \leftarrow W * pi(A_t S_t) / b(A_t S_t)$

Temporal Difference Learning

Values from returns:

\[V_{\pi}(s) = E_{\pi}[G_t | S_t = s]\] \[V(S_t) = V(S_t) + \alpha (G_t - V(S_t))\]

Bootstrapping and Temporal Difference

\[G_t = R_{t+1} + \gamma V(S_{t+1})\] \[V(S_t) = V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))\]

The TD error is the difference between the target and the current estimate.

Tabular TD(0) for estimating $V \approx V_{\pi}$

Input: the policy $\pi$ to be evaluated

Initialize:

$V(s) in R$ for all $s$ in $S$
$alpha \in (0, 1]$

Loop forever (for each episode):

Initialize S
Loop for each step of episode:
- A = $\pi(S)$
- Take action A, observe R, S’
- $V(S) = V(S) + \alpha (R + \gamma V(S’) - V(S))$
- S = S’

TD combines DP and MC

TD bootstraps like DP, but updates like MC.

Do not requre a model of the environment. it’s online and incremental.

converge faster than MC.

TD vs MC

TD converges faster to a lower final error. with larger learning rate, Td error drops faster than smaller learning rate. however, smaller learning rate will reach to a lower error at the end.

Sarsa: GPI with TD

How GPI can be used with TD to find improved policies

Generalized Policy Iteration consists of policy evaluation and policy improvement.

GPI with MC:

Policy evaluation: estimate $V_{\pi}$
Policy improvement: $\pi_{k+1} = argmax_{a} Q_{\pi_k}(s, a)$

GPI with TD:

from state-value function to action-value function.

instead of state to state, we go from state-action to state-action.

SARSA: $S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, R_{t+2}, …$

\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))\]

Q-learning

Q-learning is an off-policy TD control algorithm.

\[Q(S_t, A_t) = Q(S_t, A_t) + \alpha (R_{t+1} + \gamma max_{a} Q(S_{t+1}, a) - Q(S_t, A_t))\]

it uses the max action value of the next state to update the current state-action value.

The SARSA update is an on-policy update, while the Q-learning update is an off-policy update.

Expected Sarsa

Expected Sarsa is an off-policy TD control algorithm.a

The Bellman equation for action-values

\[Q_{\pi}(s, a) = E_{\pi}[R_{t+1} + \gamma Q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]\]

The expected Sarsa Algorithm:

Initialize Q(s,a) arbitrarily for all state-action pairs For each episode:

Initialize state S Choose action A from S using policy derived from Q (e.g., ε-greedy) For each step in the episode:

Take action A, observe reward R and next state S’ Choose next action A’ from S’ using policy Update Q(S,A) using the expected value of next state-action pairs:

Q(S,A) ← Q(S,A) + α[R + γ∑π(a’

S’)Q(S’,a’) - Q(S,A)]

S ← S’, A ← A’

Until S is terminal

Model-based RL

what’s a model in RL?

Distribution models and sample models.

Process planning with model experence without interacting with the environment.

Tabular Dyna-Q

Algorithm:

Initialize $Q(s, a)$ and $Model(s, a)$ for all $s$ in $S$, $a$ in $A(s)$ Loop forever (for each episode):

S = current state
A = epsilon-greedy(S, Q)
Take action A, observe R, S’
Q(S, A) = Q(S, A) + alpha(R + gamma * max_{a} Q(S’, a) - Q(S, A))
Model(S, A) = R, S’
Loop for i = 0, 1, …, n:
- S = random previously observed state
- A = random action previously taken in S
- R, S’ = Model(S, A)
- Q(S, A) = Q(S, A) + alpha(R + gamma * max_{a} Q(S’, a) - Q(S, A))

Inaccurate model issue

Ways in which models can be inaccurate

The transition model stored is different from transitions in the environment.

models of missing transitions incomplete models

The environment is changing. before and after the change, same action may lead to different states with different rewards.

the effects of planning with an inaccurate model

Dyna-Q with an inaccurate model can be unstable.

How dyna can be used with inaccurate models

llm-mem

Titans paper