Note one of RL course from Alberta University. k-armed bandit problem action-value methods

Roadmap

  1. Mutlti-armed bandit problem (MAB)
  2. Markov Decision Process (MDP)
  3. Dynamic programming

K-armed bandit problem

Agent needs to choose from k actions, each action will have a unknown reward. Once the action is performed, the agent see the reward.

  • reward
  • time steps
  • values The values of the action is the expected reward of taking that action. The goal of the agent is to maximize the expected reward. argmax_a \(Q(a) = a*\). which means the argument which maximizes the q* function.

Action-value methods

\[q*(a) = E[R_t | A_t = a]\]

This number is unkown, we need to estimate it.

  • Sample-average method sum of the rewards when a is taken prior to t / number of times a is taken prior to t
\[Q_t(a) = \frac{sum \: of \: rewards \: when \: a \: is \: taken \: prior \: to \: t}{\: number \: of \: times \: a \: is \: taken \: prior \: to \: t}\]
  • Greedy action selection Choose the action with the highest estimated value.

  • exploration-exploitation dilemma

Sequential decision making with evaluative feedback

The agent needs to make a sequence of decisions, and need to learn from the error and gain.

The feedback needs to be quantitative.

Learning action

what if the doctors know what’s the long term effect of the action. then choosing action will be easy.

Estimating action values incrementally

incremental update rule for estimating action values

\[Q_{n+1} = \frac{1}{n} \sum_{i=1}^{n} R_i = \frac{1}{n} (R_n + \sum_{i=1}^{n-1} R_i) = \frac{1}{n} (R_n + (n-1)\frac{1}{n-1}\sum_{i=1}^{n-1} R_i) = \frac{1}{n} (R_n + (n-1)Q_{n}) = \frac{1}{n}(R_n + nQ_{n} - Q_{n})\] \[NewEstimate = OldEstimate + StepSize(Target - OldEstimate)\]

The error of the estimate is the difference between the new target and OldEstimate

The new reward is our target

The StepSize can be a function of $n$ that between 0 and 1.

general update rule can be used to solve a non-stationary bantdit problem.

Non-stationary bandit problem

The distribution of reward changes over time. If the step size is constant, the most recent rewards will have the most influence on the estimate than the old rewards.

Decaying the past rewards

Exploration and exploitation trade-off

When to explore and when to exploit?

exploration: improve knowledge for long-term benefit exploitation: maximize reward in the short term

Epsilon-greedy methods

With probability \(\epsilon\), select a random action. With probability \(1-\epsilon\), select the greedy action.

\[A_t = \begin{cases} argmax_a Q_t(a) & \text{with probability 1 - epsilon} \\ random(a) & \text{with probability epsilon} \end{cases}\]

use mutiple run’s result to get the average reward.

Optimistic initial values

The optimistic initial values can encourage early exploration. ( the agent will not be stuck in the suboptimal action at the very beginning)

The limitation of the optimistic initial values

  • only drive early exploration
  • not well-suited for non-stantinary problems
  • how to set the optimistic initial value is unknown.

Upper-Confidence-Bound Action Selection

UCB is using the uncertainty of the action value to balance the exploration and exploitation.

if we are uncertain about the action value, we should use the upper bound. \(A_t = argmax_a [Q_t(a) + c \sqrt{\frac{ln(t)}{N_t(a)}}]\) The first term is the exploitation part. the second term is the exploration part.

Greedy Bandit Algorithms

The update step is the most important part of the algorithm. Clearly, we need env and agent in the RL setting.


class GreedyAgent(main_agent.Agent):
    def agent_step(self, reward, observation=None):
        """
        Takes one step for the agent. It takes in a reward and observation and 
        returns the action the agent chooses at that time step.
        
        Arguments:
        reward -- float, the reward the agent recieved from the environment after taking the last action.
        observation -- float, the observed state the agent is in. Do not worry about this as you will not use it
                              until future lessons
        Returns:
        current_action -- int, the action chosen by the agent at the current time step.
        """
        ### Useful Class Variables ###
        # self.q_values : An array with what the agent believes each of the values of the arm are.
        # self.arm_count : An array with a count of the number of times each arm has been pulled.
        # self.last_action : The action that the agent took on the previous time step
        #######################
        
        # Update Q values Hint: Look at the algorithm in section 2.4 of the textbook.
        # increment the counter in self.arm_count for the action from the previous time step
        # update the step size using self.arm_count
        # update self.q_values for the action from the previous time step


        self.arm_count[self.last_action] += 1
        alpha = 1 / self.arm_count[self.last_action] 
        self.q_values[self.last_action] += alpha * (reward - self.q_values[self.last_action]) 
        current_action = argmax(self.q_values)    
        self.last_action = current_action
        return current_action

Epsilon-Greedy Bandit Algorithm

class EpsilonGreedyAgent(main_agent.Agent):
    def agent_step(self, reward, observation):
        """
        Takes one step for the agent. It takes in a reward and observation and 
        returns the action the agent chooses at that time step.
        
        Arguments:
        reward -- float, the reward the agent recieved from the environment after taking the last action.
        observation -- float, the observed state the agent is in. Do not worry about this as you will not use it
                              until future lessons
        Returns:
        current_action -- int, the action chosen by the agent at the current time step.
        """
        
        ### Useful Class Variables ###
        # self.q_values : An array with what the agent believes each of the values of the arm are.
        # self.arm_count : An array with a count of the number of times each arm has been pulled.
        # self.last_action : The action that the agent took on the previous time step
        # self.epsilon : The probability an epsilon greedy agent will explore (ranges between 0 and 1)
        #######################
        
        # Update Q values - this should be the same update as your greedy agent above
 
        # Choose action using epsilon greedy
        # Randomly choose a number between 0 and 1 and see if it's less than self.epsilon
        # (hint: look at np.random.random()). If it is, set current_action to a random action.
        # otherwise choose current_action greedily as you did above.
  
        self.arm_count[self.last_action] += 1
        alpha = 1 / self.arm_count[self.last_action] 
        self.q_values[self.last_action] += alpha * (reward - self.q_values[self.last_action])
        random_float = np.random.random()
        if random_float < self.epsilon:
            current_action = np.random.randint(len(self.arm_count))
        else:
            current_action = argmax(self.q_values)
        
        self.last_action = current_action
        
        return current_action

Optimistic initital values

  1. How optimisitic initial values encourage early exploration

  2. Describe limitations of optimistic initial values

Markov Decision Process

The bandit reward is only dependent on the most recent action. In the MDP, the reward is dependent on the current state and action.

Formolize the interaction between the agent and the environment.

  • agent: the learner and decision maker
  • environment: the thing that the agent interacts with.
  • action: the decision that the agent makes
  • state: the current situation

At each time, agent receives a state from the environment, then agent selects an action, then the environment returns a reward and the next state.

The transition function tells us the probability of the next state and reward given the current state and action. The state, action, and reward are finite.

Markov property: the future state is only dependent on the current state and action.

MDPs provide a framework for modeling sequential decision making.

The dynamics of an MDP are defined by a probalilty distribution.

Example: Recycling Robot

  • battery state: low and high

  • action: based on different battery state, the robot can choose from different actions. if the state is high, the robot can do search or wait. if the state is low, the robot can do search or recharge or wait.

  • The transition dynamics are the critical part of the MDP. and even for RL.

  • Search with high battery state may affect the battery state. it can remain in high or change to low. there is a probability envolved. The probability of staying in the high state is \(/alpha\), and the probability of changing to the low state is \(1-\alpha\). Both will give the robot the reward of search action. if it decides to wait, the robot will stay in the same state and get a reward of waiting.

  • States can be low-level sensory readings or high-level abstractions.

  • Actions can be low-level motor commands or high-level goals.

  • Time steps can be milliseconds or years.

  • These make the MDP framework very flexible.

Goal of RL

  • the agent should have long-term goals. what looks good in the short term may not be good in the long term.

  • maximize the total future reward. Does it capture the long-term goal of the agent?

  • maximize the expected sum of future rewards. because the different trajectories can have different rewards.

  • the steps must be finite.

identify the episodic tasks.

  • break the trajectory into episodes. each episode has a starting state and ending state.
  • at termination. the agent reset to the starting state.

The Reward Hypothesis

  • give a man a fish and he’ll eat for a day This is the programming.

  • teach a man to fish and he’ll eat for a lifetime This is the Supervised Learning.

  • Give a man a taste for fish and he’ll figure out how to get fish even the details change This is the Reinforcement Learning.

Versions of Hypothesis

“All goals can be described by maximization of expected cumulative rewards”

“Intelligent behavior arises from the actions of an individual seeking to maximize its received reward signals in a complex and changing world”

  • where is the reward signal coming from?
  • develop algorithms that search the space of behaviros to maximize the reward signal.

Goal rewarding coding

1 for goal, 0 for non-goal. or -1 for not goal, 0 once goal reached.

Whence Rewards

  • Programming
    • Coding
    • Human-in-the-loop
  • Examples
    • Mimic reward
    • Inverse reinforcement learning
  • Optimization
    • Evolutionary optimization
    • Meta RL

Challenges to the hypothesis

  • Target is something other than expected cumulative reward. especially the risk-sensitive behavior. Trading.
  • How to capture the diversity in behavior.

Continuing Tasks

Differentiate between episodic and continuing tasks.

  • Episodic Tasks
    1. Interaction breaks into episodes
    2. Each episode ends in a terminal state
    3. Episodes are independent
\[G_t = R_{t+1} + R_{t+2} + ... + R_T\]
  • Continuing Tasks
    1. Interaction goes on continually
    2. No terminal state
\[G_t = R_{t+1} + R_{t+2} + ...\]

Formulate returns for continuing tasks using discounting.

Since the return is infinite, we need to discount the return.

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\]

Assume \(R_{max}\) is the maximum reward, then

\[G_t = /sum_{k=0}^{\infty} \gamma^k R_{t+k+1}</sum_{k=0}^{\infty} \gamma^k R_{max} = R_{max} /sum_{k=0}^{\infty} \gamma^k = R_{max} / (1 - \gamma)\]

when \(gamma ==0\), the agent is short-sighted. when \(gamma == 1\), the agent is far-sighted.

Formalize a task as episodic or continuing.

Specifying Policies

Policy is a distribution over actions for each possible state.

  • deterministic policy policy is a function with input of state output of action.

  • stochastic policy policy is a distribution over actions for each state. it’s a conditional distribution that conditions on the state. The distribution is over the actions.

Similarites and differences between stochastic and deterministic policies.

The current state has all the information that the agent can use to make a decision. Not the other elements.

what’s a valid policy for a given MDP.

if the action is dependent on other elements other than the state, it’s not a valid policy.

Value Functions

The reward capture the short-term gain. The value function captures the long-term gain.

state-value and action-value functions.

The state-value function is the expected return starting from the state and following the policy.

\[v_{\pi}(s) = E_{\pi}[G_t | S_t = s]\]

The action-value function is the expected return starting from the state, taking the action, and following the policy.

\[q_{\pi}(s, a) = E_{\pi}[G_t | S_t = s, A_t = a]\]

relationship between value functions and policies

value functions predict rewards into the future. value functions are used to evaluate the policies. some policies are better than the others.

Bellman Equation Derivation

Bellman equation for state-value functions

\[v_{\pi}(s) = E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s]\] \[\sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi}(s')]\] \[\sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) r + \gamma \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) v_{\pi}(s')\]

Bellman equation for action-value functions

\[q_{pi}(s, a) = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]\] \[\sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a'|s') q_{\pi}(s', a')]\]

The current time-step’s state/action values can be written recursively in terms of future state/action values.

how Bellman equations relate current and future values

Bellman equations to solve for a value function by writing a system of linear equations

we can only solve small MDPs directly, but Bellman Equations will factor into the solutions for larger MDPs.

Optimal policies

Policy defines how the agent behaves.

How to evaluate a policy? identify an optimal policy for a given MDP.

An optimal policy is as good as or better than all the other policies. There is always at least one optimal policy. Since we can always to combine the better one on different states together to form a optimal policy.

the number of possible policies is exponential in the number of states and actions.

Optimal value functions

\[{pi}_{1} >= {pi}_{2} if and only if v_{pi1}(s) >= v_{pi2}(s) for all s\]

The optimal value functio is as good as or better than every other value function. For every state, the optimal value function is the maximum value function over all the policies.

\[v_{pi*}(s) = E[G_t | S_t = s]\]

Optimal policy share the same optimal action-value function.

Bellman optimality equation for state-value functions

\[v_{pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{pi}(s')]\] \[v_{*}(s) = \sum_{a} \pi_{*}(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]\] \[v_{*}(s) = max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]\]

Bellman optimality equation for action-value functions

\[q_{pi}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a'|s') q_{pi}(s', a')]\] \[q_{*}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma max_{a'} q_{*}(s', a')]\]

Can we solve the optimal bellman optimality equation?

take the max on actions is not a linear operation. it’s a non-linear operation.

Using optimal value functions to get the optimal policy

The connection between the optimal value functions and optimal policies

\[v_{*}(s) = max_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]\] \[\pi_{*}(a|s) = 1 if a = argmax_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]\]

Policy Evaluation vs Control

Policy evaluation is the task of determining the value function for a given policy.

Policy control is the task of finding a policy to obtain as much reward as possible. Control is the ultimate goal of RL.

But the policy evaluation is uaually the first step of the control. It’s hard to improve the policy without knowing how good it is.

dynamic programming is a general solution method for policy evaluation and control. What’s the limitations.

It uses the bellman equations to define iterative algorithms for both policy evaluation and control.

\[v_{pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{pi}(s')]\]

Control is the task of improving a policy. The goal of the control is to produce a policy that is strictly better than the current one.

Dynamic programming techniques can be used to solve both if we have access to the dynamics function p.

Iterative Policy Evaluation

iterative policy evaluation algorithm for estimating state values under a given policy

\[v_{pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_{pi}(s')]\]

sweep through all the states and update the value function.

Use two arrays to store the value function. one for the current value function, one for the next value function.

Use one array to store the value function. update the value function in place. it’s called in-place dynamic programming. it’s still converge. it even converge faster than the two-array method.

algorithm for iterative policy evaluation for estimating $v_{\pi}(s)$

input $pi$, the policy to be evaluated

initialize $V(s) = 0$, for all s

loop:

delta = 0

for each s:

  v = V(s)
  
  V(s) = sum_{a} pi(a|s) sum_{s', r} p(s', r | s, a) [r + gamma V(s')]
  
  delta = max(delta, |v - V(s)|)
  
until delta < theta (a small positive number)

return V

Policy Improvement

policy improvement theorem

\[{pi}_{*} = argmax_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]\]

The greedy action maximizes the bellman’s optimality equation in each state.

The new policy must be different than pi. if the alg is not changing the policy , then the policy is already optimal.

\(q_{\pi}(s, pi'(s)) >= q_{pi}(s, {pi}(x))\) for all s.

at each state, the value of the action selected by pi prime is greater than or equal to the value of the action selected by pi. pi prime is better if the value of the action selected by pi prime is greater than at least one value of the action selected by pi.

Policy iteration

\[{pi}_{*} = argmax_{a} \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]\]

${pi}{*}$ is the greedy policy with respect to the value function $v{*}$.

is strictly bettwer than oriagl policy unless the original policy is already optimal.

Outline the policy iteration algorithm for finding the optimal policy.

how policy iteration reaches the optimal policy by alternating between evaluating a policy and improving it.

steps of the policy iteration algorithm

step 1: from the initial policy, evaluate the policy to get the value function.

step 2: improve the policy by selecting the greedy action with respect to the value function.

sudocode:

  1. initialize $pi$ randomly

  2. Policy Evaluation loop:

    delta = 0

    for each s:

     v = V(s)
        
     V(s) = sum_{a} pi(a|s) sum_{s', r} p(s', r | s, a) [r + gamma V(s')]
        
     delta = max(delta, |v - V(s)|)
    

    until delta < theta

  3. Policy Improvement

    policy_stable = true

    for each s:

    old_action = pi(s)

    pi(s) = argmax_{a} sum_{s’, r} p(s’, r s, a) [r + gamma V(s’)]

    if old_action != pi(s):

    policy_stable = false
    

    if policy_stable:

    return pi

    else:

    go to step 2

Flexibility of Policy Iteration

Generalized Policy Iteration

Value iteration an important special case of GPI

One sweep of all states and then do a greedy policy improvement.

Loop: delta = 0 loop for each s: v = V(s) V(s) = max_{a} sum_{s’, r} p(s’, r | s, a) [r + gamma V(s’)] delta = max(delta, |v - V(s)|) until delta < theta

pi(s) = argmax_{a} sum_{s’, r} p(s’, r s, a) [r + gamma V(s’)]

synchronous and asynchronous dynamic programming

synchronous dynamic programming: update all the states in each iteration.

asynchronous dynamic programming: order doesn’t matter. update the states in any order.

Efficiency of Dynamic Programming

Monte Carlo Sampling

\[v_{\pi}(s) = E_{\pi}[G_t | S_t = s]\]

Take a large number of returns under pi and average them. have to do this for each state.

this approach is treating the states independently. it’s not efficient.

However, the dynamic programming is more efficient. it’s using the bellman equation to update the value function.

Using the successor states to update the current state is known as bootstrapping.

action at exponential in the number of states.

The Curse of Dimensionality

the size of the state space grows exponentially with the number of dimensions. This is not an issue for dynamic programming.but inherent complexity of the problem.