Note three of RL course from Alberta University.

Parameterized funcitions to approximate values

Bring some parameters into the value function with the state,

\(V(s, w)\) Where w is the parameter vector. We can learn the w.

The weights can be changes so the value function can be changed.

Linear value function approximation

\[v(s, w) = \sum W_i x_i(s)\]

Where $x_i$ is the feature vector. The features are the basis functions.

Generalization and Discrimination

Generalization: updates to one state affect the value of other states.

Discrimination: the ability to make the value of two states different.

Framing value estimation as supervised learning

The function approximator should be compatible with online updates.

Semi-Gradient TD for Policy Evaluation

  • TD update for function approximation

The gradient Monte Carlo update equation:

\[w \leftarrow w + \alpha (G - V(s, w)) \nabla_w V(s, w)\]

if we replace the return with other estimates, we can get the TD update.

\[w \leftarrow w + \alpha (U_t - V(s, w)) \nabla_w V(s, w)\]

Where $U_t$ is the TD target. if the target is unbiased, then $w$ will converge to a local optimum.