RL-Course-3

Note three of RL course from Alberta University.

Bring some parameters into the value function with the state,

$V(s, w)$ Where w is the parameter vector. We can learn the w.

The weights can be changes so the value function can be changed.

\[v(s, w) = \sum W_i x_i(s)\]

Where $x_i$ is the feature vector. The features are the basis functions.

Generalization: updates to one state affect the value of other states.

Discrimination: the ability to make the value of two states different.

The function approximator should be compatible with online updates.

The gradient Monte Carlo update equation:

\[w \leftarrow w + \alpha (G - V(s, w)) \nabla_w V(s, w)\]

if we replace the return with other estimates, we can get the TD update.

\[w \leftarrow w + \alpha (U_t - V(s, w)) \nabla_w V(s, w)\]

Where $U_t$ is the TD target. if the target is unbiased, then $w$ will converge to a local optimum.