This article is rated Stub-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
When did this algorithm get invented ? the day of the of the pear 19:46, 7 May 2007 (UTC)
For updates, SARSA uses the next action chosen, not the best next action, to reflect the value of the last state/action under the current policy. If using the best next action, you'll end up with Watkin's Q-Learning which SARSA was an attempt to provide an alternative to. By updating with the value of the best next action (Watkin's Q-Learning) the update can possibly over-estimate values, as the control method used will not pick this action all the time (due to the need to balance exploration and exploitation). A comparison between Q-Learning and SARSA, perhaps Cliff World from Rich Sutton's 'Reinforcement Learning An Introduction' (1998), may be useful to clarify the differences and the resulting behaviour -- 131.217.6.6 08:17, 29 May 2007 (UTC)
this is the algorithm presented in
Q-Learning:
SARSA:
Uses "backpropagation"? updates previous Q entry with future reward? Dspattison ( talk) 19:20, 19 March 2008 (UTC)
Is the algorithm given correct? Should it not be R(t) not R(t+1)Â ? I've looked at [1] and that seems to support what Thrun & Norvig teach in their Stanford ai-class wheeliebin ( talk) 04:58, 12 November 2011 (UTC)
Bomberzocker ( talk) 19:36, 6 February 2018 (UTC)
There are different definitions in use. Sutton [2] uses "R(t+1)" for the immediate reward when choosing action "a(t)" in state "s(t)" while Norvig uses "R(t)". It makes no difference really, but mentioning the different conventions might be a good idea.
This article is rated Stub-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
When did this algorithm get invented ? the day of the of the pear 19:46, 7 May 2007 (UTC)
For updates, SARSA uses the next action chosen, not the best next action, to reflect the value of the last state/action under the current policy. If using the best next action, you'll end up with Watkin's Q-Learning which SARSA was an attempt to provide an alternative to. By updating with the value of the best next action (Watkin's Q-Learning) the update can possibly over-estimate values, as the control method used will not pick this action all the time (due to the need to balance exploration and exploitation). A comparison between Q-Learning and SARSA, perhaps Cliff World from Rich Sutton's 'Reinforcement Learning An Introduction' (1998), may be useful to clarify the differences and the resulting behaviour -- 131.217.6.6 08:17, 29 May 2007 (UTC)
this is the algorithm presented in
Q-Learning:
SARSA:
Uses "backpropagation"? updates previous Q entry with future reward? Dspattison ( talk) 19:20, 19 March 2008 (UTC)
Is the algorithm given correct? Should it not be R(t) not R(t+1)Â ? I've looked at [1] and that seems to support what Thrun & Norvig teach in their Stanford ai-class wheeliebin ( talk) 04:58, 12 November 2011 (UTC)
Bomberzocker ( talk) 19:36, 6 February 2018 (UTC)
There are different definitions in use. Sutton [2] uses "R(t+1)" for the immediate reward when choosing action "a(t)" in state "s(t)" while Norvig uses "R(t)". It makes no difference really, but mentioning the different conventions might be a good idea.