TALK:REINFORCEMENT LEARNING

This is the talk page for discussing improvements to the Reinforcement learning article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Robotics Mid‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics articles
Mid	This article has been rated as Mid-importance on the project's importance scale.
	This article has been marked as needing immediate attention.

Daily pageviews of this article

A graph should have been displayed here but graphs are temporarily disabled. Until they are enabled again, visit the interactive graph at pageviews.wmcloud.org

Question

Is R=Σ_tγ^tr_t, $R=\sum \limits _{t^{\gamma }}^{t}r_{t}$ or $R=\sum \limits _{t\gamma }^{t}r_{t}$ or $R=\sum \limits _{t}^{t}\gamma r_{t}$ ?

Answer: It is : $R=\sum \limits _{t=0}^{\infty }\gamma ^{t}r_{t}$

Policies

What exactly is a policy? The Sutton-Barto book is very vague on this point, and so is this article. In both cases the word is used without much explanation.

According to both the book and the article, a policy is a mapping from states to action probabilities. Fine. But this is not elaborated upon. What does a policy look like? I infer that it must be a table (2-D array), indexed by state and action, and containing probabilities, say p_ij for the i-th state and j-th action, each p_ij being a transition probability for the MDP. If so, what is its relation to the values derived from rewards? I.e. where exactly do the probabilities p_ij come from? How does one generate a policy table starting from values?

Sorry if I appear stupid, but I've been studying the book and I find it very difficult to comprehend, even though the maths is very simple (almost too simple). Or maybe it's in there somewhere but I've missed it?

-- 84.9.83.127 09:36, 18 November 2006 (UTC) reply

A policy is indeed a mapping from states to action probabilities, usually written π. So we could write π:S×A→[0,1], saying that π gives a probability of taking a given action a in state s. It doesn't have to be a table, it is just a function. If S and A are discrete then it can be easily written as a table, but if either is continuous then another form is needed. For instance, if S is the interval [0,10], we can set a number of radial basis functions over that interval (say, 11 of them, one at 0, one at 1, one at 2, etc.). Number them r₀, ... r₁₀. Now our policy is a function π:r₀×...×r₁₀×A→[0,1], which we can no longer write as a table.

The relation of the policy to values depends on the particular solution being used for the RL problem. In an actor-critic architecture, the policy is the set of state-action values along with a function for selecting an action (softmax, for instance, or just choosing the action with the highest value) and the state-action values are updated according to state values and the error signal. In a Q-learning agent, the policy and the values are essentially the same. Well, more correctly the policy is a function of the values given by the action selection mechanism.

For the most part, when you're just learning reinforcement learning theory, the use of policies may not be particularly clear. At least, in my own case, I didn't understand the focus on policies until I read Sutton, Precup, and Singh (1999) on options [1], at which point policies became crystal clear.

Hope that answers your question. digfarenough ( talk) 19:25, 4 March 2007 (UTC) reply

Thanks. But your reply raises more questions for me, which I need to try and find answers to! -- 84.9.75.142 22:41, 16 March 2007 (UTC) (formerly 84.9.83.127) reply

Feel free to ask further questions on my talk page. I'm certainly no expert on reinforcement learning, but I've written one paper on it and have written a large number of simulations of RL-related things, so I at least know the basics. digfarenough ( talk) 01:09, 17 March 2007 (UTC) reply

I hope the new version explains what a policy might mean. In fact, it has multiple meanings and is used somewhat inconsistently in the literature. Szepi ( talk) 03:11, 7 September 2010 (UTC) reply

merge with Q learning

There is a short article on Q learning and could be merged with reinforcement learning Kpmiyapuram 14:23, 24 April 2007 (UTC) reply

I'd offer that Q Learning be expanded instead. In Q Learning's "See Also" there's Watkins' thesis, which I faintly remember is where Q Learning was introduced; but there's no mention of Watkins or any other researcher in the article. Additionally, Sutton's RL book is listed, which would be a great source to mine for further detail on history and application. -- 59.167.203.115 ( talk) 01:17, 11 January 2008 (UTC) reply

I'd back Q-learning being expanded instead, with a summary in RL. As Q-learning is an active area of research it will grow over time, so it would be short-sighted to merge them - especially as they are already separate. At the start of my research it would have been SO helpful to know what was applicable to RL generally, and what was Q-Learning. -- 217.37.215.53 ( talk) 10:05, 6 March 2008 (UTC) reply

algorithms/concepts not mentioned

active (policy improvement) vs passive (policy evaluation)
Adaptive Dynamic Programming (ADP) —Preceding unsigned comment added by 132.177.27.1 ( talk) 17:23, 1 April 2008 (UTC) reply

Policy improvement and evaluation are included now. However, these methods are rarely if ever called active/passive. The problems addressed by these methods are control learning and prediction learning. These could be included..
ADP refers to approximate dynamic programming, as far as I know. I have added the term to the article. Thanks for the suggestions.

Szepi ( talk) 03:20, 7 September 2010 (UTC) reply

Economics?

Where's all the stuff about learning in games? It would be great if someone could incorporate this. Jeremy Tobacman 23:40, 1 August 2007 (UTC) reply

It's certainly relevant, but you may have to add it yourself if you're familiar with the subject. I've come across that aspect a few times but never really looked into it, though I have seen quite a few papers on interacting multiagent systems from the game and economic perspectives (always, I think, the agents were working against each other to try to maximize profit or win the game, etc.). So add what you know, and others may be able to clean up any incorrect claims. digfarenough ( talk) 16:31, 2 August 2007 (UTC) reply

Psychology

This article starts with a reference to 'Reinforcement learning' in psychology. Isn't there an article about that? -- Rinconsoleao 13:43, 27 September 2007 (UTC) reply

Found it... -- Rinconsoleao 13:45, 27 September 2007 (UTC) reply

Literature

I feel the literature referenced by Csaba Szepesvàri was a useful addition and perhaps should not have been removed. Even though he referenced a book written by himself, he is a well known and respected researcher in reinforcement learning and this book is a useful overview of the field. I do not know of many good recent alternatives, so I would favor reverting MrOllie's revision. However, rather than immediately doing so, I thought it might be better to start a discussion.

What literature would be indispenable? (In my opinion, in any case the books by Sutton & Barto and by Berstekas & Tsitsiklis, although most of the other referenced work at present also looks fine.)
What literature might be removed? (For instance, I haven't read the latest addition by Tokic, is this a relevant enough paper to include?)
Is there any important work missing? (As mentioned, I would favor the return of a reference to Csaba Szepesvàri's book.) —Preceding unsigned comment added by 192.16.201.233 ( talk) 12:04, 20 September 2010 (UTC) reply

Attention needed

Is there any difference between the "inverse" and "apprenticeship" learning? From the descriptions, they appear to be basically the same.
Refs - needs inline refs
Check content for missing statements
Assess on B scale
Broken link: A Short Introduction To Some Reinforcement Learning Algorithms — Preceding unsigned comment added by 192.76.175.3 ( talk) 01:11, 19 March 2016 (UTC) reply

Chaosdruid ( talk) 05:03, 6 March 2011 (UTC) reply

small and large mdps

'The theory of small mdps is [..] mature; [..] the theory of large mdps needs more work.'

What does that even mean ? Theory is theory; if you understand an mdp with 10 states, than you understand one with ten million states, although standard algorithms may run too slow, I can't see the conceptual difference between ten and ten million as far as theory is concerned.

Does the author mean either: a) small equals finite and large equals countably or uncountably infinite, or b) approximation methods (in itself only useful when direct methods fail) are not as well understood.

— Preceding 
unsigned comment added by 
157.193.140.25 (
talk) 
09:21, 26 August 2011 (UTC)
reply

I only use small in the context of finite MDPs. "Theory of small, finite MDPs" means theoretical results concerning algorithms whose complexity scales at least linearly with the size of the state-action space. I think this is intuitive, but if you have some suggestions, but I would welcome any alternative suggestions. I realize this could be misunderstood (someone might think that small means 10 or 100s, though I did not think this would be likely to happen).

Szepi ( talk) 15:23, 16 September 2011 (UTC) reply

category needed

Can someone make a sub-category for machine learning maybe? -- 77.4.90.71 ( talk) 16:35, 1 November 2011 (UTC) reply

The whole article is a subcategory of machine learning. Perhaps you seek practical applications or tools? Or I'm just not sure what you mean. Krehel ( talk) 00:11, 24 September 2018 (UTC) reply

The comparison of algorithms table

The table comparing algorithms is just plain wrong:

Monte Carlo is not an algorithm at all, but a family of algorithms for all kinds of problems (including RL). For RL many different Monte Carlo algorithms exist. The description is even more misleading: " Every visit to Monte Carlo" Every-visit is only one value of one option in Monte-Carlo RL methods (the other option being First-visit). Not picking Every-visit doesn't make the method less Monte-Carlo, it just changes the update operator for the value function.
The Policy column is not actually about the type of policy, but about how the policy is optimised (on- or off-policy).
The Operator column does not contain operators: Q-value and Advantage are types of value functions. The operators used on these value functions are what defines the method. The book by Sutton and Barto defines these operators using backup diagrams. Note that Monte-Carlo methods typically also maintain a value function (such as Q-values), they are just updated differently from methods such as Q-learning, which use Bellman backups rather than Monte Carlo estimators of the returns.
A number of relevant properties is omitted
The table seems heavily biased towards recent neural-network based methods (only the simpler classical methods are represented, giving for example the wrong impression that no classical methods existed that could handle continuous state or action spaces).

I'm not quite sure how to reorganise this table without it becoming monstrous in size. In its present state it is however highly confusing and misleading. I'd say it would be better to remove it than to keep it as it currently is. LordDamorcro ( talk) 18:39, 5 July 2021 (UTC) reply

A Commons file used on this page or its Wikidata item has been nominated for deletion

The following Wikimedia Commons file used on this page or its Wikidata item has been nominated for deletion:

DNC training recall task.gif

Participate in the deletion discussion at the nomination page. — Community Tech bot ( talk) 20:25, 11 September 2021 (UTC) reply

General level of accessibility.

Wikipedia is not for the purpose only of informing persons already expert in the subject matter, not is it a forum for authors to demonstrate their knowledge or show off their technical grasp to others in their field. Articles in Wikipedia are supposed to EXPLAIN things. This means breaking down jargon. It means setting out topics in a manner that makes them approachable for people not already well read in the field.

Too many Wikipedia articles, including this one, are written by peopple incapable of understanding this extremely obvious perspective. The purpose is not to compose some form of canonical description of the field in the most compact, concise or dense langiage possible. It is the opposite. Many authors here are academics, but it seems clear many would struggle successfully to teach a class anything at all. 49.180.205.46 ( talk) 10:20, 9 September 2022 (UTC) reply

research project

related literature about effect of academic pressure 209.35.172.23 ( talk) 07:02, 20 April 2023 (UTC) reply

A section on Applications

It would be good to have a section on the applications of RL on this page. I haven't done any major writing on wiki and not sure If I can just add one. eg. Robotics, self driving cars, gaming (AlphaGo) etc. Amitkannan ( talk) 07:16, 26 September 2023 (UTC) reply

Please don't. Applications sections are spam magnets and generally fill up with advertising and self promotion in short order. MrOllie ( talk) 12:24, 26 September 2023 (UTC) reply

Question

Policies

merge with Q learning

algorithms/concepts not mentioned

Economics?

Psychology

Literature

Attention needed

small and large mdps

category needed

The comparison of algorithms table

A Commons file used on this page or its Wikidata item has been nominated for deletion

General level of accessibility.

research project

A section on Applications

Question

Policies

merge with Q learning

algorithms/concepts not mentioned

Economics?

Psychology

Literature

Attention needed

small and large mdps

category needed

The comparison of algorithms table

A Commons file used on this page or its Wikidata item has been nominated for deletion

General level of accessibility.

research project

A section on Applications

Videos

Websites

Encyclopedia

Facebook