- Abstract:
- This paper sets out the concept of consistent exploration of observation-action pairs. We present a new temporal difference algorithm, CEQ(lambda), based on this concept and demonstrate using a randomly generated set of partially observable Markov decision processes (POMDPs) that it outperforms SARSA(lambda). This result should generalise to any POMDP where satisficing policies which map observations to actions exists. We also set out reasons for preferring CEQ(lambda) over an alternative Monte-Carlo style algorithm, MCESP, when working in the robotics domain.
- Links To Paper
- 1st Link
- Bibtex format
- @InProceedings{EDI-INF-RR-1259,
- author = {
Paul Crook
and Gillian Hayes
},
- title = {Consistent exploration improves convergence of reinforcement learning on POMDPs},
- book title = {AAMAS 2007 Workshop on Adaptive and Learning Agents (ALAg-07)},
- year = 2007,
- month = {May},
- url = {http://homepages.inf.ed.ac.uk/pcrook/publications/alag2007-paper.pdf},
- }
|