0% found this document useful (0 votes)
137 views9 pages

RLAlgs in MDPs

This document provides references to numerous papers related to reinforcement learning and temporal difference learning. Specifically, it lists over 90 references to papers published between 1992 and 2010 related to topics like multi-agent reinforcement learning, predictive state representations, temporal difference learning with function approximation, policy search methods, and applications of reinforcement learning.

Uploaded by

abarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views9 pages

RLAlgs in MDPs

This document provides references to numerous papers related to reinforcement learning and temporal difference learning. Specifically, it lists over 90 references to papers published between 1992 and 2010 related to topics like multi-agent reinforcement learning, predictive state representations, temporal difference learning with function approximation, policy search methods, and applications of reinforcement learning.

Uploaded by

abarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Li, Y., Szepesvári, C., and Schuurmans, D. (2009).

Learning exercise policies for american


options. In Proc. of the Twelfth International Conference on Artificial Intelligence and
Statistics, JMLR: W&CP, volume 5, pages 352–359.

Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning
and teaching. Machine Learning, 9:293–321.

Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning.


In Cohen and Hirsh (1994), pages 157–163.

Littman, M. L., Sutton, R. S., and Singh, S. P. (2001). Predictive representations of state.
In Dietterich et al. (2001), pages 1555–1561.

Maei, H., Szepesvári, C., Bhatnagar, S., Silver, D., Precup, D., and Sutton, R. (2010a).
Convergent temporal-di↵erence learning with arbitrary smooth function approximation.
In NIPS-22, pages 1204–1212.

Maei, H., Szepesvári, C., Bhatnagar, S., and Sutton, R. (2010b). Toward o↵-policy learning
control with function approximation. In Wrobel et al. (2010).

Maei, H. R. and Sutton, R. S. (2010). GQ( ): A general gradient algorithm for temporal-
di↵erence prediction learning with eligibility traces. In Baum, E., Hutter, M., and Kitzel-
mann, E., editors, Proceedings of the Third Conference on Artificial General Intelligence,
pages 91–96. Atlantis Press.

Mahadevan, S. (2009). Learning representation and control in Markov decision processes:


New frontiers. Foundations and Trends in Machine Learning, 1(4):403–565.

McAllester, D. A. and Myllymäki, P., editors (2008). Proceedings of the 24th Conference in
Uncertainty in Artificial Intelligence (UAI’08). AUAI Press.

Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning
with function approximation. In Cohen et al. (2008), pages 664–671.

Menache, I., Mannor, S., and Shimkin, N. (2005). Basis function adaptation in temporal
di↵erence reinforcement learning. Annals of Operations Research, 134(1):215–238.

Mnih, V., Szepesvári, C., and Audibert, J.-Y. (2008). Empirical Bernstein stopping. In
Cohen et al. (2008), pages 672–679.

Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal
of Machine Learning Research, 9:815–857.

90
Nascimento, J. and Powell, W. (2009). An optimal approximate dynamic programming
algorithm for the lagged asset acquisition problem. Mathematics of Operations Research,
34:210–237.

Nedič, A. and Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear
function approximation. Discrete Event Dynamic Systems, 13(1):79–110.

Neu, G., György, A., and Szepesvári, C. (2010). The online loop-free stochastic shortest-path
problem. In COLT-10.

Ng, A. Y. and Jordan, M. (2000). PEGASUS: A policy search method for large MDPs and
POMDPs. In Boutilier, C. and Goldszmidt, M., editors, Proceedings of the 16th Confer-
ence in Uncertainty in Artificial Intelligence (UAI’00), pages 406–415, San Francisco CA.
Morgan Kaufmann.

Nouri, A. and Littman, M. (2009). Multi-resolution exploration in continuous spaces. In


Koller et al. (2009), pages 1209–1216.

Ormoneit, D. and Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning,


49:161–178.

Ortner, R. (2008). Online regret bounds for Markov decision processes with deterministic
transitions. In Freund, Y., Györfi, L., Turán, G., and Zeugmann, T., editors, Proc. of the
19th International Conference on Algorithmic Learning Theory (ALT 2008), volume 5254
of Lecture Notes in Computer Science, pages 123–137. Springer.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008). An analysis of
linear models, linear value-function approximation, and feature selection for reinforcement
learning. In Cohen et al. (2008), pages 752–759.

Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. L. (2007). Analyzing feature gener-
ation for value-function approximation. In Ghahramani (2007), pages 737–744.

Perkins, T. and Precup, D. (2003). A convergent form of approximate policy iteration. In


S. Becker, S. T. and Obermayer, K., editors, Advances in Neural Information Processing
Systems 15, pages 1595–1602, Cambridge, MA, USA. MIT Press.

Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7–9):1180–1190.

Peters, J., Vijayakumar, S., and Schaal, S. (2003). Reinforcement learning for humanoid
robotics. In Humanoids2003, Third IEEE-RAS International Conference on Humanoid
Robots, pages 225—230.

91
Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors (2008). Advances in Neural
Information Processing Systems 20, Cambridge, MA, USA. MIT Press.

Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation by averaging.


SIAM Journal on Control and Optimization, 30:838–855.

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete
Bayesian reinforcement learning. In Cohen and Moore (2006), pages 697–704.

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimen-


sionality. John Wiley and Sons, New York.

Proper, S. and Tadepalli, P. (2006). Scaling model-based average-reward reinforcement


learning for product delivery. In Fürnkranz et al. (2006), pages 735–742.

Puterman, M. (1994). Markov Decision Processes — Discrete Stochastic Dynamic Program-


ming. John Wiley & Sons, Inc., New York, NY.

Rasmussen, C. and Williams, C. (2005). Gaussian Processes for Machine Learning (Adaptive
Computation and Machine Learning). The MIT Press.

Rasmussen, C. E. and Kuss, M. (2004). Gaussian processes in reinforcement learning. In


Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural Information Pro-
cessing Systems 16, pages 751–759, Cambridge, MA, USA. MIT Press.

Riedmiller, M. (2005). Neural fitted Q iteration – first experiences with a data efficient
neural reinforcement learning method. In Gama, J., Camacho, R., Brazdil, P., Jorge, A.,
and Torgo, L., editors, Proceedings of the 16th European Conference on Machine Learning
(ECML-05), volume 3720 of Lecture Notes in Computer Science, pages 317–328. Springer.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the
American Mathematics Society, 58:527–535.

Ross, S. and Pineau, J. (2008). Model-based Bayesian reinforcement learning in large struc-
tured domains. In McAllester and Myllymäki (2008), pages 476–483.

Ross, S., Pineau, J., Paquet, S., and Chaib-draa, B. (2008). Online planning algorithms for
POMDPs. Journal of Artificial Intelligence Research, 32:663–704.

Rummery, G. A. (1995). Problem solving with reinforcement learning. PhD thesis, Cambridge
University.

92
Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems.
Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Depart-
ment.

Rusmevichientong, P., Salisbury, J. A., Truss, L. T., Van Roy, B., and Glynn, P. W. (2006).
Opportunities and challenges in using online preference data for vehicle pricing: A case
study at General Motors. Journal of Revenue and Pricing Management, 5(1):45–61.

Rust, J. (1996). Using randomization to break the curse of dimensionality. Econometrica,


65:487–516.

Scherrer, B. (2010). Should one compute the temporal di↵erence fix point or minimize the
Bellman residual? The unified oblique projection view. In Wrobel et al. (2010).

Schölkopf, B., Platt, J. C., and Ho↵man, T., editors (2007). Advances in Neural Information
Processing Systems 19, Cambridge, MA, USA. MIT Press.

Schraudolph, N. (1999). Local gain adaptation in stochastic gradient descent. In Ninth


International Conference on Artificial Neural Networks (ICANN 99), volume 2, pages
569–574.

Shapiro, A. (2003). Monte Carlo sampling methods. In Stochastic Programming, Handbooks


in OR & MS, volume 10. North-Holland Publishing Company, Amsterdam.

Shavlik, J. W., editor (1998). Proceedings of the 15th International Conference on Machine
Learning (ICML 1998), San Francisco, CA, USA. Morgan Kau↵mann.

Silver, D., Sutton, R. S., and Müller, M. (2007). Reinforcement learning of local shape in
the game of Go. In Veloso, M. M., editor, Proceedings of the 20th International Joint
Conference on Artificial Intelligence (IJCAI 2007), pages 1053—1058.

Simão, H. P., Day, J., George, A. P., Gi↵ord, T., Nienow, J., and Powell, W. B. (2009). An
approximate dynamic programming algorithm for large-scale fleet management: A case
application. Transportation Science, 43(2):178–197.

Singh, S. P. and Bertsekas, D. P. (1997). Reinforcement learning for dynamic channel al-
location in cellular telephone systems. In Mozer, M. C., Jordan, M. I., and Petsche, T.,
editors, NIPS-9: Advances in Neural Information Processing Systems: Proceedings of the
1996 Conference, pages 974–980, Cambridge, MA, USA. MIT Press.

Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learning with soft state
aggregation. In Tesauro et al. (1995), pages 361–368.

93
Singh, S. P., Jaakkola, T., Littman, M. L., and Szepesvári, C. (2000). Convergence results for
single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308.

Singh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility


traces. Machine Learning, 32:123–158.

Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss from approximate optimal-
value functions. Machine Learning, 16(3):227–233.

Solla, S. A., Leen, T. K., and Müller, K. R., editors (1999). Advances in Neural Information
Processing Systems 12, Cambridge, MA, USA. MIT Press.

Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. (2006). PAC model-free
reinforcement learning. In Cohen and Moore (2006), pages 881–888.

Strehl, A. L. and Littman, M. L. (2005). A theoretical analysis of model-based interval


estimation. In De Raedt and Wrobel (2005), pages 857–864.

Strehl, A. L. and Littman, M. L. (2008). Online linear regression and its application to
model-based reinforcement learning. In Platt et al. (2008), pages 1417–1424.

Strens, M. (2000). A Bayesian framework for reinforcement learning. In Langley, P., edi-
tor, Proceedings of the 17th International Conference on Machine Learning (ICML 2000),
pages 943–950. Morgan Kaufmann.

Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis,


University of Massachusetts, Amherst, MA.

Sutton, R. S. (1988). Learning to predict by the method of temporal di↵erences. Machine


Learning, 3(1):9–44.

Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale
Workshop on Adaptive and Learning Systems, pages 161—166.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. Bradford


Book. MIT Press.

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and
Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-di↵erence learning
with linear function approximation. In Danyluk et al. (2009), pages 993—1000.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999a). Policy gradient
methods for reinforcement learning with function approximation. In Solla et al. (1999),
pages 1057–1063.

94
Sutton, R. S., Precup, D., and Singh, S. P. (1999b). Between MDPs and semi-MDPs:
A framework for temporal abstraction in reinforcement learning. Artificial Intelligence,
112:181–211.

Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. H. (2008). Dyna-style
planning with linear function approximation and prioritized sweeping. In McAllester and
Myllymäki (2008), pages 528–536.

Sutton, R. S., Szepesvári, C., and Maei, H. R. (2009b). A convergent O(n) temporal-
di↵erence algorithm for o↵-policy learning with linear function approximation. In Koller
et al. (2009), pages 1609–1616.

Szepesvári, C. (1997). The asymptotic convergence-rate of Q-learning. In Jordan, M. I.,


Kearns, M. J., and Solla, S. A., editors, Advances in Neural Information Processing Sys-
tems 10, pages 1064–1070, Cambridge, MA, USA. MIT Press.

Szepesvári, C. (1997). Learning and exploitation do not conflict under minimax optimality. In
Someren, M. and Widmer, G., editors, Machine Learning: ECML’97 (9th European Conf.
on Machine Learning, Proceedings), volume 1224 of Lecture Notes in Artificial Intelligence,
pages 242–249. Springer, Berlin.

Szepesvári, C. (1998). Static and Dynamic Aspects of Optimal Sequential Decision Making.
PhD thesis, Bolyai Institute of Mathematics, University of Szeged, Szeged, Aradi vrt. tere
1, HUNGARY, 6720.

Szepesvári, C. (2001). Efficient approximate planning in continuous space Markovian decision


problems. AI Communications, 13:163–176.

Szepesvári, C. and Littman, M. L. (1999). A unified analysis of value-function-based


reinforcement-learning algorithms. Neural Computation, 11:2017–2059.

Szepesvári, C. and Smart, W. D. (2004). Interpolation-based Q-learning. In Brodley, C. E.,


editor, Proceedings of the 21st International Conference on Machine Learning (ICML
2004), pages 791–798. ACM.

Szita, I. and Lőrincz, A. (2008). The many faces of optimism: a unifying approach. In Cohen
et al. (2008), pages 1048–1055.

Szita, I. and Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight
exploration complexity bounds. In Wrobel et al. (2010).

Tadić, V. B. (2004). On the almost sure rate of convergence of linear stochastic approximation
algorithms. IEEE Transactions on Information Theory, 5(2):401–409.

95
Tanner, B. and White, A. (2009). RL-Glue: Language-independent software for
reinforcement-learning experiments. Journal of Machine Learning Research, 10:2133–2136.

Taylor, G. and Parr, R. (2009). Kernelized value function approximation for reinforcement
learning. In Danyluk et al. (2009), pages 1017–1024.

Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-


level play. Neural Computation, 6(2):215–219.

Tesauro, G., Touretzky, D., and Leen, T., editors (1995). NIPS-7: Advances in Neural
Information Processing Systems: Proceedings of the 1994 Conference, Cambridge, MA,
USA. MIT Press.

Thrun, S. B. (1992). Efficient exploration in reinforcement learning. Technical Report CMU-


CS-92-102, Carnegie Mellon University, Pittsburgh, PA.

Toussaint, M., Charlin, L., and Poupart, P. (2008). Hierarchical POMDP controller opti-
mization by likelihood maximization. In McAllester and Myllymäki (2008), pages 562–570.

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine


Learning, 16(3):185–202.

Tsitsiklis, J. N. and Mannor, S. (2004). The sample complexity of exploration in the multi-
armed bandit problem. Journal of Machine Learning Research, 5:623–648.

Tsitsiklis, J. N. and Van Roy, B. (1996). Feature-based methods for large scale dynamic
programming. Machine Learning, 22:59–94.

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal di↵erence learning with
function approximation. IEEE Transactions on Automatic Control, 42:674–690.

Tsitsiklis, J. N. and Van Roy, B. (1999a). Average cost temporal-di↵erence learning. Auto-
matica, 35(11):1799–1808.

Tsitsiklis, J. N. and Van Roy, B. (1999b). Optimal stopping of Markov processes: Hilbert
space theory, approximation algorithms, and an application to pricing financial derivatives.
IEEE Transactions on Automatic Control, 44:1840–1851.

Tsitsiklis, J. N. and Van Roy, B. (2001). Regression methods for pricing complex American-
style options. IEEE Transactions on Neural Networks, 12:694–703.

Tsybakov, A. (2009). Introduction to nonparametric estimation. Springer Verlag.

96
Van Roy, B. (2006). Performance loss bounds for approximate value iteration with state
aggregation. Mathematics of Operations Research, 31(2):234–244.

Wahba, G. (2003). Reproducing kernel Hilbert spaces – two brief reviews. In Proceedings of
the 13th IFAC Symposium on System Identification, pages 549–559.

Wang, T., Lizotte, D. J., Bowling, M. H., and Schuurmans, D. (2008). Stable dual dynamic
programming. In Platt et al. (2008).

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, UK.

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 3(8):279–292.

Widrow, B. and Stearns, S. (1985). Adaptive Signal Processing. Prentice Hall, Englewood
Cli↵s, NJ.

Williams, R. J. (1987). A class of gradient-estimating algorithms for reinforcement learning


in neural networks. In Proceedings of the IEEE First International Conference on Neural
Networks, San Diego, CA.

Wrobel, S., Fürnkranz, J., and Joachims, T., editors (2010). Proceedings of the 27th An-
nual International Conference on Machine Learning (ICML 2010), ACM International
Conference Proceeding Series, New York, NY, USA. ACM.

Xu, X., He, H., and Hu, D. (2002). Efficient reinforcement learning using recursive least-
squares methods. Journal of Artificial Intelligence Research, 16:259–292.

Xu, X., Hu, D., and Lu, X. (2007). Kernel-based least squares policy iteration for reinforce-
ment learning. IEEE Transactions on Neural Networks, 18:973–992.

Yu, H. and Bertsekas, D. (2007). Q-learning algorithms for optimal stopping based on least
squares. In Proceedings of the European Control Conference.

Yu, J. and Bertsekas, D. P. (2008). New error bounds for approximations from projected lin-
ear equations. Technical Report C-2008-43, Department of Computer Science, University
of Helsinki. revised July, 2009.

Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary
reward processes. Mathematics of Operations Research. to appear.

Zhang, W. and Dietterich, T. G. (1995). A reinforcement learning approach to job-shop


scheduling. In Perrault, C. R. and Mellish, C. S., editors, Proceedings of the Fourteenth

97
International Joint Conference on Artificial Intelligence (IJCAI 95), pages 1114–1120,
San Francisco, CA, USA. Morgan Kaufmann.

98

You might also like