RLAlgs in MDPs
RLAlgs in MDPs
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning
and teaching. Machine Learning, 9:293–321.
Littman, M. L., Sutton, R. S., and Singh, S. P. (2001). Predictive representations of state.
In Dietterich et al. (2001), pages 1555–1561.
Maei, H., Szepesvári, C., Bhatnagar, S., Silver, D., Precup, D., and Sutton, R. (2010a).
Convergent temporal-di↵erence learning with arbitrary smooth function approximation.
In NIPS-22, pages 1204–1212.
Maei, H., Szepesvári, C., Bhatnagar, S., and Sutton, R. (2010b). Toward o↵-policy learning
control with function approximation. In Wrobel et al. (2010).
Maei, H. R. and Sutton, R. S. (2010). GQ( ): A general gradient algorithm for temporal-
di↵erence prediction learning with eligibility traces. In Baum, E., Hutter, M., and Kitzel-
mann, E., editors, Proceedings of the Third Conference on Artificial General Intelligence,
pages 91–96. Atlantis Press.
McAllester, D. A. and Myllymäki, P., editors (2008). Proceedings of the 24th Conference in
Uncertainty in Artificial Intelligence (UAI’08). AUAI Press.
Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning
with function approximation. In Cohen et al. (2008), pages 664–671.
Menache, I., Mannor, S., and Shimkin, N. (2005). Basis function adaptation in temporal
di↵erence reinforcement learning. Annals of Operations Research, 134(1):215–238.
Mnih, V., Szepesvári, C., and Audibert, J.-Y. (2008). Empirical Bernstein stopping. In
Cohen et al. (2008), pages 672–679.
Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal
of Machine Learning Research, 9:815–857.
90
Nascimento, J. and Powell, W. (2009). An optimal approximate dynamic programming
algorithm for the lagged asset acquisition problem. Mathematics of Operations Research,
34:210–237.
Nedič, A. and Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear
function approximation. Discrete Event Dynamic Systems, 13(1):79–110.
Neu, G., György, A., and Szepesvári, C. (2010). The online loop-free stochastic shortest-path
problem. In COLT-10.
Ng, A. Y. and Jordan, M. (2000). PEGASUS: A policy search method for large MDPs and
POMDPs. In Boutilier, C. and Goldszmidt, M., editors, Proceedings of the 16th Confer-
ence in Uncertainty in Artificial Intelligence (UAI’00), pages 406–415, San Francisco CA.
Morgan Kaufmann.
Ortner, R. (2008). Online regret bounds for Markov decision processes with deterministic
transitions. In Freund, Y., Györfi, L., Turán, G., and Zeugmann, T., editors, Proc. of the
19th International Conference on Algorithmic Learning Theory (ALT 2008), volume 5254
of Lecture Notes in Computer Science, pages 123–137. Springer.
Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008). An analysis of
linear models, linear value-function approximation, and feature selection for reinforcement
learning. In Cohen et al. (2008), pages 752–759.
Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. L. (2007). Analyzing feature gener-
ation for value-function approximation. In Ghahramani (2007), pages 737–744.
Peters, J., Vijayakumar, S., and Schaal, S. (2003). Reinforcement learning for humanoid
robotics. In Humanoids2003, Third IEEE-RAS International Conference on Humanoid
Robots, pages 225—230.
91
Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors (2008). Advances in Neural
Information Processing Systems 20, Cambridge, MA, USA. MIT Press.
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete
Bayesian reinforcement learning. In Cohen and Moore (2006), pages 697–704.
Rasmussen, C. and Williams, C. (2005). Gaussian Processes for Machine Learning (Adaptive
Computation and Machine Learning). The MIT Press.
Riedmiller, M. (2005). Neural fitted Q iteration – first experiences with a data efficient
neural reinforcement learning method. In Gama, J., Camacho, R., Brazdil, P., Jorge, A.,
and Torgo, L., editors, Proceedings of the 16th European Conference on Machine Learning
(ECML-05), volume 3720 of Lecture Notes in Computer Science, pages 317–328. Springer.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the
American Mathematics Society, 58:527–535.
Ross, S. and Pineau, J. (2008). Model-based Bayesian reinforcement learning in large struc-
tured domains. In McAllester and Myllymäki (2008), pages 476–483.
Ross, S., Pineau, J., Paquet, S., and Chaib-draa, B. (2008). Online planning algorithms for
POMDPs. Journal of Artificial Intelligence Research, 32:663–704.
Rummery, G. A. (1995). Problem solving with reinforcement learning. PhD thesis, Cambridge
University.
92
Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems.
Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Depart-
ment.
Rusmevichientong, P., Salisbury, J. A., Truss, L. T., Van Roy, B., and Glynn, P. W. (2006).
Opportunities and challenges in using online preference data for vehicle pricing: A case
study at General Motors. Journal of Revenue and Pricing Management, 5(1):45–61.
Scherrer, B. (2010). Should one compute the temporal di↵erence fix point or minimize the
Bellman residual? The unified oblique projection view. In Wrobel et al. (2010).
Schölkopf, B., Platt, J. C., and Ho↵man, T., editors (2007). Advances in Neural Information
Processing Systems 19, Cambridge, MA, USA. MIT Press.
Shavlik, J. W., editor (1998). Proceedings of the 15th International Conference on Machine
Learning (ICML 1998), San Francisco, CA, USA. Morgan Kau↵mann.
Silver, D., Sutton, R. S., and Müller, M. (2007). Reinforcement learning of local shape in
the game of Go. In Veloso, M. M., editor, Proceedings of the 20th International Joint
Conference on Artificial Intelligence (IJCAI 2007), pages 1053—1058.
Simão, H. P., Day, J., George, A. P., Gi↵ord, T., Nienow, J., and Powell, W. B. (2009). An
approximate dynamic programming algorithm for large-scale fleet management: A case
application. Transportation Science, 43(2):178–197.
Singh, S. P. and Bertsekas, D. P. (1997). Reinforcement learning for dynamic channel al-
location in cellular telephone systems. In Mozer, M. C., Jordan, M. I., and Petsche, T.,
editors, NIPS-9: Advances in Neural Information Processing Systems: Proceedings of the
1996 Conference, pages 974–980, Cambridge, MA, USA. MIT Press.
Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learning with soft state
aggregation. In Tesauro et al. (1995), pages 361–368.
93
Singh, S. P., Jaakkola, T., Littman, M. L., and Szepesvári, C. (2000). Convergence results for
single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308.
Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss from approximate optimal-
value functions. Machine Learning, 16(3):227–233.
Solla, S. A., Leen, T. K., and Müller, K. R., editors (1999). Advances in Neural Information
Processing Systems 12, Cambridge, MA, USA. MIT Press.
Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. (2006). PAC model-free
reinforcement learning. In Cohen and Moore (2006), pages 881–888.
Strehl, A. L. and Littman, M. L. (2008). Online linear regression and its application to
model-based reinforcement learning. In Platt et al. (2008), pages 1417–1424.
Strens, M. (2000). A Bayesian framework for reinforcement learning. In Langley, P., edi-
tor, Proceedings of the 17th International Conference on Machine Learning (ICML 2000),
pages 943–950. Morgan Kaufmann.
Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale
Workshop on Adaptive and Learning Systems, pages 161—166.
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and
Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-di↵erence learning
with linear function approximation. In Danyluk et al. (2009), pages 993—1000.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999a). Policy gradient
methods for reinforcement learning with function approximation. In Solla et al. (1999),
pages 1057–1063.
94
Sutton, R. S., Precup, D., and Singh, S. P. (1999b). Between MDPs and semi-MDPs:
A framework for temporal abstraction in reinforcement learning. Artificial Intelligence,
112:181–211.
Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. H. (2008). Dyna-style
planning with linear function approximation and prioritized sweeping. In McAllester and
Myllymäki (2008), pages 528–536.
Sutton, R. S., Szepesvári, C., and Maei, H. R. (2009b). A convergent O(n) temporal-
di↵erence algorithm for o↵-policy learning with linear function approximation. In Koller
et al. (2009), pages 1609–1616.
Szepesvári, C. (1997). Learning and exploitation do not conflict under minimax optimality. In
Someren, M. and Widmer, G., editors, Machine Learning: ECML’97 (9th European Conf.
on Machine Learning, Proceedings), volume 1224 of Lecture Notes in Artificial Intelligence,
pages 242–249. Springer, Berlin.
Szepesvári, C. (1998). Static and Dynamic Aspects of Optimal Sequential Decision Making.
PhD thesis, Bolyai Institute of Mathematics, University of Szeged, Szeged, Aradi vrt. tere
1, HUNGARY, 6720.
Szita, I. and Lőrincz, A. (2008). The many faces of optimism: a unifying approach. In Cohen
et al. (2008), pages 1048–1055.
Szita, I. and Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight
exploration complexity bounds. In Wrobel et al. (2010).
Tadić, V. B. (2004). On the almost sure rate of convergence of linear stochastic approximation
algorithms. IEEE Transactions on Information Theory, 5(2):401–409.
95
Tanner, B. and White, A. (2009). RL-Glue: Language-independent software for
reinforcement-learning experiments. Journal of Machine Learning Research, 10:2133–2136.
Taylor, G. and Parr, R. (2009). Kernelized value function approximation for reinforcement
learning. In Danyluk et al. (2009), pages 1017–1024.
Tesauro, G., Touretzky, D., and Leen, T., editors (1995). NIPS-7: Advances in Neural
Information Processing Systems: Proceedings of the 1994 Conference, Cambridge, MA,
USA. MIT Press.
Toussaint, M., Charlin, L., and Poupart, P. (2008). Hierarchical POMDP controller opti-
mization by likelihood maximization. In McAllester and Myllymäki (2008), pages 562–570.
Tsitsiklis, J. N. and Mannor, S. (2004). The sample complexity of exploration in the multi-
armed bandit problem. Journal of Machine Learning Research, 5:623–648.
Tsitsiklis, J. N. and Van Roy, B. (1996). Feature-based methods for large scale dynamic
programming. Machine Learning, 22:59–94.
Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal di↵erence learning with
function approximation. IEEE Transactions on Automatic Control, 42:674–690.
Tsitsiklis, J. N. and Van Roy, B. (1999a). Average cost temporal-di↵erence learning. Auto-
matica, 35(11):1799–1808.
Tsitsiklis, J. N. and Van Roy, B. (1999b). Optimal stopping of Markov processes: Hilbert
space theory, approximation algorithms, and an application to pricing financial derivatives.
IEEE Transactions on Automatic Control, 44:1840–1851.
Tsitsiklis, J. N. and Van Roy, B. (2001). Regression methods for pricing complex American-
style options. IEEE Transactions on Neural Networks, 12:694–703.
96
Van Roy, B. (2006). Performance loss bounds for approximate value iteration with state
aggregation. Mathematics of Operations Research, 31(2):234–244.
Wahba, G. (2003). Reproducing kernel Hilbert spaces – two brief reviews. In Proceedings of
the 13th IFAC Symposium on System Identification, pages 549–559.
Wang, T., Lizotte, D. J., Bowling, M. H., and Schuurmans, D. (2008). Stable dual dynamic
programming. In Platt et al. (2008).
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, UK.
Widrow, B. and Stearns, S. (1985). Adaptive Signal Processing. Prentice Hall, Englewood
Cli↵s, NJ.
Wrobel, S., Fürnkranz, J., and Joachims, T., editors (2010). Proceedings of the 27th An-
nual International Conference on Machine Learning (ICML 2010), ACM International
Conference Proceeding Series, New York, NY, USA. ACM.
Xu, X., He, H., and Hu, D. (2002). Efficient reinforcement learning using recursive least-
squares methods. Journal of Artificial Intelligence Research, 16:259–292.
Xu, X., Hu, D., and Lu, X. (2007). Kernel-based least squares policy iteration for reinforce-
ment learning. IEEE Transactions on Neural Networks, 18:973–992.
Yu, H. and Bertsekas, D. (2007). Q-learning algorithms for optimal stopping based on least
squares. In Proceedings of the European Control Conference.
Yu, J. and Bertsekas, D. P. (2008). New error bounds for approximations from projected lin-
ear equations. Technical Report C-2008-43, Department of Computer Science, University
of Helsinki. revised July, 2009.
Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary
reward processes. Mathematics of Operations Research. to appear.
97
International Joint Conference on Artificial Intelligence (IJCAI 95), pages 1114–1120,
San Francisco, CA, USA. Morgan Kaufmann.
98