100% found this document useful (1 vote)
59 views11 pages

Simulation-Based Optimization Parametric Optimizat

This document provides an overview of the book "Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning" by Abhijit Gosavi. The book covers various parametric optimization techniques and reinforcement learning methods that can be applied to problems modeled via simulation. Chapter 1 introduces why the book was written and how it is organized. Subsequent chapters cover background topics, basic simulation concepts, response surface methodology, parametric optimization algorithms, dynamic programming, and reinforcement learning. The book provides details on modeling simulation problems and algorithms for optimizing systems modeled via simulation.

Uploaded by

mic0331
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
59 views11 pages

Simulation-Based Optimization Parametric Optimizat

This document provides an overview of the book "Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning" by Abhijit Gosavi. The book covers various parametric optimization techniques and reinforcement learning methods that can be applied to problems modeled via simulation. Chapter 1 introduces why the book was written and how it is organized. Subsequent chapters cover background topics, basic simulation concepts, response surface methodology, parametric optimization algorithms, dynamic programming, and reinforcement learning. The book provides details on modeling simulation problems and algorithms for optimizing systems modeled via simulation.

Uploaded by

mic0331
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SIMULATION-BASED OPTIMIZATION:

Parametric Optimization Techniques


and Reinforcement Learning

ABHUIT GOSAVI
Department of Industrial Engineering
The State University of NewYork, Buffalo

Kluwer Academic Publishers


Boston/Dordrecht/London
Contents

List of Figures xvii


List of Tables xxi
Acknowledgments xxiii
Preface xxv
1. BACKGROUND 1
1.1 Why this book was written 1
1.2 Simulation-based optimization and modern times 3
1.3 How this book is organized 7
2. NOTATION 9
2.1 Chapter Overview 9
2.2 Some Basic Conventions 9
2.3 Vector notation 9
2.3.1 Max norm 10
2.3.2 Euclidean norm 10
2.4 Notation for matrices 10
2.5 Notation for n-tuples 11
2.6 Notation for sets :• 11
2.7 Notation for Sequences 11
2.8 Notation for Transformations j 11
2.9 Max, min, and arg max 12
2.10 Acronyms and Abbreviations 12
2.11 Concluding Remarks . 12
3. PROBABILITY THEORY: A REFRESHER 15
3.1 Overview of this chapter 15
3.1.1 Random variables 15
3.2 Laws of Probability 16
viii SIMULATION-BASED OPTIMIZATION

3.2.1 Addition Law 17


3.2.2 Multiplication Law 18
3.3 Probability Distributions 21
3.3.1 Discrete random variables 21
3.3.2 Continuous random variables 22
3.4 Expected value of a random variable 23
3.5 Standard deviation of a random variable 25
3.6 Limit Theorems 27
3.7 Review Questions 28
4. BASIC CONCEPTS UNDERLYING SIMULATION 29
4.1 Chapter Overview 29
4.2 Introduction 29
4.3 Models 30
4.4 Simulation Modeling of Random Systems 32
4.4.1 Random Number Generation 33
4.4.1.1 Uniformly Distributed Random Numbers 33
4.4.1.2 Other Distributions 36
4.4.2 Re-creation of events using random numbers 37
4.4.3 Independence of samples collected 42
4.4.4 Terminating and non-terminating systems 43
4.5 Concluding Remarks 44
4.6 Historical Remarks 44
4.7 Review Questions 45
5. SIMULATION OPTIMIZATION: AN OVERVIEW 47
5.1 Chapter Overview 47
5.2 Stochastic parametric optimization 47
5.2.1 The role of simulation in parametric optimization 50
5.3 Stochastic control optimization 51
5.3.1 The role of simulation in control optimization 53
5.4 Historical Remarks , 54
5.5 Review Questions 54
6. RESPONSE SURFACES AND NEURAL NETS 57
6.1 Chapter Overview 57
6.2 RSM: An Overview 58
6.3 RSM: Details > 59
6.3.1 Sampling 60
6.3.2 Function Fitting 60
6.3.2.1 Fitting a straight line 60
6.3.2.2 Fitting a plane 63
6.3.2.3 Fitting hyper-planes 64
Contents ix

6.3.2.4 Piecewise regression 65


6.3.2.5 Fitting non-linear forms 66
6.3.3 How good is the metamodel? 67
6.3.4 Optimization with a metamodel 68
6.4 Neuro-Response Surface Methods 69
6.4.1 Linear Neural Networks 69
6.4.1.1 Steps in the Widrow-Hoff Algorithm 72
6.4.1.2 Incremental Widrow-Hoff 72
6.4.1.3 Pictorial Representation of a Neuron 73
6.4.2 Non-linear Neural Networks 73
6.4.2.1 The Basic Structure of a Non-Linear Neural Network 75
6.4.2.2 The Backprop Algorithm 78
6.4.2.3 Deriving the backprop algorithm 79
6.4.2.4 Backprop with a Bias Node 82
6.4.2.5 Deriving the algorithm for the bias weight 82
6.4.2.6 Steps in Backprop 84
6.4.2.7 Incremental Backprop 86
6.4.2.8 Example D 88
6.4.2.9 Validation of the neural network 89
6.4.2.10 Optimization with a neuro-RSM model 90
6.5 Concluding Remarks 90
6.6 Bibliographic Remarks 90
6.7 Review Questions 91
7. PARAMETRIC OPTIMIZATION 93
7.1 Chapter Overview 93
7.2 Continuous Optimization 94
7.2.1 Gradient Descent 94
7.2.1.1 Simulation and Gradient Descent 98
7.2.1.2 Simultaneous Perturbation 101
7.2.2 Non-derivative methods 104
7.3 Discrete Optimization 106
7.3.1 Ranking and Selection j 107
7.3.1.1 Steps in the Rinott method 108
7.3.1.2 Steps in the Kim-Nelson method 109
7.3.2 Meta-heuristics ' 110
7.3.2.1 Simulated Annealing 111
7.3.2.2 The Genetic Algorithm / 117
7.3.2.3 Tabu Search / 119
7.3.2.4 A Learning Automata Search Technique 123
7.3.2.5 Other Meta-Heuristics 128
7.3.2.6 Ranking and selection & meta-heuristics 128
7.4 Hybrid solution spaces 128
7.5 Concluding Remarks 129
X SIMULATION-BASED OPTIMIZATION

7.6 Bibliographic Remarks 129


7.7 Review Questions 131
8. DYNAMIC PROGRAMMING 133
8.1 Chapter Overview 133
8.2 Stochastic processes 133
8.3 Markov processes, Markov chains and semi-Markov processes 136
8.3.1 Markov chains 139
8.3.1.1 n-step transition probabilities 140
8.3.2 Regular Markov chains 142
8.3.2.1 Limiting probabilities 143
8.3.3 Ergodicity 145
8.3.4 Semi-Markov processes 146
8.4 Markov decision problems 148
8.4.1 Elements of the Markov decision framework 151
8.5 How to solve an MDP using exhaustive enumeration 157
8.5.1 Example A 158
8.5.2 Drawbacks of exhaustive enumeration 161
8.6 Dynamic programming for average reward 161
8.6.1 AveragerewardBellman equation for a policy 162
8.6.2 Policy iteration for averagerewardMDPs 163
8.6.2.1 Steps 163
8.6.3 Value iteration and its variants: averagerewardMDPs 165
8.6.4 Value iteration for averagerewardMDPs 165
8.6.4.1 Steps 166
8.6.5 Relative value iteration 168
8.6.5.1 Steps 168
8.6.6 A general expression for the averagerewardof an MDP 169
8.7 Dynamic programming and discounted reward 170
8.7.1 Discounted reward 171
8.7.2 DiscountedrewardMDP 171
8.7.3 Bellman equation for a policy: discounted reward 173
8.7.4 Policy iteration for discountedrewardMDPs 173
8.7.4.1 Steps ' 174
8.7.5 Value iteration for discountedrewardMDPs 175
8.7.5.1 Steps ' 176
8.7.6 Getting value iteration to converge faster 177
8.7.6.1 Gauss Siedel value iteration 178
8.7.6.2 Relative value iteration for discounted reward 179
8.7.6.3 Span seminorm termination 180
8.8 The Bellman equation: An intuitive perspective 181
8.9 Semi-Markov decision problems 182
8.9.1 The natural process and the decision-making process 184
8.9.2 AveragerewardSMDPs 186
Contents xi

8.9.2.1 Exhaustive enumeration for averagerewardSMDPs 186


8.9.2.2 Example B 187
8.9.2.3 Policy iteration for averagerewardSMDPs 189
8.9.2.4 Value iteration for averagerewardSMDPs 191
8.9.2.5 Counterexample forregularvalue iteration 192
8.9.2.6 Uniformization for SMDPs 193
8.9.2.7 Value iteration based on the Bellman equation 194
8.9.2.8 Extension to random time SMDPs 194
8.9.3 DiscountedrewardSMDPs 194
8.9.3.1 Policy iteration for discounted SMDPs 195
8.9.3.2 Value iteration for discountedrewardSMDPs 195
8.9.3.3 Extension to random time SMDPs 196
8.9.3.4 Uniformization 196
8.10 Modified policy iteration 197
8.10.1 Steps for discountedrewardMDPs 198
8.10.2 Steps for averagerewardMDPs 199
8.11 Miscellaneous topicsrelatedto MDPs and SMDPs 200
8.11.1 A parametric-optimization approach to solving MDPs 200
8.11.2 The MDP as a special case of a stochastic game 201
8.11.3 Finite Horizon MDPs 203
8.11.4 The approximating sequence method 206
8.12 Conclusions 207
8.13 Bibliographic Remarks 207
8.14 Review Questions 208
9. REINFORCEMENT LEARNING 211
9.1 Chapter Overview 211
9.2 The Need for Reinforcement Learning 212
9.3 Generating the TPM through straightforward counting 214
9.4 Reinforcement Learning: Fundamentals 215
9.4.1 Q-factors 218
9.4.1.1 A Q-factor version of value iteration 219
9.4.2 The Robbins-Monro algorithm 220
9.4.3 The Robbins-Monro algorithm and Q-factors 221
9.4.4 Simulators, asynchronous;implementations, and step
!
sizes 222
9.5 DiscountedrewardReinforcement Learning 224
9.5.1 DiscountedrewardRL based on value iteration 224
9.5.1.1 Steps in Q-Learning / 225
9.5.1.2 Reinforcement Learning: A "Learning" Perspective 227
9.5.1.3 On-line and Off-line 229
9.5.1.4 Exploration 230
9.5.1.5 A worked-out example for Q-Learning 231
9.5.2 DiscountedrewardRL based on policy iteration 234
xii SIMULATION-BASED OPTIMIZATION
9.5.2.1 Q-factor version ofregularpolicy iteration 235
9.5.2.2 Steps in the Q-factor version ofregularpolicy iteration 235
9.5.2.3 Steps in Q-P-Learning 237
9.6 AveragerewardReinforcement Learning 238
9.6.1 Discounted RL for averagerewardMDPs 238
9.6.2 AveragerewardRL based on value iteration 238
9.6.2.1 Steps in Relative Q-Leaming 239
9.6.2.2 Calculating the averagerewardof a policy in a simulator 240
9.6.3 Other algorithms for averagerewardMDPs 241
9.6.3.1 Steps in .R-Learning 241
9.6.3.2 Steps in SMART for MDPs 242
9.6.4 An RL algorithm based on policy iteration 244
9.6.4.1 Steps in Q-P-Learning for average reward 244
9.7 Semi-Markov decision problems and RL 245
9.7.1 Discounted Reward 245
9.7.1.1 Steps in Q-Learning for discountedrewardDTMDPs 245
9.7.1.2 Steps in Q-P-Leaming for discountedrewardDTMDPs 246
9.7.2 Average reward 247
9.7.2.1 Steps in SMART for SMDPs 248
9.7.2.2 Steps in Q-P-Learning for SMDPs 250
9.8 RL Algorithms and their DP counterparts 252
9.9 Actor-Critic Algoriflims 252
9.10 Model-building algorithms 253
9.10.1 if-Learning for discounted reward 254
9.10.2 iif-Learning for average reward 255
9.10.3 Model-building Q-Learning 257
9.10.4 Model-buildingrelativeQ-Learning 258
9.11 Finite Horizon Problems 259
9.12 Function approximation 260
9.12.1 Function approximation with state aggregation 260
9.12.2 Function approximation with function fitting 262
9.12.2.1 Difficulties 262
9.12.2.2 Steps in Q-Learning coupled with neural networks 264
9.12.3 Function approximation with interpolation methods 265
9.12.4 Linear and non-linear functions 269
!
9.12.5 Arobuststrategy 269
9.12.6 Function approximation: Model-building algorithms 270
9.13 Conclusions / 270
9.14 Bibliographic Remarks / 271
9.14.1 Early works 271
9.14.2 Newo-Dynamic Programming 271
9.14.3 RL algorithms based on Q-factors 271
9.14.4 Actor-critic Algorithms 272
9.14.5 Model-building algorithms 272
Contents xiii
9.14.6 Function Approximation 273
9.14.7 Some other references 273
9.14.8 Further reading 273
9.15 Review Questions 273
10. MARKOV CHAIN AUTOMATA THEORY 277
10.1 Chapter Overview 277
10.2 The MCAT framework 278
10.2.1 The working mechanism of MCAT 278
10.2.2 Step-by-step details of an MCAT algorithm 280
10.2.3 An illustrative 3-state example 282
10.2.4 What if there are more than two actions? 284
10.3 Concluding Remarks 285
10.4 Bibliographic Remarks 285
10.5 Review Questions 285
11. CONVERGENCE: BACKGROUND MATERIAL 287
11.1 Chapter Overview 287
11.2 Vectors and Vector Spaces 288
11.3 Norms 290
11.3.1 Properties of Norms 291
11.4 Nonned Vector Spaces 291
11.5 Functions and Mappings 291
11.5.1 Domain and Range of a function 291
11.5.2 The notation for transformations 293
11.6 Mathematical Induction 294
11.7 Sequences 297
11.7.1 Convergent Sequences 298
11.7.2 Increasing and decreasing sequences 300
11.7.3 Boundedness 300
11.8 Sequences in Hn 306
11.9 Cauchy sequences in Hn 307
11.10 Contraction mappings in Kn ' 308
11.11 Bibliographic Remarks i 315
11.12 Review Questions 315
12. CONVERGENCE: PARAMETRIC OPTIMIZATION 317
12.1 Chapter Overview { 317
12.2 Some Definitions and a result 317
12.2.1 Continuous Functions 318
12.2.2 Partial derivatives 319
12.2.3 A continuously differentiable function 319
12.2.4 Stationary points, local optima, and global optima 319
xiv SIMULATION-BASED OPTIMIZATION
12.2.5 Taylor's theorem 320
12.3 Convergence of gradient-descent approaches 323
12.4 Perturbation Estimates 327
12.4.1 Finite Difference Estimates 327
12.4.2 Notation 328
12.4.3 Simultaneous Perturbation Estimates 328
12.5 Convergence of Simulated Annealing 333
12.6 Concluding Remarks 341
12.7 Bibliographic Remarks 341
12.8 Review Questions 341
13. CONVERGENCE: CONTROL OPTIMIZATION 343
13.1 Chapter Overview 343
13.2 Dynamic programming transformations 344
13.3 Some definitions 345
13.4 MonotonicityofT,7}i,L, and£/i 346
13.5 Someresultsfor average & discounted MDPs 347
13.6 Discountedrewardand classical dynamic programming 349
13.6.1 Bellman Equation for Discounted Reward 349
13.6.2 Policy Iteration 356
13.6.3 Value iteration for discountedrewardMDPs 359
13.7 Averagerewardand classical dynamic programming 364
13.7.1 Bellman equation for average reward 365
13.7.2 Policy iteration for averagerewardMDPs 368
13.7.3 Value Iteration for averagerewardMDPs 372
13.8 Convergence of DP schemes for SMDPs 379
13.9 Convergence of Reinforcement Learning Schemes 379
13.10 Background Material for RL Convergence 380
13.10.1 Non-Expansive Mappings 380
13.10.2 Lipschitz Continuity 380
13.10.3 Convergence of a sequence with probability 1 381
13.11 Key Results for RL convergence ' 381
13.11.1 Synchronous Convergence 382
13.11.2 Asynchronous Convergence 383
13.12 Convergence of RL based on value iteration 392
13.12.1 Convergence of Q-Learning 392
13.12.2 Convergence of Relative Q-Learning 397
13.12.3 Finite Convergence of Q-Learning 397
13.13 Convergence of Q-P-Learning for MDPs 400
13.13.1 Discounted reward 400
13.13.2 Average Reward 401
13.14 SMDPs 402
Contents xv

13.14.1 Value iteration for average reward 402


13.14.2 Policy iteration for average reward 402
13.15 Convergence of Actor-Critic Algorithms 404
13.16 Function approximation and convergence analysis 405
13.17 Bibliographic Remarks 406
13.17.1 DP theory 406
13.17.2 RL theory 406
13.18 Review Questions 407
14. CASE STUDIES 409
14.1 Chapter Overview 409
14.2 A Classical Inventory Control Problem 410
14.3 Airline Yield Management 412
14.4 Preventive Maintenance 416
14.5 Transfer Line Buffer Optimization 420
14.6 Inventory Control in a Supply Chain 423
14.7 AGV Routing 424
14.8 Quality Control 426
14.9 Elevator Scheduling 427
14.10 Simulation optimization: A comparative perspective 429
14.11 Concluding Remarks 430
14.12 Review Questions 430
15. CODES 433
15.1 Introduction 433
15.2 C programming 434
15.3 Code Organization 436
15.4 Random Number Generators 437
15.5 Simultaneous Perturbation 439
15.6 Dynamic Programming Codes 441
15.6.1 Policy Iteration for average reward MDPs 442
15.6.2 Relative Iteration for average reward MDPs 447
15.6.3 Policy Iteration for discounted reward MDPs 450
15.6.4 Value Iteration for discounted reward MDPs 453
15.6.5 Policy Iteration for average reward SMDPs 460
15.7 Codes for Neural Networks / 464
15.7.1 Neuron / 465
15.7.2 Backprop Algorithm — Batch Mode 470
15.8 Reinforcement Learning Codes 478
15.8.1 Codes for Q-Learning 478
15.8.2 Codes for Relative Q-Learning 486
15.8.3 Codes for Relaxed-SMART 495
xvi SIMULATION-BASED OPTIMIZATION
15.9 Codes for the Preventive Maintenance Case Study 506
15.9.1 Learning Codes 507
15.9.2 Fixed Policy Codes 521
15.10 MATLAB Codes 531
15.11 Concluding Remarks 535
15.12 Review Questions 535
16. CONCLUDING REMARKS 537
References 539
Index 551

You might also like