0% found this document useful (0 votes)
3 views

Script

The document outlines the instructions for Programming Assignment 2, including submission guidelines and the importance of originality. It introduces a Markov Decision Process with four states and two actions, focusing on a deterministic career path problem. Additionally, it includes code snippets for implementing the environment and testing policies within the framework of the assignment.

Uploaded by

aashish0270
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Script

The document outlines the instructions for Programming Assignment 2, including submission guidelines and the importance of originality. It introduces a Markov Decision Process with four states and two actions, focusing on a deterministic career path problem. Additionally, it includes code snippets for implementing the environment and testing policies within the framework of the assignment.

Uploaded by

aashish0270
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1 {

2 "cells": [
3 {
4 "cell_type": "markdown",
5 "id": "17c4e1c2-931b-4e68-9378-5d376b0df0ef",
6 "metadata": {
7 "id": "17c4e1c2-931b-4e68-9378-5d376b0df0ef"
8 },
9 "source": [
10 "# Programming Assignment 2"
11 ]
12 },
13 {
14 "cell_type": "markdown",
15 "id": "7d597afc-35f8-4b03-8084-de771032604c",
16 "metadata": {
17 "id": "7d597afc-35f8-4b03-8084-de771032604c"
18 },
19 "source": [
20 "**Name:** <br />\n",
21 "**Roll No:**\n",
22 "***\n",
23 "\n",
24 "## Instructions\n",
25 "\n",
26 "\n",
27 "- Kindly name your submission files as `RollNo_Name_PA2.ipynb`. <br />\n",
28 "- You are required to work out your answers and submit only the iPython Notebook.
The code should be well commented and easy to understand as there are marks for
this. This notebook can be used as a template for assignment submission. <br />\n",
29 "- Submissions are to be made through iPearl portal. Submissions made through mail
will not be graded.<br />\n",
30 "- Answers to the theory questions if any should be included in the notebook itself.
While using special symbols use the $\\LaTeX$ mode <br />\n",
31 "- Make sure your plots are clear and have title, legends and clear lines, etc. <br
/>\n",
32 "- Plagiarism of any form will not be tolerated. If your solutions are found to
match with other students or from other uncited sources, there will be heavy
penalties and the incident will be reported to the disciplinary authorities. <br
/>\n",
33 "- In case you have any doubts, feel free to reach out to TAs for help. <br />\n",
34 "\n",
35 "***"
36 ]
37 },
38 {
39 "cell_type": "markdown",
40 "id": "69751002-4656-47cf-8b2c-6016a434f4b6",
41 "metadata": {
42 "id": "69751002-4656-47cf-8b2c-6016a434f4b6"
43 },
44 "source": [
45 "## E1: A Deterministic Career Path\n",
46 "\n",
47 "Consider a simple Markov Decision Process below with four states and two actions
available at each state. In this simplistic setting actions have deterministic
effects, i.e., taking an action in a state always leads to one next state with
transition probability equal to one. There are two actions out of each state for the
agent to choose from: D for development and R for research. The
_ultimately-care-only-about-money_ reward scheme is given along with the states.\n",
48 "\n",
49 "<img src='assets/mdp-d.png' width=\"700\" align=\"left\"></img>"
50 ]
51 },
52 {
53 "cell_type": "code",
54 "execution_count": null,
55 "id": "b0f991f3-9630-4656-9caa-135f13847ed8",
56 "metadata": {
57 "id": "b0f991f3-9630-4656-9caa-135f13847ed8"
58 },
59 "outputs": [],
60 "source": [
61 "# import required libraries\n",
62 "import gymnasium as gym\n",
63 "import copy\n",
64 "import numpy as np\n",
65 "import matplotlib.pyplot as plt\n",
66 "import matplotlib.font_manager\n",
67 "import random"
68 ]
69 },
70 {
71 "cell_type": "markdown",
72 "id": "47afefbb-7b00-44e5-82dd-16953b59a7f3",
73 "metadata": {
74 "id": "47afefbb-7b00-44e5-82dd-16953b59a7f3"
75 },
76 "source": [
77 "### E1.1 Environment Implementation"
78 ]
79 },
80 {
81 "cell_type": "code",
82 "execution_count": null,
83 "id": "61c01aa4-94c3-48e5-ad3b-279e03685260",
84 "metadata": {
85 "id": "61c01aa4-94c3-48e5-ad3b-279e03685260"
86 },
87 "outputs": [],
88 "source": [
89 "'''\n",
90 "Represents a Career Path problem Gym Environment which provides a Fully
observable\n",
91 "MDP\n",
92 "'''\n",
93 "class CareerPathEnv(gym.Env):\n",
94 " '''\n",
95 " CareerPathEnv represents the Gym Environment for the Career Path problem
environment\n",
96 " States : [0:'Unemployed',1:'Industry',2:'Grad School',3:'Academia']\n",
97 " Actions : [0:'Research', 1:'Development']\n",
98 " '''\n",
99 " metadata = {'render.modes': ['human']}\n",
100 "\n",
101 " def __init__(self,initial_state=0,no_states=4,no_actions=2):\n",
102 " '''\n",
103 " Constructor for the CareerPath class\n",
104 "\n",
105 " Args:\n",
106 " initial_state : starting state of the agent\n",
107 " no_states : The no. of possible states which is 4\n",
108 " no_actions : The no. of possible actions which is 2\n",
109 "\n",
110 " '''\n",
111 " self.initial_state = initial_state\n",
112 " self.state = self.initial_state\n",
113 " self.nA = no_actions\n",
114 " self.nS = no_states\n",
115 " self.prob_dynamics = {\n",
116 " # s: {\n",
117 " # a: [(p(s,s'|a), s', r', terminal/not)]\n",
118 " # }\n",
119 "\n",
120 " 0: {\n",
121 " 0: [(1.0, 2, 0.0, False)],\n",
122 " 1: [(1.0, 1, 100.0, False)],\n",
123 " },\n",
124 " 1: {\n",
125 " 0: [(1.0, 0, -10.0, False)],\n",
126 " 1: [(1.0, 1, 100.0, False)],\n",
127 " },\n",
128 " 2: {\n",
129 " 0: [(1.0, 3, 10.0, False)],\n",
130 " 1: [(1.0, 1, 100.0, False)],\n",
131 " },\n",
132 " 3: {\n",
133 " 0: [(1.0, 3, 10.0, False)],\n",
134 " 1: [(1.0, 1, 100.0, False)],\n",
135 " },\n",
136 " }\n",
137 " self.reset()\n",
138 "\n",
139 " def reset(self):\n",
140 " '''\n",
141 " Resets the environment\n",
142 " Returns:\n",
143 " observations containing player's current state\n",
144 " '''\n",
145 " self.state = self.initial_state\n",
146 " return self.get_obs()\n",
147 "\n",
148 " def get_obs(self):\n",
149 " '''\n",
150 " Returns the player's state as the observation of the environment\n",
151 " '''\n",
152 " return (self.state)\n",
153 "\n",
154 " def render(self, mode='human'):\n",
155 " '''\n",
156 " Renders the environment\n",
157 " '''\n",
158 " print(\"Current state: {}\".format(self.state))\n",
159 "\n",
160 " def sample_action(self):\n",
161 " '''\n",
162 " Samples and returns a random action from the action space\n",
163 " '''\n",
164 " return random.randint(0, self.nA)\n",
165 " def P(self):\n",
166 " '''\n",
167 " Defines and returns the probabilty transition matrix which is in the form
of a nested dictionary\n",
168 " '''\n",
169 " self.prob_dynamics = {\n",
170 " 0: {\n",
171 " 0: [(1.0, 2, 0.0, False)],\n",
172 " 1: [(1.0, 1, 100.0, False)],\n",
173 " },\n",
174 " 1: {\n",
175 " 0: [(1.0, 0, -10.0, False)],\n",
176 " 1: [(1.0, 1, 100.0, False)],\n",
177 " },\n",
178 " 2: {\n",
179 " 0: [(1.0, 3, 10.0, False)],\n",
180 " 1: [(1.0, 1, 100.0, False)],\n",
181 " },\n",
182 " 3: {\n",
183 " 0: [(1.0, 3, 10.0, False)],\n",
184 " 1: [(1.0, 1, 100.0, False)],\n",
185 " },\n",
186 " }\n",
187 " return self.prob_dynamics\n",
188 "\n",
189 "\n",
190 " def step(self, action):\n",
191 " '''\n",
192 " Performs the given action\n",
193 " Args:\n",
194 " action : action from the action_space to be taking in the
environment\n",
195 " Returns:\n",
196 " observation - returns current state\n",
197 " reward - reward obtained after taking the given action\n",
198 " done - True if the episode is complete else False\n",
199 " '''\n",
200 " if action >= self.nA:\n",
201 " action = self.nA-1\n",
202 "\n",
203 " dynamics_tuple = self.prob_dynamics[self.state][action][0]\n",
204 " self.state = dynamics_tuple[1]\n",
205 "\n",
206 "\n",
207 " return self.state, dynamics_tuple[2], dynamics_tuple[3]"
208 ]
209 },
210 {
211 "cell_type": "markdown",
212 "id": "c9125c6e-8599-4dea-b388-b20596e33201",
213 "metadata": {
214 "id": "c9125c6e-8599-4dea-b388-b20596e33201"
215 },
216 "source": [
217 "### E1.2 Policies\n",
218 "\n",
219 "After implementing the environment let us see how to make decisions in the
environment. Let $\\pi_1(s) = R$ and $\\pi_2(s) = D$ for any state be two policies.
Let us see how these policies look like."
220 ]
221 },
222 {
223 "cell_type": "code",
224 "execution_count": null,
225 "id": "7d58aa48-25aa-4cb3-a70d-c4ac68f0cacc",
226 "metadata": {
227 "id": "7d58aa48-25aa-4cb3-a70d-c4ac68f0cacc",
228 "outputId": "8d2c9d13-71f0-4ed9-b0d8-28b4eac47e1e"
229 },
230 "outputs": [
231 {
232 "name": "stdout",
233 "output_type": "stream",
234 "text": [
235 "Research policy: \n",
236 " [[1. 0.]\n",
237 " [1. 0.]\n",
238 " [1. 0.]\n",
239 " [1. 0.]]\n",
240 "Development policy: \n",
241 " [[0. 1.]\n",
242 " [0. 1.]\n",
243 " [0. 1.]\n",
244 " [0. 1.]]\n",
245 "Random policy: \n",
246 " [[1 0]\n",
247 " [0 1]\n",
248 " [0 1]\n",
249 " [1 0]]\n",
250 "Uncertain policy: \n",
251 " [[0.5 0.5]\n",
252 " [0.5 0.5]\n",
253 " [0.5 0.5]\n",
254 " [0.5 0.5]]\n"
255 ]
256 }
257 ],
258 "source": [
259 "policy_R = np.concatenate((np.ones([4, 1]), np.zeros([4, 1])), axis=1)\n",
260 "policy_D = np.concatenate((np.zeros([4, 1]), np.ones([4, 1])), axis=1)\n",
261 "policy_random = np.array((np.random.permutation(2), np.random.permutation(2),
np.random.permutation(2), np.random.permutation(2)))\n",
262 "print(\"Research policy: \\n\",policy_R)\n",
263 "print(\"Development policy: \\n\", policy_D)\n",
264 "print(\"Random policy: \\n\",policy_random)\n",
265 "\n",
266 "policy_uncertain = np.concatenate((0.5*np.ones([4, 1]), 0.5*np.ones([4, 1])),
axis=1)\n",
267 "print(\"Uncertain policy: \\n\",policy_uncertain)"
268 ]
269 },
270 {
271 "cell_type": "markdown",
272 "id": "ed00cfd0",
273 "metadata": {
274 "id": "ed00cfd0"
275 },
276 "source": [
277 "### E1.3 Testing\n",
278 "\n",
279 "By usine one of the above policies, lets see how we navigate the environment. We
want to see how we make take and action based on a given policy, what state we
transition to and obtain the rewards from the transition."
280 ]
281 },
282 {
283 "cell_type": "code",
284 "execution_count": null,
285 "id": "3fd4869e",
286 "metadata": {
287 "id": "3fd4869e",
288 "outputId": "e0dc70c0-be1d-469f-f1db-d1157bd19c1c"
289 },
290 "outputs": [
291 {
292 "name": "stdout",
293 "output_type": "stream",
294 "text": [
295 "State\t Action\t New State\t Reward\t is_Terminal\n",
296 " 0 \t 1 \t 1 \t 100.0 \t False\n",
297 " 1 \t 1 \t 1 \t 100.0 \t False\n",
298 " 1 \t 0 \t 0 \t -10.0 \t False\n",
299 " 0 \t 0 \t 2 \t 0.0 \t False\n",
300 " 2 \t 0 \t 3 \t 10.0 \t False\n",
301 " 3 \t 1 \t 1 \t 100.0 \t False\n",
302 " 1 \t 1 \t 1 \t 100.0 \t False\n",
303 " 1 \t 1 \t 1 \t 100.0 \t False\n",
304 " 1 \t 0 \t 0 \t -10.0 \t False\n",
305 " 0 \t 0 \t 2 \t 0.0 \t False\n",
306 "Total Number of steps: 10\n",
307 "Final Reward: 490.0\n"
308 ]
309 }
310 ],
311 "source": [
312 "env = CareerPathEnv()\n",
313 "is_Terminal = False\n",
314 "start_state = env.reset()\n",
315 "steps = 0\n",
316 "total_reward = 0\n",
317 "\n",
318 "# you may change policy here\n",
319 "policy = policy_uncertain\n",
320 "# policy = policy_R\n",
321 "# policy = policy_D\n",
322 "# policy = policy_random\n",
323 "\n",
324 "print(\"State\\t\", \"Action\\t\" , \"New State\\t\" , \"Reward\\t\" ,
\"is_Terminal\")\n",
325 "steps = 0\n",
326 "max_steps = 5\n",
327 "\n",
328 "prev_state = start_state\n",
329 "\n",
330 "while steps < 10:\n",
331 " steps += 1\n",
332 "\n",
333 " action = np.random.choice(2,1,p=policy[prev_state])[0] #0 -> Research, 1 ->
Development\n",
334 " state, reward, is_Terminal = env.step(action)\n",
335 "\n",
336 " total_reward += reward\n",
337 "\n",
338 " print(\" \",prev_state, \"\\t \", action, \"\\t \", state, \"\\t\", reward,
\"\\t \", is_Terminal)\n",
339 " prev_state = state\n",
340 "\n",
341 "print(\"Total Number of steps:\", steps)\n",
342 "print(\"Final Reward:\", total_reward)"
343 ]
344 },
345 {
346 "cell_type": "markdown",
347 "id": "9121e977-35c4-4845-888d-2b662262347a",
348 "metadata": {
349 "id": "9121e977-35c4-4845-888d-2b662262347a"
350 },
351 "source": [
352 "### Iterative Policy Evaluation\n",
353 "Iterative Policy Evaluation is commonly used to calculate the state value function
$V_\\pi(s)$ for a given policy $\\pi$. Here we implement a function to compute the
state value function $V_\\pi(s)$ for a given policy\n",
354 "\n",
355 "<img src='assets/policy_eval.png' width=\"500\" align=\"left\"></img>"
356 ]
357 },
358 {
359 "cell_type": "code",
360 "execution_count": null,
361 "id": "c360b73b-e4d4-4538-b654-5837a123ee11",
362 "metadata": {
363 "id": "c360b73b-e4d4-4538-b654-5837a123ee11"
364 },
365 "outputs": [],
366 "source": [
367 "# Policy Evaluation\n",
368 "def EvaluatePolicy(env, policy, gamma=0.9, theta=1e-8, draw=False):\n",
369 " V = np.zeros(env.nS)\n",
370 " while True:\n",
371 " delta = 0\n",
372 " for s in range(env.nS):\n",
373 " Vs = 0\n",
374 " for a, action_prob in enumerate(policy[s]):\n",
375 " for prob, next_state, reward, done in env.P()[s][a]:\n",
376 " Vs += action_prob * prob * (reward + gamma * V[next_state])\n",
377 " delta = max(delta, np.abs(V[s]-Vs))\n",
378 " V[s] = Vs\n",
379 " if delta < theta:\n",
380 " break\n",
381 " return V"
382 ]
383 },
384 {
385 "cell_type": "markdown",
386 "id": "7b891c3a",
387 "metadata": {
388 "id": "7b891c3a"
389 },
390 "source": [
391 "### Policy improvement\n",
392 "\n",
393 "$\\pi'(s) = \\arg \\max_a \\sum_{s',r} p(s',r|s,a)\\left[ r + \\gamma v_\\pi(s')
\\right ]$\n"
394 ]
395 },
396 {
397 "cell_type": "code",
398 "execution_count": null,
399 "id": "930db58c",
400 "metadata": {
401 "id": "930db58c"
402 },
403 "outputs": [],
404 "source": [
405 "##Policy Improvement Function\n",
406 "def ImprovePolicy(env, v, gamma):\n",
407 " num_states = env.nS\n",
408 " num_actions = env.nA\n",
409 " prob_dynamics = env.P()\n",
410 "\n",
411 " q = np.zeros((num_states, num_actions))\n",
412 "\n",
413 " for state in prob_dynamics:\n",
414 " for action in prob_dynamics[state]:\n",
415 " #print(state, action)\n",
416 " for prob, new_state, reward, is_terminal in
prob_dynamics[state][action]:\n",
417 " #print(prob, new_state, reward, is_terminal)\n",
418 " q[state][action] += prob*(reward + gamma*v[new_state])\n",
419 "\n",
420 " new_pi = np.zeros((num_states, num_actions))\n",
421 "\n",
422 " for state in range(num_states):\n",
423 " opt_action = np.argmax(q[state])\n",
424 " new_pi[state][opt_action] = 1.0\n",
425 "\n",
426 " return new_pi"
427 ]
428 },
429 {
430 "cell_type": "markdown",
431 "id": "d8634e11",
432 "metadata": {
433 "id": "d8634e11"
434 },
435 "source": [
436 "### Policy Iteration\n",
437 "\n",
438 "<img src='assets/policy_iteration.png' width=\"500\" align=\"left\"></img>"
439 ]
440 },
441 {
442 "cell_type": "code",
443 "execution_count": null,
444 "id": "1860233b",
445 "metadata": {
446 "id": "1860233b"
447 },
448 "outputs": [],
449 "source": [
450 "def PolicyIteration(env, pi, gamma, tol = 1e-10):\n",
451 " num_states = env.nS\n",
452 " num_actions = env.nA\n",
453 " iterations = 0\n",
454 "\n",
455 " while True:\n",
456 " # print(pi)\n",
457 " iterations += 1\n",
458 " pi_old = pi\n",
459 " v = EvaluatePolicy(env, pi_old, gamma, tol)\n",
460 " pi = ImprovePolicy(env, v, gamma)\n",
461 "\n",
462 " is_equal = True\n",
463 " for s in range(num_states):\n",
464 " if np.argmax(pi_old[s]) == np.argmax(pi[s]):\n",
465 " continue\n",
466 " is_equal = False\n",
467 " if is_equal == True:\n",
468 " break\n",
469 " return pi, v, iterations\n",
470 "\n"
471 ]
472 },
473 {
474 "cell_type": "markdown",
475 "id": "5049ce61",
476 "metadata": {
477 "id": "5049ce61"
478 },
479 "source": [
480 "### Testing Policy Iteration"
481 ]
482 },
483 {
484 "cell_type": "code",
485 "execution_count": null,
486 "id": "9da9d38f",
487 "metadata": {
488 "id": "9da9d38f",
489 "outputId": "dfcaf066-c998-4339-c29c-24633da4e496"
490 },
491 "outputs": [
492 {
493 "name": "stdout",
494 "output_type": "stream",
495 "text": [
496 "Initial Policy: \n",
497 " [[1 0]\n",
498 " [0 1]\n",
499 " [0 1]\n",
500 " [1 0]]\n",
501 "Final Policy: \n",
502 " [[0. 1.]\n",
503 " [0. 1.]\n",
504 " [0. 1.]\n",
505 " [0. 1.]]\n",
506 "State Value Function: [1000. 1000. 1000. 1000.]\n",
507 "Number of iterations for Policy Iteration: 2\n",
508 "Iterations:\n",
509 "Min\t Max\t Average\n",
510 "1 \t 2 \t 1.91\n"
511 ]
512 }
513 ],
514 "source": [
515 "gamma = 0.9\n",
516 "env = CareerPathEnv()\n",
517 "\n",
518 "print(\"Initial Policy: \\n\",policy_random)\n",
519 "pi, v, iters = PolicyIteration(env, policy_random, gamma)\n",
520 "print(\"Final Policy: \\n\",pi)\n",
521 "print(\"State Value Function: \",v)\n",
522 "print(\"Number of iterations for Policy Iteration: \",iters)\n",
523 "\n",
524 "# average number of iterations required\n",
525 "avg_iters = 0\n",
526 "min_iters = 1000\n",
527 "max_iters = 0\n",
528 "for _ in range(100):\n",
529 " policy_random = np.array((np.random.permutation(2), np.random.permutation(2),
np.random.permutation(2), np.random.permutation(2)))\n",
530 " _, _, iters = PolicyIteration(env,policy_random, gamma)\n",
531 " avg_iters += iters\n",
532 " min_iters = min(min_iters, iters)\n",
533 " max_iters = max(max_iters, iters)\n",
534 "avg_iters /= 100\n",
535 "print(\"Iterations:\")\n",
536 "print(\"Min\\t\", \"Max\\t\" , \"Average\")\n",
537 "print(min_iters,\"\\t\", max_iters,\"\\t\", avg_iters)"
538 ]
539 },
540 {
541 "cell_type": "markdown",
542 "id": "b57f1d6c-dd20-4902-9581-ce7f8a0ec944",
543 "metadata": {
544 "id": "b57f1d6c-dd20-4902-9581-ce7f8a0ec944"
545 },
546 "source": [
547 "***"
548 ]
549 },
550 {
551 "cell_type": "markdown",
552 "id": "c086d4b9",
553 "metadata": {
554 "id": "c086d4b9"
555 },
556 "source": [
557 "### A1. Find an optimal policy to navigate the given environment using Value
Iteration (VI)\n",
558 "\n",
559 "<img src='assets/value_iteration.png' width=\"500\" align=\"left\"></img>"
560 ]
561 },
562 {
563 "cell_type": "code",
564 "execution_count": null,
565 "id": "82e66db5",
566 "metadata": {
567 "id": "82e66db5"
568 },
569 "outputs": [],
570 "source": [
571 "# write your code here\n"
572 ]
573 },
574 {
575 "cell_type": "markdown",
576 "id": "876d88b5",
577 "metadata": {
578 "id": "876d88b5"
579 },
580 "source": [
581 "### Testing Value Iterations"
582 ]
583 },
584 {
585 "cell_type": "code",
586 "execution_count": null,
587 "id": "90c71594",
588 "metadata": {
589 "id": "90c71594"
590 },
591 "outputs": [],
592 "source": [
593 "# write your code for testing value iteration here\n"
594 ]
595 },
596 {
597 "cell_type": "markdown",
598 "id": "83bf1175",
599 "metadata": {
600 "id": "83bf1175"
601 },
602 "source": [
603 "### A1.2 Compare PI and VI in terms of convergence (average number of iteration,
time required for each iteration). Is the policy obtained by both same?\n",
604 "\n",
605 "Write your answer here"
606 ]
607 },
608 {
609 "cell_type": "markdown",
610 "id": "4cc712ac",
611 "metadata": {
612 "id": "4cc712ac"
613 },
614 "source": [
615 "***"
616 ]
617 },
618 {
619 "cell_type": "markdown",
620 "id": "a47f6340",
621 "metadata": {
622 "id": "a47f6340"
623 },
624 "source": [
625 "## Part B : A Stochastic Career Path\n",
626 "\n",
627 "Now consider a more realistic Markov Decision Process below with four states and
two actions available at each state. In this setting Actions have nondeterministic
effects, i.e., taking an action in a state always leads to one next state, but which
state is the one next state is determined by transition probabilities. These
transition probabilites are shown in the figure attached to the transition arrows
from states and actions to states. There are two actions out of each state for the
agent to choose from: D for development and R for research. The same
_ultimately-care-only-about-money_ reward scheme is given along with the states.\n",
628 "\n",
629 "<img src='assets/mdp-nd.png' width=\"700\" align=\"left\"></img>"
630 ]
631 },
632 {
633 "cell_type": "code",
634 "execution_count": null,
635 "id": "73a0ac3e-223e-42c9-896f-d3d74c060258",
636 "metadata": {
637 "id": "73a0ac3e-223e-42c9-896f-d3d74c060258"
638 },
639 "outputs": [],
640 "source": [
641 "'''\n",
642 "Represents a Career Path problem Gym Environment which provides a Fully
observable\n",
643 "MDP\n",
644 "'''\n",
645 "class StochasticCareerPathEnv(gym.Env):\n",
646 " '''\n",
647 " StocasticCareerPathEnv represents the Gym Environment for the Career Path
problem environment\n",
648 " States : [0:'Unemployed',1:'Industry',2:'Grad School',3:'Academia']\n",
649 " Actions : [0:'Research', 1:'Development']\n",
650 " '''\n",
651 " metadata = {'render.modes': ['human']}\n",
652 "\n",
653 " def __init__(self,initial_state=3,no_states=4,no_actions=2):\n",
654 " '''\n",
655 " Constructor for the CareerPath class\n",
656 "\n",
657 " Args:\n",
658 " initial_state : starting state of the agent\n",
659 " no_states : The no. of possible states which is 4\n",
660 " no_actions : The no. of possible actions which is 2\n",
661 "\n",
662 " '''\n",
663 " self.initial_state = initial_state\n",
664 " self.state = self.initial_state\n",
665 " self.nA = no_actions\n",
666 " self.nS = no_states\n",
667 " self.prob_dynamics = {\n",
668 " # s: {\n",
669 " # a: [(p(s,s'|a), s', r', terminal/not), (p(s,s''|a), s'', r'',
terminal/not)]\n",
670 " # }\n",
671 "\n",
672 " 0: {\n",
673 " 0: [(1.0, 2, 0.0, False)],\n",
674 " 1: [(1.0, 1, 100.0, False)],\n",
675 " },\n",
676 " 1: {\n",
677 " 0: [(0.9, 0, -10.0, False),(0.1, 1, 100, False)],\n",
678 " 1: [(1.0, 1, 100.0, False)],\n",
679 " },\n",
680 " 2: {\n",
681 " 0: [(0.9, 3, 10.0, False),(0.1, 2, 0, False)],\n",
682 " 1: [(0.9, 1, 100.0, False),(0.1, 1, 100, False)],\n",
683 " },\n",
684 " 3: {\n",
685 " 0: [(1.0, 3, 10.0, False)],\n",
686 " 1: [(0.9, 1, 100.0, False),(0.1, 3, 10, False)],\n",
687 " },\n",
688 " }\n",
689 " self.reset()\n",
690 "\n",
691 " def reset(self):\n",
692 " '''\n",
693 " Resets the environment\n",
694 " Returns:\n",
695 " observations containing player's current state\n",
696 " '''\n",
697 " self.state = self.initial_state\n",
698 " return self.get_obs()\n",
699 "\n",
700 " def get_obs(self):\n",
701 " '''\n",
702 " Returns the player's state as the observation of the environment\n",
703 " '''\n",
704 " return (self.state)\n",
705 "\n",
706 " def render(self, mode='human'):\n",
707 " '''\n",
708 " Renders the environment\n",
709 " '''\n",
710 " print(\"Current state: {}\".format(self.state))\n",
711 "\n",
712 " def sample_action(self):\n",
713 " '''\n",
714 " Samples and returns a random action from the action space\n",
715 " '''\n",
716 " return random.randint(0, self.nA)\n",
717 " def P(self):\n",
718 " '''\n",
719 " Defines and returns the probabilty transition matrix which is in the form
of a nested dictionary\n",
720 " '''\n",
721 " self.prob_dynamics = {\n",
722 " 0: {\n",
723 " 0: [(1.0, 2, 0.0, False)],\n",
724 " 1: [(1.0, 1, 100.0, False)],\n",
725 " },\n",
726 " 1: {\n",
727 " 0: [(0.9, 0, -10.0, False),(0.1, 1, 100, False)],\n",
728 " 1: [(1.0, 1, 100.0, False)],\n",
729 " },\n",
730 " 2: {\n",
731 " 0: [(0.9, 3, 10.0, False),(0.1, 2, 0, False)],\n",
732 " 1: [(0.9, 1, 100.0, False),(0.1, 1, 100, False)],\n",
733 " },\n",
734 " 3: {\n",
735 " 0: [(1.0, 3, 10.0, False)],\n",
736 " 1: [(0.9, 1, 100.0, False),(0.1, 3, 10, False)],\n",
737 " },\n",
738 " }\n",
739 " return self.prob_dynamics\n",
740 "\n",
741 "\n",
742 " def step(self, action):\n",
743 " '''\n",
744 " Performs the given action\n",
745 " Args:\n",
746 " action : action from the action_space to be taking in the
environment\n",
747 " Returns:\n",
748 " observation - returns current state\n",
749 " reward - reward obtained after taking the given action\n",
750 " done - True if the episode is complete else False\n",
751 " '''\n",
752 " if action >= self.nA:\n",
753 " action = self.nA-1\n",
754 "\n",
755 " if self.state == 0 or (self.state == 1 and action == 1) or (self.state == 3
and action == 0):\n",
756 " index = 0\n",
757 " else:\n",
758 " index = np.random.choice(2,1,p=[0.9,0.1])[0]\n",
759 "\n",
760 " dynamics_tuple = self.prob_dynamics[self.state][action][index]\n",
761 " self.state = dynamics_tuple[1]\n",
762 "\n",
763 "\n",
764 " return self.state, dynamics_tuple[2], dynamics_tuple[3]"
765 ]
766 },
767 {
768 "cell_type": "markdown",
769 "id": "4e6a5212",
770 "metadata": {
771 "id": "4e6a5212"
772 },
773 "source": [
774 "### Navigating in Stochastic Career Path"
775 ]
776 },
777 {
778 "cell_type": "code",
779 "execution_count": null,
780 "id": "d3e69064",
781 "metadata": {
782 "id": "d3e69064",
783 "outputId": "8bba3991-3a27-43c3-dcca-5919f2277c1a"
784 },
785 "outputs": [
786 {
787 "name": "stdout",
788 "output_type": "stream",
789 "text": [
790 "State\t Action\t New State\t Reward\t is_Terminal\n",
791 " 3 \t 1 \t 1 \t 100.0 \t False\n",
792 " 1 \t 1 \t 1 \t 100.0 \t False\n",
793 " 1 \t 1 \t 1 \t 100.0 \t False\n",
794 " 1 \t 1 \t 1 \t 100.0 \t False\n",
795 " 1 \t 1 \t 1 \t 100.0 \t False\n",
796 " 1 \t 1 \t 1 \t 100.0 \t False\n",
797 " 1 \t 1 \t 1 \t 100.0 \t False\n",
798 " 1 \t 1 \t 1 \t 100.0 \t False\n",
799 " 1 \t 1 \t 1 \t 100.0 \t False\n",
800 " 1 \t 1 \t 1 \t 100.0 \t False\n",
801 "Total Number of steps: 10\n",
802 "Final Reward: 1000.0\n"
803 ]
804 }
805 ],
806 "source": [
807 "env = StochasticCareerPathEnv()\n",
808 "is_Terminal = False\n",
809 "start_state = env.reset()\n",
810 "steps = 0\n",
811 "total_reward = 0\n",
812 "\n",
813 "# you may change policy here\n",
814 "policy = policy_random\n",
815 "# policy = policy_1\n",
816 "# policy = policy_2\n",
817 "\n",
818 "print(\"State\\t\", \"Action\\t\" , \"New State\\t\" , \"Reward\\t\" ,
\"is_Terminal\")\n",
819 "steps = 0\n",
820 "max_steps = 5\n",
821 "\n",
822 "prev_state = start_state\n",
823 "\n",
824 "while steps < 10:\n",
825 " steps += 1\n",
826 "\n",
827 " action = np.random.choice(2,1,p=policy[prev_state])[0] #0 -> Research, 1 ->
Development\n",
828 " state, reward, is_Terminal = env.step(action)\n",
829 "\n",
830 " total_reward += reward\n",
831 "\n",
832 " print(\" \",prev_state, \"\\t \", action, \"\\t \", state, \"\\t\", reward,
\"\\t \", is_Terminal)\n",
833 " prev_state = state\n",
834 "\n",
835 "print(\"Total Number of steps:\", steps)\n",
836 "print(\"Final Reward:\", total_reward)"
837 ]
838 },
839 {
840 "cell_type": "markdown",
841 "id": "c01a3612",
842 "metadata": {
843 "id": "c01a3612"
844 },
845 "source": [
846 "### B1.1 Find an optimal policy to navigate the given SCP environment using Policy
Iteration (PI)"
847 ]
848 },
849 {
850 "cell_type": "code",
851 "execution_count": null,
852 "id": "eb1faa8a",
853 "metadata": {
854 "id": "eb1faa8a"
855 },
856 "outputs": [],
857 "source": [
858 "# [Hint] What would change for the stochastic MDP in the Policy Iteration code from
Part A?\n",
859 "# write your code here"
860 ]
861 },
862 {
863 "cell_type": "markdown",
864 "id": "3264b19c",
865 "metadata": {
866 "id": "3264b19c"
867 },
868 "source": [
869 "### B1.2 Find an optimal policy to navigate the given SCP environment using Value
Iteration (VI)"
870 ]
871 },
872 {
873 "cell_type": "code",
874 "execution_count": null,
875 "id": "77079cf6",
876 "metadata": {
877 "id": "77079cf6"
878 },
879 "outputs": [],
880 "source": [
881 "# [Hint] What would change for the stochastic MDP in the Value Iteration code from
Part A?\n",
882 "# write your code here"
883 ]
884 },
885 {
886 "cell_type": "markdown",
887 "id": "dda774d8",
888 "metadata": {
889 "id": "dda774d8"
890 },
891 "source": [
892 "### B1.3 Compare PI and VI in terms of convergence (average number of iteration,
time required for each iteration). Is the policy obtained by both same for SCP
environment?\n"
893 ]
894 },
895 {
896 "cell_type": "code",
897 "execution_count": null,
898 "id": "5c6f9de8",
899 "metadata": {
900 "id": "5c6f9de8"
901 },
902 "outputs": [],
903 "source": [
904 "# write your code for comparison here"
905 ]
906 },
907 {
908 "cell_type": "markdown",
909 "id": "5a61e7a8",
910 "metadata": {
911 "id": "5a61e7a8"
912 },
913 "source": [
914 "Write your comments compairing convergence and policies here."
915 ]
916 }
917 ],
918 "metadata": {
919 "colab": {
920 "provenance": []
921 },
922 "kernelspec": {
923 "display_name": "Python 3 (ipykernel)",
924 "language": "python",
925 "name": "python3"
926 },
927 "language_info": {
928 "codemirror_mode": {
929 "name": "ipython",
930 "version": 3
931 },
932 "file_extension": ".py",
933 "mimetype": "text/x-python",
934 "name": "python",
935 "nbconvert_exporter": "python",
936 "pygments_lexer": "ipython3",
937 "version": "3.9.12"
938 },
939 "vscode": {
940 "interpreter": {
941 "hash": "f9f85f796d01129d0dd105a088854619f454435301f6ffec2fea96ecbd9be4ac"
942 }
943 }
944 },
945 "nbformat": 4,
946 "nbformat_minor": 5
947 }
948

You might also like