1 DRL Compre Regular
1 DRL Compre Regular
WoíkI⭲tcoíatcdLcaí⭲i⭲oPíooíammcsDivisio
⭲ Fiíst Scmcstcí 2023-2024
Compícmc⭲sivc"cst(EC-3Rcoulaí)
Course No. : AIMLCZG512 Course Title :Deep Reinforcement
Learning Nature of Exam : Open BookWeightage : 40%No. of Pages = 2 ; No. Of Questions = 4;
Duration:2:30Hours/150Mins DateofExam:06-06-2024 (AN)
NotetoStudents:
1. Answerallthequestions.Allpartsofaquestionshouldbeansweredconsecutively.Ea
ch answer should start from a fresh page.
2. WritealltheanswersneatlyinA4papers,scananduploadthem.
3. Assumptionsmadeifany,shouldbestatedclearlyatthebeginningofyouranswer.
Q1)[Answerpartsandtheirsubpartsinthesamesequence.]
Imagine you are an investor trying to optimize your trading strategy for four
different stocks, labeled A, B, C, and D. Each stock has its own unique potential for
profit, which is unknownto you. To maximize
yourreturnsoveraseriesof100trades,youdecidetoimplementanε-greedy strategywithε
being0.1. The actual returns from each stock follow these distributions:
StockA:70%chanceof+1return,30%chanceof0.
StockB:50%chanceof+2return,50%chanceof0.
StockC:10%chanceof+5return,90%chanceof0.
Stock D: Guaranteed return of +0.5.
Giventhis,answerthefollowingquestions
(a) ShowhowdoyoumodelthisasaReinforcementLearningProblem.[1M]
(b) An investor intends to buy 100 times ( each time buying one share of one
stock). The strategies the investor chooses are (i) follow ε-greedy for the
initial25tradingsandonly exploit the information for the next 75 purchases
[1.5M] (ii)followε-greedyfortheinitial 75 tradings and only exploit the
information for the next 25 purchases [1.5M](iii)Follow ε-greedy for all the
100 purchases.[1.5M] Support the investor with youranalysis.Show all the
steps, tabulate answers to all the options and write your conclusion. [1M]
(c) What are MDP, POMDP and CMDP? What are they? Suggest one RL
techniquethatis usedtosolveproblems stated using them. It is adequate if you
write just one/two lines for each. [1.5M]
[1+1.5+1.5+1+1.5=8M]
Q2)[Answerpartsandtheirsubpartsinthesamesequence.]
Consider the MDP given below containing 2 states A and B with action Shift that may
result in A,B or terminal state..The rewards obtained are as indicated along the
edges in the figure (X-2,X-3,X-1,X,X+2). Treat the value of X to be6.The transition
probabilities are. As given along the edges. Let the discount factor be 0.4.
(a) Evaluate the given deterministic policy where the shift always
executes the higher probability action.Improve it upto1iterations.Use
DynamicProgramming solution to MDP.[4.0 M]
(b) Using value iteration of dynamic programming, determine the values
of states A and B. Let the values of A and B be initialized to 1. Show 1
Iteration. [4.0 M]
[4+4=8Marks]
Q3)[Answerpartsandtheirsubpartsinthesamesequence.]
(a) What are the two
mostimportantissueswhenyouhavetolearnthevaluefunctionusing a first-visit
Monte Carlo using for a deterministic policy.[2.0 M]Explain. Also, provide
possible solutions. [1.5 M].
(b) Explain any 3 most significant action selection strategies used in RL and
mention how each selection method balances exploration and exploitation.
Provide your answer asa table. [3 M]
(c) If we utilize a policy gradient method to address a reinforcement learning
problem and find that the policy it provides is not
optimal,whatcouldbethepossibleexplanationsfor this? State the most relevant
3 reasons. [1.5 M]
[2+1.5+3+1.5=8 Marks]
Q4) [ Answer parts and their subparts in the same sequence. ]For each of the
questions answer in not more than 4 precise statements. Vague Answers will not be
accepted.
(a) WhyAlphaGouseaseparatepolicynetworkandaseparatevaluenetwork?[1.0M]
(b) How does the MCTS ensure an actionwiththehighestvalueisfoundinreal-time?
Ifthe best action can be selected only by MCTS, why
isanypriorlearningofQ(s,a)required? [2.0 M]
(c) We have learned that Supervised Learning that learns with samples from a
given distribution does not capture the online nature of interactions as
required for reinforcement learning quite well.
(i) Why does AlphaGo use supervised learning to learn the initial policy
( and even further)? [1.5 M]
(ii) In
whatwaystheshortcomingsofsupervisedlearningaremitigatedinAlphaGo
? [2.0 M]
(d) HowdoesDQNhandlethechallengesreferredtointhecpartofthisquestion?[1.5M]
[1+2+1.5+2+1.5=8Marks]
Q5)[Answerpartsandtheirsubpartsinthesamesequence.]
(a) Considerthefollowingwaysoforganizingreinforcementlearningtechniques.(i)
Model-Basedvs.ModelFree;(ii)Value-basedvs.Policy-Based;(iii)On-Policyvs.Off-
Policy. Writeastatementortwooneachofthepoints(forbothcategories)explaining
the kind of problems those RL techniques are suited to. Provide your
response in a neatly organized table.[3 M]
(b) Consider the learning scenario. A human expert is
presentedwithtwotrajectoriestaken by two drivers in a highway stretch. The
humanexpertmarkswhichofthetrajectoriesis better. The agent
learnsthisexpertise(todecideabettertrajectorybygivingtwounseen trajectories)
observing the expert’s decisionfrommanysuchexamples.Explainhowyou
precisely model this as an appropriate RL problem [3 M]. Show all
theelementsofyour modeling making necessary assumptions [2 M]. [Note:
Only the most appropriate modeling gets the credit.]
[3+3+2=8Marks]
(a)