0% found this document useful (0 votes)
44 views100 pages

ML

Uploaded by

Saurabh Bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
44 views100 pages

ML

Uploaded by

Saurabh Bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 100
MODEL PAPER-I END TERM EXAMINATION [MAY-2016] EIGHTH SEMESTER [B.TECH] MACHINE LEARNING [ETCS-402] Time :3 Hrs. MM. :75 Note: Q.NO 1. is compulsory. Attempt any four from others. Q.1.(a) Explain using an example the concept of a learning system? (5) Ans A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Acheckers learning problem: + Task T: playing checkers * Performance measure P: percent of games won against opponents * Training experience E: playing practice games against itself Ahandwriting recognition learning problem: * Task T: recognizing and classifying handwritten words within images * Performance measure P: percent of words correctly classified + Training experience E: a database of handwritten words with given classifications Arobot driving learning problem: . * Task T: driving on public four-lane highways using vision sensors + Performance measure P: average distance traveled before an error (as judged by human overseer) + Training experience E: a sequence of images and steering commands recorded while observing a human driver. Q.1. (6) What are the goals of machine learning? 6) ‘Ans The primary goal of machine learning is to develop general purpose algorithms of practical value. Such algorithms should be efficient. Learning algorithms should also be as general purpose as possible. We are looking for algorithms that can be easily applied to a broad class of learning problems. Of primary importance, we want the result of learning to be a prediction rule that is as accurate as possible in the predictions that it makes. Occasionally, we may also be interested in thesnterpretability of the prediction rules produced by learning. In other words, in-some contexts (such as medical diagnosis), we want the computer to find prediction rules that are easily understandable by human experts. Machine learning can be thought of as “programming by example.” The results of using machine learning are often more accurate than what can be created through direct programming. The reason is that machine learning algorithms are data driven, and are able to examine large amounts of data, On the other hand, a human expertiis likely to be guided by imprecise impressions or perhaps an examination of only a relatively small number of examples. 2-MP-1 Eighth Semester, Machine Learning Q.1. (©) What are Generative learning algorithms? Ans, Consider a class: between elephants (y = Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line—-that is, a decision boundary— that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly, Here's a different approach. Fi what elephants look like. Then, | what dogs look like. Finally, against the elephant model, “new animal looks more like training set. . ‘rst, looking at elephants, we can build a model of looking at dogs, we can build a separate model of to classify a new animal, we can match the new animal and match it against the dog model, to see whether the the elephants or more like the dogs we had seen in the Algorithms that try to learn p(y 1x) directly (such as logistic regression), or algorithms that try to learn mappings directly from the space of inputs X to the labele (0, 1), (such as the perceptron algorithm) are called discriminative learning algorithms. elephant (1), then p(xly = 0) models th The different generative learning algorithms are:- * Gaussian discriminant analysis * Naive Bayes Q.1. (d) What do you understand by Reinforcement Learning? (5) Ans. Reinforcement Learning refers to the problem of a goal-directed agent interacting with an uncertain em ironment. The goal of an RL agent is to maximize a long-term scalar reward by sensing the state of the environment and taking actions Which affect the state. At each step, an RL system gets evaluative feedback about the Performance of its action, allowing it to improve the performance of subsequent actions Several RL methods have been developed and successfully applied in machine learning to learn optimal policies for finite-state finite-action discrete-time Markov Decision Processes (MDPs). Reinforcement LP. University-(B.Tech)-AB Publisher MP-I-3 An analogous RL control system is shown in Figure, where the controller, based on state feedback and reinforcement feedback about its previous action, calculates the next control which should lead to an improved performance. The reinforcement signal is the output of a performance evaluator function, which is typically a function of the state and the control. An RL system has a similar objective to an optimal controller which aims to optimize a long-term performance criterion while maintaining stability. RL methods typically estimate the value function, which is a measure of goodness of a given action for a given state. . Q.1. (e) What is overfitting? (5) Ans. Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance on the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize. Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns. For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up. Q.2. Explain models of learning in detail. (12.5) Ans. Learning algorithms are normally designed around a particular ‘paradigm’ for the learning process, i.e. the overall approach to learning. A computational learning model should be clear about the following aspects: * Learner: Who or what is doing the learning i.e.-an algorithm or‘a computer program. Learning algorithms may be embedded in more general software systems e.g. involving systems of agents or may be embodied in physical objects like robots and ad- hoc networks of processors in intelligent environments. * Domain: What is being learned i.e. a function, or a concept. Among the many other possibilities are: the operation of a device, a tune, a game, a language, a preference, and so on. In the case of concepts, sets of concepts that are considered for learning are called concept classes. * Goal: Why the learning is done. The learning can be done to retrieve a set of rules from spurious data, to become a good simulator for some physical phenomenon, to take control over a system, and 80 on. + Representation: The way the objects to be learned are represented c.q. the way they are to be represented by the computer program. The hypotheses which the program develops while learning may be represented in the same way, or in a broader (or: more * restricted) format. + Algorithmic technology: The algorithmic framework to be used. Among the many different ‘technologies’ are: artificial neural networks, belief networks, case-based reasoning, decision trees, grammars, liquid state machines, probabilistic networks, rule learning, support vector machines, and threshold functions. One may also specify ‘MPI Bighth Semester, Machine Learning the specific learning paradigm or has its own learning strategy an strategy approaches. - . een source: The information (training data) the program oe learning, This could have different forms: positive and negative examples caled labeley examples), answers to queries, feedback from certain actions, and so on, Fu concepts are typically revealed in the form of labeled instances taken from an oe space X. One often identifies 4 concept with the set of all its positive instances — a subset of X. An information source may be noisy, ic. the training data nae oa errors. Examples may be clustered before use in training a program, + 'y have * Training scenario: The description of the learning process. In this tutorial, mostly on-line learning is discussed. In an on-line learning scenario, the Program ig given examples one by one, and it recalculates its hypothesis of what it learns after each example. Examples may be drawn from a random source, according to some known or unknown probability distribution. An on-line scenario can also be interactive, in which case new examples are supplied depending on the performance of the program on previous examples. In contrast, in an off-line learning scenario the program receives all examples at once. One often distinguishes between discovery tools to be used. Each algorithmic techno, d its own range of application. There also are mae nctions and * Supervised learning: the scenario in which a program is fed examples and must predict the label of every next example before a teacher tells the answer. : * Unsupervised learning: the scenario in which the program must determine certain regularities or properties of the instances it receives e.g. from an unknown physical process, all by itself (without a teacher). ‘Training scenarios are typically finite, On the other hand, in inductive inference a program can be fed an unbounded amount of data. In reinforcement learning the inputs come from an unpredictable environment and positive or negative feedback is given at the end of every small sequence of learning steps e.g. in the process of learning an optimal strategy. * Prior knowledge: What is known in advance about the domain, e.g. about specific Properties (mathematical or otherwise) of the concepts to be learned. This might help to limit the class of hypotheses that the program needs to consider during the learning, and thus to limit its ‘uncertainty’ about the unknown object it learns and to converge faster. The program may also use it to bias its choice of hypothesis. * Success criteria: The criteria for successful learning, i.e. for determining when the learning is completed or has otherwise converged sufficiently. Depending on the goal of the learning program, the program should be fit for its task. If the program is used eg. in safety-critical environments, it must have reached sufficient accuracy in the training phase so it can decide or predict reliably during operation. A success criterion can be ‘measured’ by means of test sets or by theorétical analysis. . * Performance: The amount of time, space and computational power needed in order to learn a certain task, and also the quality (accuracy) reached in the process. There is often a trade-off between the number of examples used to train a program and thus the computational resources used, and the capabilities of the program afterwards. Computational learning models may depend on many more criteria and on specific theories of the learning process. LP. University-(B.Tech)-AB Publisher MP-I-5 Q.8. Explain decision tree algorithm in detail. (12.5) ‘Ans. The basic algorithm for decision tree learning, corresponds approximately to the 1D3 algorithm. The ID3 algorithm learns decision trees by constructing them topdown, beginning with the question “which attribute should be tested at the root of the tree? The best attribute is selected and used as the test at the root node of the tree. Adescendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example’s value for this attribute).The entire process is then repeated using the training examples associated with each descendant node to select the kest attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices. + Entropy, that characterizes the (im)purity of an arbitrary collection of examples. Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is Entropy (S) = — pg log, p®-Py logs Py where p @, is the proportion of positive examples in S and p0, is the proportion of negative examples in S. « Information gain, measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree. Information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as !S. entropy (So) Gain (S,A) = Entropy(S)— 4) 1ST Where Values(A) is the set of all possible values for attribute A, and S, , is the subset of S for which attribute A has value v (i.e., Sy = {s ¢ S/A(s) = v)). The value of Gain(S, A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A. sunny Overcast Rain Humidity es Wind High rons Strong Weak Z Yes No Yes 6-MP-I Eighth Semester, Machine Learning To illustrate, suppose S jis a collection of 14 examples of some Boolean concept, including 9 positive and 5 negative examples (we adopt the notation (9+, 5-] to summarize such a sample of data). Then the entropy of S relative to this boolean classification is Entropy ((9 + , 5-1) = —(9/14) log,(9/14) - (5/14)log,(5/14) = 0.940 Suppose S is a collection of training-example days described by attributes including Wind, which can have the values Weak or Strong . Of these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and the remainder have Wind = Strong. The information gain due to sorting the original 14 examples by the attribute Wind may then be calculated as Vaues(Wind) = Weak; strong S = [9+,5-] Bun + 164,21 Sirng — (8+, 3-1 . 1s,1 E Gain(S, Wind) = Entropy(S) - Tgrentrepy(S,) - ve(Weak, Strong) 's! Entropy(S) - (8/14)Entropy(S\y.4,) ~ G/AEntropySsrrong) 0.940 — (8/14)0.811 - (6/14)1.00 = 0.048, _ Information gain is precisely the measure used by ID8 to select the best attribute at each step in growing the tree. The use of information gain to evaluate the relevance of attributes. In the given figure the information gain of two different attributes, Humidity and Wind, is computed in order to determine which is the better attribute for classifying the training examples Which attribute is the best classifier? S{9+54 S[9+5-] E=0.940 E=0.940 Humidity Wind High /” \ Normal Weak Strong [6+,1-] (6+,2] [3+,3] E=0.592 . E0811 E=1.00 Gain (S, Humidity) Gain(S,wind) }0-(7/14).985-(7/14).592 =.940-(8/14).811-(6/14)1.0 =151 = 048 LP. University-(B.Tech)-AB Publisher MP-I-7 Q4. (a) Consider the following set of training examples: (8.5) Jnstance | Classification | Attrb1 ‘Attrb2 | Attrb3 el el Caran one 2 & & scp ee eh oO oP See ee ee aa cree rtepno res 10 el =) (a) What is the entropy of this collection of training examples with respect to the target function classification? (4) (6) What is the information gain of Attrb1 relative to these training examples? (c) Create the decision tree for these training examples using ID3. (d) Convert the decision tree into the rules. Ans. (a) Entropy = -5/10log(5/10) — 2/10log(2/10) — 3/10log(3/10) = 1.48 (b) E(Attrb1 =a) 8/4log(3/4) — 1/4log(1/4) = = 0.811 E(Attrbl = b) = -2/4log(2/4) - 1/4log(1/4) - 1/4log(1/4) = 1.5 E(Attrb1 = c) = -1/2log(1/2)— /2log(/2) = 1 (c) Gain of Attrb1 = E - 4/10E(Attrb1 = a) - 4/10 E(Attrb1 = b) — 2/10E(Attrb1 = ¢) =0.36 Attrb 3 "ee * c d At ‘Attrb 4 c2 ct Trug/ False b @ ec C3 (d) If (Attrb3 = a) , (Attrb2 = T) Then Classification = cl Tf (Attrb3 = a) » (Attrb2 = F) Then Classification = 3 If (Attrb3 = b) » (Attrb1 = a) Then Classification = cl If (Attrb3 = b) a (Attrb1 = b) a (Attrb2 = T) Then Classification = ¢3 S-MP.I Eighth Semester, Machine Learning If (Attrb3 = b) a (Attrb1 = b) A (Attrb2 = If (Attrb3 = c) Then Classification = ¢2 If (Attrb3 = d) Then Classification = ¢1 Q.4. (6) Give decision trees to re (a) Av[BaC] @) [AaB] vIC AD] Ans. (a) Av [B~C] F) Then Classification = ¢1 Present the following boolean functions: A ‘True False Yes [2 | True alse Sf [se] Trus/ als Yes No (b) [Aa B] v[CvD] Q5. What is the difference between supervised and unsupervised learning? (12.5) Ans. Supervised Machine Learning: The majority of practical machine learning uses supervised learning, Supervised Jearning is fairly common in classification problems because the goal is often to get the computer to learn a classification system that we have created. More generally, classification learning is appropriate for any problem where deducing a classification is LP. University—(B.Tech)-AB Publisher MP-I-9 useful and the classification is easy to determine. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. The algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance. Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. Y = fy The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data. Supervised learning problems can be further grouped into regression’ and classification problems, * Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. + Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”. Some popular examples of supervised machine learning algorithms are: * Logistic Regression * Decision trees * Support vector machine (SVM) * k-Nearest Neighbours * Naive Bayes * Random forest * Linear regression * Polynomial regression * SVM for regression Unsupervised learning : The goal is to have the computer learn how to do something that we don’t tell it how to do! There are actually two approaches to unsupervised learning. The first approach is to teach the agent not by giving explicit categorizations, but by using some sort of reward system to indicate success. A second type of unsupervised learning is called clustering. In this type of learning, the goal is not to maximize a utility function, but simply to find similarities in the training data, The assumption is often that the clusters discovered will match reasonably well with an intuitive classification. Unsupervised learning is where you only have input data (%) and no corresponding output variables, The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data. Unsupervised learning problems can be further grouped into clustering and association problems. 10-MP-I Eighth Semester, Machine Learning * Clustering: A clustering problem is where you want to discover the inherent, groupings in the data, such as grouping customers by purchasing behavior, * Association: An association rule learning problem is where you want tp discove rules that describe large portions of your data, such as people that buy X also tend. to buy Y. All clustering algorithms come under unsupervised learning algorithms, + K—means clustering * Hierarchical clustering * Hidden Markov models Q. 6. What is the difference between Model selection and Feature selection? ; (12.5) Ans.(a) Feature selection: In machine learning feature selection, also known as variable selection , attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for three reasons: * Simplification of models to make them easier to interpret by researchers/users, * Shorter training times, * Enhanced generalization by reducing overfitting (formally, reduction of variance) Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Archetypal cases for the application of feature selection include the analysis of written texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of samples. A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. Feature Selection Algorithms There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods. 1. Filter Methods Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable. Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores. 2. Wrapper Methods Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other LP. University-(B.Tech)-AB Publisher MP-I-11 combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy. The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, ike forward and backward passes to add and remove feateres An example of a wrapper method is the recursive feature elimination algorithm. 8. Embedded Methods Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods. Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients). Examples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression, (6) Model selection It is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice. Model selection is one of the fundamental tasks of scientific inquiry. Determining the principle that explains a series of observations is often linked directly to a mathematical model predicting those observations. Once the set of candidate models has been chosen, the statistical analysis allows us to select the best of these models. A good model selection technique will balance goodness of fit with simplicity. More complex models will be better able to adapt their shape to fit the data, but the additional parameters may not represent anything useful. Goodness of fit is generally determined using a likelihood ratio approach, or an approximation of this, leading to a chi-squared test. The complexity is generally measured by counting the number of parameters in the model. Model selection techniques can be considered as estimators of some physical quantity, such as the probability of the model producing the given data. The bias and variance are both important measures uf the quality of this estimator; efficiency is also often considered. * Astandard example of model selection is that of curve fitting, where, given a set of points and other background knowledge .we must select a curve that describes the function that generated the points. Q.7. Explain any two (12.5) (a) Markov Decision Process (MDP) (6) Bellman’s Equation (c) Value function approximation algorithm Ans. (a) Markov Decision Process (MDP) A Markov decision process is a tuple (S, A, ( P,,l, Y, R) , where : © Sis.aset of states. (For example, in autonomous helicopter flight, S might be the set of all possible positions and orientations of the helicopter.) 12-MP-I Eighth Semester, Machine Learning » Ais a set of actions. (For example, the set of all possible directions in which you can push the helicopter’s control sticks.) _ Pesare the state transition probabilities, For each state ¢S and action aA, P, js a distribution over the state space, We'll say more about this later, but briefly P.,gfaes the distribution over what states we will transition to if we take action ain statins * 7 € [0, 1) is called the discount factor, + R:S xA- Ris the reward function. (Rewards are sometimes also written as a function of a state S only, in which case we would have R : $ > R). The dynamics of an MDP proceeds as follows: We start in some state s,, and get to choose some action ay¢.A to take in the MDP. As a result of our choice, the state of the MDP randomly transitions to some successor state s,, drawn according to s, ~ Ps,ay.. Then, we get to pick another action a,. As a result of this aetion, the state transitions again, now to some s, ~ Ps,a, . We then pick a, and so on. .. . Pictorially, we can represent this process as follows: Sy —“->8, —“t_ys, 2 yg, 8»... Upon visiting the sequence of states s,,8,,. .. with actions a), a,,..., our total payoff is given by Rp, ay) + 7R(s,,0,) + PRls,,a,) +... Or, when we are writing rewards as a function of the states only, this becomes RGsy) + 7RS,) + PRs) +. Ans. (6) Bellman’s Equation A Bellman equation is also known as a dynamic programming equation. It is a necessary condition optimality associated with the mathematical optimization method known as dynamic programming. It writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and-the value of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into simpler subproblems, as Bellman’s “Principle of Optimality” prescribes. Almost any problem which can be solved using optimal control theory can also be solved by analyzing the approp.iate Bellman equation. However, the term ‘Bellman equation’ usually refers to the dynamic programming equation associated with discrete- time optimization problems. This is the Bellman equation. It can be simplified even further if we drop time subscripts and plug in the value of the next state: Vi) = max, {Fxa)+BV(T(x,0)} The Bellman equation is classified as a functional equation, because by solving it means finding the unknown function V, which is the value function. Recall that the value function describes the best possible value of the objective, as a function of the state x. By calculating the value function, we will also find the function a(x) that LP. University-(B.Tech)-AB Publisher MP-I-13 describes the optimal action as a function of the state; this is called the policy function. For a general stochastic sequential optimization problem with Markovian shocks and where the agent is faced with his decision ex-post, the Bellmann equation takes a very similar form Viz) = max, Flx,c,2)+B [VT ,c),2'dH,@) ‘Ans. (c) Value function approximation algorithm Value function approximation, has been successfully applied to many Reinforcement learning problems. It is an alternative method for finding policies in continuous state MDPs, in which we approximate V* directly, without resorting to discretization. To develop a value function approximation algorithm, we will assume that we have a model, or simulator, for the MDP. Informally, a simulator is a black-box that takes as input any (continuous-valued) state s, and action a, , and outputs a next-state s,,, sampled according to the state transition probabilities Py Se Su17Pse ay ‘There’re several ways that one can get such a model. One is to use physics simulation. For example, the simulator for the inverted pendulum in PS4 was obtained by using the laws of physics to calculate what position and orientation the cart/pole will be in at time t + 1, given the current state at time t and the action a taken, assuming that we know all the parameters of the system such as the length of the pole, the mass of the pole, and so on. Alternatively, one can also use an off-the-shelf physics simulation software package which takes as input a complete physical description of a mechanical system, the current state s, and action a, , and computes the state s,,, of the system a small faction of a second into the future. {Am alternative way to get a model is to learn one from data collected in the MDP. For example, suppose we execute m trials in which we repeatedly take actions in an MDP each trial for'T timesteps. This can be done picking actions at random, executing some specific policy, or via some other way of choosing actions. We would then observe m state sequences like the following: . a9 a) 0 og a? at! pad 2 ys) Sh... Eh sp a oo a 3 of (at 5 9... 88 sp? tm om) alt, (mm) _af? (m) 3) 40 ) aft, «fm at”... 2H 5 git 14-MP-I Eighth Semester, Machine Learning We can then apply a learning algorithm to predict 8,,1 48 a function ofs, and a. Pop example, one may choose to learn a linear model of the form 54, = As,+Ba,, using an algorithm similar to linear regression. Here, th ‘e parameters of the Mode] are the matrices A and B, and we can estimate them using the data collecteq from oy m trials, by picking mT) : 2 aremin 9 te, -(Asf? + Baf?)f ABD iO MODEL PAPER-II END TERM EXAMINATION [MAY-2016] EIGHTH SEMESTER [B.TECH] MACHINE LEARNING [ETCS-402] Time: 3 Hrs. MLM. : 75 Note: Q.NO 1. is compulsory. Avtempt any four from others. Q.1.(a) What do you mean by machine learning? Give some examples of machine learning. 6) ‘Ans. Machine learning studies computer algorithms for learning to do stuff. The learning that is being done is always based on some sort of observations or data, such as examples, direct experience, or instruction. So in general, machine learning is about earning to do better in-the future based on what was experienced in the past. The emphasis of machine learning is on automatic methods. In other words, the goal is to devise learning algorithms that do the learning automatically without human intervention or assistance. The machine learning paradigm can be viewed as “programming by example.” Often we have a specific task in mind, such as spam filtering. But rather than program the computer to solve the task directly, in machine learning, we seek methods by which the computer will come up with its own program based on examples that we provide. Machine learning is a core subarea of artificial intelligence.There are many examples of machine learning problems. Here are several examples: * Optical character recognition: Categorize images of handwritten characters by the letters represented + Face detection: Find faces in images (or indicate if a face is present) + Spam filtering: Identify email messages as spam or non-spam * Topic spotting: Categorize news articles (say) as to whether they are about politics, sports, entertainment, etc. ‘ * Spoken language understanding: Within the context of a limited domain, determine the meaning of something uttered by a speaker to the extent that it can be classified into one ofa fixed set of categories * Medical diagnosis: Diagnose a patient as a sufferer or non-sufferer of some disease ”, + Customer segmentation: Predict, for instance, which customers will respond to a particular promotion + Fraud detection: Identify credit card transactions (for instance) which may be fraudulent in nature * Weather prediction: Predict, for instance, whether or not it will rain tomorrow. Q.L. (6) Explain in brief Supervised machine Learning? Also give some examples of Supervised machine Learning. (5) Ans. Supervised Machine Learning:- 7 The majority of practical machine learning uses supervised learning. Supervised learning is fairly common in classification problems because the goal is often to get the ’ 2-MP-II Eighth Semester, Machine Learning computer to learn a classification system that we have created. More generally, classification learning is appropriate for any problem where deducing a classification is useful and the classification is easy to determine. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. The algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance, Supervised learning is where you have input variables (x) and an output variable (¥) and you use an algorithm to learn the mapping function from the input to the output. Y = £(X) The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data. Supervised learning problems can be further grouped into regression and classification problems. * Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. + Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”. : Some popular examples of supervised machine learning algorithms are: * Logistic Regression * Decision trees * Support vector machine (SVM) + k-Nearest Neighbors * Naive Bayes + Random forest * Linear regression * polYnomial regression * SVM for regression Q.1. (c) What is a Perception. Explain in brief. (5) Ans. One type of ANN system is based on a unit called a perceptron. A perceptron takes a vector of real -valued inputs, calculates a linear combination of these inputs, then Cutputs a 1 ifthe result is greater than some threshold and -1 otherwise. More precisely, given inputs «through x, , the output o(x, ,++9%,) computed by the perceptron is Lif wy +wyxy + Wary +... Xp > OC poe) = Seat a ert ean > 0 (-1 otherwise Where each W, is a real-valued constant, or weight, that determines the contribution of input x, to the perceptron output. Notice the quantity (-W,) is a threshold that the Weighted combination of inputs W,X, + ....+ W,X, must surpass’ in order for the Perceptron to output a 1. ‘To simplify notation, we imagine an additional constant input x, , allowing us to write the above inequality as yo WX: > O,, or in vector form as W>0, For brevity, we will sometimes write the perceptron function as LP. University-(B.Tech)-AB Publisher MP-II-3 o(@) = sgn(w- x) Where lif y>0 sgn(y) = { -1otheriwse We can view the perceptron as representing a hyperplane decision surface in the anedimensional space of instances (i.., points). The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a-1 for instances lying on the other side. x1 tz nf 1 i2pwr0 -1 otherwise Q.1. @) How can overfitting be limited in machine learning? 6) ‘Ans. Both overfitting and underfitting can lead to poor model perforinance. But by far the most common problem in applied machine learning is overfitting.Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data,There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting: 1. Use a resampling technique to estimate model accuracy. 2. Hold back a validation dataset. ‘The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data. ‘A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data. Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice. Q.1. (e) What do you understand by POMDP? 6) Ans. A Partially Observable Markov Decision Process: (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it ig assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP. ‘The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine 4-MP-I Righth Semester, Machine Learning maintenance, and planning under uncertainty in general. An exact ae 7 a Pomp yields the optimal action for each possible belief over the wor 14 states. The opting action maximizes (or minimizes) the expected reward (or cos ine 1 ee Over Possibly infinite horizon. The sequence of optimal actions is known as the oPting policy of the agent for interacting with its environment. A discrete-time POMDP models the relationship between an’ agent ang it environment. Formally, a POMDP is a 7-tuple (S, A,T, R, Q O,y) where * S is a set of states, * Aisa set of actions * Tis a set of conditional transition probabilities between states *R:SXA> Ris the reward function * Q isa set of observations * O is a set of conditional observation probabilities, and * ye [0,1] is the discount factor Q.2. Give the different aspects of designing a learning system. (12.5) Ans. In order to illustrate some of the basic design issues and approaches to machine learning, let us consider designing a program to learn'to play checkers, with the goal of entering it in the world checkers tournament. 1. Choosing the Training Experience The first design choice we face is to choose the type of training experience from which our system will learn. The type of training experience available can have a significant impact on success or failure of the learner. One key attribute is whether the Asecond important attribute of the training experience is the degree to which the learner controls the sequence of: training examples. For example, the learner might rely on the teacher to select informative board states and to provide the correct move for cach. Alternatively, the learner might itself propose board states that it finds particularly confusing and ask the teacher for the correct move. Or the learner may have complete control over both the board states and (indirect) training classifications, as it does when it learns by playing against itself with no teacher present. A third important attribute of the training experience is how well it represents the distribution of examples over which the final system performance P must be measured. In general, learning is most reliable when the training examples follow a distribution similar to that of future test examples, . Acheckers learning problem: * Task T: playing checkers * Performance measure P: percent of games won against opponents * Training experience E: playing practice games against itself 1.P. Univoraity-(l.Myeh)-AB Publisher MP-II-6 2, Choosing the Target Function The next design choico ix to dotormine oxactly what t. L type of knowled Joarned and how this will be used by the performance program, Let a aie ciuckers:playing program that-can genorate the legal moves from any board state, The acter aes ony to Fn how to chooks the bast mov fom among thew egal moves. rere yearning task is representative of w largo claan of tasks for which the legal moves that define some largo doarch spaco are known a priori, but for which the bost search strategy is not known, Given this setting whore we must learn to chooso among the legal moves, the most obvious choice for tho type of information to be learned is a program,or fanetion that chooses tho best move forany given boned stata Lat us al this function ChooseMove and use the notation to indicate that this function accepts as input any board from the set atlogal board states B and produces as output somo move from the sot of logal moves M. ChooseMove : B——> M We find it useful to define one particular target function V among the many that produce optimal play. As we shall seo,this will make it easier to design a training Algorithm. Let us therefore define thetarget value V(b) for an arbitrary board state b in B, as follows: Lifb is a final board state that is won, then V(b) = 100 24f b is a final board state that is lost, then V(b) = -100 34fb is a final board state that is drawn, then V(b) = 0 Aifb is a not a final state in the game, then V(b) = V(bl), where b’ is the best final ‘board state that can be achieved starting from b and playing optimally until the end of the game (assuming the opponent plays optimally, as well). While this recursive definition specifies a value of V(b) for every board state b, this definition is not usable by our checkers player because it is not efficiently computable, Except forthe trivial cases (cases 1-3) in which the game has already ended, determining the value of V(b) for a particular board state requires (case 4) searching ahead for the optimal line of play, all the way to the end of the game! Because this definition is not efficiently computable by our checkers playing program, we say that itis a nonoperational definition. The goal of learning in this case is to discover an operational description of V. 3. Choosing a Representation for the Target Function Now that we have specified the ideal: target function V, we must choose @ representation that the learning program will use to describe the function V that it will learn. In general, this choice of representation involves a crucial tradeoff. On one hand, we wish to pick a very expressive representation to allow representing as close an approximation as possible to the ideal target function V. On the other hand, the more expressive the representation, the more training data the program will require in order to choose among the alternative hypotheses it can represent. Let us choose a simple representation: for any given board state, the function 7 Will be calculated as a linear combination of the following board features: * x,: the number of black pieces on the board * x,: the number of red pieces on the board 6-MP-II Eighth Semester, Machine Learning * x,: the number of black kings on the board * xj the number of red kings on the board * x5: the number of black pieces threatened by red (i.e., which can be captured on red’s next turn) * X,: the number of red pieces threatened by black Thus, our learning program will represent 7 (b) as a linear function of the form V = Wot WAX, + WX) + Wet Wt WXee WX, Where W, through W, are numerical coefficients, or weights, to be chosen by the learning algorithm. Learned values for the weights W, through W, will determine the relative importance of the various board features in determining tho value of the board, whereas the weight W, will provide an additive constant to the board valve, Q.3. Explain in detail the concept of decision tree in machine learning. (12.5) Ans. Decision tree learning is a method for approximating discrete-valued target the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This Process is then repeated for the subtree rooted at the new node. outlook Sunny overcast Rain Humidity yes Wind pe Normal Strong Weak No Yes No Yes The figure shows a decision tree for the concept PlayTennis. This example is classified by sorting it through the tree to the appropriate leaf node, then returning the example, the instance would be sorted down the leftmost branch of this decision tree and would therefore be classi; PlayTennis =no). LP. University-(B.Tech)-AB Publisher MP-II-7 (Outlook = Sunny, Temperature = Hot, Humidity = igh, Wind = Strong) In general, decision trees represent a disjunction of conjuncti i , n injunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. For example, the decision tree corresponds to the expression:- (Outlook = Sunny » Humidity = Normal) V (Outlook = Overcast) v (Outlook = Rain » Wind = Weak) @.4.(A) Consider the following set of training examples: (8.5) Instance | Classification ay e anr one setae a ll (a) What is the entropy of this collection of training examples with respect to the target function classification? (b) What is the information gain of a, relative to these training examples? (c) Create the decision tree for these training examples using ID3. ‘Ans. (a) Entropy = -3/6(1og3/6) ~ 3/6(10g3/6) = (b) Ela, =T) = -2/4log(2/4) - 2/4log(2/4) = 1 E(a, =F) = -1/2log(1/2) - 1/2log(W/2) = 1 Gain of a, = E(S) — 4/6E(@, =1) — 2/3E(@, =F) =1-4/6-2/6 =0 (C) Gain(S, a,) > Gain(S, a,) True Yes Eighth Semester, Machine Learning 8-MP-II ae lowing book Q4. (b) Give decision trees to represent the following boolean function, (a)A* =B : (b) AXORB Ans. (a) AA“B A True Neue B No False TH | Yes ] i (6) AXOR B Al means A, and A2 means B At True False | ae 2 Tue False . False rue reas No Yes Ne ’ : Yes @.5. Explain in detail Su flachi liyperpiane at in port vector Machines? How is the optine (12. LP. University-(B.Tech)-AB Publisher MP-II-9 "ASupport Vector Machine (SVM) isa discriminative classifier formally defined by asaparating hyperplane. In other words, given labeled training data (supervised learning), seepigorithm outputs an optimal hyperplane which categorizes new examples. Let's consider the following simple problem: For a linearly separable set of 2D- points which belong to one of two classes, find a separating straight line x In the above picture you can see that there exists multiple lines that offer a solution to the problem. Is any of them better than the others? We can intuitively define criterion to estimate the worth of the lines: A line is bad if it passes too close to the points because it will be noise sensitive and it will not generalize correctly. Therefore, our goal should be to find the line passing as far as possible from all points. ‘Then, the operation of the SVM algorithm is based on finding the hyperplane that gives the largest minimum distance to the training examples. Twice, this distance receives the important name of margin within SVM's theory. Therefore, the optimal separating hyperplane maximizes the margin of the training data. How is the optimal hyperplane computed? _ Asupport vector machine constructs a hyperplane oF set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other - Machine Learning lane that h; d by the hyperp! sth of any class (so-called functiona) Sat the generalization error of the ant te 10-MP-II Eighth Semester tasks. Intuitively, a good separa distance to the nearest training tion is achie’ data point o i wer since in general the larger the margin the lo eileen x Let's introduce the notation used to define es f(t) = Bo+ Bi bias. where B is known as the weight vector and By a8 cs ae be represented in an infinite number of ait i Jane ean : ways ieee eaaeaae matter ‘of convention, among all the possible Tepresentaige of the hyperplane, the one chosen is Ipy+B7x! = 1 where symbolizes the training examples closest to the Psi In general, training examples that are closest to the hyperplane are cal support vectors, Th | representation is known as the canonical hyperplane. Now, we use the result of geometry that gives the distance between a point ang, hyperplane: (B,By) : 1B) +Btx | distance = ~~ In particular, for the canonical hyperplane, the numerator is equal to one andthe distance to the support vectors is By +B 1 distance ssopon vectors = Ba. Recall that the margin, here denoted as M , is twice the distance to the closest examples: a bl Finally, the problem of maximizing is equivalent to the problem of minimizing® function L(B) subject to some constraints. The constraints model the requirement fot the hyperplane to classify correctly all the training examples x, . Formally, M- mini 1 S i min Lip) 3ler subject to y,(B" x, +By)21Vi, where y, represents each of the labels of the training examples Q. 6. What is boosting? Also explain AD, a Ans. Boosting is a machi; reducing bias, and also variance j Boost Algorithm in detail? (125) Me earning ensemble meta-algorithm for prima algorithms which convert weak learners “learning, and a family of machine lea a Classifier which is only slightly correlated wite tre i eek learner is defined examples better than random Guessing). In contra; Ct e true classification (it ca" . st, is arbitrarily well-correlated with the true classi astrong learner is a classifier ication. LP. University-(B.Tech)-AB Publisher MP-II-11 While boosting is not algorithmically constrai i . : stiotivelylearaing week claesifere wth reapect to ditsbuten and aading them to a final strong classifier. When they are added, they are typically wei hited in = way that is usually related to the weak learners’ accuracy. After a weak lene ia added, the data are reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight. Thus, future weak Tearners focu: more on the examples that previous weak learners misclassified. ° Only algorithms that are provable boosting algorithms in the probabl. approximately correct learning formulation can accurately be called boosting elgnrithme. Other algorithms that are similar in spirit to boosting algorithms are sometimes ealled “jeveraging algorithms”. The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and perhaps the most significant historically as it was the first algorithm that could adapt to the weak learners. However, there are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, and others. AdaBoost / ‘AdaBoost (Adaptive Boosting) is a machine learning meta-algorithm. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output: of the other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (e.g, their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to astrong learner. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. AdaBoost refers to a particular method of training a boosted classifier. A boost classifier is a classifier in the form T = YA@ Fy) Dy fi where each F, is a weak Jearner that takes an object x as input and returns a value indicating the class of the object. For example in the two class problem, the sign of the weak learner output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the ‘Tt classifier will be positive if the sample is believed to be in the positive class and negative otherwise. Each weak learner produces an output, hypothesis h(x), for each sample in the training set, At each iteration ¢, a weak learner is elected and assigned a coefficient a, such that the sum training error Ey of the resulting T-stage boost classifier is minimized. p= DAR ae) rahe) eee i fier that has been built up to the previous stage of Here f, ,(x) is the boosted classifi nthe yrerens na of training, E(F) is some error function and f(x) = o,)(x) is the weal considered for addition to the final classifier. 12-MP-II Eighth Semester, Machine Learning Q.7. Explain any two (12.5) (a) Linear quadratic regulation (LQR) (6) Q-learning (c) Hidden Markov models Ans. (a) Linear quadratic regulation (LQR) The theory of optimal control is concerned.with operating a dynamic system at minimum cost. The case where the system dynamics are described by a set of linear differential equations and the cost is described by a quadratic function is called the LQ problem. The LQR is an important part of the solution to the LQG (linear-quadratic. Gaussian) problem. The cost V of a particular control input u is defined by the following integral where Vu) = [teunde+ meaty where (a) t, is the final time of the control problem. (6) Lis a scalar-valued function of x, u, and t (c) mis a function of x. It is called the terminal penalty function. We assume that xy, fy, and t, are known, fixed values, and x(t,) is free. Our goal is to choose the control ult,,t,] to minimize V . Acase which is typical in applications is where m and I are quadratic functions of their arguments. For an LTV model, the system description and cost are then given by u & = A@r+BOu, xt.) =x, Vu) = ['e7QGye tu7RE wae +27 ¢,)Mxtt) where M, Q and R are positive semidefinite matrix-valued functions of time. These matrices can be chosen by the designer to obtain desirable closed loop response. The minimization of the quadratic cost V for a linear system is known as the linear quadratic regulator (LQR) problem. (6) Q-learning: Is a model-free reinforcement learning technique. Specifically, @-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter, A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle proble: d Q n ms with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, @ learning ee finds an optimal policy, in the sense that the expected value of the al reward return over all successive steps, starti is the SP ies rting from the current state, is LP. University-(B.Tech)-AB Publisher MP-II-13 The Q-Learning algorithm was proposed as a way to optimize solutions in Markov decision process problems. The distinctive feature of Q-Learning is in its capacity to choose between immediate rewards and delayed rewards. At each step of time, an agent observes the vector of state x, then chooses and applies an action u,. As the process moves to state x,,,, the agent receives a reinforcement r(x,,u,). The goal of the training is to find the sequential order of actions which maximizes the sum of the future reinforcements, thus leading to the shortest path from start to finish, The transition rule of Q learning is a very simple formula: Q(state, action) = R(state, action) + © * Max{Q(next state, all actions)] The @ parameter has a range of 0 to 1 (0 <= 2 > 1), and ensures the convergence of the sum. If Qis closer to zero, the agent will tend to consider only immediate rewards. If Mis closer to one, the agent will consider future rewards with greater weight, willing to delay the reward. The Q-Learning algorithm goes as follows: 1. Set the gamma parameter, and environment rewards in matrix R. 2. Initialize matrix Q to zero. 3. For each episode: Select a random initial state. Do While the goal state hasn’t been reached. * Select one among all possible actions for the current state. * Using this possible action, consider going to the next state. * Get maximum Q value for this next state based on all possible actions. * Compute: Q(state, action) = R(state, action) + gamma * Max{Q(next state, all actions)] * Set the next state as the current state. End Do End For (c) Hidden Markov models (HMM): Is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. An HMM can be presented as the simplest dynamic Bayesian network. In simpler Markov models (like a Markov chain), the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. Ina hidden Markov model, the state is not directly visible, but the output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. Therefore, the sequence of tokens generated by an HMM gives some information about the sequence of states. The adjective ‘hidden’ refers to the state sequence through which the model passes, not to the parameters of the model; the model is still referred to as a ‘hidden’ Markov model even if these parameters are known exactly. Hidden Markov models are especially known for their application in temporal Pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics. In its discrete form, a hidden Markov process can be visualized as a generalization of the Urn problem with replacement (where each item from the urn is returned to the original urn before the next step). Consider this example: in a room that is not visible to 14-MP-II Eighth Semester, Machine Learning an observer there is a genie. The room contains urns x1, x2, x3, . a known mix of balls, each ball labeled y1, y2, y3, room and randomly draws a ball from that urn, It t belt, where the observer can observe the sequence of urns from which they were drawn, The genie has s choice of the urn for the n-th ball depends only upon the urn for the (n — 1)-th ball. The choice of urn do chosen before this single previous urn; therefore, thi be described by the upper part of Figure. The Markov process itself cannot be observed, only the sequence of labeled balls, thus this arrangement is called a “hidden Markov process”. This is illustrated by the lower part of the diagram shown in Figure, where one can see that balls yl, 2, y3, y4.can be drawn at each state. Even if the observer knows the composition of the uriis and has Just observed a sequer:ce of three balls, e.g. yl, y2 and y3 on the conveyor belt, the observer still cannot be sure which urn (i.e., at which state) the genie has drawn the third ball from. However, the observer can work out other information, such as the likelihood that the third ball came from each of the urns. «each of which contains ‘The genie chooses an urn in that ‘hen puts the ball onto a conveyor f the balls but not the sequence of ome procedure to choose urns; the arandom number and the choice of es not directly depend on the urns is is called a Markov Process. It can al2 a23 Fig.1. Probabilistic parameter of a hidden markov model (example) x — state "y & Possible observations a — state transition probabilities b — output probabilities FIRST TERM EXAMINATION [FEB. 2017] EIGHTH SEMESTER [B.TECH.] MACHINE LEARNING [ETCS-402] Time :1% hrs. M.M.:30 Note: Q.1. is compulsory. Attempt any two more questions from the rest. Q.1. (a) What do you understand by Reinforcement Learning? 2.5x4=10 ‘Ans, Reinforcement Learning refers to the problem of a goal-directed agent interacting with an uncertain environment. The goal of an RL agent is to maximize a long-term scalar reward by sensing the state of the environment and taking actions which affect the state. At each step, an RL system gets evaluative feedback about the performance ofits action, allowing it to improve the performance of subsequent actions. Several RL methods have been developed and successfully applied in machine learning to learn optimal policies for finite-state finite-action discrete-time Markov Decision Processes (MDPs). ‘An analogous RL control system is shown in Figure, where the controller, based on state feedback and reinforcement feedback about its previous action, calculates the next control which should lead to an improved performance. The reinforcement signal is the output of a performance evaluator function, which is typically a function of the state and the control. An RL system has a similar objective to an optimal controller which aims to optimize a long-term performance criterion while maintaining stability. RL methods typically estimate the value function, which is a measure of goodness ofa given action for a given state. Q.1. (b) What is over fitting? Ans. Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance on the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model, The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize. Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns, For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up. 2-2017 Eighth Semester, Machine Learning Q.1. (©) What do you 2 Oy ey ectaloul process Pomp»), ially observal in SS cetera decision process (MDP). A preeeeeen decision Process in which it is assumed that the system dynamics Tnatead, itmuct y an Mpp but the agent cannot directly observe the eee ea on : eae maintain ? probability distribution over the set of possible s eo servations and observation probabilities, and the underlying MDP. rr The POMDP framework is general mee to ni ereane iS Sac Seren isi . Applications include robo lems, machi eee under uncertainty in general. An san solution to a POMDp yields the optimal action for each possible belief over the world states. The optim) action maximizes (or minimizes) the expected reward (or cost) of the agent over a possi infinite horizon. The sequence of optimal actions is known as the optimal poliey of jy, agent for interacting with its environment. A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple (S,A,T,R, Q , O, 7) where * S$ isa set of states, + Aisaset of actions * Tis a set of conditional transition probabilities between states * .R:SXA- Ris the reward function + Q isa set of observations * 0 isa set of conditional observation probabilities, and * 7 € [0,1] is the discount factor Q.1.(d) Explain the concept of Hidden Markov model? Ans. Hidden Markov model (HMM) is a statistical Markov model in which the Hidden Markov models are especially known for their application in temporal pattern racunition such as speech, handwriting, gesture recognition, part-of-speech taggin “nusieal score following, partial discharges and bioinformatics, ¢ room contains urns X1, X2, X3, ... each of which contain ; ach balll labeled y1, ¥2, 8, .... The genie chooses an urn in the!

You might also like