0% found this document useful (0 votes)
17 views13 pages

FML Unit2

FML Unit2

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

FML Unit2

FML Unit2

Uploaded by

Vinoth Kumar M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
UNIT II Supervised Learning Syllabus Linear Regression Models : Least squares, single & multiple variables, Bayesian linear regression, radient descent, Linear Classification Models : Discriminant function - Perceptron algorithm, Sopabilistic discriminative model - Logistic regression, Probabilistic generative model - Naive Bayes, Maximum margin classifier - Support vector machine, Decision Tree, Random Forests Contents 2.1. Regression 2.2. Linear Classification Models 23 Probabilistic Generative Model 2.4 Maximum Margin Classifier : Support Vector Machine 2.5 Decision Tree 2.6 Random Forests 2.7 Two Marks Questions with Answers @- Machine Learing Regression © Regression finds correlations between i dependent and independent variables. 7 cnr If the desired output consists of one line eteeco, or more continuous variable, then the s z task is called as regression. 8 + Therefore, regression algorithms help a predict continuous variables such as house prices, market trends, weather patterns, oil and gas prices etc. Fig, 241 Regression Independent variable + Fig, 2.1.1 shows regression. * When the targets in a dataset are real numbers, ¢ known as regression and each sample in the dataset has a real-valued output or the machine learning task is target. * Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It can be utilized to assess the strength of the relationship between variables and for modelling the future relationship between them. + The two basic types of regression are linear regression and multiple linear regression. EXE] Linear Regression Models . Linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. * The objective of a linear regressi is i 7 ‘gression model is to find a relationshi input variables and a target variable. ae 1. One variable, denoted x, is regarded as the predictor, independent variable. explanatory or 2. The other variable, denoted y, is rej , 7 garded as Fre ainer| vac he response, outcome or * Regression models predict a continuous variable, such as the sales made on a d or predict temperature of a city. Let’s imagine that we fit a line with the trai i point that we have. If we want to add another data point, but to fit it, we ane change existing model. rw need to * This will happen with each data point that we add to the model; hence, {i regression isn’t good for classification models. ‘ oer ® iL ‘Supervised Leaming ion line gives thi . + The regress! 8 © average relationshi iables i mathematical form. ip between the two variables in vo variables X . For two varial and Y, there are always two lines of regression. + Regression line of X on Y Gives the best ost Fae civen'walues'st Wf imate for the value of X for any specific 5 X= atby here a = X- intercept b = Slope of the line X = Dependent variable Y = Independent variable « Regression line Y on X : Gives the best estimate for the value of Y for any specific given values of X : Y = a+bx wee a Y - intercept b = Slope of the line Y = Dependent variable x = Independent variable thod (a procedure that minimizes the vertical yunding a straight line) we are able to construct a ter diagram points and then formulate a * By using the least squares mel deviations of plotted points surro' best fitting straight line to the scal gression equation in the form of : 1 § = a+bx Bias term——" My fix. w) § = y+be-x SS "Re J ¥+bex-%) Input vector) *2—w, ‘tession analysis is the art x t “nd science of fitting straight “ “Nes to patterns of data. In @ Fig. 2.1.2 ar regression model, the TECHNICAL PUBLICAT! 2-4 Supervised Leeming Machine Learning : variable) is predicted from k other variables + equation. If Y denotes the dependent ariables, then the assumption is that etermined by the linear equation ; variable of interest (“dependent” (Vindependent” variables) using a lineal variable and X1,.../Xxv are the independent v. the value of Y at time t in the data sample is d Yy = Bo +BaX11 +BaXat + +Ba%Ht Ft betas are constants and the epsilons are independent and identically ith mean zero. the predicted value and the actual values ‘The split point errors across lowest SSE is where the distributed normal random variables wi At each split point, the “error” between E is squared to get a “Sum of Squared Errors (SSE)". The sp the variables are compared and the variable/point yielding the chosen as the root node/split point. This process is recursively continued. « Error function measures how much our predictions deviate from the desired answers. 1 Mean-squared error Jn == (yi fous)? isn Advantages : a. Training a linear regression model is usually much faster than methods such as neural networks. b. Linear regression models are simple and require minimum memory to implement. c. By examining the magnitude and sign of the regression coefficients you can infer how predictor variables affect the target outcome. EXP Least Squares + The method of least squares is about estimating parameters by minimizing the squared discrepancies between observed data, on the one hand, and their expected values on the other. y Considering an arbitrary ' $= By + Byx straight line, y = by +b)x, is to be fitted through these data points. The question is ‘Which line is the most yj -9j=Ertor (residual) representative" ? || * What are the values of 3% bo andb; such that the z resulting line "best" fits the | data points ? But, what Fig. 24.3 ea a Leamind 2. ye 2 supervised Leaming ess-of-fit criterion to oo? use to determine among all possibl binations of poand er ? gall possible combi east Squares (LS) criteri nee The ae that the sum of the squares of errors is mun hi solutions yields y(x) whose elements sum to 1, but do ensure the outputs to be in the range [0,1] s How to draw such a line based gn data points observed 7 y guppose @ imaginary line ofy= 4 a + bx. | jmagine @ vertical distance 3 petween the line and a data point E = ¥ - EQ). E(Y)=a + bX « This error is the deviation of the data point from the imaginary line, regression line. Then what is the best values ofaandb?A and b that minimizes the sum of such errors. + Deviation does not have good Fig. 2.14 properties for computation. Then why do we use squares of deviation ? Let us get a and b that can minimize the sum of squared, deviations rather than the sum of deviations. This method is called least squares. f squares of errors. Such a and b are parameters a. and B. a and b) is called estimation. thod minimizes the sum of * Least squares met timators of called least squares estimators i.e. es parameter estimators (eg, * The process of getting method of Ordinary Least Squares (OLS). Lest squares method is the estimation tT isadvantages of least square 1, Lack robustness to outliers 2 Certain datasets unsuitable for I ation east squares classific 3. Decision boundary corresponds to ML. solution vn up-thrust for knowledge pL ICATIONS Supervised Learning 2-6 Machine Leaming CQEEEEESED Fo stright tine to the points i the table. Compute m and b by least squares. Points x y x 3.00 oy B 4.25 4.25 c 5.50 5.50 D 8.00 SEN Solution : Represent in matrix form : 3.00 1 450] [va 425 1] [m 425] | vg = + 550 1 cE ve 8.00 1 5.50} vp X= (tl aT aytaty 121.3125 20.7500]""/105.8125] _ [0.246 * | 20.7500 4.0000 | | 19.7500 | ~ | 3.663 V = AX-L 3.00 1 4.50] f-0.10 425 1/7024] | 4.25 0.46 5.50 1||3.663|7] 5.50! ~|.-0.48 8.00 1 5.50 0.13 Multiple Regression + Regression analysis is used to predict the value of one or more responses from a set of predictors. It can also be used to estimate the linear association between the predictors and responses. Predictors can be continuous or categorical or a mixture of both. e If multiple independent variables affect the response variable, then the analysis calls for a model different from that used for the single predictor variable. In a situation where more than one independent factor (variable) affects.the outcome of a process, a multiple regression model is used. This is referred to as multiple linear regression model or multivariate least squares fitting. “Gq a machine Learning > Machin? Supervised Leaming « Let Z1; Z, be a set of r . Predictors believed to be related to a response variable Y. The li i varial near regression model for the j!" sample unit has the form Yi = Bo+Br 2i1+B2 Zip +.B, 2p +e; s ir i where € is a random error and Bj, i= x Bi, i=0,1,...,r are un-known regression coefficients. » With n independent observations, Ss, We can writ , rac deennotel te now rite one model for each sample unit so Y = ZBte where Y is nx 1, Z is nx (r+1),B is(r+1)x1 and eis nx1 ein order to estimate B , we take a least squares approach that is analogous to what we did in the simple linear regression case. « In matrix form, we can arrange the data in the following form : Lox x2 XK yi By Lox Xa A xn [EE Xm a] yfye | og |B 1 XN1 XN20 ++ XNK YN B, where fj are the estimates of the regression coefficients By Difference between Simple Regression and Multiple Regression | Simple regression Multiple regression One dependent variable Y predicted from one One dependent variable ¥ predicteg om 2 set independent variable X _of independent variables (X), Xz -- Xi) One regression coefficient for each independent One regression coefficient variable R? : Proportion of variation in dependent 1 : Proportion of variation in dependent variable Y predictable from X variable Y predictable by set of independent variables (X's) EXE] Bayesian Linear Regression n allows a useful mechanism to deal with insufficient to put a prior on the coefficients and the priors can take over. A prior is a * Bayesian linear regressio: data, or poor distributed data. It allows user on the noise so that in the absence of data, distribution on a parameter. * If we could flip the coin an infinite number easy by the law of large numbers. Howevel if we c ques that a coin is biased if we handful of times? Would we. of times, inferring its bias would be what if we could only flip the coin a saw three heads in TECHNICAL PUBLICATIONS® - an up-thrust for krowiedge Supervised Learning Machine Lea ight times with unbiased coins? The bias of p =1- quantifying our prior knowledge that a the bias parameter is peaked aroung about coins. three flips, an event that happens one out ofl MLE would overfit these data, inferring a coin + A Bayesian approach avoids overfitting b; most coins are unbiased, that the prior on ig pri lief one-half, The data must overwhelm this prior be i eee ae imate model para’ - ° . vesi \ds allow us to estimal . A Igorith: ee a conduct model comparisons. Le EE ee ete sts and to 1 hypotheses. dea that the training data are utilized to calculate foreca t calculate explicit probabilities fo + Bayesian classifiers use a simple i an observed probability of each class an classifier is used for unclassified data, i s for the new features. based on feature values. it uses the observed « When Bayesi e probabilities to predict the most likely clas « Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. + Prior knowledge can be combined with observed data to determine the final probability of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting a prior probability for each candidate hypotheses and a probability distribution over observed data for each possible hypothesis. * Bayesian methods can accommodate hypotheses that make probabilistic predictions. New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities. Even in cases where Bayesian methods provide a standard of optimal decisi methods can be measured. . prove computationally intractable, they can ion making against which other practical * Uses of Bayesian classifiers are as follows : 1. Used in text-based classification f i ¢ or finding spam or junk mail filter 2. Medical diagnosis. sean 3. Network security such as detecting illegal intrusion, The basic procedure for implementing Bayesian Linear R ion i i) Specify priors for the model parameter, Baer ii) Create a model mapping the training inj ie Puts to the traini iii) Have a Markov Chain Monte Carlo (MCMC) alpori Ing outputs. the posterior distributions for the parameters Tt draw samples from je Leaming 2 -9 Supervised Leaming Mech po Gradient Descent + Goal : Solving minimizati i Goi iB nization nonlinear problems through derivative information . eon dead : First a econ derivatives of the objective function or the constraints play an important role in optimization. The first order derivatives are called the gradient and the second order derivatives are called the Hessian matrix. Derivative based optimization is also called nonlinear. Capable of determining search directions" according to an objective function's derivative information. Derivative based optimization methods are used for : 1, Optimization of nonlinear neuro-fuzzy models 2. Neural network learning 3, Regression analysis in nonlinear models Basic descent methods are as follows : 1. Steepest descent 2. Newton-Raphson method Gradient Descent : Gradient descent is a first-order optimization algorithm. T of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. Gradient descent is popular for very large-scale optimization problems because it is easy to implement, can handle black box functions, and each iteration is cheap. Given a differentiable scalar field f (x) and an initial guess x; , gradient descent s of "f" by taking steps in the s ‘o find a local minimum iteratively moves the guess toward lower value: direction of the negative gradient ~ V f (x). * Locally, the negated gradient is the steep that x would need to move in order to yest descent direction, ie., the direction decrease "f" the fastest. The algorithm typically converges to a local minimum, but may rarely reach @ saddle point, or lies at a local maximum. e curve at that x and its direction will point change x in the opposite direction to lower not move at all if x; * The gradient will give the slope of th to an increase in the function. So we the function value : Xie = xR AVE (x) The A>0 is a small number that forces the algorithm to make small jumps TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Supervised Machine Leaning 2-10 Learn dient Descent : aa he minimum : techni to # technically, it Jatively slow close ce is inferior to many other methods. gradient descent increasingly ‘zigzag. the shortest direction to a mi A as ‘um, Limitations of Gra © Gradient descent is rel asymptotic rate of convergen .d convex problems, + For poorly conditione orthogonally to the gradients point nearly point Steepest Descent : # Steepest descent is also « This method is based on first function. This method is also call descent method. known as gradient method. order Taylor series approximation of obje ed saddle point method. Fig. 2.1.5 shows ae e est Fig. 2.1.5 Steepest descent method © The Steepest D escent i : direction is where Paes simplest of the gradient methods. The choice of es most quickly, which is in the direction opposite VE (x). The search : starts at an arbi : until reach close to the eae) point x0 and then go down the gradient The method of steepest is nen eas ee one is the discrete analogue of gradient descent, but using a local minimization rather than computing ® gradient. It is typi ically local minima Be 'y able to converge in few st ne e plateaus in the objective functi ‘eps but it is unable to escap’ inction. The gradient i ; t is everywhi aaa eres Eee to the contour lines. After each lin? ient i: ‘i nt is always orthogonal to the previous step directo" TECHNICAL PUBLICAT| ip-thrust for knowled: 'UBLICATIONS™ - an up-thrust 8 jowledge achine Leeming 2-14 ‘Supervised Leaming Consequently, the iterates tend to zig-zag down the valley in a very inefficient manner. + The method of Steepest Descent is simple, easy to apply, and each iteration is fast. It also very stable; if the minimum points exist, the method is guaranteed to locate them after at least an infinite number of iterations. fa Linear Classification Models + A classification algorithm (Classifier) that makes its classification based on a linear predictor function combining a set of weights with the feature vector. + A linear classifier does classification decision based on the value of a linear combination of the characteristics. Imagine that the linear classifier will merge into it's weights all the characteristics that define a particular class. + Linear classifiers can represent a lot of things, but they can't represent everything. The classic example of what they can't represent is the XOR function. EZAD Discri + Linear Discriminant Analysis (LDA) is the most commonly used dimensionality reduction technique in supervised learning. Basically, it is a preprocessing step for pattern classification and machine learning applications. LDA is a powerful algorithm that can be used to determine the best separation between two or more inant Function classes, + LDA is a supervised learning algorithm, which means that it requires a labelled training set of data points in order to learn the linear discriminant function. * The main purpose of LDA is to find the line or plane that best separates data points belonging to different classes. The key idea behind LDA is that the decision boundary should be chosen such that it maximizes the distance between the means of the two classes while simultaneously minimizing the variance within each class's data or within-class scatter. This criterion is known as the Fisher criterion, * LDA is one of the most widely used machine learning algorithms due to its accuracy and flexibility, LDA can be used for a variety of tasks such as classification, dimensionality reduction, and feature selection. Supervised 1, 2-12 oe Machine Learning : ify them efficiently, then ye; classes and we need to classify # a two * Suppose we have " an classes are divided as follows : : Before LDA After LDA Fig. 2.2.1 LDA ithm wi following steps : * LDA algorithm works based on the x a) The first step is to calculate the means and standard deviation of each feature, b) Within class scatter matrix and between class scatter matrix is calculated c) These matrices are then used to calculate the eigenvectors and eigenvalues. 4) LDA chooses the k eigenvectors with the largest eigenvalues to form a transformation matrix. LDA uses this transformation matrix to transform the data into a new space with k dimensions. f) Once the transformation matrix transforms the data into new space with k dimensions, LDA can then be used for classification or dimensionality reduction Benefits of using LDA : a) LDA is used for classification problems, e) ») LDA is a powerful tool for dimensionality reduction, ©) LDA is not susceptible to the learning algorithms. Logistic Regression ae a ees ie form of regression analysis in which the outcome variable shotomous. A statistical method i : : : used to model dichotomous oF binary outcomes using predictor variables, . * Logistic component : Instead of mode models the log odds curse of dimensionality" like many other machine ling the outcome, Y, directly, the method CO using the logistic function.” ” ‘ TECHNICAL PUBLICATIONS® _ . ine Leaming é sch 2-13 Supervised Leeming » Regression component ? Methods us outcome and predictor variables, function of predictors. ed to quantify association between an It could be used to build predictive models as a «In simple logistic regression, logistic regression with 1 predictor variable. Logistic Regression : PQ) ne of 2) = Bo + BiX1+B2X2 +...4B,X, = Bo+ BiX1+B2X2 4...4B,X_ +e With logistic regression, the response variable is an indicator of some characteristic, that is, a 0/1 variable. Logistic regression is used to determine whether other measurements are related to the presence of some characteristic, for example, whether certain blood measures are predictive of having a disease. If analysis of covariance can be said to be a t test adjusted for other variables, then logistic regression can be thought of as a chi-square test for homogeneity of proportions adjusted for other variables. While the response variable in a logistic regression is a 0/1 variable, the logistic regression equation, which is a linear equation, does not predict the 0/1 variable itself. Linear Ny Logistic Fig, 2.2.2 Fig. 22.2 shows Sigmoid curve for logistic regression. * The linear and logistic probability models are : Linear Regression : P= at ayXq $agXo tet AkXk Logistic Regression : In[p(—py = bo + byX1 tb2X2 tt PRX obability p is a linear function of the * The li that the pr pero came that the natural log of the odds egressors, while the logistic model assumes P/(1~p) is a linear function of the regressors. * The major advantage of the linear model is its interpretability. In the linear model, if a1 is 0.05, that means that a one-unit increase in X1 is associated with a5 % Point increase in the probability that ¥ is 7 TECHNICAL PUBLIGATIONS® - an up-hrst for knowiedge

You might also like