0% found this document useful (0 votes)
41 views11 pages

IOML Ch-5

Uploaded by

Gaurav Kamath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
41 views11 pages

IOML Ch-5

Uploaded by

Gaurav Kamath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
Logistic Regression and Support Vector Machine ‘Syllabus tic Regression, Introduction to Support Vector Machine, The Dual Formation, Maximum vein with Noise, Nonlinear SVM and Kernel Function, SVM: Solution tothe Dual Problem. Contents 51 Logistic Regression 52 Introduction to Support Vector Machine 53 Kemel Methods for Non - linearity ‘Scanned with CamScanner ‘NG istic Regression and Su; \ Introduction to Machine Leaming 5:2 int PPO Veet | Sty Ea Logistic Regression * Logistic regression is a form of regression analysis in which the Sutcome is binary dichotomous. A statistical method used to model di Vat binary or did chatg- *, binary outcomes using predictor variables. ‘ Logistic component : Instead of modeling the outcome, Y, directly, the . models the log odds (¥) using the logistic function. Regression component : Methods used to quantify associa outcome and predictor variables. It could be used to build function of predictors. nt tion betw, ict een Predictive m ® ‘Odels a, In simple logistic regression, logistic regression with 1 Predictor variable. Pa) ) i + ayy | = Bo +BiXy +BaX2 +... +B,Xy Y = Bo+BiXy +B.X2 +...4 BX, +e "th logistic regression, the response variable characteristic, that is, a 0 * Wi is an indicato /1 variable. Logistic - regressi a Logistic Fig. 5.4.4 * If analysis of cova , lite regression can bar ee Sid © Be test adjusted for other variables, ti Proparions sha eget of a a chee er henge regression is 2 o/t r other Variables. While the res or homogeneity equation, does nop STH the logistic Ponse variable in a logit °F Predict the 0/1 vatiable ™ equation, which is a line!) Tegressioy lf. a Introduction to Support | itself, Vector Machine Support Vector Machines (SVMs) are a . learn from the dataset and used fc. claset cl learning, methods wi statistical learning theory by Vapnik ana ChervoneoeM is a classifier derived TECHNICAL PUBLICATIONS® ~ _ 2 ‘Scanned with CamScanner ind Loar 3 vs to Machine Leaming 5-3 Logistic Regression and ‘Support Vector Machir ine ‘An SVM is a kind of large-margin classifier: it is a ve . ct ir classes that is maximally far from any point in the training data unm Given a set of training examples, each marked as belon f B iging to one of tw ‘an SVM algorithm builds @ model that predicts whether a new ae fale nee one class or the other. Simply speaking, we can think of an SVM model as representing the examples as points in space, mapped so that each of the examples of the separate classes are divided by a gap that is as wide as possible. ° New examples are then mapped into the same space and classified the class based on which side of the gap they fall on. to belong to two Class Problems + Many decision boundaries can separate these two classes. Which one should we choose ? Perceptron learning rule can be used to find any decision boundary between class 1 and class 2. oe Class 1 Fig. 5.2.1 Two class problem The line that maximizes the minimum margin is a good bet. The model class of “hyper-planes with a margin of m" has a low VC dimension if m is big. This maximum-margin separator is determined by a subset of the data points. Data points in this subset are called “support vectors’. It will be useful computationally if only a small fraction of the data points are support vectors, because we use the support vectors to decide which side of the separator a test case is on. Example of Bad Decision Boundaries * SVM are primarily two - class classifiers with the distinct characteristic that they aim to find the optimal hyperplane such that the expected generalization erro! minimized. Instead of directly minimizing the empirical risk calculated from 1 training data, SVMs perform structural risk minimization to achieve & Beneralization. TECHNICAL PUBLICATIONS® - an up-thust for knowledg® ‘Scanned with CamScanner Logistic Regression and Support Vectoy, Mag Introduction to Machine Leaming SS om Class 1 Fig. 5.2.2 Bad decision boundary of SVM * The empirical risk is the average loss of an estimator for a finite set of data rayy from P. The idea of risk minimization is not only measure the performance of estimator by its risk, but to actually search for the estimator that minimizes rik over distribution P. Because we don't know distribution P we instead minimize empirical risk over a training dataset drawn from P. This general Teaming technique is called empirical risk minimization. * Fig. 523 shows empirical risk. High 4, \ \ : Short ‘Small Large Complexity of function set lie ver consistent but is more ‘likens °8° {© the boundary, the classifier may "likely" to make en istributi rors i te distribution. Hence, we prefer classifiers that enjen TeN instances from : data points to the separator. maximize the minimal distance 1. Margin (m) : The margin is the minimon ret” &22 points and the classifier boundary. ™ this hyperplane is in the canonical fan? sample to the decision boundat f vse margin can be measured bY TECHNICAL PUBLICATION<® — ‘Scanned with CamScanner 5-5 wt eaming Logistic Regression and Support Vector Machine © denotes +1 © denotes ~1 Fig. 5.2.4 Good decision boundary length of the weight vector. The margin is given by the projection of the distance between these two points on the direction perpendicular to the hyperplane. Margin of the separator is the distance between support vectors. . 2 Margin ©) - Try] 2. Maximal margin classifier : A classifier in the family F that maximizes the margin. Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable. For the folowing figure find a linear hypenplane (decision boundary) tha wi separate the data. pies rear ' Fig, 5.2.5 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ‘Scanned with CamScanner 6 Logistic Regression and Support Vector, Me 5 Introduction to Machine Learning Solution : 7, Second possible solution 7. One possible solution ——— 5 ° OG P 6 «= %\ ° 3. Other possible solution By Bh 5. How do you define better? 6. Find a hyperplane that maximizes selaiins B1 is better than B2 8, Wex+b=0: el — Wwextb=+1 Wextb=—1) bay Fig. 5.2.6 1. Define what an optimal hyperplane is : maximize 2. Extend the above definition for non term for misclassifications. margin. ~ linearly separable problems : have a penalty Pace where it is easier to classify with lines! decision surfaces : reformulate problem so that data is mapped implicitly to this space. EEEZI Key Properties of Support Vector Machines 1. Use a single hyperplane which subdivides the 7 ne Space into two half - spaces, © which is occupied by Class 1 and the other by C! 2 ‘lass 2 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ‘Scanned with CamScanner ino Learning 5-7 tp Machine Learning Logistic Regression and Support Vector Machi fachine maximize the margin of the decision boundary usin 2 They hes which find the optimal hyperplane, 8 quadratic optimization ality *0 handle large feature spaces. 5 veins can be controlled by soft margin approach. 4 an used in practice, SVM approaches frequently map the examples to a higher =e onal SPACE and find margin maximal hyperplanes in the mapped space cotaining, decision boundaries which are not hyperplanes in the original space. : « tre mast popular versions of SVMs use non - linear kernel functions and map the attribute space into a higher dimensional space to facilitate finding "good" linear decsion boundaries in the modified space. i522 SVM Applications + SVM has been used successfully in many real - world problems, 1. Text (and hypertext) categorization 2. Image classification 3, Bioinformatics (Protein classification, Cancer classification) 4. Hand-written character recognition 5. Determination of SPAM email. FE} Limitations of SVM L Itis sensitive to noise. 2 The biggest limitation of SVM lies in the choice of the kernel. 3. Another limitation is speed and size. 4 The optimal design for multiclass SVM classifiers is also a research area. EY soft margin sv sification, sometimes are not, and even if the bulk of the data 2 the very high dimensional problems common in text clas data are linearly separable. But in the general case they wi yi we might prefer a solution that better separates ignoring a few weird noise documents. * What . tt * if the training set is not linearly separable ? Slack parables can ae fo . Misclassification of difficult or noisy examples, resulting margin ca the ov Soft - a : in or % ~ margin allows a few variables to cross into the marB , allowing misclassification. ee eet age TECHNICAL PLAN ICATIONS® - an up-thrust for Knowedd ‘Scanned with CamScanner Introduction to Machine Leamming 5-8 Logistic Regression and Support Vector May, © We penalize the crossover by looking at the number and distance o, misclassifications, This is a trade off between the hyperplane violations ang margin size, The slack variables are bounded by some set cost. The farther the are from the soft margin, the less influence they have on the prediction, they * All observations have an associated slack variable 1. Slack variable = 0 then all points on the margin. 2. Slack variable > 0 then a point in the margin or on the wrong side of y, hyperplane 3. Cis the tradeoff between the slack variable penalty and the margin. [EEE] comparison of SVM and Neural Networks ‘Support Vector Machine Neural Network Hidden Layers map to lower dimensional CBRE? From the following diagram, identify which data points (1, 2, 3, 4, 5) ar Support vectors (if any), slack variables on correct side of classifier (if any) and slack variables on wrong side of classifier (if any). Mention which Point will have maximum ) enalty and why ? vas Peay Fig. 5.2.7 TECHNICAL PUBLICATIONS® - an up-thrist for knowledge ‘Scanned with CamScanner j ye Learning 5-9 ir } ee ee Vector Machin wt points 1 and 5 will have maximum penalty. + in (m) is the gap between data points & the classifier bounda M ry. Mae rinimam distance of any sample to the decision bo Y. The margin uundary, rane is in the canonical form, the margin can be measured iz fe ia is tp eit vector. Maximal margin classifier : A classifier in the family F that maximizes the margin. vmizing the margin 38 good according to intultion and PAC theory implies saly support vectors mater; oer training examples ae ignorable : _ iat if the traning set is nat linearly separable ? Slack variables canbe added to wey misclssifcation of dificult or noisy examples, resulting margin called sof |g sofemargin allows a few variables to cross into the margin or over the rypaplae, allowing misclassification. 4 ve penalize the crossover by looking at the number and distance of the ‘redassifications. This is a trade off between the hyperplane violations and the muagin size. The slack variables are bounded by some set cost. The father they wre from the soft margin, the less influence they have on the prediction. « All observations have an associated slack variable, 1. Slack variable = 0 then all points on the margin. 2 Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane. 3. Cis the tradeoff between the slack variable penalty and the margin. 153 Kernel Methods for Non - linearity * Kemel methods refer to a family of widely used nonlinear algorithms for machine leaming tasks like classification, regression, and feature extraction. * Any non-linear problem (classification, regression) in the original input space can be converted into linear by making non-linear mapping into a feature space with higher dimension and shown in Fig. 53.1. * Often we want to capture nonlinear patterns in the data. ; ; Nonlinear Regression : Input - output relationship may not be linear + Nonlinear Classification : Classes may not be separable by @ linear boundary. €mels : Make linear models work in nonlinear settings. oa new space where the original Kemels, using a feature mapping ¢, map data t ming problem becomes easy. TECHNICAL PUBLICATIONS® - en upthust for nowied?® ‘Scanned with CamScanner Logistic Regression and Support Vector, < 5-10 Introduction to Machine Leaming ett reo Feature space Fig. 5.3.1 Mapping of space * Consider two data points x = (%; % } and z = {z4; Z, }. Suppose we roy function k which takes as inputs x and z and computes. K(x, z) = (xTz)? = (xiz1 + X2%2)" = xdebt x3 23 + 2xqx22122 = (12, V2 xyx2/x3)? (22, V2 2422 23) = 6(x)"6(z) (an inner product) Features Examples | Kemel function orga ears matic ema mai Fig. 5.3.2 * The above k implicitly defines a mapping @ to a higher dimensional space 90%) = Ix}, V2xqxpx3) * We didn't need to pre - define/compute the mapping @ to compute K(x, 2) * The function k is known as the kernel function. K:NXN matrix of pairwise similarities between examples in F space. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ‘Scanned with CamScanner YD ees Machine Leaming Logistic Regression and Support Vector Machine pavantages oa The kernel defines a similarity measure between two data points and thus allows "ie to incorporate prior knowledge of the problem domain. + Most importantly, the Kernel contains all of the information about the relative yevaitions of the inputs in the feature space and the actual learning algorithm is pased only on the kernel function and can thus be carried out without explicit use of the feature space. 43. The number of operations required is not necessarily proportional to the number of features. Q00 ~ ‘Scanned with CamScanner

You might also like