0% found this document useful (0 votes)
2 views17 pages

SVM, RF, Decision Tree

The document discusses supervised learning, focusing on linear regression and its advantages, such as speed and simplicity. It explains the least squares method for parameter estimation and introduces gradient descent as an optimization technique. Additionally, it covers Naive Bayes classifiers, conditional probabilities, Bayes' theorem, and Support Vector Machines for classification tasks.

Uploaded by

mjananimjanani12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

SVM, RF, Decision Tree

The document discusses supervised learning, focusing on linear regression and its advantages, such as speed and simplicity. It explains the least squares method for parameter estimation and introduces gradient descent as an optimization technique. Additionally, it covers Naive Bayes classifiers, conditional probabilities, Bayes' theorem, and Support Vector Machines for classification tasks.

Uploaded by

mjananimjanani12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 17
Machine Leaming 2-4 ‘Supervised Learning Variable of interest (“dependent” variable) is predicted from k other variables independent” variables) using a linear equation. If Y denotes the dependent variable and X,,...,X,, are the independent variables, then the assumption is that the value of Y at time t in the data sample is determined by the linear equation : . , AI 0o eee aarp ir Z Y1 = Bo+BiXit +B2X ap +---+BaX a +t where the betas are constants and the epsilons are independent and distributed normal random variables with mean zero. * At each split point, the “error” between the predicted value and the actual values is squared to get a “Sum of Squared Errors (SSE)”. The split point errors across the variables are compared and the variable/point yielding the lowest SSE is chosen as the root node/split point. This process is recursively continued. * Error function measures how much our predictions deviate from the desired identically \ answers, Mean-squared error Jy, => (vi fi)? ‘ i=in . Advantages : fis a, Training a linear regression model is usually much faster than methods such as neural networks. % b. Linear regression models are simple and require minimum memory to implement. c. By examining the magnitude and_sign of the regression coefficients you can infer how predictor variables-affect the target outcome. Least Squares’ * The. method of least, squares is about estimating parameters by minimizing the squared discrepancies between observed data, on the one hand, and their expected values on the other. y * Considering an arbitrary straight line, y = bg +b x is to be fitted through these data points. The question is "Which line is the most representative" ? © What are the values of bg andb, such that the resulting line "best" fits the data points ? But, what Fig. 2.1.3 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Leaming 2-3 ‘Supervised Leaming « Regressi , ee ees are used to explain the relationship between one dependent labels (dase oF more independent variables, Classification predi it aniees 8), prediction models continuous ~ valued functions. Classification is _— idered to be supervised leafning.§ | +“ Classifies data based on the training set aa a pad on the training set and the values in a classifying attribute cao classifying new data, Prediction means models continuous - valued » Le. predicts unknowrror missing values, "~~ a © The regression line gives the av mship between the two variables in : —— . , 3 average relationsh of regression. _ Vie for the value of X for any + “For two variables : or two variables X and Y, there are always two lin + Regression line of X on Y Gives the best estima cific given valiés of YY : xe Y. i a+by X < atby x catby where a = X- intercept ke arby joel b = Slope of the line | ez tel x = Dependent variable Y< aba bx Sepeo| { * Ut Y = Independent variable Ce Mid Regression line Y on X : Gives the best estimate” for the Yale of Y' for any specific given values of X : Yes Ve where a = pt i: ath is ™) = Y- interce] x Soh 5 5 b = Slope of the line yor ee Vewalle Y = Dependent variable Xie in dpadat eel x = Independent variable quares method (a procedure that minimizes the vertical 5 surrounding a straight line) we are able to construct a to ter diagram points and then formulate a By_using the least deviations of plot: _ best fitting “straight line to_the Tegression equation in the form o! p= at bX Yo ats Bigs term—— 1 0 J = FHbR-D Yay h-¥} x a A tw) * Regression analysis is the art Input vecor] 22 —y and science’ of fitting straight i lines to patterns of data. In a linear regression model, the Fig. 2.1.2 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Learning as ‘Supervised Learning goodness-of-fit criterion to bo and b, ? use to determine among all possible combinations of ¢ The Least Squares (Ls) inimum. The | not ensure the o el ) criterion states that the sum of the squares of errors is Stsquares solutions yields y(x) whose elements sum to 1, but do utputs to be in the range [0,1]. How to draw such a line based on data points observed 7 Suppose a imaginary line of y = at bx. * Imagine a vertical distance between the line and a data point E = Y~E(Y), * This error is the deviation of the data point from the imaginary line, regression line. Then what is the best values of a and b? A and b that minimizes the sum of such errors. Deviation does not have good Fig. 2:14 properties for computation. Then why do we use squares of deviation ? Let us get a and b that can minimize the sum of squared deviations rather than the sum of deviations. This method is called least squares. © Least squares method minimizes the sum of squares of errors. Such a and b are called least squares estimators i.e. estimators of parameters o and B. The, process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares method is the estimation method of Ordinary Least Squares (OLS). Disadvantages of least square 1, Lack robustness to outliers 2. Certain datasets unsuitable for least squares classification 3. Decision boundary corresponds to ML solution TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Mochine teaming 2-9 Supervised Leaming ERED Gradient Descent Goal : Solving minimization nonlinear problems through derivative information « First and second deriv; : fatives of the objective function or the constraints play an important role in optimization. The first order derivatives are called the gradient and the second order derivatives are called the Hessian matrix. . Desvative based optimization is also called nonlinear. Capable of determining search directions" according to an objective function's derivative information. « Derivative based optimization methods are used for : 1. Optimization of nonlinear neuro-fuzzy models 2. Neural network learning 3. Regression analysis in nonlinear models Basic descent methods are as follows : 1. Steepest descent 2. Newton-Raphson method Gradient Descent : « Gradient descent is a first-order optimization algorithm. To find a local minimum. of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. Gradient descent is popular for very large-scale optimization problems because it is easy to implement, can handle black box functions, and each iteration is cheap. * Given a differentiable scalar field £ (x) and an initial guess x, , gradient descent iteratively moves the guess toward lower values of "f" by taking steps in the direction of the negative gradient - V f (x). the negated gradient is the steepest descent direction, ie, the direction Locally, that x would need to move in order to decrease "f’ the fastest. The algorithm typically converges to a local minimum, but may rarely reach a saddle point, or lies at a local maximum. the curve at that x and its direction will point change x in the opposite direction to lower not move at all if x; * The gradient will give the slope of to an increase in the function. So we the function value : Xked = xp - AVE (&) The 4>0 is a small number that forces the algorithm to make small jumps Co a ee TECHNICAL PUBLICATIONS® - an up-thrust for knowed?? supervised Learning Machine Leaming eee ically, Limitations of Gradient Descent : close to the minimum = technically, its * Gradient descent is relatively slow y other methods i jor to many . Ni, 1 sereode aie of Saver dient descent increasingly "2182485" as gra * For poorly conditioned convex problem, BE) reest direction to @ minimum the gradients point nearly orthogonally to the point Steepest Descent : * Steepest descent is also known as gradient auean Sager © This method is based on first order Taylor series approw pis thas Gia function. This method is also called saddle point method. Fig. 2.1 descent method. 7 J Fig. 2.1.5 Steepest descent method The Steepest Descent is the simplest of the Bradient methods. The choice of direction is where f decreases most quickly, which is in the direction opposite to VE (xj). The search starts at an arbitrary point x0 and then go down the gradient, until reach close to the solution. * The method of steepest descent is the discrete analogue of gradient descent, but the best move is computed using a local minimization rather than ‘computing @ gradient. It is typically able to converge in few steps but it is unable to escape local minima or plateaus in the objective function, * The gradient is everywhere perpendicular to the contour lines. After each line minimization the new gradient is always orthogonal to the Previous step direction. TECHNICAL PUBLICATIONS® « an up-thrust for knowledge Naive Bayes * Naive Bayes classifiers are a family of simple probabilistic classifiers based , applying Bayes’ theorem with strong independence assumptions between te features. It is highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. * A Naive Bayes Classifier is a program which predicts a class value given a set of attributes. © For each known class value, 1. Calculate probabilities for each attribute, conditional on the class value. 2. Use the product rule to obtain a joint conditional probability for the attributes, 3. Use Bayes rule to derive conditional probabilities for the class variable. * Once this has been done for all class values, output the class with the highest probability. * Naive bayes simplifies the calculation of probabilities by assuming that the Probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method. * The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Machine Leaming 2-98 Supervised Learning Conditional Probabitity * Let A and Bbe eines nee ‘Wo events such that P(A) > 0. We denote P(BIA) the probability ne canis " A has occurred Since A is known to have occurred, it becomes the Pace replacing the original §. From this, the definition is . rayay , PAs ny MAD OR PAY BY = PLAY PeByay «© The notation Puy " se " (BIA) is read "the Probability of event B given event A”. It is the Probanwty of an event B given the occurrence of the event A * We say that, the aaa Probability that both A and B occur is equal to the probability a pera tones the probability that B occurs given that A has occurred. We «a the conditional Probability of B given A, ic. the probability that B will occur given that A has occurred Similarly, the conditional probability PAN B) POA i (AB) DB) of an event A, given B by, * The probability P(A1B) simply reflects the fact that the probability of an event A may depend on a second event B. If A and B are mutually exclusive AA B = 6 and P(AIB) = 0. * Another way to look at the conditional probability formula is Pi ice i P(Second/First) = Pitt choice and second choice) P (First choice) * Conditional probability is a defined quantity and cannot be proven. + The key to solving conditional probability problems is to : 1. Define the events. i 2. Express the given information and question in probability notation. 3. Apply the formula. Joint Probability * A joint probability is a probability that measures the likelihood that two or more events will happen concurrently. * If there are two independent events A and B, the probability that A and B will occur is found by multiplying the two probabilities. Thus for two events A and B the special rule of multiplication shown symbolically is : P(A and B) = P(A) P(B). TECHNICAL PUBLICATIONS” - an up-thrust for knowledge Bayes Theorem 7 sect ain © Bayes’ theorem is a method to revise the probability of an aoe given additional information. Bayes's theorem calculates a conditional probability called a posterior or revised probability. theorem is a result in probability theory that relates conditional ) denotes the conditional Bayes' probabilities. If A and B denote two events, P(A|B a ; probability of A occurring, given that B occurs. The two conditional probabilities P(AIB) and P(BIA) are in general different. Bayes theorem gives a relation between P(A|B) and P(BIA). An important application of Bayes’ theorem is that it gives a rule how to update or revise the strengths of evidence-based beliefs in light of new evidence a posteriori. A prior probability is an initial probability value originally obtained before any additional information is obtained. A posterior probability is a probability value that has been revised by using additional information that is later obtained. * Suppose that By, B ,B3 ...By partition the outcomes of an experiment and that A is another event. For any number, k, with 1 attribute list > splitting attribute 9. For (each outcome j of splitting criterion ) 10. Let Dj be the set of data tuples in D satisfying outcome j; 11, If Dj is empty then attach a leaf labeled with the majority class in D to node N; 12. os rn the node retumed by Generate decision tree(D;, attribute list) to node N; 13. End of for loop 14. Return N; * Decision tree generation consists of two phases : Tree construction and pruning * In tree construction phase, all the training examples are at the root. Partition examples recursively based on selected attributes. + In tree pruning phase, the identification and removal of branches that reflect noise or outliers. + There are various paradigms that are used for learning binary classifiers which include : 1. Decision Trees 2. Neural Networks 3. Bayesian Classification 4. Support Vector Machines Fig. 2.5.1 Decision tree © TECHNICAL PUBLICATIONS® - an up-thnust for knowledge ‘Supervised Learning : os Tee with training set class distribution in the leaves. TO e8cision tree obtained using the majority class decision rule. FEB hoeropriate Protiem for Decisio Decision tree learning ; : * characteristics ; "8 § generally best suited to problems with the following 1 n Tree Learning Inst ae "Presented by attribute-value Pairs. Fixed set of attributes, and tributes take a small number of disjoint possible values. 2. The target functic ‘on has discrete output values. Decision tree learning is 4 The training data may contain errors. Decision tree learning methods are robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples. The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values 6 Decision tree learning has been applied to problems such as learning to classify. Advantages and Disadvantages of Decision Tree Advantages 1. Rules are simple and easy to understand. 2 Decision trees can handle both nominal and numerical attributes. 3. Decision trees are capable of handling datasets that may have errors 4. Decision trees are capable of handling datasets that may have missing values. 5. Decision trees are considered to be a nonparametnc method. 6. Decision trees are self-explantory Disadvantages ; 1, Most of the algorithms require that the target attribute will have only discrete values. 2 Some problem are difficult to solve like XOR. 3. Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute. TECHNICAL PUBLICATIONS® - an up-thrust for irowledg® oie ae he | : Supervised Ley Machine Leaming pet eeke hg on trees are prone to errors in classification problems with many class ang 4. De relatively small number of training examples EEG Random Forests ¢ Random forest is a famous system learning set of rules that belongs to the supervised getting to know method. It may be used for both classification ang regression issues in ML. It is based totally on the concept of ensemble studying, that's a process of combining multiple classifiers to solve a complex problem ang to enhance the overall performance of the model. ‘As the call indicates, "Random forest is a classifier that incorporates some of choice timber on diverse subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and primarily based on most of the people's votes of predictions, and it predicts the very last output. * The more wider variety of trees within the forest results in better accuracy and prevents the hassle of overfitting. EEGEII How Does Random Forest Algorithm Work ? ¢ Random forest works in two-section first is to create the random woodland by combining N selection trees and second is to make predictions for each tree created inside the first segment. * The working technique may be explained within the below steps and diagram : Stop - 1: Select random K statistics points from the schooling set, Step -2: Build the selection trees associated with the selected information points (Subsets). Step - 3: Choose the wide variety N for selection trees which we want to build. Step - 4: Repeat step 1 and 2. Step - 5 : For new factors, locate the predictions of each choice tree and assign the new records factors to the category that wins most people's votes. * The working of the set of rules may be higher understood by the underneath example : * Example ; Suppose there may be a dataset that includes more than one fruit Photo. So, this dataset is given to the random wooded area classifier. The dataset is divided into subsets and given to every decision tree. During the training Section, each decision tree produces a prediction end result and while a brand new TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ochine Leaming a aoe ‘Supervised Learning statistics point occurs, then primari a Primarily based on the mi ‘ajority of consequences, the random forest. classifier : predicts th i i ae he final decision. Consider the underneath Instance Tree- e Treen Class-A Class-B Fig. 2.6.1 Example of random forest EZEI Applications of Random Forest ‘There are specifically 4 sectors where random forest normally used : 1. Banking : Banking zone in general uses this algorithm for the identification of loan danger. 2. Medicine : With the assistance of this set of rules, disorder traits and risks of the disorder may be recognized. 3. Land use ; We can perceive the areas of comparable land use with the aid of this | algorithm. | 4. Marketing : Marketing tendencies can be recognized by the usage of this | algorithm. Advantages of Random Forest Random forest is able to appearing both classification and regression responsibilities. * It is capable of managing large datasets with high dimensionality. * It enhances the accuracy of the version and forestalls the overfitting trouble. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

You might also like