0% found this document useful (0 votes)
29 views

Chapter 4. Classification Algorithms-Stud

machine learning note

Uploaded by

samisey316
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
29 views

Chapter 4. Classification Algorithms-Stud

machine learning note

Uploaded by

samisey316
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 43
Chapter 4. Classification in Machine Learning Abebe B, PhD Classifications in Machine learning * Classification is a supervised machine learning method where the model tries to predict the correct label of a given input data. * In classification, the model is fully trained using the training data, and then it is evaluated on test data before being used to perform prediction on new unseen data. * The prediction task is a classification when the target variable is discrete. An application is the identification of the underlying sentiment of a piece of text. *Linear classification techniques + Linear classifier *Perceptron * Logistic regression *Linear SVM ¢ Naive Bayes ¢Non-linear classification techniques *k-nearest neighbors *Non-linear SVM «Neural networks * Decision trees «Random forest Linear vs Non-linear Techniques * Linear vs Non-linear Techniques Ay Linear separetic B lon linear separable e For some tasks, input data can be linearly separable, and linear classifiers can be suitably applied For other tasks, linear - ee #7 classifiers may have ae.7e difficulties to produce hal ae adequate decision * boundaries = %6 8 & 4 Non-linear Techniques * Linear vs Non-linear Techniques * Non-linear classification = Features z; are obtained as non-linear functions of the inputs x; * It results in non-linear decision boundaries * Can deal with non-linearly separable data Inputs: x; = [%n1_ Xn2] os a 2. 98 Features: 2) = [1 %n2 Xni*%n2 Xn nl se Outputs: f(x;,W,b) = Wz, +b * There are four main classification tasks in Machine learning: binary, multi-class, multi-label, and imbalanced classifications. * Binary Classification:- * In a binary classification task, the goal is to classify the input data into two mutually exclusive categories. * The training data in such a situation is labeled in a binary format: true and false; positive and negative; O and 1; spam and not spam, etc. depending on the problem being tackled. * For instance, we might want to detect whether a given image is a truck ora boat. * Logistic Regression and Support Vector Machines algorithms are natively designed for binary classifications. However, other algorithms such as K-Nearest Neighbors and Decision Trees can also be used for binary classification. Binary Classification Problem *N iid training samples: {x;,, Cy} * Class label: c,, € {0,1} * Feature vector: X € R¢ * Focus on modeling conditional probabilities P(C|X) * Needs to be followed by a decision step * Multi-Class Classification: The multi-class classification has at least two mutually exclusive class labels, where the goal is to predict to which class a given input example belongs to. For example, using a model to identify animal types in images from an encyclopedia is a multiclass classification example because there are many different animal classifications that each image can be classified as. Example Mee oT teeta) e cece ened ear’ * Most of the binary classification algorithms can be also used for multi-class classification. These algorithms include but are not limited to: *Random Forest, Naive Bayes, K-Nearest Neighbors, Gradient Boosting , SVM, Logistic Regression. * Unlike binary classification, multi-class classification does not have the notion of normal and abnormal outcomes. Instead, examples are classified as belonging to one among a range of known classes. Example: Classifications of types of crops, Classification of types of music. Binary vs Multi-class Classification * A classification problem with only 2 classes is referred to as binary classification * The output labels are 0 or 1 + Eg, benign or malignant tumor, spam or no-spam email * Aproblem with 3 or more classes is referred to as multi-class classification Binary classification: Multi-class classification: x A x x nl oN | SOKExx o.O0 oa ° O90 % % Linear Vs Non-linear Classification + Linear Classification refers to categorizing a set of data points into a discrete class based ona linear combination of its explanatory variables. + Non-Linear Classification refers to categorizing those instances that are not linearly separable. Linear Nonlinear S.No Linear Classification Non-Linear Classification Linear Classification refers to categorizing a set of data points into a discrete class based on a linear combination of its explanatory variables. Non-Linear Classification refers to categorizing those instances that are not linearly separable. hyperplane. 1. | Itis possible to classify data with a Itis not easy to classify data with a straight line. straight line. 1. | Datais classified with the help of a The utilization of kernels is made to transform non-separable data into separable data. K-Nearest Neighbor(KNN) Algorithm * K-NN algorithm stores all the available data and classifies a new data point based on the similarity. * This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. * It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. * KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. * The K-NN working can be explained on the basis of the below algorithm: * Step-1: Select the number K of the neighbors KAN alot Test se) * Step-2: Calculate the Euclidean distance * Ra of K number of neighbors ? * Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. * Step-4: Among these k neighbors, count the number of the data points in each category. * Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. * Step-6: Our model is ready. k-Nearest Neighbors Classifier ¢ k-Nearest Neighbors approach considers multiple neighboring data points to classify a test data point + E.g., 3nearest neighbors o The test example in the figure is the + mark o The class of the test example is obtained by voting (based on the distance to the 3 closest points) Advantages of KNN Algorithm: * It is simple to implement. * It is robust to the noisy training data * It can be more effective if the training data is large. * There’s no need to build a model, tune several parameters, or make additional assumptions. Disadvantages of KNN Algorithm: * Always needs to determine the value of K which may be complex some time. * The computation cost is high because of calculating the distance between the data points for all the training samples. * To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before. Linear Classifier « Find a linear function f of the inputs +; that separates the classes f%,W,b) =Wxj+b + Use pairs of inputs and labels to find the weights matrix Wand the bias vector b o The weights and biases are the parameters of the function f = Several methods have been used to find the optimal set of parameters of a linear classifier o Acommon method of choice is the Perceptron algorithm, where the parameters are updated until a minimal error is reached (single layer, does not use backpropagation) + Linear classifier is a simple approach, but it is a building block of advanced classification algorithms, such as SVM and neural networks o Earlier multi-layer neural networks were referred to as multi- layer perceptrons (MLPs) Linear Classifier ¢ The decision boundary is linear = Astraight line in 2D, a flat plane in 3D, a hyperplane in 3D and higher dimensional space e Example: classify an input image « The selected parameters in this example are not good, because the predicted cat score is low stretch pixels into single column J 02 |-05| 01 | 20 56 14 | -96.8 | cat score 15] 43] 21] 00} |231] 4 32) 437.9 | dog score input image © 9) 02) 08 ee a 61.95 | ship score Ww 2 b F(2i;W,b) ry Support Vector Machines = How to find the best decision boundary? © All lines in the figure correctly separate the 2 classes o The line that is farthest from all training examples will have better generalization capabilities = SVM solves an optimization problem: o First, identify a decision boundary that correctly classifies the examples © Next, increase the geometric margin between the boundary and all examples = The data points that define the maximum % margin width are called support vectors * Find Wand b by solving: min off st.yj(ar-x, +b) 21, Va, Support Vector Machines * The line that maximizes the minimum margin is a good bet. * The model class of “hyper-planes with a margin of m” has a low VC dimension if m is big. * This maximum-margin separator is determined by a subset of the datapoints. * Datapoints in this subset are called “support vectors”. «It will be useful computationally if only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on. 413/2023 aveve 8, Pro The support vectors are indicated by the circles around them. Training a linear SVM * To find the maximum margin separator, we have to solve the following optimization problem: w.x°+h>+1 for positive cases w.x°+h<-1 for negativecases and \\w\\? is as small as possible * This is tricky but it’s a convex problem. There is only one optimum and we can find it without fiddling with learning rates or weight decay or early stopping. * Don’t worry about the optimization problem. It has been solved. Its called quadratic programming. * It takes time proportional to N42 which is really bad for very big datasets * so for big datasets we end up doing approximate optimization! Testing a linear SVM * The separator is defined as the set of points for which: wxtb=0 soif w.x°+b>0 say its a positive case and if w.x° +b<0 say its a negative case Non-linear Support Vector Machines * Linear vs Non-linear Techniques © Non-linear SVM + The original input space is mapped to a higher-dimensional feature space where the training set is linearly separable * Define a non-linear kernel function to calculate a non-linear decision boundary in the original feature space Linear vs Nonlinear separable data Non linearly separable data This type of separator best provides the classification. But e tis quite difficult to train a model like this . e This is termed as Regularisation parameter. A CofA Tuning . Regularization Parameters . Gamma . Margin SVM ¢ Margin Margin is the perpendicular distance between the closest data points and the Hyperplane ( on both sides ) + The best optimised line ( hyperplane ) with maximum margin is termed as Margin Maximal Hyperplane. The closest points where the margin distance is calculated are considered as the support vectors. Maximum Support vectors, margin decision hyperplane Sy \. Margin is maximized Regularization Also the ‘ C ‘ parameter in Python’s SkLearn Library Optimises SVM classifier to avoid misclassifying the data C—- large Margin of hyperplane — small C— small Margin of hyperplane — large e misclassification(possible) C —> large , chance of overfit C —> small, chance of underfitting a ‘ > * ; SSp + + : > ; Se F< : = - 4 > x1 1 Large value for Smail value for persmeterc parameter © ¢ Gamma * Defines how far influences the calculation of of plausible line of separation. * Low gamma -----> points far from plausible line are considered for calculation * High gamma -----> points close to plausible line are considered for calculation High Gamma Value Low Gamma Value Kernels * Mathematical functions for transforming data using some linear algebra + Different SVM algorithms use different types of kernel functions Various kernels available Example : Use on) Kernel function dot product of n- dimensional inputs ° Pros: * It works really well with clear margin of separation * Itis effective in high dimensional spaces. = It is effective in cases where number of dimensions is greater than the number of samples. = It uses a subset of training points in the decision function (called support vectors), 50 it is also memory efficient. © Cons: * It doesn't perform well, when we have large data set because the required training time is higher * Italso doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping * SVM doesn't ody provide probability estimates, these are calculated using, an expensive five-fold cross-validation. It is related SVC method of Python scikit-learn library. ¢ Applications : = 1. Face detection = 2. Text and hypertext categorization = 3. Classification of images 4. Bioinformatics » 5, Handwriting recognition = 6. Protein fold and remote homology detection = 7. Generalized predictive control(GPC) Linear Kernel As the name suggests, the kernel separates the data linearly as it is the one-dimensional kemel Alinear kernel is fast as compared to other kernels because in the linear kernel we need only regularization parameter and in other kernels, we need the addition of the y parameter to performing the grid search. ¢ Equation of Linear Kernel shown below: Where, x and y = input column vectors k(z,y) =a'y * The degree=1 and coef0=0 if the coef is zero then the kernel is homogeneous means it maps the data in a compact linear representation. Polynomial Kernel * As the name suggests, this kernel deals with the degree/order of the features. The relationship between features is not linear but rather a curve, due to this the linear kernel is not suited to get its maximum result because the residue will be more between observed and predicted value. * The working function of this kernel is less effective as compared to other kernels. + Equation of Polynomial Kernel shown below: Eek T. d k(x, y) = (yx y+ eo) Where d = kernel degree Sigmoid Kernel * This kernel is mostly used in neural networks or perceptron in machine learning. To classify the classes in the data it works as an activation function. The curve in this kernel is also called the cumulative distribution function that goes from Oto 1. + Equation of sigmoid Kernel shown below: vty a) * Where, x and y = input column vectors, ¥ = Slope, and C0 = intercept RBF (radial basis kernel) Kernel + The most using kernel in the machine learning algorithm to classify the data without knowing the data types and try to separate the classes smoothly. * The other kernels are not trying to scale well on a huge number of input features. * Equation of RBF Kernel shown below: k(x, y) = exp(—yllz — yl?) + Where, x and y = input column vectors, Y = ¥ = 02, kernel of variance. Laplacian Kernel ° The laplacian kernel is from the family of RBF kernel and it can be used in noiseless data. It is very less affected by the changes in the data and also it has some similar features with the exponential kernel, as we will discuss later in this article. + The application of this kernel is mainly in image processing to detect edges of the objects by the name of Laplacian over Gaussian Filter (LoG). + Equation of Laplacian Kernel shown below: k(z,y) = exp(—y||z — ylli) * Where, x and y = input column vectors, | |x-y| | =Manhattan distance metric cca Bonk sepal width (cm) sepal width (cm) SVC with linear kernel sepal length (cm) SVC with RBF kernel sepal length (cm) LinearSVC (linear kernel) sepal width (cm) sepal length (cm) SVC with polynomial (degree 3) kernel sepal width (cm) sepal length (cm) Naive Bayes Classifier * Baive Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. * It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. + A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem. ol Works based on Bayes’ theorem «Why its is called Naive? ~ Because it assumes that the presence of a. particular feature in a class is unrelated to the presence of any other feature «Easy to build «Useful for very large data sets The theorem can be stated mathematically as follow: P(B| A) P(A) P(B) P(A) and P(B) are the probabilities of observing A and B without regard to each other. Also known as Prior Probability. P(A| B) = ’ P(A | B), a conditional (Posterior) probability, is the probability of observing event A given that B is true. P(B | A) is the conditional (Posterior)probability of observing event B given that A is true. So, how does naive bayes classifier work based on thi: Let D be a training set of tuples and each tuple is represented by n-dimensional attribute vector, X = (x1, x2,...., xn) Suppose that there are m classes, C7, C2...., Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the Naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(Ci|X) > P(G|X) for 1sjsmj#i. By Bayes’ theorem _ PUXIC)PCG) PIGIX) = Fay P(X) is constant for all classes, only P(X|C;)P(Cj)needs to be maximized To reduce computation in evaluating PCX|G), the naive assumption of class-conditional independence is made. This presumes that the attributes’ values are conditionally independent of one another, given the class label of the tuple (i.e, that there are no dependence relationships among the attributes). This assumption is called class conditional independence. Thus, Pic) =[] Peaic ra = P(x |C)) x Plea|G) x + * PIC) anspaoes ” Given all the previous patient's symptoms and How Naive diagnosis Bayes chills runnynose headache fever flu? Works? Y N Mild Y N (Hands on y y No N y Calculation) Y N Sirong y N yi Mid Y y N N No. N N N Y Strong Y Y 2 Y Strong N N Y Y Mild Y Y Does the patient with the following symptoms have the flu? chills runnynose headache fever flu? Y N Mild Y 2 How Naive Bayes el ces (Hands on Calculation) ete R First, we compute all possible individual probabilities conditioned on the target attribute (flu) P(Fu=Y) P(chills=Njflu=Y) P(runny nose=Y|flu=Y) P(runny nose=Niflu=Y) P(headache=Mild|flu=Y) P(headache=No}flu=Y) P(headache=Strongiflu=Y) P(fever=Yiflu=Y) P(fever=Niflu=Y) 0.625 06 0.4 0.8 0.2 04 02 04 08 P(Flu=N) P(chills=¥|Nu=N) P(chills=N}flu=N) P(runny nose=Y|flu=N) P(runny nose=Niflu=N) P(headache=Mildjfiu=N) P(headache=No|flu=N) P(headache=Strong|flu=N) P(fever=Yiflu=N) P(fever=Niflu=N) 0.375 0.333 0.666 0.333 0.666 0.333 0.333 0.333 0.333 0.666 Structure of Logistic Regression * Li : mn Linear Model: wo + Di w; * x; Bias Weight Weight per Feature + y* = sigmoid(wo + YW; * x) E 1 "| Predict 1 * sigmoid(z) = Bex : ------ 4 —- ——> Threshold = ~ 9 Example: Predict 0 Threshold = .5 Wo Wy We Est x2 25 a 1 ofojel|e ofR}ole

You might also like