0% found this document useful (0 votes)
61 views12 pages

Unsupervised Lec

this content is about unsupervised learning

Uploaded by

faria shahzadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
61 views12 pages

Unsupervised Lec

this content is about unsupervised learning

Uploaded by

faria shahzadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 12
the required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space and transforms it into a higher dimensional space. In simple words, kernel converts non- separable problems into separable problems by adding more dimensions to it. It makes SVM more powerful, flexible and accurate. The following are some of the types of kernels used by SVM. Linear Kernel Itcan be used as a dot product between any two observations. The formula of linear kernel is as below KOqx}=sum(xex) From the above formula, we can see that the product between two vectors say x & xi is the sum of the multiplication of each pair of input values. 2.5. Unsupervised Machine Learning: 2.5.1. Introduction to clustering ‘As the name suggests, unsupervised learning is a machine learning technique in which models, are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. It can be defined as: “Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision.” Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data, The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the image features on their own. Unsupervised learning algorithm will perform this task by clustering the image dataset into the groups according to similarities between images. Why use Unsupervised Learning? 53 Below are some main reasons which describe the importance of Unsupervised Learning: © Unsupervised learning is helpful for finding useful insights from the data. © Unsupervised learning is much similar as a human learns to think by their own experiences, which makes it closer to the real Al. © Unsupervised learning works on unlabeled and uncategorized data which make unsupervised learning more important. © Inreal-world, we do not always have input data with the corresponding output so to solve such cases, we need unsupervised learning, Working of Unsupervised Learning Working of unsupervised learning can be understood by the below diagram: INPUT RAW DATA, Cats Unlabeled data Here, we have taken an unlabeled input data, which means it is not categorized and corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will apply suitable algorithms such as k-means clustering, Decision tree, etc. ‘Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to the similarities and difference between the objects. Types of Unsupervised Learning Algorithm: The unsupervised learning algorithm can be further categorized into two types of problems: Unsupervised Learning 5A Clustering: Clustering is a method of grouping the objects into clusters such that objects with most similarities remains into a group and has less or no similarities with the objects of another group. Cluster analysis finds the commonalities between the data objects and categorizes them ‘as per the presence and absence of those commonalities. Association: An association rule is an unsupervised learning method which is used for finding. the relationships between variables in the large database. It determines the set of items that ‘occurs together in the dataset. Association rule makes marketing strategy more effective. Such as people who buy X item (suppose a bread} are also tend to purchase ¥ (Butter/lam) item. A typical example of Association rule is Market Basket Analysis. Unsupervised Learning algorithms: Below is the list of some popular unsupervised learning algorithms: K-means clustering KNN (k-nearest neighbors) Hierarchal clustering ‘Anomaly detection Neural Networks Principle Component Analysis Independent Component Analysis Apriori algorithm Singular value decomposition ‘Advantages of Unsupervised Learning Unsupervised learning is used for more complex tasks as compared to supervised learning because, in unsupervised learning, we don't have labeled input data. Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data, Disadvantages of Unsupervised Learning Unsupervised learning is intrinsically more difficult than supervised learning as it does not have corresponding output. The result of the unsupervised learning algorithm might be less accurate as input data is not labeled, and algorithms do not know the exact output in advance. 55 Supervised Learning Supervised learning algorithms are trained using labeled data, Supervised learning model takes direct feedback to check ifit is predicting comet output or not Supervised leaming model predicts the output In supervised leaming, input data provided to the model along with the ‘output ‘The goal of supervised learning isto tain the made! so that i ean predict the output when is given new data, ‘Supervised learning needs supervision to train the model Supervised leaming can be categorized in Classification and Regression problems ‘Supervised learning can be used for those cases where we know the input as well us comesponding outputs Supervised leaming model produces an accurate result ‘Supervised learning isnot close to true Artificial intelligence as inthis, we first train the model foreach data, and then only it can predict the correct output It includes various algorithms such as Linear Regression, Logistic Regression, Support Vector Machine, Multi-clas Classification, Decision ‘ree, Bayesian Logie, ete 2.5.2. K-Mean Clustering k-means clustering algorithm Unsupervised Learning Unsupervised leaning algorithms are trained using unlabeled data, Unsupervised learning model dacs not take any feedback, Unsupervised leaming mode finds the hidden pattems in data, Jn unsupervised learning, only input data is provided tothe model The goal of unsupervised leaming is to find the hidden pattems and useful insights from the unknown dataset. Unsupervised leaming does not need any supervision to train the ‘model Unsupervised Learning ean be classified in Clustering and Associations problems. Unsupervised learning can be used for those cases where we have ‘nly input data and no corresponding output data, Unsupervised leaming model may give less accurate result as ‘compared to supervised learning. Unsupervised learning s more close tothe true Artifical Intelligence as it Teams similarly as a child leas daily routine things by his experiences i includes various algorithms such as Clustering, KNN, and Apriori algorithm. ‘One of the most used clustering algorithm is k-means. It allows to group the data according to the existing similarities among them in k clusters, given as input to the algorithm. Ill start with a simple example. Let's imagine we have 5 objects (say 5 people) and for each of them we know two features (height and weight). We want to group them into k=2 clusters. ‘Our dataset will look like this: 56 Height (H) _ Weight (W) 167 55 Person 1 Person 2 120 32 Person 3 113 33 Person 4 175 76 Person 5 108 25 First of all, we have to initialize the value of the centroids for our clusters, For instance, let’s choose Person 2 and Person 3 as the two centroids cl and ¢2, so that c1=(120,32) and c2=(113,33). Now we compute the euclidian distance between each of the two centroids and each point in the data. If you did all the calculations, you should have come up with the following numbers: Distance of object from c/ Distance of object from c2 Person 1 523 583 Person 2 0 71 Person 3 7 0 Person 4 704 754 Person 5 13.9 94 At this point, we will assign each object to the cluster itis closer to (that is taking the minimum between the two computed distances for each object). We can then arrange the points as follows: Person 1-> cluster 1 Person 2 -> cluster 1 Person 3 -> cluster 2 Person 4 -> cluster 1 Person 5-> cluster 2 Let's iterate, which means to redefine the centroids by calculating the mean of the members of each of the two clusters Soc'l (167+120+175)/3, (55432+76)/3) = (154, 54.3) and c’2 = ((113+108)/2, (33+25)/2) = (110.5, 29) Then, we calculate the distances again and re-assign the points to the new centroids. We repeat this process until the centroids don’t move anymore (or the difference between them is under a certain small threshold). In our case, the result we get is given in the figure below. You can see the two different clusters labelled with two different colours and the position of the centroids, given by the crosses. 7 x | = * How to apply k-means? ‘As you probably already know, I'm using Python libraries to analyze my data. The k-means algorithm is implemented in the scikit-learn package. To use it, you will just need the following line in your script: What if our data is... non-numerical? At this point, you will maybe have noticed something. The basic concept of k-means stands on mathematical calculations (means, euclidian distances). But what if our data is non-numerical or, in ‘other words, categorical? Imagine, for instance, to have the ID code and date of birth of the five people of the previous example, instead of their heights and weights. We could think of transforming our categorical values in numerical values and eventually apply k- ‘means. But beware: k-means uses numerical distances, so it could consider close two really distant objects that merely have been assigned two close numbers, k-modesis an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data. We will have as many modes as the number of clusters we required, since they act as centroids. 58 Unit Ensemble and Probabilistic Learning Ensemble Learning: Model Combination Schemes, Voting, Error-Correcting Output Codes, : Random Forest Trees, Boosting: Adaboost, Stacking. ic Learning: Gaussian mixture models - The Expectation-Maximization (EM) Algorithm, Information Criteria, Nearest neighbour methods - Nearest Neighbour Smoothing, Efficient Distance Computations: the KD-Tree, Distance Measures. 3. Introduction: Ensemble Learning Ensemble learning usually produces more accurate solutions than a single model would, 59 Ensemble Learning is a technique that create multiple models and then combine them them to produce improved results. Ensemble learning usually produces more accurate solutions than a single model would, ‘© Ensemble learning methods is applied to regression as well as classification, ‘© Ensemble learning for regression creates multiple repressors i.e. multiple regression models such as linear, polynomial, etc. ‘© Ensemble learning for classification creates multiple classifiers i.e. multiple classification models such as logistic, decision tress, KNN, SVM, etc. Class Prediction Combiner Classifier 1 ||Classifier 2 | + - - | Classifier N Input Features Figure 1: Ensemble learning view Which components to combine? + different learning algorithms same learning algorithm trained in different ways + same learning algorithm trained the same way There are two steps in ensemble learning: Multiples machine learning models were generated using same or different machine learning algorithm. These are called “base models”. The prediction perform on the basis of base models. Techniques/Methods in ensemble learning Voting, Error-Correcting Output Codes, Bagging: Random Forest Trees, Boosting: Adaboost, Stacking. 3.1 Model Combination Schemes - Combining Multiple Learners We discussed many different learning algorithms in the previous chapters. Though these are generally successful, no one single algorithm is always the most accurate. Now, we are going to discuss models composed of multiple learners that complement each other so that by combining them, we attain higher accuracy. There are also different ways the multiple base-learners are combined to generate the final output: Figure2: General Idea - Combining Multiple Learners 60 i Set ulate Sissies Muttiexpert combination Multiexpert combination methods have base-learners that work in parallel. These methods can in turn be divided into two: Inthe global approach, also called learner fusion, given an input, all base-learners generate an ‘output and all these outputs are used. Examples are voting and stacking. ‘* In the local approach, or learner selection, for example, in mixture of experts, there is a gating model, which looks at the input and chooses one (ar very few) of the learners as responsible for generating the output. ‘Multistage combination Multistage combination methods use a serial approach where the next base-learner is trained with or tested on only the instances where the previous base-learners are not accurate enough. The idea is that the base-leamers (or the different representations they use) are sorted in increasing complexity so that a complex base-learner is not used (or its complex representation is not extracted) unless the preceding simpler base-learners are not confident, ‘An example is cascading. Let us say that we have L base-learners. We denote by dj(x) the prediction of base-learner M, given the arbitrary dimensional input x. In the case of multiple representations, each M, uses a different input representation x;. The final prediction is calculated from the predictions of the base-learners: ¥=Fldy dy... d, |) where f (:)is the combining function with @ denoting its parameters. 61 Figure 1: Base-learners are dj and their outputs are combined using f (). This is for a single output; in the case of classification, each base-learner has K outputs that are separately used to calculate y, and then we choose the maximum. Note that here all learners observe the same input; it may be the case that different learners observe different representations of the same input object or event. When there are K outputs, for each learner there are dy(x), i= 1,..., K, 1,..., L, and, combining them, we also generate K values, y, i= 1,..., K and then for example in classification, we choose the class with the maximum yi value: kK Choose G; if yj; = max ye = max ve 3.2 Voting The simplest way to combine multiple classifiers is by voting, which corresponds to taking a linear combination of the learn cers, Refer figure 1. ¥i = > wydy where w; = 0, w) =1 7 7 This is also known as ensembles and linear opinion pools. In the sim plest case, all learners are given equal weight and we have simple voting that corresponds to taking an average. Still, taking a (weighted) sum is only one of the possibilities and there are also other combination rules, as shown in table 1. Ifthe outputs are not posterior probabilities, these rules require that outputs be normalized to the same scale Table 1 - Classifier combination rules Rule Fusion function 7) sum Ma pda dy Weighted sum | y; = Sj wydy, wy = 0,4 wy = 1 Median yy = medianyd, Minimum Yi = ming dip Maximum yr = max; djs Product v= Ty dy 62 ‘An example of the use of these rules is shown in table 2, which demonstrates the effects of different rules. Sum rule is the most intuitive and is the most widely used in practice. Median rule is more robust to outliers; minimum and maximum rules are pessimistic and optimistic, respectively. With the product rule, each learner has veto power; regardless of the other ones, if one learner has an output of 0, the overall output goes to 0. Note that after the combination rules, yi do not necessarily sum up to 1 Table 2: Example of combination rules on three learners and three classes Cy | Gy ai 02/05 [03 do 00} 06 | 04 dy o4|o4 | 02 Sum 02/05 [03 Median [02] 0.5 | 04 Minimum | 0.0 | 0.4 | 0.2 Maximum | 0.4 | 0.6 | 0.4 Product | 0.0 | 0.12 | 0.032 In weighted sum, dj is the vote of learner j for class C, and wis the weight of its vote. Simple voting is a special case where all voters have equal weight, namely, j= 1/L. In classification, this is called plurality voting where the class having the maximum number of votes is the winner. When there are two classes, this is majority voting where the winning class gets more than half of the votes. If the voters can also supply the additional information of how much they vote for each class (eg,, by the posterior probability), then after normalization, these can be used as weights in a weighted voting scheme. Equivalently if d, are the class posterior probabilities, P(C, | x,M,), then we can just sum them up (w;= 1/L) and choose the class with maximum y, In the case of regression, simple or weighted averaging or median can be used to fuse the outputs of base-regressors. Median is more robust to noise than the average. Another possible way to find wis to assess the accuracies of the learners (regressor or classifier) on @ separate validation set and use that information to compute the weights, so that we give more weights to more accurate learners. Voting schemes can be seen as approximations under a Bayesian framework with weights approximating prior model probabilities, and mode! decisions approximating model-conditional likelihoods. PiCilx) = P(CiIx, Mj)P(Mj) all models M; Simple voting corresponds to a uniform prior. f we have a prior distribution preferring simpler models, this would give larger weights to them, We cannot integrate over all models; we only choose a subset for which we believe P/M, is high, or we can have another Bayesian step and calculate P(C; | x,M, ), the probability of a model given the sample, and sample high probable models from this density. Let us assume that dj are iid with expected value E[d and variance Var(d,), then when we take a simple 63 average with w; = 1/L, the expected value and variance of the output are 1 1 Eb = =|S 74 = [EELay) = Ftd] Var(y) vae( plvara)) W ‘ar(d;) 7 We see that the expected value does not change, so the bias does not change. But variance, and therefore mean square error, decreases as the number of independent voters, l, increases. In the general case, 1 var (Sd, r 24s) = which implies that if learners are positively correlated, variance (and error) increase. We can thus view Using different algorithms and input features as efforts to decrease, if not completely eliminate, the positive correlation. Var(y} = [Evens 25 Fema] T T ich 3.3 Error-Correcting Output Codes ‘The Error-Correcting Output Codes method is a technique that allows a multi-class classification problem to be reframed as multiple binary classification problems, allowing the use of native binary classification models to be used directly. Unlike one-vs-rest and one-vs-one methods that offer a similar solution by dividing a multi-class classification problem into a fixed number of binary classification problems, the error-correcting output codes technique allows each class to be encoded as an arbitrary number of binary classification problems. When an overdetermined representation is used, it allows the extra models to act as “error- correction” predictions that can result in better predictive performance. In error-correcting output codes (ECOC), the main classification task is defined in terms of a number of subtasks that are implemented by the base-learners. The idea is that the original task of separating one class from all other classes may be a difficult problem. Instead, we want to define a set of simpler classification problems, each specializing in one aspect of the task, and combining these simpler classifiers, we get the final classifier. Base-learners are binary classifiers having output ~1/+ 1, and there is a code matrix W of Kx L whose K rows are the binary codes of classes in terms of the L base-learners d. For example, ifthe second row of Wis [-1,+1,+1,-1], this means that for us to say an instance belongs to C,, the instance should be on the negative side of d, and d4, and on the positive side of d, and d. Similarly, the columns of the code matrix defines the task of the base-learners. For example, if the third column is [-1,+1,+1]T , we understand that the task of the third base-learner, d;, is to separate the instances of C; from the instances of C; and C, combined. This is how we form the training set of the base-learners. For example in this case, all instances labeled with C2 and C3 form Xand instances labeled with C;form Xz, and dy is trained so that x’ € X! give output +1 and.’ € X= give output -1. The code matrix thus allows us to define a polychotomy (K > 2 classification problem) in terms of dichotomies (K = 2 classification problem), and it is a method that is applicable using any learning 64

You might also like