0% found this document useful (0 votes)
54 views22 pages

Textbook ML - Removed

The document discusses various statistical methods for analyzing univariate, bivariate, and multivariate data, emphasizing the importance of visualizations such as Q-Q plots, scatter plots, and heatmaps. It explains key concepts in bivariate statistics, including covariance and correlation, and introduces multivariate statistics, which involves analyzing multiple variables simultaneously. Additionally, it highlights the significance of linear algebra and probability in machine learning, detailing methods like Gaussian elimination and matrix decomposition.

Uploaded by

shohilcg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
54 views22 pages

Textbook ML - Removed

The document discusses various statistical methods for analyzing univariate, bivariate, and multivariate data, emphasizing the importance of visualizations such as Q-Q plots, scatter plots, and heatmaps. It explains key concepts in bivariate statistics, including covariance and correlation, and introduces multivariate statistics, which involves analyzing multiple variables simultaneously. Additionally, it highlights the significance of linear algebra and probability in machine learning, detailing methods like Gaussian elimination and matrix decomposition.

Uploaded by

shohilcg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 22
44 «© Machine Learning QQ plot of sample data versus standard normal 16 14 Quantiles of input sample 5 0 t 05 0 05 1 15 Standard normal quantiles Figure 2.10: Normal Q-Q Plot Ideally, the points fall along the reference line (45 Degree) if the data follows normal distri- bution. Ifthe deviation is more, then there is greater evidence that the datasets follow some different distribution, that is, other than the normal distribution shape. In such a case, careful analysis of the slatistical investigations should be carried out before interpretation. This skewness, kurtosis, mean absolute deviation and coefficient of variation help in assessing the univariate data. vy 2.6 BIVARIATE DATA AND MULTIVARIATE DATA Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is to find relationships among data. Consider the following Table 2.3, with data ofthe temperature in a shop and sales of sweaters. Table 2.3: Temperature in a Shop and Sales Data Oye ce Temperature (in centigr 5 200 10 150 15 ~ 140 20 75 2 @ 3 55 | B 20 Understanding Data» 45 stri- vent > the ‘ing nis ein Here, the aim of bivariate analysis is to find relationships among variables. The relationships can then be used in comparisons, finding causes; and in further explorations. To do that, graphical display ofthe data is necessary. One such graph method is called ggatter plot, Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without ‘nominal variables to illustrate the trends, and also to show differences, Itis a plot between explan: atory and response variables. [tis a 2D graph showing the relationship between two variables. ‘The scatter plot (Refer Figure 2.11) indicates strength, shape, direction and the presence of Outliers. It is useful in exploratory data before calculating, a correlation coefficient or fitting regression curve. Scatter plot 200 e | @ Series 1 150 5 | g | & seo |- | 3 | a ee Bg e i ° 0 5 10 B20 2B Temperature Figure 2.11: Scatter Plot Line graphs are similar to scatter plots. The Line Chart for sales data is shown in Figure 2.12. sales Sales data 0 —— Line t 150 100 50 ot — 5 10 15 20 2 Temperture Figure 2.12: Line Chart 46 «Machine earning $$ 2.6.1 Bivariate Statistics Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of joint probability of random variables, say X and Y. Generally, random variables are represented in capital letters. It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between two dimensions. The formula for finding co-variance for specific x, and y are: cov(X,Y) = EB, ~ E(X)\(y, - EO) (2.17) Here, x, and y, are data values from X and Y, E(X) and E(Y) are the mean values of x, and y, Nis the number of given data. Also, the COV(X, Y) is same as COV(Y, X). eee Find the covariance of data X= (1, 2,3 4,5} and Y= (1,4, 9, 16,25}. Solution: Mean(X) = E(X) = 2 Mean(Y) = E(Y) = 2 = 11, The covariance is computed -using Eq, (2.17) as: (= S)A= 11) + 2 = 34-11) + 8 ~30)9- 11) + (4-996 ~ 11) + -9925-1)_ - The covariance between X and Y is 12. It can be normalized to a value between -1 and +1. This is done by dividing it by the correlation of variables. This is called Pearson correlation coefficient. Sometimes, N— 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15. —. Correlation The Pearson correlation coefficient is the most common test for determining any association between two phenomena. It measures the strength and direction of a linear relationship between the x and y variables. ‘The correlation indicates the relationship between dimensions using its sign. The sign is more important than the actual value. 1. If the value is positive, it indicates that the dimensions increase together. 2. Ifthe value is negative, it indicates that while one-dimension increases, the other dimension decreases. 3. If the value is zero, then it indicates that both the dimensions are independent of each other. If the dimensions are correlated, then it is better to remove one dimension as it is a redundant dimension. Ifthe given attributes are X = (ty x5 coefficient, that is denoted as r, is given as: = COVKY) 0,0, 1%) and Y= (yy, Yo ~-, yy), then the Pearson correlation (2.18) where, 6, 6, are the standard deviations of X and Y. dint his 2 ont. ion en, ore ion ich ant on 18) Understanding Data + 47 pe Find the correlation coefficient of data X = (1, 2,3,4, 5} and Y= (1, 4,9, 16, 25} olution: The mean values of X and Y are Be 3and 11. The standard deviations of 5 sand Yare 1.41 and 8.6486, respectively. Therefore the correlation coefficient is given as ratio of ‘ovariance (12 from the previous problem 2.5) and standard deviation of x and y as per Eq. (2.18) as: 12 ———— = 0.984 1.41 x 8.6486 —--------—— 2.7 MULTIVARIATE STATISTICS In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of more than two observable variables, and often, thousands of multiple measurements need to be conducted for one or more subjects. The multivariate data is like bivariate data but may have more than two dependant variables. ome of the multivariate analysis are regression analysis, principal component analysis, and path analysis. Id Attribute | Attribute 2 Attribute 3 _ 7 4 1 2 5 2 303 6 1 ‘The mean ofhultivariate data is a mean vector and the mean of the above three attributes is given as 2, #8, 1.33). The variance of multivariate data becomes the covariance matrix. ‘The mean vector is called centroid and variance is called dispersion matrix. This is discussed in the next section. Multivariate data has three or more variables. The aim of the multivariate analysis is much more. They are regression analysis, factor analysis and multivariate analysis of variance that are explained in the subsequent chapters of this book. Heatmap Heatmap is a graphical representation of 2D mattix. It takes a matrix as input and colours it. The darker colours indicate very large values and lighter colours indicate smaller values. ‘The advantage of this method is that humans perceive colours well. So, by colour shaping, larger ue: dit n be perceived well. For example, in vehicle traffic data, heavy traffic regions can be entiated from low traffi regions through heatmap. In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis light ounts. The dar tients’ weights vs is weights and Y-axis is patie lour regions high patient counts in health status, 48 © Machine Learning ~ Count of self assessed health status vs, Weight T excelent 0 lel aloe iho olool loa i ils 0 E Fair loot! 4jofo}4}4oofojetololslt lojlohn ofa ololgollo| so i Hy z j ; § ood |i sflolo iff oof ooo ooo | Ly EEEEEERAASSARRESARARERREGEEEOOS Weight Figure 2.13: Heatmap for Patient Data Hextmap fo Pairplot Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several pair-wise scatter plots of variables of the multivariate data. All the results are Presented in a matrix format. By visual examination of the chart, one can easily find relationships among the variables such as correlation between the variables. A random matrix of three columns is chosen and the relationships ofthe columns is plotted as a pairplot (or scattermatrix) as shown below in Figure 2.14. 50 40 30 20 10] 0 100 90 80 3240123 4324012 32470129 Figure 2.14: Pairplot for Random Data — _ —_____—— Understanding Data » 49 2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA Machine learning involves many mathematical concepts from the domain of Linear algebra, sgatstics, Probability and Information theory. The subsequent sections discuss important aspects sflinear algebra and probability. ‘Linear Algebra’ is a branch of mathematics that is central for many scientific applications and other mathematical subjects. While all branches of mathematics are crucial for machine learning, finear algebra plays a major large role as it is the mathematics of data. Linear algebra deals with jinear equations, vectors, matrices, vector spaces and transformations. These are the driving forces ofmachine learning and machine learning cannot exist without these data types. Let us discuss some of the important concepts of linear algebra now. 2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data ‘Alinear system of equations is a group of equations with unknown variables. Let Ax=y, then the solution x is given as: xaylAsAty 19) ‘This is true ify is not zero and A is not zero. The logic can be extended for N-set of equations with ‘n’ unknown variables. Itmeans if A = andy =(y, ¥,---¥,); then the unknown variable x can be Fay Mag Sy computed as: x=yA=Aty (2.20) If there is a unique solution, then the system is called consistent independent. If there are various solutions, then the system is called consistent dependant. If there are no solutions and if the equations are contradictory, then the system is called inconsistent. For solving large number of system of equations, Gaussian elimination can be used. The procedure for applying Gaussian elimination is given as follows: Write the given matrix. y Append vector y to the matrix A. This matrix is called augmentation matrix. » Keep the element a,, as pivot and eliminate all a,, in second row using the matrix operation, 4, 4, R [ =| here R, is the second row and [2] is called the multiplier. The same logic 4, 4, in hi can be used to remove a,, in all other equations. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as: ¥, x = Then, the remaining unknown variables can be found by back-substitution as: (2.22) This partis called backward substitution. | | 50 + Machine Learning To facilitate the application of Gaussian elimination method, the following row operations are applied: 1. Swapping the rows 2. Multiplying or dividing a row by a constant 3. Replacing a row by adding or subtracting a multiple of another row to it These concepts are illustrated in Example 2.8. A Solve the following set of equations using Gaussian Elimination method. Solution: Rewrite this in matrix form as follows: 2416 4317 2416 {i 31 ik Apply the transformation by dividing the row 1 by 2. There are no general guidelines of row operations other than reducing the given matrix to row echelon form. The operator ~ means reducing to, The above matrix can further be reduced as follows: -(! 2} i}aek oar 4317 ean ( el 5 ears 0-5 1-5 2 1213 -() 0 iJ=8 28 10/17 orld ‘Therefore, in the reduced echelon form, it can be observed that: qe Fo eee 2.8.2 Matrix Decompositions Itis often necessary to reduce a matrix to its constituent parts so that complex matrix operations can be performed. These methods are also known as matrix factorization methods. ‘The most popular matrix decomposition is called eigen decomposition. tis a way of reducing the matrix into eigen values and eigen vectors. _ Then, the matrix A can be decomposed as: ° : A=QAQ? 223) is the matrix of eigen vectors, A is the diagonal matrix and QT is the transpose of matrix Q. ee Iinerstanding Data 51 LU Decomposition ne of the sinaplest matrix decompositions is LU decomposition where the matrix A can be decom- sed matrices: Ls AsLU Here, L is the lower triangular matrix and U is the upper triangular matrix. The decomposition can be done using Gaussian elimination method as discussed in the previous section. First, an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination | js applied to reduce the given matrix to get matrices L and U. i Example 2.9 illustrates the application of Gaussian elimination to get LU. i —_—_ es Find LU decomposition of the given matrix: 124 A=|332 342 100 010 oo1 100 310 oo1 310 301 00 10 2 1 3 124 332 342 12 4 0-3 +10 34 2 Initial Matrix) Now, it can be observed that the first matrix is 1 as it is the lower triangular matrix whose values are the determiners used in the reduction of equations above such as 3, 3 and 2/3. The second matrix is U, the upper triangular matrix whose v matrix because of Gaussian elimination, lues are the values of the reduced arn * + CAMBRIDGE INSTITE LIBRARY 8 INFORMATION CENTRE. | RPURAW.BANGALORE-SEO 036, | 52 Machine Learning It can be cross verified that the multiplication of LU yields the original matrix A. What are the applications of LU decomposition? Some of the applications are finding matrix inverses and deter- minant. Ifthe order of the matrix is large, then this method can be used. ee “”n 2.8.3 Machine Learning and Importance of Probability and Statistics ‘Machine leaming is linked with statistics and probability. Like linear algebra, statistics is the heart of machine leaning. The importance of statistics needs to be stressed as without statistics; analysis of data is difficult. Probability is especially important for machine learning. Any data can be assumed to be generated by a probability distribution. Machine learning datasets have multiple data that are generated by multiple distributions. So, a knowledge of probability distribution and random variables are must for better understanding of the machine learning concepts. What is the objective for an experiment? This is about hypothesis or constructing a model. Machine learning has many models. So, the theory of hypothesis, construction of models, evalu- ating the models using hypothesis testing, significance of the model and evaluation of the model are all linking factors of machine learning and probability theory. Similarly, the construction of datasets using sampling theory is also required. ‘The subsequent section will deal with the important concepts of statistics and probability. Probability Distributions A probability distribution of a variable, say X, summarizes the probability associated with X's events, Distribution is a parameterized mathematical function. In other words, distribution is a function that describes the relationship between the observations in a sample space. Consider a set of data. The data is said to follow a distribution if it obeys a mathematical function that characterizes that distribution. The function can be used to calculate the probability of individual observations. Probability distributions are of two types: 1. Discrete probability distribution 2. Continuous probability distribution ‘The relationships between the events for a continuous random variable and their probabilities is called a continuous probability distribution. It is summarized as Probability Density Function (PDF). PDF calculates the probability of observing an instance. The plot of PDF shows the shape of the distribution. Cumulative Distributive Function (CDF) computes the probability of an obser- vation < value. Both PDF and CDF are continuous values. The discrete equivalent of PDF in discrete distribution is called Probability Mass Function (PMF). The probability of an event cannot be detected directly. It should be computed as the area under the curve for a small interval around the specific outcome. This is defined as CDF. Let us discuss some of the distributions that are encountered in machine learning. Continuous Probability Distributions Normal, Rectangular, and Exponential distributions fall under this category. ee Basics of Learning Theory » 81 seaming framework. One such important framework is PAC. The learning framework is discussed ina detailed manner in Section 3.7. COLT focuses on supervised leaming tasks. Since the complexity “analyzing difficult, normally, binary classification tasks are considered for analysis. 3.3 DESIGN OF A LEARNING SYSTEM ‘system that is built around a learning algorithm is called a learning, system. The design of ystems focuses on these steps: 1. Choosing a training experience 2. Choosing a target function 3, Representation of a target function 4, Function approximation Training Experience Let us consider designing ofa chess game. In direct experience, individual board states and correct moves of the chess game are given directly. In indirect system, the move sequences and results are only given. The training experience also depends on the presence of a supervisor who can label all valid moves for a board state. In the absence of a supervisor, the game agent plays against itself and learns the good moves, if the training samples cover all scenarios, or in other words, distributed enough for performance computation. If the training samples and testing samples have the same distribution, the results would be good. Determine the Target Function ‘The next step is the determination of a target function. In this step, the type of knowledge that needs to be learnt is determined. In direct experience, a board move is selected and is determined whether it is a good move or not against all other moves. If it is the best move, then it is chosen as: B-> M, where, B and M are legal moves. In indirect experience, all legal moves are accepted and a score is generated for each. The move with largest score is then chosen and executed. Determine the Target Function Representation ‘The representation of knowledge may be a table, collection of rules or a neural network. The linear combination of these factors can be coined as: View, +x, + 0,2, +0,3, where, x, x, and x, represent different board features and w, w, 1, and 10, represent weights. Choosing an Approximation Algorithm for the Target Function The focus is to choose weights and fit the given training samp the error given as: ectively. The aim is to reduce Ee Y [Val voy] a SSS 82. © Machine Learning Here, b is the sample and (/(b) is the predicted hypothesis. The approximation is carrie out as: * Computing the error as the difference between trained and expected hypothesis. Let error be error(b). * Then, for every board feature x, the weights are updated as: w, = w, + [Lx error(b) x x, Here, 11 is the constant that moderates the size of the weight update. Thus, the learning system has the following components: * A Performance system to allow the game to play against itself. * A Critic system to generate the samples. * A Generalizer system to generate a hypothesis based on samples. + An Experimenter system to generate a new system based on the currently learnt function, | This is sent as input to the performance system. 3.4 INTRODUCTION TO CONCEPT LEARNING Concept leaming is a learning strategy of acquiring abstract knowledge or inferring a general concept or deriving a category from the given training samples. It is a process of abstraction and generalization from the data. Concept learning helps to classify an object that has a set of common, relevant features. Thus, it helps a learner compare and contrast categories based on the similarity and association of positive and negative instances in the training data to classify an object. The leatner tries to simplify by observing. the common features from the training samples and then apply this simplified model to the future samples. This task is also known as learning from experience. Each concept or category obtained by learning is a Boolean valued function which takes a true or false value. For example, humans can identify different kinds of animals based on common relevant features and categorize all animals based on specific sets of features. The special features that distinguish one animal from another can be called as a concept. This way of learning categories for object and to recognize new instances of those categories is called as concept learning. It is formally defined as inferring a Boolean valued function by processing training instances. Concept learning requires three things: 1, Input - Training dataset which is a set of training instances, each labeled with the name of a concept or category to which it belongs. Use this past experience to train and build the model. xv Output - Target concept or Target function f. It is a mapping function f(x) from input x to output y. It is to determine the specific features or common features to identify an object. In other words, it is to find the hypothesis to determine the target concept. For eg, the specific set of features to identify an elephant from all animals. 3. Test - New instances to test the learned model. Basics of Learning Theory «88 Formally, Concept learning is defined as-"Given a set of hypotheses, the leamer searches, trough the hypothesis space to identify the best hypothesis that matches the target concept". Consider the following set of training instances shown in Table 3.1. able 3.1: Sample Training instances ure tin Seer ie ae eo eee . [No {Short [Yest [No |No [Black [No _| Big Yes 2 [ves [Short [No [No [No |Brown | Yes Medium [No 3. [No [Short [yes [No [No [Black _|No Medium | Yes 4. |No Long, No ‘Yes ‘Yes White No Medium | No 3. [No [Short [ves [ves [Yes [Black [No Big Yes Here, in this set of training instances, the independent attributes considered are ‘Horns’, “Tail tusks’, ‘Paws, Fur, ‘Color’, Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’. The target concept is to identify the animal to be an Elephant. Let us now take this example and understand further the concept of hypothesis. ‘Target Concept: Predict the type of animal - For example ~“Elephant’. 3.4.1 Representation of a Hypothesis A hypothesis ‘W’ approximates a target function ‘f’ to represent the relationship between the independent attributes and the dependent attribute of the training instances. The hypothesis is the predicted approximate model that best maps the inputs to outputs. Each hypothesis is represented as a conjunction of attribute conditions in the antecedent part. For example, (Tail = Short) (Color = Black)... The set of hypothesis in the search space is called as hypotheses. Hypotheses are the plural form of hypothesis. Generally ‘H’ is used to represent the hypotheses and ‘fi’ is used to represent a candidate hypothesis. Bach attribute condition is the constraint on the attribute whichis represented as attribute-value pair, In the antecedent of an attribute condition of a hypothesis, each attribute can take value as either Y or ‘g" or can hold a single value. * “2” denotes that the attribute can take any value [e.g., Color = 7] + “g" denotes that the attribute cannot take any value, ie, it represents a null value [e.g., Homs = 9] * Single value denotes a specific single value from acceptable values of the attribute, ie, the attribute ‘Tail’ can take a value as ‘short’ [e.g,, Tail = Short] For example, a hypothesis will look like, Homs Tail Tusks Paws Fur Color Hooves Size h Given a test instance x, we say h(x) = 1, if the test instance x satisfies this hypothesis h 84 « Machine Learning ‘The training dataset given above has 5 training instances with 8 independent attributes and one dependent attribute. Here, the different hypotheses that can be predicted for the target concept are, Horns Tail Tusks Paws Fur Color Hooves Size h= — (or) h= ‘The task is to predict the best hypothesis for the target concept (an elephant). The most general hypothesis can allow any value for each of the attribute. It is represented as: <2,2,2, 2,2, 2,2, 2. This hypothesis indicates that any animal can be an elephant. ‘The most specific hypothesis will not allow any value for each of the attribute {Yes, No} Hypotheses H: Set of hypothesis each with conjunctions of literals as propositions [i.e, each literal is represented as an attribute-value pair] Solution: The hypothesis ‘i’ for the concept leaming task of an Elephant is given as: h= Basics of Learning Theory » 87 When reading the second instance 22, itis a negative instance, so ignore it. p: Yes Short No No No Brown Yes Medium No (Negative instance) fa= : Similarly, when reading the third instance 13, itis a positive instance so generalize H2.to 13 to ocommodate it. The resulting i3 is generalized. B: No Short Yes No No Black No Medium Yes (Positive instance) ig= va ? > No b 2 2 men igseeeare i2 imposes negative instance to true, 13: No Short Yes No No No Medium Yes (Positive instance) i= 2? ? ? ? ? No ?> 2? ? ? ? ? ? Big> 4; No Long No Yes Yes White No Medium No (Negative instance) a NM= Remove any hypothesis inconsistent with this negative instance. J: No Short Yes Yes Yes Black No Big. Yes (Positive instance) Wb=2 2? Ys 2 2? 2 20 > 22 2 2 2 Black ? = 2? 2 2 2 2 2 Big> ‘Thus, his the hypothesis space generated which will classify the positive instances to true and negative instances to false. ——____—_____—_——_—_— 3.4.5 Hypothesis Space Search by Find-S Algorithm Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with the positive instances in the training dataset. Obviously, it will also be consistent with the negative instances. Thus, this algorithm considers only the positive instances and eliminates negative instances while generating the hypothesis. It initially starts with the most specific hypothesis. Input: Positive instances in the Training dataset Output: Hypothesis “h’ 1. Initialize ‘h’ to the most specific hypothesis. h=

Step 3: Scan the next instance 12, since I2 is a positive instance. Generalize ’h’ to include positive instance I2. For each of the non-matching attribute value in ‘h’ put a ‘2’ to include this positive instance. The third attribute value is mismatching in ‘I with 12, so puta ‘2’, R: 29 Yes Good Good Fast Yes Positive instance h=<29 Yes? Good Fast Yes> Now,scan 13. Since it is a negative instance, ignore it. Hence, the hypothesis remains the same without any change after scanning 13. I 2B No Good Good Fast No Negative instance h=<29 Yes? Good Fast Yes> Now scan I4, Since it is a positive instance, check for mismatch in the hypothesis ‘h’ with I4. The 5" and 6* attribute value are mismatching, so add “? to those attributes in ‘W’. M4: 29 Yes Good Good Slow No Positive instance h=<29 Yes? Good 2 Now, the final hypothesis generated with Find-S algorithm is h=<29 Yes? Good 2? It includes all positive instances and obviously ignores any negative instance. a) Limitations of Find-S Algorithm 1, Find-S algorithm tries to find a hypothe all negative instances, As long as the training dataset is consistent, the hypothesis found by this algorithm may be consistent hat is consistent with positive instances, ignoring > 2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses 1 that are consistent with the training dataset. 90 «Machine Learning ——. 3, Many times, the training dataset may contain some errors; hence such inconsistent data instances can mislead this algorithm in determining the consistent hypothesis since it ignores negative ‘instances. Hence, it is necessary to find the set of hypotheses that are consistent with the training data including the negative examples. To overcome the limitations of Find-S algorithm, Candidate | Elimination algorithm was proposed to output the set of all hypotheses consistent with the training dataset. 346 Version Spaces The version space contains the subset of hypotheses from the hypothesis space that is consistent with all training instances in the training dataset. Scan for information on ‘Addtional Examples on Version Spaces” List-Then-Eliminate Algorithm ‘The principle idea ofthis learning algorithm isto initialize the version space to contain all hypotheses and then eliminate any hypothesis that is found inconsistent with any training instances. Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. ‘Thehypotheses that are inconsistent with the training instance are eliminated. Finally, the algorithm outputs the list of remaining hypotheses that are all consistent. sian | Input: Version Space ~a list of all hypotheses Output: Set of consistent hypotheses 1. Initialize the version space with a list of hypotheses, 2. For each training instance, * remove from version space any hypothesis that is inconsistent. This algorithm works fine if the hypothesis space is finite but practically it is difficult to deploy this algorithm. Hence, a variation of this idea is introduced in the Candidate Elimination algorithm. Version Spaces and the Candidate Elimination Algorithm Version space learning is to generate all consistent hypotheses around. This algorithm computes the version space by the combination of the two cases namely, * Specific to General learning ~ Generalize S to include the positive example * General to Specific learning - Specialize G to exclude the negative example Basics oftearning Theory» 94 Using the Candidate Elimination algorithm, we can compute the version space containing ail and only those) hypotheses from H that are consistent with the given observed sequence of training instances, The algorithm defines two boundaries called ‘general boundary’ which is a set of ail hypotheses that are the most general and ‘specific boundary’ which isa set ofall hypotheses that sre the most specific. Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most specific. Thus, it provides a compact representation of List-then algorithm. Input: Set of instances in the Training dataset Output: Hypothesis G and S 1. Initialize G, to the maximally general hypotheses. 2. Initialize §, to the maximally specific hypotheses. © Generalize the initial hypothesis for the first positive instance. 3, For each subsequent new training instance, * Ifthe instance is positive, © Generalize $ to include the positive instance, > Check the attribute value of the positive instance and S, «If the attribute value of positive instance and $ are different, fill that field value with ‘?’. + If the attribute value of positive instance and $ are same, then do no change. 0 Prune G to exclude all inconsistent hypotheses in G with the positive instance. + Ifthe instance is negative, © Specialize G to exclude the negative instance, > Add to G all minimal specializations to exclude the negative example and be consistent with S. + If the attribute value of $ and the negative instance are different, then fill that attribute value with $ value, «If the attribute value of $ and negative instance are same, no need to update ‘G’ and fill that attribute value with ‘7’. ‘0 Remove from S all inconsistent hypotheses with the negative instance. Generating Positive Hypothesis ‘S’ If itis a positive example, refine $ to include the positive instance, We need to generalize $ to include the positive instance. The hypothesis is the conjun S’ and positive instance. When generalizing, for the first positive instance, add to § all minimal generalizations such that S is filled with attribute values of the positive instance. For the subsequent Dositive instances scanned, check the attribute value of the positive instance and S obtained in the 92. © Machine Learning $$ poms previous iteration. If the attribute values: of positive instance and S$ are different, fill that field value / jteration with a7. Ifthe attribute values of positive instance and $ are same, no change is required. i Now Sea Ifitis a negative instance, it skips, : Generating Negative Hypothesis ‘G’ If it is a negative instance, refine G to exclude the gine negative instance. Then, prune G to exclude all inconsistent hypotheses in G with the positive consiste instance, The idea is to add to G all minimal specializations to exclude the negative instance and end fill be consistent with the positive instance. Negative hypothesis indicates general hypothesis. ‘elutes, If the attribute values of positive and negative instances are different, then fill that field with are gen positive instance value so that the hypothesis does not classify that negative instance as true. Ifthe The attribute values of positive and negative instances are same, then no need to update ‘G’ and fill that oa attribute value with a’. e Generating Version Space - [Consistent Hypothesis] We need to take the combination of sets in‘G’ and check that with ‘S', When the combined set fields are matched with fields in’S’, then only that is included in the version space as consistent hypothesis. a 33 BPERRSREWA Consider the same set of instances from the training dataset shown in-Table 3.3 Iterati and generate version space as consistent hypothesis. Bae e Solution: u Step 1: Initialize ‘G’ boundary to the maximally general hypotheses, Gao? 2 2% 2% ® s Step 2: Initialize ‘S’ boundary to the maximally specific hypothesis. There are 6 attributes, so for u each attribute, we initially fill’¢ in the hypothesis ‘S’. c S=<—p 9 9 9 P Generalize the initial hypothesis for the first positive instance. Il is a positive instance, so generalize the most specific hypothesis ‘S’ to include this positive instance. Hence, I: 29 Yes Excellent © Good Fast. Yes _ Positive instance one. Si=<29 Yes Excellent Good Fast Yes> G 7 2 2 2 > Step 3: Iteration 1 cor: Scan the next instance 12. Since [2 is a positive instance, generalize ‘S1' to include positive instance 12. For each of the non-matching attribute value in ‘$1’, put a‘?’ to include this positive instance. The third attribute value is mismatching in ‘$1’ with 12, so put a‘?’. R: 29 Yes Good Good Fast Yes _ Positive instance $2=<29 Yes? Good Fast - Yes> Prune G1 to exclude all inconsistent hypotheses with the positive instance. Since G1 is o consistent with this positive instance, there is no change. The resulting G2 is, Gl=2? 2? ? ? 2 > —__——— Basics of Learning Theory + 93 Iteration 2 Now Sean 13, B: 2 No Good Good Fast. No Negative instance since it is a negative instance, specialize G2 to exclude the negative example but stay consistent with $2. Generate hypothesis for each of the non-matching, attribute value in $2 and fill with the attribute value of $2, In those generated hypotheses, for all matching attribute values, put a‘? The first, second and 6* attribute values do not match, hence ‘3 hypotheses are generated in G3. ‘There is no inconsistent hypothesis in $2 with the negative instance, hence $3 remains the same. B09 27 2 2 2 Bb £ a Ys 2° 2 2 Pb : ar 2 2 2 Yes §3=<29 Yes ? — Good Fast’ Yeo eration 3 Now Scan [4, Since it is a positive instance, check for mismatch in the hypothesis ‘53° with M4. ‘The 5 and 6" attribute value are mismatching, so add ‘?’ to those attributes in ‘54’. Is 29 Yes Good Good Slow No Positive instance Sh=<29 Yes 2 Good 2? 9 Prune G3 to exclude all inconsistent hypotheses with the positive instance I4. G=s9 2 2 2 2 Bw “a Yes? 2 > 2 Yes, Indonsistent Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third one, The resulting G4 is, ou 2 Gh=<29 ? 2 2 2 > 2 Yes? 2 2 > Using the two boundary sets, $4 and G4, the version space is converged to contain the set of consistent hypotheses. ‘The final version space is, <29 Yes? ? 2 > 2 Good ? > Yes? Good jorithm finds the version s those hypotheses that are most general and most specific. The shown in Figure 3,2, nntation of deriving the version space mmatic rep!

You might also like