Pattern Recognition - Organizer - 2023

B. Tech_Section 2021-2022

Uploaded by

Chang lifestyle Avishak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

100% found this document useful (2 votes)

2K views112 pages

Pattern Recognition - Organizer - 2023

B. Tech_Section 2021-2022

Uploaded by

Chang lifestyle Avishak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 112

PATTERN RECOGNITION Basics of Pattern Recognition Z Bayesian Decision Theory 8 Parameters Estimation Methods 26 Hidden Markov Models for Sequential Pattern Classification 42 Dimension Reduction Methods , 52 Non-Parametric Techniques for Density Estimation 67 Linear Discriminant Function Based Classifier 75 Non Metric Method for Pattern Classification SF Unsupervised Learning and Clustering 105 NOTE: © MAKAUT course structure and syllabus of 6" semester has been changed from 2021. PATTERN RECOGNITION has been introduced as a new subject in present Curriculum. Taking special care of this matter we are providing chapterwise model Questions & answer, so that students can get an idea about university questions Patterns,POPULAR PUBLICATIONS BASICS OF PATTERN RECOGNITION 1. Which of the following is an example of Pattern Recognition? (MODEL QUESTION} b) Speaker identification a) Speech recognition d) All of the above c) MDR Answer: (4) 2. Pattern recognition solves the problem of fake bio metric detection. a) TRUE b) FALSE [MODEL QUESTION} c) Can be true or false d) cannot say Answer: (a) 3. Which of the following is disadvantages pattern recognition? [MODEL QUESTION] . a) Syntactic Pattern recognition approach is complex to implement b) Itis very slow process c) Sometime to get better accuracy, larger dataset is required d) All of these Answer: (d) 4. In a typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. [MODEL QUESTION] a) TRUE b) FALSE: c) Can be true or false d) cannot say Answer: (a) ee is the process of recognizing patterns by using machine learning algorithm. [MODEL QUESTION] a) Processed Data b) Literate Statistical Programming c) Pattern Recognition d) Likelihood Answer: (c) Short Answer estions 1. What is pattern recognition? TION] asics [MODEL QUES’ Pattern recognition is the process of r iz i i i ‘ ep recognizing patterns by using a machine learning ee a recognition can be defined as the classification of data based om ige already gained or on statistical information extracted from patterns and/or their representation. One of the important aspects of Pattern recognition 1S portat pe PRN-2PATTERN RECOGNITION Examples: Speech recognition, speaker identification, multimedia document recognition (MDR), automatic medical diagnosis. In a typical pattern recognition application, the raw data is processed and converted into a form that is amenable for a machine to use. Pattern recognition involves the classification and cluster of patterns, + Inclassification, an appropriate class label is assigned to a pattern based on an abstraction that is generated using a set of training patterns or domain knowledge. Classification is used in supervised learning. « Clustering generated a partition of the data which helps decision making, the specific decision-making activity of interest to us. Clustering is used in unsupervised learning. i Features may be represented as continuous, discrete, or discrete binary variables. A feature is a function of one or more measurements, computed so that it quantifies some significant characteristics of the object. Example: consider our face then eyes, ears, nose, etc are features of the face. A set of features that are taken together, forms the features vector. Example: In the above example of a face, if all the features (eyes, ears, nose, etc) are taken together then the sequence is a feature vector([eyes, ears, nose]). The feature vector is the sequence of a feature represented as a d-dimensional column vector. In the case of speech, MFCC (Mel-frequency Cepstral Coefficient) is the spectral feature of the speech. The sequence of the first 13 features forms a feature vector. Pattern recognition possesses the following features: + Pattern recognition system should recognize familiar patterns quickly and accurate Recognize and classify unfamiliar objects Accurately recognize shapes and objects from different angles Identify patterns and objects even when partly hidden Recognize patterns quickly with ease, and with automaticity. 1. Explain the idea of pattern recognition. [MODEL QUESTION] Answer: The problem of searching for patterns in data is a fundamental one and has a long and Successful history. For instance, the extensive astronomical observations of Tycho Brahe in the 16" century allowed Johannes Kepler to discover the empirical laws of planetary ‘motion, which in turn provided a springboard for the development of classical mechanics. Similarly, the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twentieth century. The field f pattern recognition is concerned with the automatic discovery of regularities in data PRN-3JPULAR PUI 101 through the use of computer algorithms and with the use of these regularities to ; actions such as classifying the data into different categories. 7 Consider the example of recognizing handwritten digits, ilustrated in a Seure ' digit corresponds to a 28x28 pixel image and so can be es fT sO comprising 784 real numbers. The goal is to build a machine d fi w phat: hs x as input and that will produce the identity of the digit 0, ; aa Be pa 4 nontrivial problem due to the wide variability of handwriting. t a - fa can be obtained by adopting a machine learning approach in whic , N digits {x,,...,xy} called a training set is used to tune the parameters of an adaptive model. The categories of the digits in the cana ate kav a advance Oe inspecting them individually and hand-levelling them. fe can 2 dat alae target vector 1, which represents the identity of the corresponding digit, Suitable techniques for representing categories in terms of vectors will be discussed later, Note that there is one such target vector for each digit image x- a cose sua Examples of hand-written digits taken from US zip codes ‘ Pre-processing might also be performed in order to speed up computation, For example, if the goal is real-time face detection in a high-resolution video stream, the computer ‘must handle huge numbers of pixels per second and presenting these directly to ay complex pattern recognition algorithm may be computationally infeasible. Instead, aid is to find useful features that are fast to compute and yet that also preserve use! discriminatory information enabling faces to be distinguished from non-faces. ‘These features are then used as the inputs to the pattern recognition algorithm. For instance, average value of the image intensity over a rectangular sub-region can be evaluate extremely efficiently (Viola and Jones, 2004) and a set of such features can prove Very effective in fast face detection. Because the number of such features is smaller than the” number of pixels, this kind of pre-processing represents a form of dimensionalit reduction. Care must be taken during pre-processing because often information Is discarded and if this information is important to the solution of the problem, then overall accuracy of the system can suffer. ‘ee » Stee Pao re somn ADI, the training data consists of a set of input ae bape * ow ae target values. The goal in such unsupervised | ee groups of similar examples within the data, where ‘ tering, or to determine the distribution of data within the input space, kno density estimation, or to project the data from a high-dimensional space down to three dimensions for the purpose of visualization. s PRN-4 aPATTERN RECOGNITION 2. What are the scopes of pattern recognition? [MODEL QUESTION] Answer? Tee pattern Recognition is a mature but exciting and fast developing field, which underpins developments in cognate fields such as computer vision, image processing, text and document analysis and neural networks. It is closely akin to machine learning and also finds applications in fast emerging areas such as biometrics, bio-informatics, multimedia data analysis and most recently data science. The journal Pattern Recognition was established some 50 years ago, as the field emerged in the early years of computer science. Over the intervening years it has expanded considerably. Scope of PR in Machine Learning: — Data ig: It refers to the extraction of useful information from large amount of data from heterogeneous sources. The meaningful data obtained from data mining techniques are used for prediction making and data analysis. _ Recommender Systems: Most of the websites dedicated to online shopping make use of recommender systems. These systems collect data related to each customer purchase and make suggestions using machine learning algorithms by identifying the trends in the pattern of customer purchase. — Image Processing: Image process is basically of two types — Digital Image Processing and Analog Image Processing. Digital image processing uses intelligent machine learning algorithms for enhancing the quality of the image obtained from distant sources such as satellites. — Bio Informatics: It is a field of science that uses compotation tools and software to make predictions relating to biological data. for example, suppose someone discovered a new protein in the lab but the sequence of the protein is not known. Using bio-informatics tools, the unknown protein is compared with a huge number of proteins stored in the database to predict a sequence based on similar patterns. — Analysis: Pattern recognition is used for identifying important data trends. These trends can be used for future predictions. An analysis is required in almost every domain be it technical or non-technical. For example, the tweets made by a person on twitter helps in the sentiment analysis by identifying the patterns in the posts using natural language processing. 3. What are challenges in pattern recognition? [MODEL QUESTION] Answer: — Data Collection: The input data, whatever its form is, is sampled at fixed intervals in time or image metric domain and digitised to be presented with a preset number of bits per measurement. Any additional noise will be disadvantageous to successful operation of the system. So, to have a “clean” input, we have to use a lot of techniques such as filtering. Sometimes, for example, online processing, we also have to store/transfer a lot of data, the real problem may be come up is bottleneck. ~ Segmentation: Depending on how the application has been realized, the Segmentation block may either add the information regarding the segment boundaries PRN-S liPOPULAR PUBLICATIONS all the segments in separate buffers and | to the data flow, or alternatively, copy them over to the following pad one by one. There are plenty of research int field, but it is still an open-problem. " — Feature Extraction: It is still a big question for every. researcher because of s obvious questions such as how objects are described, how are may description be, what are the ways to incorporate knowledge a be ppli domain? At this moment, we don’t have a completely answer for ind of o So, the only thing can help is doing research and study existing research on t application domain. ne 7 Clasifcato : This is the most crucial step in the process of pattern recognition. T| primary division of the various classification algorithms fae Lae bet syntactic and statistical methods. Numerous taxonomies for classific: i. nea 1 pattern recognition have been presented. None has been so clearly mo advantageous than the others that it would have gained uncontested status. ona [MODEL QUESTIO 4, What are applications of pattern recognition? Answer: Bi : 1. Natural Language Processing (NLP): The Pattern Recognition algorithms ae used in NLPs for building strong software systems that have further applications in t computer and communications industry. aan 2. Network Intrusion Detection: Network intrusion detection is one of the sectors of security. The intrusion is one of the serious threats posed to any data firm. Thus, the PR system applications help in intrusion detection by recognizing pattems intrusion over time. This ensures security systems to be at alarm if the slightest patterns of intrusion show their traces over the network. 3. Disease Categorization: The PR systems have been employed in disease recogn and imaging. 4. Image Sensing and Recognition: Pattern recognition well suits the im processing and its segmentation. The analysis is then performed. This is forwarded to expert reviews. PR algorithms have gradually incorporated intelligence, similar to humans. Machine learning has boosted their recognition powers in medical image sensing and recognizing. 5. Data Mining and Warehousing Patterns or Knowledge Discovery: The KDD and sgl ects are used for finding patterns when performing data maining? 6. Acting as Eyes in Computer Vision: Pattern recognition i i ed in computer vision. They help in extracting jieactegal cea a nga en etc. There are applications in biomedical and medical imaging i 7. Predicti - i metre cctools Dears Te 8. Seismic Analysis: ted with pattern recognition algorithms, a t Analysis: PR approaches are used for findings, imagi i Nentccbialrnainategteecae for findings, imaging and elucidation eismic display of recorded data. in this applicatis PRN-6PATTERN RECOGNITION implement Statistical pattern recognition i ‘ jcaa ok a tmic enalysis and dase subd Senesy ak Radar Signal Recognition and Analysis: The Pattern recognition schemes are used in radar signal and classification. Signal processing methods are used in various applications of radar signal classifications like AP mine detection and identification. 10. Speech Recognition: The huge success of pattern recognition is seen in the speech recognition domain. the linguistics and PR systems are going hand in hand with the research and developments. It uses algorithms that are competitive and is able to treat large data sets simultaneously, 11, Agriculture: In the Agriculture industry, we have a lot of applications which are reflected in the contribution of economic benefits. It works hand in hand with the breeding industry; the researchers are using multiple pattern recognition schemes for research for identification, improvement and breed key traits. This leads to dealing with the rising production demands, increase the resistance to various diseases, reducing the threats to the environment by using less water, fertilizers, etc. 12. Financial Services: In financiai companies, the PR systems are helping data recognition related to trends in the financial markets. They are doing the job of identifying they are able to identify key insights. This might prevent financial crashes and save society from financial troubles. This technology is further used to make investments and expand businesses. Cyber surveillance is one fo the examples that help in timely recognizing risks and taking steps to prevent thers. 13. Finger Print Identification: Fingerprint recognition technology is a dominant technology in the biometric market. A number of recognition. methods have been used to perform fingerprint matching out of which pattern recognition approaches are widely used. 14, Texture Discrimination: The textile industry makes use of PR systems for texture determination for its clients based on their housing needs. 15. Transportation: The PR systems are expanding its applications to the transportation sector as well. On the basis of travel history, the mould of routes, packages, destinations and costs are made pre available to the customers using PR systems. The transportation companies thus are able to forecast potential risks that might occur on certain routes and suitably counsel their customers timely. s PRN-7POPULAR PUBLICATIONS BAYESIAN DECISION THEORY Multiple Choice Type Questions about the Bayes classifier? [MODEL QUESTION) rem of probability. 4. Which of the following statement is TRUE a) Bayes classifier works on the Bayes theor rc 4 ties classifier is an unsupervised learning atgorifun. i c) Bayes classifier is also known as maximum aprior! classi ie d) It assumes the independence between the independent variables or features, Answer: (2) ian classification when some features afe missing? 2. How do we perform Bayesian cl ee oe a) We assuming the missing values as the mean of all values. 4 b) We ignore the missing features. c) We integrate the posteriors probabilities over d) Drop the features completely. Answer: (c) Short Answer juestions the missing features. 4. What is random variable? Narrate Bayes theorem. [MODEL QUESTION]: Answer: 4 Random Variable ‘i alues like ‘A random variable is a function that maps a possible set of outcomes to some Vi while tossing a coin and getting head H as | and Tail T as 0 where 0 and 1 are random variables. Bayes Theorem The conditional probability of A given B, represented by P(A|B) is the chance of occurrence of A given that B has occurred. P(A|B) = P(A, B)/P(B) or By Using the Chain rule, this can also be written as: P(A, B) = P(A|B)P(B)=P(BIA)P(A) PAB) SREAES) ere, }) = P(B, A) + P(B, A’) = P(BJA)P(A) + Pr 4 f Here, Eg (1) snow a the Bayes Thenrem of probabints 2. What is prior probabili gwiet le ptonnrobabliiyy [MODEL QUESTION] ea is calculated according to the past occurrences of the outcomes (it P(a,); otherwise decide a,. PRN-ILPut /BLICATIONS: ‘This rule makes sense if we are to judge just one fish, but if we are 19 judge ‘After all, we would always n using this rule repeatedly may seem a bit strange. v same decision even though we know that both types of fish will appear. How . works depends upon the values of the prior probabilities. If P(a) is very much greg than P(o,), our decision in favour of @ will be right most of the tim P(@)=P(a,), we have only a fifty-fifty chance of being right. In general, probability of eror is the smaller of P(q) and P(q,) and we a see later that u these conditions no other decision rule can yield a larger probability of being right In most circumstances we are not asked to make decisions with so little informatio i our example, we might for instance use a lightness measurement x to improve oy classifier. Different fish will yield different lightness readings and we express variability in probabilistic terms; we consider x to be a continuous random whose distribution depends on the state of nature and is expressed as. This is the cl conditional probability density function. Strictly speaking, the probability dens function p(x|@) should be written as py (xla) to indicate that we are speaking, a particular density function for the random variable X . This more elaborate subser notation makes it clear that p,(-) and py (). Since this potential confusion rarely in practice, we have elected to adopt the simpler notation. This is the probability density function for x given that the state of nature is @, (It also sometimes called state-conditional probability density). Then the difference bet ee p(xla) and p(x|a,) describes the difference in lightness between populations of s bass and salmon (Fig. 1). ; Suppose that we know both the prior probabilities P(@,) and the conditional dens p(x|@,). Suppose further that we measure the lightness of a fish and discover that value is x. How does this measurement influence our attitude concerning the ture statt nature — that is, the category of the fish? We.note first that the (joint) probability der of finding a pattern that is in category @, and has feature value x can be written tW ways: p{o,,x)=P(w, |x) p(x)=p(x|o,)P(«,). Rearranging these leads us tot answer to our question, which js called Bayes’ formula: ‘ : : P(a,|2) 21% )?(%) i P(x) a where in this case of two categories PO)=Zrl(sie,)P(o,) —° OiPATTERN RECOGNITION Bayes’ formula can be expressed informally in English by saying that likelihood x prior evidence ae bayes’ formula shows that by observing the value of x we can convert the prior probability P(a,) to the a posteriori probability (or posterior) probability P(@,\x) — the probability of the state of nature being «, given that feature value x has been posterior measured. We call p(x|/@,) the likelihood of @, with respect to x (a term chosen to indicate that, other things being equal, the category w, for which P(x1o,) is large is more “likely” to be the true category). Notice that it is the product of the likelihood and the prior propbability that is most important in determining the posterior probability; the evidence factor, p(x), can be viewed as merely a scale factor that guarantees that the posterior probabilities sum to one, as all good probabilities must. The variation of p(o, |x) with ~ is illustrated in Fig, 2 forthe case P(a)=2 and P(«,)= P(xla) 04 9 10 u 2 3 14 15 Fig: 1. Hypothetical classconditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category @,. If x represents the length of a fish, the two curves might describe the difference in length of populations of two types of fish. Density functions are normalized and thus the area under each curve is 1.0 If we have an observation x for which P(a|x) is greater than P(«,|x), we would naturally be inclined to decide that the true state of nature is @ . Similarly, if P(a, |x) is Bteater than P(@,|x), we would be inclined to choose ,- To justify this decision Procedure, let us calculate the probability of error whenever we make a decision. Whenever we observe a particular x,POPULAR PUBLICATIONS P(a|x) if we decide @, we (4) 5 P(error|x) | if we decide 0 7 t Pal) robability of error by deciding we may never observe exé probability of error? Y Clearly, for a given x We can minimize the - P(a, [x)> P(@|*) and @, atheonise. i out pn “— Jue of x twice. Will this rule minimize ee ability of error is given by because the average pro! i “ P(error)= J P(error, x)dx= JP (error! x) p(x) P(a|x) a, 1 os 9 10 nl 12 13 14 15 Fig: 2. Posterior probabilities for the particular priors P(@)=2/3 and P(a,) ; for the class-conditional probability densities shown in Fig. 1. Thus in this case, given that a patiern is measured to have feature value x=14, the probability it is in category @, is roughly 0.08 and that it is in «, is 0.92. At every x, the posteriors sum to 1.0 and if for every x we insure that P(error|x) is as small as possible, then the in must be as small as possible. Thus we have justified the following Bayes’ decision rn for minimizing the probability of error: Decide a if P(«,|x)> P(a, |x); otherwise decide @, w- (6) and under this rule Eqn. (4) becomes P(error|x)=min[ P(o,|x), P(a,|x)] 8 This form of the decision rule emphasizes the role of the posteri lites. By ws i posterior probabilities. By’ Eqn. (1), we can instead express the rule in terms of the conditional iM a First note that the evidence, p(x), in Eqn. | is unimportant as far as a decision is concerned, It is basically just a scale factor that will actually measure a pattern with feature value x; its presence in Eqn. (1) PRN-14POPULAR PUBLICATIONS Ily represented in this way. For the general , since the maximum discriminant function, isk. Fir the minimum-error-rate ¢ ),, so that the maximum dise1 A Bayes classifier is easily and natural with risks, we can let g,(x)=-R(@, |x) then correspond to the minimum conditional ri can simplify things further by taking g,(*)= P(a, |x function corresponds to the maximum posterior probability. - Clearly, the choice of discriminant functions is not unique. We can always tiply the discriminant functions by the same positive constant of shift them by the additive constant without influencing the decision. More generally, if we replace g(x) by f(g,(x)), where SO is a monotonically increasing function, the res classification is unchanged. This observation can lead to significant analytical a computational simplifications. In particular, for minimum-error-rate classification, any the following choices gives identical classification resu to understand or to compute than others: P(x1@,)P(a £0)=Pfa|s)= eed -Q) Yo(xie))P(#) a g,(2)= p(x1@%,)?(@) -@) g,(x)=In p(x|@,)+In P(,) (4) where In denotes natural logarithm. Its, but some can be much sim 03 02 ou Decision Fig: 2. In this two-dimensi P re imensional two-cat¢ i. are Ge eo . tegory classifi oo 7 ‘ ae (with 1/e ellipses shown), the iene ey densities 'yperbolas and thus the decision region R, is not simpl a noe ly connected a PRN-16 ePATTERN RECOGNITION hough the discriminant functions ean be written in a variety of forms, the decision valent, The effect of any decision rule is to divide the feature space into © Rye R.- If g(2)>,(x) for all j#i, then x isin and the Even t rules are equ decision regions, decision rule calls for us to assign x to a). The regions are separated by decision ‘oundaries, surfaces in feature space where ties occur among the largest discriminant functions (Fig. 2)- The Two-Category Case: While the two-category case is just a special instance of the multicategory cas®, it has traditionally received separate treatment. Indeed, a classifier that places a pattern in one of only two categories has a special name — a dichotomizer. Instead of using two discriminant functions and g, and assigning x to @ if g, > 2, it is more common to define a single discriminant function and to use the following, decision rule: Decide @ if g(x) >05 otherwise decide @, g(x) =8:(x)-82() (6) ‘Thus, a dichotomizer can be viewed as a machine that computes & single discriminant function g(x) and classifies x according to the algebraic sign of the result. Of the various forms in which the minimum-error-rate discriminant function can be written, the following two (derived from Eqns. (2) & (4) are particularly convenient: a(x) =P(@1x)-P(@1%) t=6) (2) =e 2) in 2D sn p(xl@%) — P(2) 3. What do you know about normal density and discriminant function? [MODEL QUESTION] Answer: Before talking about discriminant functions for the normal density, we first need to know that a normal distribution is and how it is represented for just a single variable and for a vector variable. Lets begin with the continuous univariate normal or Gaussian density. aoe _Afx-#Y a | i o JI for which the expected value of x is = J20(x)ar u=e[x| and fs Bi aie where the expected squared deviation or variance 1s o =e[(x=n) |= fxn) ote) PRN-17 col nit la ha llPOPULAR PUBLICATIONS The univariate normal density is completely specified by two parameters; its mean jy variance o°. The function f, can be writted as N(u,0) which says distributed normally with mean 4 and variance o*. Samples from normal d tend to cluster about the mean with a spread related to the standard deviation o. For the multivariate normal density in d dimensions, f, is written as reve] He ays" a] oof I5p the d-by-d covariance matrix and |Z] and D"' are its determinant and respectively. Also, (x—j2)' denotes the transpose of (x~ 2). and D=e[(x—)(x—a)]= (x n)(x-n)' al) de] where the expected value of a vector or a matrix is found by taking the expected the individual components, i.e, if x, is the ith component of x, 44 the ith co # and a, the ij th component of >, then 4=e[]| and 9 =e[(,-)(x,-H)]| The convariance matrix © is always symmetric and positive definite which n the determinant of is strictly positive. The diagonal elements o,, are the vat the respective x, (i:e., o° and the off-diagonal elements o, are the covariances of and x,. If x, and x, are statistically independent, then o, =0. If all of elements are zero, p(x) reduces to the product of the univariate normal densities components of x. Discriminant Functions: Discriminant functions are used to find the minimum. probability of error in d making problems. In a problem with feature vector y and state of nature variable can represent the discriminant function as: g,(Y)=In p(¥|w,)+InP(w,)| where, p(¥|1,) conditional probability density function for ¥ with w, being the state of P(w,) is the prior probability that nature is in state w,. If we take p(¥ multivariate normal distributins. That is if p(¥|w,)=N(w,o). Then the function changes to; ¥7)--HE AE met) where, [I-Il denotes the Euclidean norm, that is, in a problem with feature vector y and state of nature variable w, we can represent the discriminant function as: (2) =-2(e— a), zi (=m) ~Lin2« Linn) We will now look at the multiple cases for a multivariate normal distribution. Case 1: E=0| This is the simplest case and it occurs when the features are statistically independent and each feature has the same variance, 0”. Here, the covariance matrix is diagonal since its simply 0 times the identity matrix 1, This means that each sample falls into equal sized clusters that are centered about their respective mean vactors. The computation of the determinant and the inverse |,|=o™ and ¥,'=(1/o*)|. Because both |Z,| and the (a/2)In2z term in the equation above are independent of i, we can ignore them and thus we obtain this simplified discriminant function: leaf +n P(w,) 207 be g(x where, |. | denotes the Euclidean norm, that is, |x~ sf) =(x- 4) (x- u)| If the prior probabilities are not equal, then the discriminant function shows that the squared distance [p-~ jf’ must be normalized by the variance o* and offset by adding InP(w,) ; therefore if x is equally near two different mean vectors, the optimal decision will favour the priori more likely. Expansion of the quadratic form (x—4,) (x~s) yields: 1 1 pls 2x + oa +19 P(H) whieh looks like a quadratic function of x. However, the quadratic term x'x is the same a all i ; Meaning it can be ignored since it just an additive constant, thereby we obtain © equivalent discriminant function: 8,(x)=w/x+ wo] 8 (x Where, w, = 1 and we=—gbrnntinrw)| PRN-19POPULAR PUBLICATIONS vo is the threshold or bias for the / th category. f A classifier that uses linear discriminants is called a linear machine. For a linear i the decision surfaces for a linear machine are just pieces of byperphets defined by linear equations g,(x)=8, (x) for the two categories with the highest po probabilities. In this situation, the equation can be written as wi(x-m)=0 , a -4)| Ju-wf PC) ‘The equations define a hyperplane through the point x) and orthogonal to the vector w, =,, the hyperplane separating R, and Ry is orthogonal to the is halfway between the means and t en the means in Fig. 1 b where, w= 4, -n)| and Because w= 4, linking the means. If P(w,) = P(w,), the point hyperplane is the perpendicular bisector of the line betwe P(w,)#P(w,), the point x, shifts away from the more likely mean. Case 2: if yar nother case occurs when the covari i corms 4 ion whos the samp fal int hypeelipooaad cata , with the cluster of the ith class being cente1 ea nd ee 4H, Both ||| and the (d/2)In27 terms can also ce i = ap. * i Me because they are independent of 7. This leads to the eon ae ae (5) =-2(x-4)E"(e-m)+InP(w) If the prior probabilities P(w, lasses, then the In P(w,) term : ») are equal for all cl: a ignored, however, if they are unequal then the decision wilt be am 0 in PRN-20POPULAR PUBLICATIONS Case 3: L, =arbitrary * 4 In the general multivariate Gaussian case where the covariance matrices are di fi each class, the only term that can be dropped from the initial discriminant function (d/2)In2 term, The resulting discriminant term is; B(x) = XW xt Wix+ Wo Fig: 3 Example: Given the set of data below of a distribution with two classes w, and w, both with probability of 0.5, find the discriminant functions and decision boundary. Sample % oe 1 =s.01 | 091 2 =5.43 130 ao 1.08 | _~7.75 4 0.86 ; 5 6 7 8 a 5.56, PRN-22POPULAR PUBLICATIONS component-vector-valued-RV and let p(Y|x,) be the conditional probability function for ¥ with x, being the true state of nature. As discussed before, P(x,) prior probability that nature is in state x,, therefore by using Bayes formula we can the posterior probability P(x, |Y): p(t x,)P(x) P(x, | y) ~ ( rH nt where, P(Y)=) p(¥|x,)P(x,) - (2) at Now, suppose we observe a particular feature space Y and we decide to take an k, . If the state of nature is x, , then from the definition of the loss function above we incur the loss 2(k,|x,). Because P(x,|Y) is the probability that the true state of is x, , the loss associated with taking action k, can be expressed as: RUIN) = Salk ls) P(s 17) =) In decision theory terminology, an expected loss is called a risk and R(k;|¥) is called the conditional risk. So whenever we have an observation Y , we can mini expected loss by choosing the action that minimized the conditional risk. To mit overall risk, compute the compute the conditional risk in Eqn. (3), for i=1,. then select the action k, for which R(é,|Y) is minimum. The resulting minimum. also called the Bayes risk and is denoted by R4]. Discrete Features: . In many practical applications, the components of the feature vectors are binary, or higher integer values so that Y can assume one of m discrete values {y, these cases, the probability density functions become sums of the form Zr (rix)}| wo) where we understarid that the summation is overall values of x in the distributions. Bayes formula then involves probabilities, rather than probability So we have: PRN-24—POPULAR PUBLICATIONS PARAMETERS ESTIMATION METHODS > Multiple Choice Type 9 estions 4. The maximum likelihood estimate is [MODEL a) minimum of a not necessarily in the parameter space b) maximum of a in the parameter space c) maximum of a not necessarily in the parameter space d) minimum of a in the parameter space ‘Answer: (b) ‘Short Answer Type Questions 1, What are parameters? [MODEL Q Answer: in Often in machine learning we use a model to describe the process that results in the that are observed. For example, we may use a random forest model to classify whet customers may cancel a subscription from a service (known as churn modelling) or may use a linear model to predict the revenue that will be generated for a comp depending on how much they may spend on advertising (this would be an exa of linear regression). Each model contains its own set of parameters that ultimately what the model looks like. Fora linear model we can write this as y=mx-+c. In this example x could represent advertising spend and y might be the revenue generated. m and ¢ are parameters for model. Different values for these parameters will ‘Three linear models with different parameter values PRN-26PATTERN RECOGNITION go parameters define a blueprint for the model. It is only when specific values are chosen for the parameters that we get an instantiation for the model that describes a given phenomenon. 2, What is maximum likelihood estimation? [MODEL QUESTION] Answer? in statistics, maximum likelihood estimation (MLE) is a method of estimating the rameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed datais most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become adominant means of statistical inference. If the likelihood function is differentiable, the derivative test for determining maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved explicitly; for instance, the ordinary least squares estimator maximizes the likelihood of the linear regression model. Under most circumstances, however, numerical methods will be necessary to find the maximum of the likelihood function. The likelihood function can also be non-convex with multiple local minima requiring the use of heuristic global optimization techniques. 3. What is Expectation- Maximization method? [MODEL QUESTION] Answer: In most of the real-life problem statements of Machine learning, it is very common that we have many relevant features available to build our model but only a small portion of them are observable. Since we do not have the values for the not observed (latent) variables, the Expectation-Maximization algorithm tries to use the existing data to determine the optimum values for these variables and then finds the model parameters. 1. Explain maximum likelihood estimation. [MODEL QUESTION] Answer: There are many methods for estimating unknown parameters from data. We will first consider the Maximum Likelihood Estimate (MLE), which answers the question: For which parameter value does the observed data have the biggest probability? The MLE is an example of a point estimate because it gives a single value for the oe parameter (later our estimates will involve intervals and probabilities). Two ee of the MLE are that it is often easy to compute and that it agrees with our ea si in simple examples. We will explain the MLE through a series of examples. Sd ple a A coin is flipped 100 times. Given that there were 55 heads, find the mum likelihood estimate for the probability p of heads on a single toss. ‘ore actually solving the problem, let’s establish some notation and terms. PRN-27POPULAR PUBLICATIONS We can think of counting the number of heads in 100 tosses as an experin given value of py, the probability of getting 55 heads in this experiment is the bin probability a 100 s P(55 iads)={ a er (1-p)" ‘The probability of getting 55 heads depends on the value using the notation of conditional probability: 100 P(55 heads| v)-( o ere ape You should read P(55 heads| p) as: “the probability of 55 heads given p.” a or, more precisely as ‘ ; ‘the probability of 55 heads given that the probability of heads on a single toss is p Here are some standard terms we will use as we do statistics. © Experiment: Flip the coin 100 times and count the number ofheads. © Data:The data is the result of the experiment. In this case it is ’55 heads’. © Parameter(s) of interest: We are interested in the value of the un parameter p. 4 * Likelihood, or likelihood function: This is P(data| p). Note it is a funetio both the data and the parameter p. In this case the likelihood is P(55 bends) (50 }0*1-p)" — of p, so let’s include p in 2. Look carefully at the definition. One typical source of confusion is to mistaki likelihood P(data| p) for P(p|data). We know from our earlier work with Baye theorem that P(data| p) and P(p| data) are usually very different. Definition: Given data the maximum likelihood estimate (MLE) for the parameter the value of p that maximizes the likelihood P(data| p). That is, the MLE is the of p for which the data is most likely. Answer: For the problem at hand, we saw above that the likelihood (55 heads| n-('s) P (I-p)" ‘We'll use the notation p for the MLE. ise . We ust i i of the likelihood function and setting it to 0. oe d ap? (tal p)= (SP )lsso"0—0y" 450 (1-p)")=0 id PRN-28PATTERN RECOGNITION Solving this for p we get 55p*(I- p)” =45p* (I- p)* 55(1- p)=45p 55=100p The MLE is p=0.55 Note: 1. The MLE for p terned out to be exactly the fraction of heads we saw in our data. 2. The MLE is computed from the data. That is, it is a statistic. 3. Officially you should check that the critical point is indeed a maximum. You can do this with the second derivative test. Log likelihood: If}s often easier to work with the natural log of the likelihood function. For short this is simply called the log likelihood. Since In(x) is an increasing function, the maxima of the likelihood and log likelihood coincide. Example 2, Redo the previous example using log likelihood. Answer: We had the likelihood P(55 heads| p) = (‘s) p*(1-p)*. Therefore the log likelihood is 100 In(P(S5 heads| p))=In|| -- | /+55In(p) +45In(1- p) Maximizing likelihood is the same as maximizing log likelihood. We check that calculus gives us the same answer as before: Flt likelihood) -4{((2)}* 55In(p)+45In(1- a| > 55(I- p)=45p > .p=0.55 Maximum likelihood for continuous distributions ‘Or continuous distributions, we use the probability density function to define the likelihood. We show this in a few exmples. In the next section we explain how this is analogous to what we did in the discrete case. PRN-29POPULAR PUBLICATIONS le 3. Light bulbs : : Z Supmene ttl a ‘Jifetime of Badger brand light bulbs is modeled by an e; Seatoution with (unknown) parameter 2. We test 5 bulbs and find they have life 5 3,1 3.and 4 years, respectively. what is the MLE for 22 a ‘newer: We need to be careful with our notation. With five different values itis tse eubscripts. Let X, be the lifetime of the ith bulb and let x, be the value X, ‘Then each X, has pdf f; (x,)=Ae™. We assume the lifetimes of the bulbs g independent, so the joint pdf is the product of the individual densities: . “f 4) -in)(Jo-#s) = Sore” SP (2% 545% 080% |) = (Ae )(4e% )(4e™ )(4e™ )(de™) = aterm oh : Note that we write this as a conditional density, since it depends on 2. Viewing th as fixed and 2a svariable, this density is the likelihood function. Our data had values a2. Xe 3, Nght, 4) eer So, the likelihood and log likelihood functions with this data are aod F(2,3,1,3,4|A)=A2e™, In f(2,3,1,3, 4] 2) =SIn(A)- 134 Finally we use calculus to find the MLE: (og likelihood) = $ -13=0 > Example 4. Normal distributions Suppose the data x x; ,x),...,, is drawn froma N(1, 07) distribution, where gi are unknown. Find the maximum likelihood estimate for the pair (12,07). Answer: Let’s be precise and phrase this in terms of random variables and densities. uppercase X,,...,X, be iid. N(u,0°) random variables and let lowercase x, be! value X, takes. The density for each X, is 7 1 (=a =, te? Fai) =p=—* Sihce the X, are independent their joint pdf is the product of the individual pdf's: LY -yyGeat Hrsg) (ze) ee ai 6 "Fea vt For the fixed data x,,...,x,, the likelihood and log likelihood are a Mn-e%lme)=[ Gen} waste 3 “4 In( f(x,POPULAR PUBLICATIONS » x, =2 years and x, =3 years. Find the value of 4 that maximizes the nent tata paradox to deal with is that for a continuous distrib probability of a single value, say x,=2, is zero. We resolve this para remembering that a single measurement really means a range of values, eg, in| example we might check the light bulb once a day. So the data x, =2 years reall x, is somewhere in a range of | day around 2 years. If the range is small we call it dx,. The probability that X, is in the approximated by fy, (x,|2)dr,. This is illustrated in the figure below. The data is treated in exactly the same way. Density f,, (x4) Density f,, (14) Probability = f,(x,| A) & 4 The usual relationship between density and probability for small ranges Since the data is collected independently the joint probability is the product 0 individual probabilities. Stated carefully P(X, in range, X, in range| A)» f;, (x,|4)dr,-fy, (x, |Z) de, Finally, using the values x,=2 and x, =3 and the formula for an exponential p have “ P(X, in range, X, in range| 2) ~ Ae dy, -Ae dx, = 16 *dx,de, Now that we have a genuine probability we can look for the value of 2 that m it. Looking at the formula above we see that the factor dx,dx, will play no role in a the maximum. So for the MLE we drop it and simply call the density the likelihood: likelihood= f (x, ,x,|4)=A7e"* The value of A that maximizes this is found just like in the example above. It is Aa 3. What is Gaussian mixture model (GMM)? [MODEL Q ‘Answer: 4 The Gaussian mixture model is defined as a clustering algorithm that is used to d the underlying groups of data. It can be understood as a probabilistic model Gaussian distributions are assumed for each group and they have means and co Which define their parameters. GMM consists of two parts — mean vectors covariance matrices (E). A Gaussian distribution is defined as a continuous pré distribution that takes on a bell-shaped curve. Another name for Gaussian d the normal distribution, Here is a picture of Gaussian mixture models: PRN-32 ZCluster 1 4, What is expectation-maximization (EM) method in relation to GMM? [MODEL QUESTION] Answer: In Gaussian mixture models, expectation-maximization method is used to find the gaussian mixture model parameters. Expectation is termed as E and maximization is termed as M. Expectation is used to find the gaussian parameters which are used to represent each component of gaussian mixture models. Maximization is termed as M and itis involved in determining whether new data points can be added or not. Expectation-maximization method is a two-step iterative algorithm that alternates between performing an expectation step, in which we compute expectations for each data point using current parameter estimates and then maximize these to produce new gaussian, followed by a maximization step where we update our gaussian means based on the maximum likelihood estimate. This iterative process is performed until the gaussians parameters converge. Here is a picture representing the two-step iterative aspect of algorithm: Mestep Update hypothesis Estep Update variables 5. What are the key steps of using Gaussian mixture models? (MODEL QUESTION] Answer; Th i = ; . * following are three different steps to using gaussian mixture models: Determining a covariance matrix that defines how each Gaussian is related to one another. The more similar two Gaussians are, the closer their means will be and PRN-33POPULAR PUBLICATIONS way from each other in terms of similarity, vice versa if they are far away fr other in ter vr cture model can have a covariance matrix that is diagonal or sym «Determining the number of gaussians in each group defines how many cai which define how to optimally separate d * Selecting the hyperparameters fine gaussian mixture models as well as deciding on whether or not each g; covariance matrix is diagonal or symmetric. mixture models and other ty 6. What are the differences between Gaussian machine learning algorithms such as K-means, support vector race Answer: : 2 Gaussian mixture models are an unsupervised machine learning algorithm, while vector machines (SVM) is a supervised learning algorithm. This means that g mixture models can be used when there is no labeled data, however, the opposite labelled dataset for training the SVM models. The Gaussian mixture model is different from K-means because Gaussian mit models discover the underlying groups of data which is different than simply divide data into different parts. Another difference is that Gaussian mixture provide a probability for each category which can be used to make more decisions and predictions about the data at hand. In addition, Gaussian mixture have a higher chance of finding the right number of clusters in the data compared to| means. Gaussian mixture models have been found to outperform other machine algorithms such as artificial neural networks (ANN) when it comes to sep volatility from trend and noise. 7. What are the scenarios when gaussian mixture models can be used? [MODEL QUESTI Answer: The following are different scenarios when GMMs can be used: 4 In case of time series analysis, GMMs can be used tol discover how volatility i related to trend and noise which can help predict future stock prices. One could consist of a trend in the time series while another can have noise and vol from other factors such as seasonality or external events which affect the stock In ener pera out these clusters, GMMs can be used because they pro probability for each category i ir ividi i Ic eee oe eee, instead of simply dividing the data into two parts "7 * Another example is when there are different groups in it's har them as belonging to one group or another which nab ant ‘i learning algorithms such as K-means clustering algorithm to separate out # aa can be used in this case because they find gaussian mixture models aon Gach group and provide a probability for each cluster which i help PRN-34PATTERN RECOGNITION. «Another example where gaussian mixture model can be useful is when it is desired to discover underlying groups of categories such as types of aes

PATTERN RECOGNITION Final Notes
90% (10)
PATTERN RECOGNITION Final Notes
40 pages
Operating System Knowledge Gate
No ratings yet
Operating System Knowledge Gate
84 pages
NLP Tech Neo Mumbai University Revised Schemes C 2019
No ratings yet
NLP Tech Neo Mumbai University Revised Schemes C 2019
146 pages
Big Data Analytics Mumbai University
100% (1)
Big Data Analytics Mumbai University
95 pages
CNS Notes
No ratings yet
CNS Notes
244 pages
ML Unit No.4 Naïve Bayes Classifiers PPT Notes
No ratings yet
ML Unit No.4 Naïve Bayes Classifiers PPT Notes
47 pages
BDA Techneo
100% (1)
BDA Techneo
91 pages
DL Notes 1 5 Deep Learning
100% (1)
DL Notes 1 5 Deep Learning
189 pages
Hypothesis Space Search in Decision Trees
No ratings yet
Hypothesis Space Search in Decision Trees
15 pages
Neural Networks & Deep Learning Makaut & & 7th SemNotes
No ratings yet
Neural Networks & Deep Learning Makaut & & 7th SemNotes
36 pages
DEEP LEARNING NOTES - Btech
No ratings yet
DEEP LEARNING NOTES - Btech
26 pages
Image Processing
No ratings yet
Image Processing
145 pages
NNDL Technical Publication Notes
No ratings yet
NNDL Technical Publication Notes
81 pages
Machine Learning Paradigms
No ratings yet
Machine Learning Paradigms
27 pages
SPCC Viva Question PDF
100% (4)
SPCC Viva Question PDF
36 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
RTN & ATN in AI
100% (1)
RTN & ATN in AI
15 pages
Jntuh r18 DM Gunshot ? Very Important ??? Questions and Answers
No ratings yet
Jntuh r18 DM Gunshot ? Very Important ??? Questions and Answers
95 pages
Soft Computing Mcqs Questions CS25june2020
75% (4)
Soft Computing Mcqs Questions CS25june2020
32 pages
FOL1
100% (1)
FOL1
74 pages
Chapters (5 - 8) TOC BOOK by Adesh K Pandey
No ratings yet
Chapters (5 - 8) TOC BOOK by Adesh K Pandey
95 pages
Machine Learning Question Paper Solved ML
No ratings yet
Machine Learning Question Paper Solved ML
55 pages
FEATURES AND AUGMENTED GRAMMARS Overall
No ratings yet
FEATURES AND AUGMENTED GRAMMARS Overall
3 pages
Machine Learning UNIT 1 PDF
100% (1)
Machine Learning UNIT 1 PDF
33 pages
Unit 5
100% (1)
Unit 5
19 pages
AKTU Notes Machine Learning (ROE083) Unit-1 - UPTU Notes PDF
50% (2)
AKTU Notes Machine Learning (ROE083) Unit-1 - UPTU Notes PDF
66 pages
Artificial Intelligence
50% (2)
Artificial Intelligence
16 pages
Web Security Unit 5
No ratings yet
Web Security Unit 5
22 pages
Deep Learning R18 Jntuh Lab Manual
0% (1)
Deep Learning R18 Jntuh Lab Manual
21 pages
Characteristics of Soft Computing
88% (8)
Characteristics of Soft Computing
11 pages
Unit I: Chapter 3:functional Units For Anns For Pattern Recognition Task
100% (2)
Unit I: Chapter 3:functional Units For Anns For Pattern Recognition Task
24 pages
STM Viva Que
100% (2)
STM Viva Que
54 pages
DL Unit - 5
No ratings yet
DL Unit - 5
14 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
Unit 2 Machine Learning Notes
100% (1)
Unit 2 Machine Learning Notes
25 pages
Designing A Learning System
No ratings yet
Designing A Learning System
12 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
DL Unit-3
No ratings yet
DL Unit-3
9 pages
AI Lab Manual 1
100% (4)
AI Lab Manual 1
12 pages
Artificial Intelligence (AI) Part - 2, Lecture - 12: Unification in First-Order Logic
0% (1)
Artificial Intelligence (AI) Part - 2, Lecture - 12: Unification in First-Order Logic
18 pages
1108 Solutions
100% (1)
1108 Solutions
13 pages
Assg 7
71% (7)
Assg 7
4 pages
Data Mining-Mining Time Series Data
0% (1)
Data Mining-Mining Time Series Data
7 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
Soft Computing (SC) Topper Solution
100% (2)
Soft Computing (SC) Topper Solution
35 pages
Pattern Recognition Organizer
No ratings yet
Pattern Recognition Organizer
112 pages
TOC Reference
100% (1)
TOC Reference
25 pages
Equivalence of PDA and CFG Enotes
100% (2)
Equivalence of PDA and CFG Enotes
7 pages
Analytical Learning
No ratings yet
Analytical Learning
42 pages
Unit 5
No ratings yet
Unit 5
8 pages
DBMS Organizer 2023
No ratings yet
DBMS Organizer 2023
160 pages
Bda Viva Q&a
No ratings yet
Bda Viva Q&a
24 pages
DL Unit - 4
No ratings yet
DL Unit - 4
14 pages
UNIT 1 TOC Sem5 RGPV
100% (2)
UNIT 1 TOC Sem5 RGPV
12 pages
AI Unit-5
No ratings yet
AI Unit-5
66 pages
Deep Learning-KTU
No ratings yet
Deep Learning-KTU
6 pages
0/1 Knapsack Problem Using FIFO-BB
No ratings yet
0/1 Knapsack Problem Using FIFO-BB
5 pages
Transaction With Replicated Data PDF
No ratings yet
Transaction With Replicated Data PDF
3 pages
Web Security Unit 4
No ratings yet
Web Security Unit 4
14 pages
Question Bank Module-1 Questions. Introduction and Concept Learning
No ratings yet
Question Bank Module-1 Questions. Introduction and Concept Learning
6 pages

Pattern Recognition - Organizer - 2023

Uploaded by

Pattern Recognition - Organizer - 2023

Uploaded by

You might also like