0% found this document useful (0 votes)
8 views

Machine Learning

Python
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
8 views

Machine Learning

Python
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
o1 and 02: Introduction, Regression Analysis, and Gradient Descent Next Indes Introduction to the course ‘+ Weill learn about © State of the art © How to do the implementation + Applications of machine learning include © Search 2 Photo tagging © Spam filters + The AI dream of building machines as intelligent as humans ‘©. Many people believe best way to do that is mimic how humans learn ‘+ What the course covers ‘© Learn about state of the art algorithms © But the algorithms and math alone are no good 2 Need to know how to get these to work in problems Why is ML so prevalent? © Grew out of AT © Build intelligent machines = You can program a machine how to do some simple thing ‘= For the most part hard-wiring Al is too difficult + Best way to da itis to have some way for machines to learn things themselves ‘= A mechanism for learning -if a machine can learn from input then it does the hard work for you Examples + Database mining ‘2 Machine learning has recently become so big party because of the huge amount of data being generated © Large datasets from growth of automation web 2 Sourees of data include ‘= Web data (click-stream or click through data) ‘Mine to understand users better ‘= Huge segment of silicon valley = Medical records 1 Electronie records -> turn records in knowledges * Biological data ‘= Gene sequences, ML algorithms give a better understanding of human genome 1 Engineering info Data from sensors, log reports, photos ete + Applications that we cannot program by hand ‘9 Autonomous helicopter © Handwriting recognition = This is very inexpensive because when you write an envelope, algorithms can automatically route envelopes through the post © Natural language processing (NLP) «Al pertaining to language © Computer vision = Al pertaining vision + Self eustomizing programs © Netflix © Amazon © iTunes genius 9 Take users info * Learn based on your behavior + Understand human learning and the brain © If we can build systems that mimic (or try to mimic) how the brain works, this may push our own understanding of the associated neurobiology What is machine learning? + Here we.. ‘© Define what itis © When touseit Not a well defined definition ‘© Couple of examples of how people have tried to define it + Arthur Samuel (1959) © Machine learning: "Field of study that gives computers the abilit programmed" 1 Samuels wrote a checkers playing program ‘= Had the program play 10000 games against itself ‘= Work out which board positions were good and bad depending on wins/losses y to learn without being explicitly ‘+ Tom Michel (1999) ‘© Well posed learning problem: "A computer program is said to learn from experi some class of tasks I’ and performance measure P, if its performance at tasks in’ improves with experience E." = The checkers example, = 100008 games * Tis playing checkers = Pifyou win or not .ce E-with respect to ‘as measured by P, + Several types of learning algorithms © Supervised learning * Teach the computer how to do something, then let it use it;s new found knowledge to do it © Unsupervised learning Tel the computer learn how to do something, and use this to determine steueture and patterns in data © Reinforcement learning ‘© Recommender systems, + This course ‘© Look at practical advice for applying learning algorithms © Learning a set of tools and how to apply them Supervised learning - introduction ‘+ Probably the most common problem type in machine learning + Starting with an example ‘© How do we predict housing prices * Collect data regarding housing prices and how they relate to size in feet Housing price prediction. 400 x 300 x x xX xX Xx Price ($) x x 200 in 1000's x 100 x ° 0 500 10001500» 2000-2500 Size in feet? + Example problem: "Given this data, a friend has a house 750 square feet - how much can they be expected to get? ‘+ What approaches can we use to solve this? © Straight line through data = Maybe $150 000 © Second order polynomial = Maybe $200 000 © One thing we discuss later - how to chose straight or curved line? © Bach of these approaches represent a way of doing supervised learning ‘+ What does this mean? ‘© We gave the algorithm a data set where a "right answer" was provided ‘© Sowe know actual prices for houses ‘The idea is we can learn what makes the price a certain value from the training data * The algorithm should then produce more right answers based on new training data where we don’t know the price already * i.e. prediet the price ‘+ We also call this a regression problem © Predict continuous valued output (price) © Noreal discrete delineation + Another example ‘© Can we definer breast cancer as malignant or benign based on tumour size WY) Malignant? own) Tumor Size + Looking at data °° Fiveofeach © Can you estimate prognosis based on tumor size? © Thisis an example ofa classification problem * Classify data into one of two diserete classes -no in between, either malignant or not * In classification problems, can have a discrete numberof possible values fo the output = eg-maybe have four values * 0- benign = 1-typet 2-type2 typed + In classification problems we ean plot data in a different way —O— 68 5 _ Sx B30 Tumor Size + Use only one attribute (size) © In other problems may have multiple attributes ‘2 Wemay also, for example, know age and tumor size Tumor Size + Based on that data, you can try and define separate classes by ‘© Drawing a straight line between the two groups © Using a more complex funetion to define the two groups (which well seuss later) ‘© Then, when you have an individual with a specific tumor size and who is a specific age, you can hopefully use that information to place them into one of your classes + You might have many features to consider ‘© Clump thickness © Uniformity of cel size © Uniformity of cell shape ‘+ The most exciting algorithms can deal with an infinite number of features ‘© How do you deal with an infinite number of features? ‘© Neat mathematical trick in support vector machine (which we discuss later) ‘Ifyou have an infinitely long list - we can develop and algorithm to deal with that + Summary ‘© Supervised learning lets you get the ‘© Regression problem © Classification problem ight” data a Unsupervised learning - introduction + Second major problem type + In unsupervised learning, we get unlabeled data ‘© dust told -here is a data set, an you structure it ‘+ One way of doing this would be to cluster data into to groups © Thisisa clustering algorithm Clustering algorithm + Example of clustering algorithm © Google news = Groups news stories into cohesive groups ‘© Used in any other problems as well = Geno! ‘= Microarray data = Have a group of individuals = On each measure expression of a gene = Run algorithm to cluster individuals into types of people Genes Individuals * Organize computer clusters += Identify potential weak spots or distribute workload effectively * Social network analysis = Customer data * Astronomical data analysis, * Algorithms give amazing results + Basically © Can you automatically generate structure ‘© Because we don't give it the answer, i's unsupervised learning Cocktail party algorithm + Cocktail party problem © Lots of overlapping voiees - hard to hear what everyone is saying, = Two people talking ‘© Microphones at different distances from speakers Cocktail party problem e @ ¢ q + Record sightly different versions of the conversation depending on where your microphone is ‘© Bul overlapping none the less ‘+ Have recordings of the conversation from each microphone ‘2 Give them to a cocktail party algorithm © Algorithm processes audio recordings * Determines there are two audio sources * Separates out the two sources «+ Isthis a very complicated problem © Algorithm can be done with one line of code! © (W,s,v] = evd((repmat (eum(x.*x,1) , size (x,1),1).*x)*x') # Not easy to identify ‘= But, programs can be short! = Using octave (or MATLAB) for examples ‘= Often prototype algorithms in octave/MATLAB to test as, = Only when you show it works migeate it to C++ ives a much faster agile development Understanding this algorithm © svd- linear algebra routine which is built into octave = In C++ this would be very complicated! ‘© Shown that using MATLAB to prototype is a really good way to do this 'svery fast Linear Regression ‘+ Housing price data example used earlier ‘© Supervised learning regression problem + What do we start with? © Training set (this is your data set) © Notation (used throughout the course) ‘= m= number of training examples input variables / features 1 y’s= output variable "target" variables ‘= (xy)- single training example = (ly) - specific example ('" training example) ‘isan index to training set Size in feet? (x) | Price ($) in 1000's (y) 2104 460 1416 232 me 4h 1534 315 852 178 ‘+ With our training set defined - how do we used it? © Take training set © Pass into a learning algorithm ‘© Algorithm outputs a function (denoted h) (h = hypothesis) * This function takes an input (eg. size of new house) 1 Tries to output the estimated value of Y + How dowe represent hypothesis h? © Going to present has; 1 hols) = 09 + 03x ‘= h() (shorthand) ho() = 00 + Ox + What does this mean? ‘© Means Y isa linear function of x! © Ojare parameters + 09s zero condition = 6, is gradient + This kind of function isa linear regression with one variable © Also called univariate linear regression + Soin summary ‘© Abypothesis takes in some variable © Uses parameters determined by a learning system ‘© Outputs a prediction based on that input Linear regression - implementation (cost function) + Accost function lets us figure out how to fit the best straight line to our data ‘+ Choosing values for 0; (parameters) © Different values give you different functions © If, is 1.5 and 6, iso then we get straight line parallel with X along is @ y © If6, is > o then we get a positive slope ‘+ Based on our training set we want to generate parameters which make the straight line ‘© Chosen these parameters 50 h(x) is close to y for our training examples ' Basically, uses xs in training set with ho(x) to give output which is as close to the actual y value as possible ‘= Think of hg(x) asa "y imitator”~it tries to convert the x into y, and considering we already have y we can evaluate how ‘well ho(x) does this + To formalize this; ‘© We want to want to solve a minimization problem © Minimize (ho() - y)® * i.e, minimize the difference between h(x) and y for each/any/every example ‘© Sum this over the training set 1” “) «) meh Y= y ‘+ Minimize squared different between predicted house price and actual house price * 1/m- means we determine the average + iam the 2 makes the math a bit easier, and doesnt change the constants we determine at al (i.e. hal the smallest, value is still the smallest value!) ‘© Minimizing 8/8, means we get the values of 8, and 8, which find on average the minimal deviation ofx from y when we use those parameters in our hypothesis function ‘+ More cleanly, this is a cost function Tie. 9) = t, Ele gF + And we want to minimize this cost function © Our cost function is (because of the summartion term) inherently looking at ALL the data in the training set at any time + Soto recap © Hypothesis -is like your prediction machine, throw in an x value, get a putative y value 3 ho(x) 2 y A=1 1 0 0 1 2 3 x © Cost. is a way to, using your training data, determine values for your 0 values which make the hypothesis as accurate as, possible Mamre S(O, a | Lost Euction ' This cost function is also ealled the squared error cost function ' This cost function is reasonable choice for most regression functions = Probably most commonly used function © Incase J(0o,0,) is a bit abstract, going into what it does, why it works and how we use it in the coming sections Cost function - a deeper look ‘+ Lets consider some intuition about the cost funetion and why we want to use it ‘© The cost function determines parameters © The value associated with the parameters determines how your hypothesis behaves, with different values generate different ‘+ Simplified hypothesis 8 Assumes 8, = 0 ho(x) = Oe Ss=o J(01) = BY (ho(2) = y)? minimize J(01) 1 + Cost function and goal here are very similar to when we have @y, but with a simpler parameter © Simplified hypothesis makes visualizing cost function JQ) bit easier ‘+ So hypothesis pass through 0,0 ‘+ Two key functins we want to understand © h(x) '= Hypothesis is a funetion of x - funetion of what the size of the house is © 30) 1 Isa function of the parameter of 8, «© So for example = JO) =0 © Plot = 8,vsJ06) = Data oD "2 “3 "0-0 = 30) =~2.3 2 Ife compute a range of values plot = J(@,) vs 8, we get a polynomial (looks ike a quadratic) J(0) 05005 115 2 25 a + "The optimization objective forthe learning algorithm is ind the value of O, which minimizes J(0,) «@ Shere 0, = 1isthe best value for 0, A deeper insight into the cost function - simplified cost function + Assume you're familiar with contour plots or contour figures Using same cost fonction, hypothesis and goa! previously 5 Its Ok to sip pars ofthis secon i'you don't understand ctour plots + Using our orginal complex hyotbeas with two parables, = Ip, 0)) + Example, Say = 09=50 = ,= 0.06 © Previously we plotted our cost function by plotting = 6,¥5I06,) 9 Now we have two parameters ' Plot becomes a bit more complicated * Generates a aD surface plot where axis are 8 = 2205 = Y= 100,04) 10 6% 20-20 e% ‘+ We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum + Instead of a surface plot we can use a contour figures/plots © Set of ellipses in different colors © Each colour is the same value of J(8y,@,), but obviously plot to different locations because 8, and 8, will vary ‘© Imagine a bow! shape function coming out ofthe screen so the middle is the concentric circles T(8o,01) {function of the parameters @o, th) + Each point (like the red one above) represents a pair of parameter values for 80 and O1 ‘© Ourexample here put the values at = 0=~800 = 8,=~-015 © Nota good fit * ie. these parameters give @ value on our contour plot far from the center © Ifwehave #0) = ~360 = 0,-0 1 This gives a better hypothesis, but still not great -not inthe center of the eountour plot Finally we find the minimum, which gives the best hypothesis + Doing this by eye/hand isa pain in the ass ‘© What we really want isan efficient algorithm fro finding the minimum for O and 0, Gradient descent algorithm + Minimize cost funetion J + Gradient descent ‘© Used all over machine learning for minimization + Start by looking ata general JQ function * Problem © Wehave J(®,, 0.) © We want to get min J(0g, 0,) Gradient descent applies to more general functions © Jo» By Oz Bn) © minJ(G, 8,8 8) How does it work? + Start with initial guesses © Start at 0,0 (or any other value) © Keeping changing Qg and 0, a little bt to try and reduce J(,,0,) ‘+ Each time you change the parameters, you select the gradient which reduces J(@,8,) the most possible + Repeat Do so until you converge to a local minimum + Has an interesting property ‘© Where you start can determine which minimum you end up 1104.0). ‘© Here we ean see one initialization point led to one local minimum © The other led to a different one Amore formal definition + Do the following until eovergence ) (for j = 0 and j = 1) + What does this ll mean? ‘© Update 6; by setting it to (0 ~a) times the partial derivative ofthe cost function with respect to 9; + Notation * Denotes assignment = NBa=bisa fruth assertion © a(alpha) = Isa number called the learning rate = Controls how big a step you take = Ifqis big have an aggressive gradient descent * If ais small take tiny steps + Derivative term a 3g, 7000) © Not going to talk about it now, derive it later ‘There isa subtly about how this gradient descent algorithm is implemented © Dothis for @, and 4, © Forj =o and j= 1 means we simultaneously update both, © How do we do this? ‘= Compute the right hand side for both ®, and @ = So we need a temp value «= Then, update 8, and 6, at the same time = We show this graphically below temp0 := 0 — a72-J(0, 01) temp] := 0, ~ az J(O, 61) 0 == temp0 6; := templ ‘+ Ifyou implement the non-simultancous update it's not gradient descent, and will behave weirdly ‘© But it might look sort of right -so it's important to remember this! Understanding the algorithm «To understand gradient descent, well return toa simpler function where we minimize one parameter to help explain the algorithm in more detail © min 8, J(,) where 8, i areal number + Two key terms inthe algorithm © Alpha Derivative term + Notation nuances > Paral derivative va. derivative * Use partial derivative when we have multiple variables but only derive with respect to one + Use derivative when we are deriving with respect to all the variables + Derivative term a = J(61) 00; «Derivative says + Lets take the tangent at the point and look atthe slope ofthe line + Somoving towards the mimum (down) wil greate a negative derivative, alpha is always postive, so will update (0) to a smaller value = Similarly, if we're moving up a slope we make j(6,) a bigger numbers Alpha term (a) © What happens ifalpha is too smal or too large ® Toosmall = Take baby steps: 1 Takes to long @ Too large * Can overshoot the minimum and fai to converge + When you get toa local minianura © Gradient of tangent/derivative is 0 2 So derivative term = 0 © alpha*o=0 2 $00, =0,-0 80 8, remains the same ‘+ Asyou approach the global minimum the derivative term gets smaller, so your update gets smaller, even with alpha is fixed ‘© Means as the algorithm runs you take smaller steps as you approach the minimum. ‘2 Sono need to change alpha over time inear regression with gradient descent + Apply gradient descent to minimize the squared error cost function J(09, 0) ‘+ Now we havea partial derivative + Sohere were just expanding out the ist expression © JO, 0,) =1/2m.... © hyls) = By + Ayre + Soe need to determine the derivative for each parameter ~ie 3 'When jo ° Whenj=1 «Figure out what this partial derivative is fr the Op and, case © When we derive this expression in terms of j = 0 and j = 1 we get the following 100,01) = Az E Chole) (he GE) yh). Fi I(G0,81) = ans + To check this you need to know multivariate caleulus ‘© Sowe can plug these values back into the gradient descent algorithm ‘+ How does it work ‘© Risk of meeting different local optimum ‘© The linear regression cost function is always a convex function - always has a single minimum * Bowl shaped = One global optima ‘So gradient descent will always converge to global optima © Inaction = Initialize values to = 8 =, ho(x) J (G0, 01) (For fixed 0p, 1, this isa function of x) (function of the parameters Hp, 61) Tala data current pothesis i000—~3000— 300000 iw fut) + End up ata global minimum + This is actually Batch Gradient Descent © Refers to the fact that over each step you look at all the training data = Each step compute aver m training examples ‘© Sometimes non-batch versions exist, which look at small data subsets += We'll look at other forms of gradient descent (to use when m is too large) later in the course ‘There exists a numerical solution for finding a solution for a minimum function ‘© Normal equations method © Gradient descent scales better to large data sets though © Used in lots of contexts and machine learning ‘What's next - important extensions Two extension to the algorithm + 1) Normal equation for numeric solution © To solve the minimization problem we can solve it min J(@6,)] exactly using a numeric method which avoids the iterative approach used by gradient descent © Normal equations method © Has advantages and disadvantages = Advantage ‘= No longer an alpha term = Can be much faster for some problems * Disadvantage = Much more complicated ‘© We discuss the normal equation in the linear regression with multiple features section + 2) We can learn with a larger number of features ‘© Somay have other parameters which contribute towards a prize = eg. with houses " Size = Age Number bedrooms = Number floors fx, x2, 33,84 ‘© With multiple features becomes hard to plot * Can't really plot in more than 3 dimensions ‘= Notation becomes more complicated too ‘Best way to get around with this is the notation of linear algebra ‘= Gives notation and set of things you can do with matrices and vectors ern 2104 5 1 45 460 y— {lie 3 2 40 , — (232 1534 3 2 30 Y= I315, 852 2 1 36 172 ‘+ We see here this matrix shows us © Size © Number of bedrooms ‘© Number floors © Age of home + Allin one variable © Block of numbers, take all data organized into one big block + Vector © Shown as y 2 Shows us the prices ‘+ Need linear algebra for more complex linear regression modles + Linear algebra is good for making computationally efficient models (as seen later too) © Provide a good way to work with large sets of data sets| ‘© Typically vectorization of a problem is a common optimization technique

You might also like