Logistic Regression

Uploaded by

rishvanjasimharangu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

40 views

Logistic Regression

Uploaded by

rishvanjasimharangu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 9

Chapter 12 Logistic Regression 12.1 Modeling Conditional Probabilities So far, we cither looked at estimating the conditional expectations of continuous variables (as in regression), or at estimating distributions. There are many situations where however we are interested in input-output relationships, as in regression, but the output variable is discrete rather than continuous. In particular there are many situations where we have binary outcomes (it snows in Pittsburgh on a given day, or i doesn’t; this squirrel carries plague, or it doesn’t; this loan will be paid back, or it won't this person will get heart disease in the next five years, or they won't). In addition to the binary outcome, we have some input variables, which may or may not be continuous. How could we model and analyze such data? We could try to come up with a rule which guesses the binary output from the input variables. ‘This is called classification, and is an important topic in statistics and machine learning, However, simply guessing “yes” or “no” is pretty crude — especially if there is no perfect rule. (Why should there be?) Something which takes noise into account, and doesnt just give a binary answer, will often be useful. In short, we want probabilities — which means we need to fit a stochastie model ‘What would be nice, in fact, would be to have conditional distribution of the response Y, given the input variables, Pr(¥|X). This would tell us about how pre- cise our predictions are. If our model says that there’s 251% chance of snow and it doesn’t snow, that’s better than ift had said there was a 99% chance of snow {though even a 99% chance is not a sure thing). We have seen how to estimate conditional probabilities non-parametrically, and could do this using the kernels for discrete variables from lecture 6, While there are alot of merits to this approach, it does involve coming up with a model forthe joint distribution of outputs ¥ and inputs X,, which can be quite time-consuming Lets pick one of the classes and call it “1” and the other “0". (It doesn't matter which is which. Then ¥ becomes an indicator variable, and you can convince yourself that Pr(¥ = 1) = E[Y]. Similarly, Pr(Y = 1|X =x) = E[Y|X =x]. dn a phrase, “conditional probability is the conditional expectation of the indicator”) 23224 CHAPTER 12. LOGISTIC REGRESSION ‘This helps us because by this point we know all about estimating conditional expectations. The most straightforward thing for us to do at this point would be to pick out our favorite smoother and estimate the regression function for the indicator variable; this will be an estimate of the conditional probability function, There are two reasons not to just plunge ahead with that idea. One is that probabilities must be between 0 and 1, but our smothers will not necessarily respect that, even if all the observed y, they get are either 0 or 1. The other is that we might be better off making more use ofthe fact that we are trying to estimate probabilities, by ‘more explicitly modeling the probability Assume that Pr(Y = 1|X (10), for some function p parameterized by 8. parameterized function 9, and further assume that observations are independent of each other. The the (conditional) likelihood function is, TTP: =9: =) =T] tein o0es9)) aay Recall that in a sequence of Bernoulli trials y,,...)» where there is a constant probability of success p, the likelihood is T]a-n (122) As you learned in intro. stats this likelihood is maximized when p If each tral had its own success probability p,, this likelihood becomes 2 T]c- 2 (23) ‘Without some constraints, estimating the “inhomogeneous Bernoulli” model by max: imum likelihood doesn’t works we'd get ; = L when y; = 1, p, =O when y, =0, and learn nothing. If on the other hand we assume that the p, aren't just arbitrary num bers but are linked together, those constraints give non-trivial parameter estimates, and let us generalize. In the kind of model we are talking about, the constraint, 1p, = plx;s6), tells us that p; must be the same whenever x, isthe same, and if p isa continuous function, then similar values of x; must lead to similar values of p,. As suming p is known (up to parameters), the likelihood isa function of 8, and we can estimate 6 by maximizing the likelihood. This lecture will be about this approach. 12.2 Logistic Regression ‘To sum up: we have a binary output variable Y, and we want to model the condi tional probability Pr(Y = 1]X =x) as a function of x; any unknown parameters in the function are to be estimated by maximum likelihood. By now, it will not surprise ‘you to learn that statisticians have approach this problem by asking themselves “how ‘an we use linear regression to solve this?”122, LOGISTIC REGRESSION 225 1. ‘The most obvious idea isto let p(x) bea linear function of x. Every inerement cof a component of x would add or subtract so much to the probability. ‘The conceptual problem here is that p must be between 0 and 1, and linear fune- tions are unbounded. Moreover, in many situations we empirically see “imin- ishing returns” — changing p by the same amount requires a bigger change in x when p is already large (or small) than when p is close to 1/2. Linear models cean't do this, 2. ‘The next most obvious idea isto let log p(x) be a linear function of x, so that changing an input variable multiplies the probability by a fixed amount. ‘The problem is that logarithms are unbounded in only one direction, and linear functions are not, 3. Finally, the easiest modification of log p which has an unbounded range is the logistic (or logit) transformation, log -2. We can make this a linear func tion of x without fear of nonsensical results. (Of course the results could still happen to be wrong, but they're not guaranteed to be wrong.) "This last alternative is logistic regression. Formally, the model logistic regression model is that =Bo+x-8 (124) Solving for p, this gives (1258) Notice that the overall specification isa lot easier to grasp in terms ofthe transformed probability that in terms ofthe untransformed probability To minimize the mis-lassfication rate, we should predict Y and ¥ =0 when p <0. This means guessing 1 whenever p+ and 0 otherwise. So logistic regression gives us a linear classifier. The d boundary separating the two predicted classes is the solution of 8. which is point if x is one dimensional, a line if itis ewo dimensional, ete. One ean show (exercise!) thatthe distance from the decision boundary is 3/\\SIl+x-3/IIAIL. Logistic regression not only says where the boundary between the classes is, but also says (via Eq, 12.5) thatthe class probabilities depend on distance from the boundary, in a particular way, and that they go towards the extremes (0 and 1) more rapidly when |/5|| is larger. It's these statements about probabilities which make logistic regression more than just a clasifer. It makes stronger, more detailed predictions, and can be fic ina different way; but those strong predictions could be wrong Using logistic regression to predict class probabilities is a modeling choice, just like it’s a modeling choice to predict quantitative variables with linear regression. I when p > 05 is non-negative, xB =9, Tales youtve uken aatnical mechanics, in which cate you recognize that this i the Baltzann Aisebution fr a system wih two sates, which dferin energy by 4-8.26 CHAPTER 12. LOGISTIC REGRESSION a A Be . Figure 12.1: Effects of sealing logistic regression parameters. Values of x, and x, are the same in all plots (~ Unif(—1,1) for both coordinates) but labels were generated randomly from logistic regressions with 2, = —0.1, 8 = (~0.2,0.2) (op left); from =0.5, 8 =(1,1) (top right); from By 5, 5) (bottom left); and from a perfect linear classifier with the same boundary. ‘The large black dot is the origin,122, LOGISTIC REGRESSION 27 In neither case is the appropriateness of the model guaranteed by the gods, nature, mathematical necessity, etc. We begin by positing the model, to get something to ‘work with, and we end ff we know what we're doing) by checking whether it really does match the data, or whether it has systematie flaws. Logistic regression is one of the most commonly used tools for applied statistics and discrete data analysis. There are basically four reasons for this. 1. Tradition, 2. In addition to the heuristic approach above, the quantity log p/(1— p) plays an important role in the analysis of contingency tables (che “log odds). Classi- fication isa bit ike having a contingency table with two columns (classes) and infinitely many rows (values of x). Wich a finite contingency table, we can estimate the log-odds foreach row empirically, by just taking counts in the table. With infinitely many rows, we need some sort of interpolation scheme; logistic. regression is linear interpolation for the log-odds. 3. It’s closely related to “exponential family” distributions, where the probabil ity of some vector o is proportional to exp 85+", f,(0),. If one of the components of v is binary, and the functions f are all the identity function, then we get a logistic regression. Exponential families arse in many contexts in statistical theory (and in physics), so there ae lots of problems which can be turned into logistic regression, 4. Itoften works surprisingly well asa clasifer. But, many simple techniques often work surprisingly wel as classifiers, and this doesn’t really testify to logistic. regression getting the probabilities right. 12.2.1 Likelihood Function for Logistic Regression Because logistic regression predicts probabilities, rather than just classes, we ean fi it using likelihood. For each training data-point, we have a vector of features, x,, and an observed class, ;. The probability of that class was either p, ify; = 1, or 1~ py if yj =0. The likelihood is then LBP) Tera =x) (12.6)228 CHAPTER 12. LOGISTIC REGRESSION {(could substitute in the actual equation for p, but things willbe clearer in a moment if don’t) The log-likelihood turns products into sums {Bo6) = Spslogels,)+(t~y)logt~ pls) (an = Dost = pbs) + Drag AE (28) = Vlogt= ple) + oy. +%- A) (29) = De logitel P+ S ly(Bot x8) (1210) where in the next-to-last step we finally use equation 12.4 ‘Typically, to find the maximum likelihood estimates we'd differentiate the log likelihood with respect to the parameters, set the derivatives equal to zero, and solve "To start that, take the derivative with respect to one component of 3 say ot o 1 Seinen, ena Doi- ee), (212) ‘We are not going to be able to set this to zero and solve exactly. (That's atranscenden- tal equation, and there is no closed-form solution.) We can however approximately solve it numerically. 12.2.2 Logistic Regression with More Than Two Classes IFY can take on more than two values, say # of them, we can still use logistic regression. Instead of having one set of parameters Jg,/3, exch class cin 0: (k 1) will have its own offset 0 and vector "9, and the predicted conditional probabilities will be Pr(¥=dX=x) (12.13) You ean check that when there are only ewo classes (say, and 1), equation 12.13 reduces to equation 125, wit fy = 6° f° and f= B29. In ae, no mater how many classes there are, we can always pick one of them, say c = 0, and fix its parameters at exactly zero, without any loss of generality’ "Since we can arbitvarly hose which ls parameters to “zero out” without affecting the predicted probabilities strictly peaing the model in E1213 unidentified. Thats ferent parameter stings lead exact the same outcome, 3 we ca se the dat to tell which one i rght. "The wl response ‘eres ta deal with this bya convention: we decide to nero out the parameters ofthe atlas, and then ‘estimate the contrasting parameters for the others.123, NEWTON'S METHOD FOR NUMERICAL OPTIMIZATION 229 Calculation of the likelihood now proceeds as before (only with more book- keeping), and so does maximum likelihood estimation. 12.3. Newton’s Method for Numerical Optimization "There are a huge number of methods for numerical optimization; we can’t cover all bases, and there is no magical method which will always work better than anything clse. However, there are some methods which work very well on an awful lot of the problems which keep coming up, and it’s worth spending a moment to sketch how they work. One of the most ancient yet important of them is Newton's method (alias “Newton-Raphson”) Lets start with the simplest ease of minimizing a function of one scalar variable, say (8). We want to find the location of the global minimum, 3°. We suppose that f issmooth, and that 5" isa regular interior minimum, meaning that the derivative at 8" is zero and the second derivative is positive. Near the minimum we could make a Taylor expansion: SO*I0)+ 16-6 2s ae can see here that the second derivative has to be positive to ensure that f(8) > (8°) In words, (8) is close ro quadratic near the minimum, Newton's method uses this fat, and minimizes a quadratic approximation to the fanction we are really interested in. (In other words, Newton's method isto replace the problem we want to solve, with a problem which we tn solve.) Guess an initial point 8°), If this is close to the minimum, we can take a second order Taylor expansion around 3) and it will still be accurate: “o say, 4 rasrerre-P LZ) +i(p- (12.15) [Now it’s easy to minimize the right-hand side of equation 12.15. Let's abbreviate the dates toe thy pce to kp wating: $e = 1B, We jut cake the derivative with respect 3, and st it equal to zer0 at a point we'll eal 0 = B+ 5/"p%269—% (12.16) a = go £0) 12.17 po = po-F) air) ‘The value 3 should be a better guess at the minimum {3° than the initial one 2 was. So if we use it to make a quadratic approximation to f,, well get a better ap proximation, and so we can iterate this procedure, minimizing one approximation230 CHAPTER 12. LOGISTIC REGRESSION and then using that to get a new approximation: £1) £8) [Notice that the true minimum 3" is a fixed point of equation 12.18: if we happen to land on it, we'll stay there (Gince "(") =0). We won't show it, but it ean be proved that if is close enough to /8", then [3 ~» 8", and that in general [2 — |= O(n, a very rapid rate of convergence. (Doubling the numberof iterations we use doesn’t reduce the error by a factor of two, but by a factor of four) Let's put this together in an algorithm, pew = pi (12.18) my.newton = function (f, £.prime, f .prime2, betad, toleranc beta = betad old.f = f{beta) iterations = 0 made.changes = TRUE while (made.changes & (iterations < max.iter)) { iterations <- iterations +1 made. changes <- FALSE new.beta = beta - £.prime (beta) /£-prime2 {beta} new.£ = £(new.beta) relative.change = aba(new.£ ~ old.f)/old.£ -1 made.changes = (relative.changes > tolerance) beta = new.beta old.f = new.t > if (made.changes) warning ("Newton’s method terminated before convergence") ) return (List (minimur beta, value=f (neta), deri: prime? (beta) , iteration! prime (beta), ) ‘The first three arguments here have to all be functions. The fourth argument is our initial guess for the minimum, (1°. The last arguments keep Newton’s method from eyeling forever: solerance tel i. top when the funtion stops changing very much (the relative difference beeween f(/3) and f((8°*) issmall), and max. iter tells it to never do more than a certain number of steps no matter what. The return value includes the estmated minimum, the value of the function there, and some agnostics — the derivative should be very small, the second derivative should be peeyou may have notied some potential problems — what if we land on a point where f" is zero? What if f("*") > f(5™)? Etc. There are ways of handling these issues, and more, which are incorporated into real optimization algorithms from numerical analysis — such as the opt im function in R; I strongly recommend123, NEWTON'S METHOD FOR NUMERICAL OPTIMIZATION 231 {you use that, or something like that, rather than trying to roll your own optimization code? 12.3.1 Newton’s Method in More than One Dimension Suppose that the objective f isa function of multiple arguments, f(8y,/33,---f) Let's bundle the parameters into a single vector, ©. Then the Newton update is pew where Vf isthe gradient of fits vector of partial derivatives [3 //2 8.8/2 Bp,...8f/2B,), and HT isthe Hessian off, its matrix of second partial derivatives, H,, = 2°f /2 8,0 8, Calculating #7 and Vf isn't usually very time-consuming, but taking the inverse of His, unless it happens to bea diagonal matrix. This leads to various quasi-Newton methods, which either approximate / by a diagonal matrix, or take a proper inverse of H only rarely (maybe just once), and then try to update an estimate of H~"(/3)) as 3 changes. HBV IB) (12.19) 12.3.2 Iteratively Re-Weighted Least Squares ‘This discussion of Newton's method is quite general, and therefore abstract. In the particular case of logistic regression, we can make everything look much more “statistical”. Logistic regression, afterall, isa linear model for a transformation of the proba bility. Ler’s call this transformation g: (12.20) So the model is (12.21) do is take g(y) and regress it linearly on x. Of course, the variance of Y, according to the model, is going to chance depending on x — it will be (g~(3q+x-/8))(1— g7'(Bo-+x-B)) —s0 we really ought to do a weighted linear regression, with weights inversely proportional to that variance. Since writing 2, +-/ is getting annoying, Jee’s abbreviate it by ye (for mean”), and let's abbreviate that variance as V( jt). The problem is that y is either or 1, so g(9) is either —ce or +00. We will evade this by using Taylor expansion. 20) ae) +0 ~ wea! ‘The right hand side, z will be our effective response variable. To regress it, we need its variance, which by propagation of error will be (g'())*V(4) (12.22) Topein actully a wrapper for aeverl diferent optimiation methods; net hodeBPs selects 4 [Newtonian method; BPGS isan seronym forthe names of the algorithn’s inventors

Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Reference Material Logistic Regression
No ratings yet
Reference Material Logistic Regression
11 pages
Msfe Week9
No ratings yet
Msfe Week9
5 pages
Reference Material - Logistic - Regression
No ratings yet
Reference Material - Logistic - Regression
11 pages
Binary Logistic Regression
No ratings yet
Binary Logistic Regression
8 pages
week_6_mle_perraillon_0
No ratings yet
week_6_mle_perraillon_0
69 pages
Reference Material - Logistic - Regression
No ratings yet
Reference Material - Logistic - Regression
11 pages
Chap10 Logistic Regression
No ratings yet
Chap10 Logistic Regression
36 pages
lec20
No ratings yet
lec20
16 pages
class
No ratings yet
class
102 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Unit-2_MLT
No ratings yet
Unit-2_MLT
84 pages
Introduction To Logistic Regression
No ratings yet
Introduction To Logistic Regression
20 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
A Simple But Effective Logistic Regression Derivation
No ratings yet
A Simple But Effective Logistic Regression Derivation
6 pages
2+Logistic_regression
No ratings yet
2+Logistic_regression
10 pages
Chapter 5-LDVM-2024
No ratings yet
Chapter 5-LDVM-2024
27 pages
03. Logistic Regression
No ratings yet
03. Logistic Regression
19 pages
Logistic and Nonlinear Regression: Department of Political Science AND International Relations Posc/Uapp 816
No ratings yet
Logistic and Nonlinear Regression: Department of Political Science AND International Relations Posc/Uapp 816
15 pages
Chapter 10 Logistic Reg (Python)
No ratings yet
Chapter 10 Logistic Reg (Python)
29 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
BANA 560 Lecture - 4 - LogisticRegression
No ratings yet
BANA 560 Lecture - 4 - LogisticRegression
26 pages
Logistic+Regression - Done
100% (1)
Logistic+Regression - Done
41 pages
Lecture Notes Chapt13
No ratings yet
Lecture Notes Chapt13
15 pages
lec42
No ratings yet
lec42
12 pages
Ourse Notes Ogistic Egression: Course Notes: Descriptive Statistics Course Notes: Descriptive Statistics
No ratings yet
Ourse Notes Ogistic Egression: Course Notes: Descriptive Statistics Course Notes: Descriptive Statistics
6 pages
Cap1_Slides
No ratings yet
Cap1_Slides
30 pages
Multivariate classification
No ratings yet
Multivariate classification
7 pages
Limited Dependent Variables - Binary Dependent Variables
No ratings yet
Limited Dependent Variables - Binary Dependent Variables
24 pages
Lecture15 Binary Dependent Variables
No ratings yet
Lecture15 Binary Dependent Variables
38 pages
Binary Logistic (5)
No ratings yet
Binary Logistic (5)
29 pages
Regression3 Slides
No ratings yet
Regression3 Slides
47 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Notes 13
No ratings yet
Notes 13
18 pages
W5S01 - PM-Logistic Regression
No ratings yet
W5S01 - PM-Logistic Regression
17 pages
Logistic Regression
No ratings yet
Logistic Regression
54 pages
Chapter 10 - Logistic Regression: Data Mining For Business Intelligence
No ratings yet
Chapter 10 - Logistic Regression: Data Mining For Business Intelligence
20 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
Econometrics
No ratings yet
Econometrics
37 pages
Chap10_LogisticRegression
No ratings yet
Chap10_LogisticRegression
19 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
Logistic Regression
No ratings yet
Logistic Regression
23 pages
ML 2024 Part6 Classification Unsupervised
No ratings yet
ML 2024 Part6 Classification Unsupervised
43 pages
Logistic Regression For Machine Learning Complete TutorialUnderstand This Popular Supervised Classifi
No ratings yet
Logistic Regression For Machine Learning Complete TutorialUnderstand This Popular Supervised Classifi
10 pages
Lecture 8
No ratings yet
Lecture 8
22 pages
Week 6 Mle
No ratings yet
Week 6 Mle
41 pages
S4-LogisticRegression-15Jan2025
No ratings yet
S4-LogisticRegression-15Jan2025
25 pages
Slides 3
No ratings yet
Slides 3
25 pages
Logistic Regression
No ratings yet
Logistic Regression
30 pages
PD2004 9
No ratings yet
PD2004 9
26 pages
02 LogisticRegression
No ratings yet
02 LogisticRegression
29 pages
Binary Logistic Regression - 6.2
No ratings yet
Binary Logistic Regression - 6.2
34 pages
Lecture 03 Logistic Regression
No ratings yet
Lecture 03 Logistic Regression
34 pages
L9 Logistical Regression Models Updated
No ratings yet
L9 Logistical Regression Models Updated
10 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
23 pages
Logistic Regression 1
No ratings yet
Logistic Regression 1
32 pages
Business Analytics & Machine Learning: Logistic and Poisson Regressions
No ratings yet
Business Analytics & Machine Learning: Logistic and Poisson Regressions
62 pages

Logistic Regression

Uploaded by

Logistic Regression

Uploaded by

You might also like