We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15
Linear Regression
• Linear Regression Models are used to identify the
relationship between a continuous dependent variable and one or more independent variables. • Simple Linear Regression: • When there is only one independent variable and one dependent variable. • Multiple Linear Regression: • When there are more than one independent variables. Linear Regression • Y = a + b1(X1) + b2(X2) + …….bn(Xn) • Where, • Y is the dependent variable • Xs are independent variables • a is the intercept • b1....bn are slope coefficients Logistic Regression (Logit) • Similar to linear regression, logistic regression is also used to estimate the relationship between a dependent variable and one or more independent variables, • But it is used to make a prediction about a categorical variable versus a continuous one. • A categorical variable can be true or false, yes or no, 1 or 0, etc. • Logit estimates the probability of an event occurring, such as voted or didn’t vote, based on a given data set of independent variables. Logistic Regression (Logit) • The Logit equation is written as: • Log Odds of Event =β0+β1X1+β2X2+⋯+βnXn • Where: • β0 is the intercept. • β1,β2,…βn are coefficients for the predictors X1,X2……Xn. • The term log odds is a way of expressing the likelihood of an event (e.g., loan defaulting, a person being employed) in a form that can be modeled linearly. • Odds: A ratio of probabilities (p/(1−p)), where p is the probability of the event happening. • Log Odds: The natural logarithm of the odds, which allows probabilities to be modeled linearly. • Logistic regression predicts log odds, which can be transformed back to probabilities for interpretation. Types of Logit • Binary logistic regression: • In this approach, the response or dependent variable is dichotomous in nature— i.e. it has only two possible outcomes (e.g. 0 or 1). • Within logistic regression, this is the most commonly used approach, and more generally, it is one of the most common classifiers for binary classification. • Example 1: Suppose that we are interested in the factors that influence whether a political candidate wins an election. • The outcome (response) variable is binary (0/1); win or lose. • The predictor variables of interest are: • the amount of time spent on the campaign, • the amount of money spent campaigning, • whether the candidate is an incumbent. • Example 2: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. • The outcome variable, admit/don’t admit, is binary. Types of Logit • Multinomial logistic regression: • In this type of logistic regression model, the dependent variable has three or more possible outcomes; however, these values have no specified order. • E.g.: movie studios want to predict what genre of film a moviegoer is likely to see to market films more effectively. A multinomial logistic regression model can help the studio to determine the strength of influence a person's age, gender, and dating status may have on the type of film that they prefer. The studio can then orient an advertising campaign of a specific movie toward a group of people likely to go see it. • The marketing team of an organization can use the model to predict the likelihood of a customer purchasing a specific product type (Basic, Standard, or Premium) based on their age, income, and gender. Types of Logit • Ordinal logistic regression: • In this type of logistic regression model, the response variable has three or more possible outcomes • But in this case, these values have a defined order. • E.g.: grading scales from A to F or rating scales from 1 to 5. Some Applications of Logit • Fraud detection: Logistic regression models can help teams identify data anomalies, which are predictive of fraud. Certain behaviors or characteristics may have a higher association with fraudulent activities, which is particularly helpful to banking and other financial institutions in protecting their clients. • Disease prediction: In medicine, Logit can be used to predict the likelihood of disease or illness for a given population. Healthcare organizations can set up preventative care for individuals that show higher propensity for specific illnesses. Some Applications of Logit • Churn prediction: Specific behaviors may be indicative of churn in different functions of an organization. For example, human resources and management teams may want to know if there are high performers within the company who are at risk of leaving the organization; this type of insight can prompt conversations to understand problem areas within the company, such as culture or compensation. Case Study • A leading financial institution is striving to improve its loan approval process by better understanding the risk factors associated with loan default. Defaulting on a loan not only causes financial losses but also affects the institution's operational efficiency and reputation. By taking data on borrowers, the institution seeks to develop a predictive model to identify individuals who are more likely to default on their loans. The institution has collected data on past loans, including financial, demographic, and loan-specific attributes of borrowers, and their loan repayment outcomes (whether they defaulted or not). The goal is to analyze this dataset and build a model that predicts the probability of loan default based on borrower characteristics. • You are required to create a logistic regression model that: 1.Identifies the key predictors of loan default. 2.Provides actionable insights into borrower profiles more likely to default. Case Study (Data description) • The dataset consists of the following features: 1.Income: Annual income of the borrower. 2.Credit Score: Credit score of the borrower, reflecting their creditworthiness. 3.Employment Status: Employment status of the borrower (0 for unemployed, 1 for employed). 4.Debt to Income Ratio: Ratio of the borrower’s debt payments to their income. 5.Loan Amount: Amount of loan requested by the borrower. 6.Age: Age of the borrower. 7.Loan Default: The target variable (1 for default, 0 for no default). Results • A p<0.05 indicates statistical significance at 95% confidence level. This means that the variable likely affects the dependent variable. • Income: • Interpretation: Measures the effect of a one-unit increase in income (e.g., 1 dollar) on the log odds of loan default. The negative coefficient (−0.0002) suggests that higher income reduces the likelihood of default. • p-value: 0.1320. This is not statistically significant (p>0.05), meaning the effect of income on loan default is not conclusive in this model. • Credit Score: • Interpretation: Measures the effect of a one-point increase in credit score on the log odds of loan default. The negative coefficient (−0.0638) suggests higher credit scores reduce the likelihood of default. • p-value: 0.1020. This is close to being statistically significant but not below the 0.05 threshold. Results • Employment Status • Interpretation: Employment status is encoded as 0 (unemployed) and 1 (employed). The negative coefficient (−0.2967) suggests that being employed might slightly reduce the likelihood of default, though the effect is negligible. • p-value: 0.9380. This indicates no significant effect of employment status on loan default. • Interpreting coefficients of dummy variable: • A positive coefficient suggests that when the dummy variable is 1 (as opposed to 0), the dependent variable (e.g., loan default) is expected to increase. • A negative coefficient suggests that when the dummy variable is 1 (as opposed to 0), the dependent variable is expected to decrease. Results • Debt-to-Income Ratio • Interpretation: Reflects the effect of a 1 unit increase in debt-to- income ratio on the log odds of default. The positive coefficient (5.75555) suggests that higher ratios might increase default likelihood. • p-value: 0.6830 This is not statistically significant. • Loan Amount • Interpretation: Reflects the effect of a one-unit increase in loan amount (e.g., 1 dollar) on the log odds of default. The negative coefficient (−0.0002) suggests larger loan amounts might reduce default likelihood. • p-value: 0.2840. This indicates no significant effect of loan amount on default. Results • Age • Interpretation: Reflects the effect of a one-year increase in age on the log odds of default. The negative coefficient (−0.1643) suggests older individuals are slightly less likely to default. • p-value: 0.3540 is not statistically significant.