3rd Module EDBA Contiuation1

exploratory data analysis

Uploaded by

rajalakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

20 views6 pages

3rd Module EDBA Contiuation1

exploratory data analysis

Uploaded by

rajalakshmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 6

Search this site feHome / 4/41 4.1 - Variable Selection for the Linear Model So in linear regression, the more features X; the better (since RSS keeps going down)? NO! Carefully selected features can improve model accuracy. But adding too many can lead to overfitting: * Overfitted models describe random error or noise instead of any underlying relationship; © They generally have poor predictive performance on test data; * For instance, we can use a 15-degree polynomial function to fit the following data so that the fitted curve goes nicely through the data points. However, a brand new dataset collected from the same population may not fit this particular curve well at all.‘ \ ea | | ° 2 ¢ é oe ® * Sometimes when we do prediction we may not want to use all of the predictor variables (sometimes p is too big). For example, a DNA array expression example has a sample size (N) of 96 but a dimension (p) of over 4000! In such cases, we would select a subset of predictor variables to perform regression or classification, e.g. to choose k predicting variables from the total of p variables yielding minimum RS'S(A). Variable Selection for the Linear Regression Model When the association of Y and X; conditioning on other features is of interest, we are interested in testing Ho : 8; = 0 versus Ha : Bj # 0. * Under the normal error (residual) assumption, 2; = 5, where v; is the jth diagonal element of (X’X)~}. * 2; is distributed as ¢y_»_; (a student's t-distribution with N — p—1 degrees of freedom). When the prediction is of interest: © F-test; * Likelihood ratio test; * AIC, BIC, etc.; * Cross-validation.F-test The residual sum-of-squares RS'9() is defined as: RSS(8) = DN (yi — 8)? = Davi — Xi)? Let RSS, correspond to the bigger model with p; + 1 parameters, and RSS» correspond to the nested smaller model with pp + 1 parameters. The F statistic measures the reduction of RSS per additional parameter in the bigger model: F = (RSSv=RSS1)/(P=Po) RSS; /(N—Pi-1) Under the normal error assumption, the F statistic will have a F(,-p0),(N-p:—1) distribution. For linear regression models, an individual t-test is equivalent to an F-test for dropping a single coefficient 8; from the model. Likelihood Ratio Test (LRT) Let Li be the maximum value of the likelihood of the bigger model. Let Lo be the maximum value of the likelihood of the nested smaller model. The likelihood ratio A = Lp/Ly is always between 0 and 1, and the less likely are the restrictive assumptions underlying the smaller model, the smaller will be The likelihood ratio test statistic (deviance), —2log(A), approximately follows a XG,-po distribution. So we can test the fit of the 'null' model Mp against a more complex model M;. Note that the quantiles of the Fip,—p,),(w-p,-1) distribution approach those of the x2,_, distribution.Akaike Information Criterion (A/C) Use of the LRT requires that our models are nested. Akaike (1971/74) proposed a more general measure of "model badness:" AIC = —2logL(8) + 2p where p is the number of parameters. Faced with a collection of putative models, the 'best' (or ‘least bad') one can be chosen by seeing which has the lowest A/C. The scale is statistical, not scientific, but the trade-off is clear; we must improve the log-likelihood by one unit for every extra parameter. AIC is asymptotically equivalent to leave-one-out cross-validation. Bayes Information Criterion (B/C) AIC tends to overfit models (see Good and Hardin Chapter 12 for how to check this). Another information criterion which penalizes complex models more severely is: BIC = —2logL(A) + p x log(n) also known as the Schwarz’ criterion due to Schwarz (1978), where an approximate Bayesian derivation is given. Lowest BIC is taken to identify the ‘best model’, as before. BIC tends to favor simpler models than those chosen by AIC. Stepwise Selection AIC and BIC also allow stepwise model selection. An exhaustive search for the subset may not be feasible if p is very large. There are two main alternatives:* Forward stepwise selection: © First, we approximate the response variable y with a constant (i.e., an intercept-only regression model). Then we gradually add one more variable at a time (or add main effects first, then interactions). Every time we always choose from the rest of the variables the one that yields the best accuracy in prediction when added to the ° ° pool of already selected variables. This accuracy can be measured by the F-statistic, LRT, AIC, BIC, etc. For example, if we have 10 predictor variables, first we would approximate y with a constant, and then use one variable out of the 10 (I would perform 10 regressions, each time using a different predictor variable; for every regression | have a residual sum of squares; the variable that yields the minimum residual sum of squares is chosen and put in the pool of selected variables). We then proceed to choose the next variable from the 9 left, etc. ° * Backward stepwise selection: This is similar to forward stepwise selection, except that we start with the full model using all the predictors and gradually delete variables one at a time. There are various methods developed to choose the number of predictors, for instance, the F-ratio test. We stop forward or backward stepwise selection when no predictor produces an F-ratio statistic greater than some threshold. «Lesson 4: Variable Selection Up 4.2-R Scripts > Lessons Lesson 1(a): Introduction to Data Mining Lesson 1(b): Exploratory Data Analysis (EDA) Lesson 2: Statistical Learning and Model Selection Lesson 3: Linear Regression Lesson 4: Variable Selection4.1 - Variable Selection for the Linear Model 4.2 R Scripts Lesson 5: Regression Shrinkage Methods Lesson 6: Principal Components Analysis Lesson 7: Dimension Reduction Methods Modeling Non-linear Relationships Lesson 9: Classification Lesson 10: Support Vector Machines Lesson 11: Tree-based Methods Lesson 12: Cluster Analysis Resources Analysis of German Credit Data Analysis of Wine Quality Data Analysis of Classification Data Final Project - Sample Work

MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
Rio Thesis - 054559
No ratings yet
Rio Thesis - 054559
53 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
ISLR
No ratings yet
ISLR
9 pages
Week8 Lecture 1 ML SPR25
No ratings yet
Week8 Lecture 1 ML SPR25
20 pages
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
No ratings yet
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
6 pages
Lesson 5 Model Selection
No ratings yet
Lesson 5 Model Selection
41 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
RM - Variable Selection Methods and Goodness of Fit
No ratings yet
RM - Variable Selection Methods and Goodness of Fit
20 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
DATT - Class 05 - Assignment - GR 9
No ratings yet
DATT - Class 05 - Assignment - GR 9
9 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
EC501 Lecture 04
No ratings yet
EC501 Lecture 04
30 pages
LM02 Evaluating Regression Model Fit and Interpreting Model Results IFT Notes
No ratings yet
LM02 Evaluating Regression Model Fit and Interpreting Model Results IFT Notes
9 pages
1.1 Regression Analysis
No ratings yet
1.1 Regression Analysis
33 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Week 10 - Lecture 10
No ratings yet
Week 10 - Lecture 10
59 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
Advanced Regression With JMP PRO Handout
No ratings yet
Advanced Regression With JMP PRO Handout
46 pages
Regression Basics
No ratings yet
Regression Basics
27 pages
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
31 pages
m2 Data Analytic and Visualization
No ratings yet
m2 Data Analytic and Visualization
53 pages
Model Selection and Model Averaging
No ratings yet
Model Selection and Model Averaging
16 pages
Regression PDF
No ratings yet
Regression PDF
16 pages
L7 Model Selection
No ratings yet
L7 Model Selection
41 pages
Jurnal Asli Diagram Sa
No ratings yet
Jurnal Asli Diagram Sa
11 pages
Regression
No ratings yet
Regression
45 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Multiple Linear Regression: Beginning of Next Lecture - Online Course Evaluation (Bring A Tablet, Laptop, Phone?)
No ratings yet
Multiple Linear Regression: Beginning of Next Lecture - Online Course Evaluation (Bring A Tablet, Laptop, Phone?)
37 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
No ratings yet
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
3 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Lec 3
No ratings yet
Lec 3
69 pages
Business Analytics
No ratings yet
Business Analytics
19 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Unit 4
No ratings yet
Unit 4
7 pages
DDMA05 ModelSelection
No ratings yet
DDMA05 ModelSelection
28 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Regressi On
No ratings yet
Regressi On
16 pages
Features Election
No ratings yet
Features Election
18 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
ML Ai
No ratings yet
ML Ai
53 pages
Chapter 14
No ratings yet
Chapter 14
15 pages
Variable Selection
No ratings yet
Variable Selection
26 pages

3rd Module EDBA Contiuation1

Uploaded by

3rd Module EDBA Contiuation1

Uploaded by

You might also like