0% found this document useful (0 votes)

56 views7 pages

Coding Activity 3.ipynb - Colaboratory

The document analyzes credit card balance data of 400 customers. It describes the variables and aims to determine factors influencing balance. It performs exploratory analysis, splits data into training and test sets, fits a linear model to training set to predict balance and calculates test error, and determines education can be dropped without affecting model.

Uploaded by

Jay Rajapaksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views7 pages

Coding Activity 3.ipynb - Colaboratory

Uploaded by

Jay Rajapaksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

List of Group Members:

Student_ID Student_Name

196026P Fernando D. N. N.

196047F Madurasinghe M. A. V. N

196086X Sumaiya A. M. N.

196090E Weerahannedige D. C.

196063B Hasali perera

This coding activity involves studying the credit card balance data in the file Credit.csv . The data set contains information on 400 customers
on the following variables.

ID : Identification
Income : Income in $1,000's
Limit : Credit limit
Rating : Credit rating
Cards : Number of credit cards
Age : Age in years
Education : Number of years of education
Gender : A factor with levels Male and Female
Student : A factor with levels No and Yes indicating whether the individual was a student
Married : A factor with levels No and Yes indicating whether the individual was married
Ethnicity : A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity
Balance : Average credit card balance in $

The aim is to determine which factors influence the credit card balance of any given individual.

1. Perform an exploratory analysis of the data including the followings as well:

Examine the data type of each variable and if needed, cast the variables in order to fit a linear regression model.

Produce a numerical summary of all quantitative variables.

Produce a scatterplot matrix of all quantitative variables.

Compute the matrix of correlations between (all quantitative) variables.

Based on the scatterplot matrix and the matrix of correlations, briefly summarize the relationships between the quantitative variables.

For a more detialed exploratory data analysis, refer https://www.kaggle.com/code/suzanaiacob/predicting-credit-card-balance-using-

regression/notebook#Finding-the-Best-Model

# SAMPLE CODE:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import itertools
import time
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV

%matplotlib inline

credit = pd.read_csv("Credit.csv", index_col=[0])
credit.Gender = credit.Gender.astype("category")
credit.Student = credit.Student.astype("category")
credit.Married = credit.Married.astype("category")
credit.Ethnicity = credit.Ethnicity.astype("category")

credit.describe()
credit.describe(include=['category'])

Gender Student Married Ethnicity

count 400 400 400 400

unique 2 2 2 3

top Female No Yes Caucasian

freq 207 360 245 199

dummies = pd.get_dummies(data=credit, drop_first=True)

numerical = credit.select_dtypes(include=["int64", "float64"])
pd.plotting.scatter_matrix(numerical, alpha=0.2, figsize=(10, 10))
plt.show()

2. Split the data set into a training set and a test set.

# SAMPLE CODE:

y = credit.Balance
X_numerical = numerical.drop(["Balance"], axis=1)

# Create all features
X = pd.concat([X_numerical, dummies[["Gender_Female", "Student_Yes", "Married_Yes", "Ethnicity_Asian", "Ethnicity_Caucasian"]]], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
credit_train = pd.concat([X_train, y_train], axis=1)
3. Fit a linear model using least squares on the training set for predicting Balance using all other explanatory variables. Calculate the test
error obtained.

Do the followings as well:

Use .summary() function to print the results. Comment on the output. For instance:

(i) Is there a significant relationship between all the predictors and the response?

(ii) Which predictors appear to have a statistically significant relationship to the response?

Test if the variable Education can be dropped from the full model. State your conclusion.

# SAMPLE CODE:

mod0 = smf.ols("Balance ~ Income + Limit + Rating + Cards + Age + Education + Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian + Ethnicity_Cauc
print(mod0.summary())

OLS Regression Results

==============================================================================
Dep. Variable: Balance R-squared: 0.956
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 608.4
Date: Fri, 03 Feb 2023 Prob (F-statistic): 1.46e-201
Time: 05:59:44 Log-Likelihood: -1907.0
No. Observations: 320 AIC: 3838.
Df Residuals: 308 BIC: 3883.
Df Model: 11
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept -503.6308 37.411 -13.462 0.000 -577.245 -430.017
Income -7.6994 0.256 -30.133 0.000 -8.202 -7.197
Limit 0.1482 0.036 4.065 0.000 0.076 0.220
Rating 1.7440 0.542 3.219 0.001 0.678 2.810
Cards 13.8773 4.686 2.961 0.003 4.657 23.098
Age -0.7371 0.312 -2.361 0.019 -1.351 -0.123
Education -0.5018 1.731 -0.290 0.772 -3.909 2.905
Gender_Female -0.0038 10.800 -0.000 1.000 -21.255 21.247
Student_Yes 421.3362 18.936 22.250 0.000 384.076 458.597
Married_Yes -0.0601 11.236 -0.005 0.996 -22.169 22.048
Ethnicity_Asian 25.1061 14.957 1.679 0.094 -4.324 54.537
Ethnicity_Caucasian 14.3602 12.970 1.107 0.269 -11.160 39.881
==============================================================================
Omnibus: 27.155 Durbin-Watson: 2.143
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.624
Skew: 0.755 Prob(JB): 1.36e-07
Kurtosis: 3.305 Cond. No. 3.68e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.68e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Not all the predictor variables are significant at 5% level of significance. Gender-female, Education, Matrried-Yes, Ethnicity- Asian, Ethnicity-
caucasion since the p-vale is less than the significance level of 5%.

Income, Limit, Rating, Cards, Age, Student-yes are statistically significant variables because p-value is greater than 5% level of significance.

y_pred = mod0.predict(X_test)
mean_squared_error(y_test, y_pred)

12827.009329624554

mod1 = smf.ols("Balance ~ Income + Limit + Rating + Cards + Age + Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian + Ethnicity_Caucasian", data
print(mod1.summary())
print(sm.stats.anova_lm(mod1,mod0))

OLS Regression Results

==============================================================================
Dep. Variable: Balance R-squared: 0.956
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 671.2
Date: Fri, 03 Feb 2023 Prob (F-statistic): 5.71e-203
Time: 06:07:50 Log-Likelihood: -1907.0
No. Observations: 320 AIC: 3836.
Df Residuals: 309 BIC: 3877.
Df Model: 10
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept -510.5521 28.755 -17.755 0.000 -567.133 -453.971
Income -7.6968 0.255 -30.186 0.000 -8.199 -7.195
Limit 0.1474 0.036 4.061 0.000 0.076 0.219
Rating 1.7554 0.540 3.254 0.001 0.694 2.817
Cards 13.8664 4.679 2.964 0.003 4.660 23.073
Age -0.7397 0.312 -2.374 0.018 -1.353 -0.127
Gender_Female 0.0604 10.782 0.006 0.996 -21.154 21.275
Student_Yes 420.8087 18.820 22.359 0.000 383.776 457.841
Married_Yes -0.2927 11.190 -0.026 0.979 -22.312 21.726
Ethnicity_Asian 25.1786 14.933 1.686 0.093 -4.204 54.561
Ethnicity_Caucasian 14.3779 12.950 1.110 0.268 -11.104 39.860
==============================================================================
Omnibus: 27.238 Durbin-Watson: 2.143
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.730
Skew: 0.755 Prob(JB): 1.29e-07
Kurtosis: 3.312 Cond. No. 2.89e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.89e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
df_resid ssr df_diff ss_diff F Pr(>F)
0 309.0 2.811583e+06 0.0 NaN NaN NaN
1 308.0 2.810816e+06 1.0 766.588031 0.084 0.772143

Since the slope coefficient of the education has higher P-Value than the significant level (0.05) we do not reject the null hypothesis. then we can
drop the varible eduction

4. Perform best subset selection on the training set and use BIC to select the best model. Calculate the test error obtained for the selected
best model.

# SAMPLE CODE:

X_train_const = sm.add_constant(X_train)

def processSubset(feature_set):
  # Fit model on feature_set and calculate RSS
  model = sm.OLS(y_train, X_train_const[list(feature_set)]).fit()
  RSS = ((model.predict(X_train_const[list(feature_set)])-y_train)**2).sum()
  return {"model":model, "RSS":RSS}

def getBest(k):
tic = time.time()

results = []

  for combo in itertools.combinations(X_train_const.columns, k):
    results.append(processSubset(combo))

  # Wrap everything up in a nice dataframe
  models = pd.DataFrame(results)

  # Choose the model with the lowest RSS
  best_model = models.loc[models["RSS"].argmin()]

  toc = time.time()
  print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")

  # Return the best model, along with some other useful information about the model
  return best_model

models_best = pd.DataFrame(columns=["RSS", "model"])

tic = time.time()
for i in range(1,12):
models_best.loc[i] = getBest(i)

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")

/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except

x = pd.concat(x[::order], 1)
Processed 12 models on 1 predictors in 0.03239750862121582 seconds.
Processed 66 models on 2 predictors in 0.17331290245056152 seconds.
Processed 220 models on 3 predictors in 0.6331427097320557 seconds.
Processed 495 models on 4 predictors in 1.627932071685791 seconds.
Processed 792 models on 5 predictors in 2.0984458923339844 seconds.
Processed 924 models on 6 predictors in 2.5077216625213623 seconds.
Processed 792 models on 7 predictors in 2.1479976177215576 seconds.
Processed 495 models on 8 predictors in 1.3949360847473145 seconds.
Processed 220 models on 9 predictors in 0.5890235900878906 seconds.
Processed 66 models on 10 predictors in 0.1715695858001709 seconds.
Processed 12 models on 11 predictors in 0.053664445877075195 seconds.
Total elapsed time: 11.522297382354736 seconds.

print(models_best.loc[7, "model"].summary())

OLS Regression Results

==============================================================================
Dep. Variable: Balance R-squared: 0.956
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 1122.
Date: Fri, 03 Feb 2023 Prob (F-statistic): 2.57e-208
Time: 06:31:10 Log-Likelihood: -1908.5
No. Observations: 320 AIC: 3831.
Df Residuals: 313 BIC: 3857.
Df Model: 6
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const -495.0901 26.071 -18.990 0.000 -546.387 -443.793
Income -7.6960 0.254 -30.327 0.000 -8.195 -7.197
Limit 0.1497 0.036 4.155 0.000 0.079 0.221
Rating 1.7223 0.535 3.218 0.001 0.669 2.775
Cards 14.0663 4.643 3.030 0.003 4.931 23.201
Age -0.7748 0.310 -2.502 0.013 -1.384 -0.166
Student_Yes 421.4346 18.650 22.596 0.000 384.738 458.131
==============================================================================
Omnibus: 26.571 Durbin-Watson: 2.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.843
Skew: 0.748 Prob(JB): 2.01e-07
Kurtosis: 3.279 Cond. No. 2.55e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

models_best.apply(lambda row: row[1].bic, axis=1)

1 4465.255067
2 4367.426611
3 4150.651368
4 3864.274214
5 3860.900589
6 3857.960126
7 3857.390089
8 3861.443547
9 3865.936177
10 3871.616516
11 3877.384808
dtype: float64

since model with 7 variables has the lowest BIC value, therefore that is the best model

5. Fit a ridge regression model on the training set, with λ chosen by 5-fold cross-validation. Report the test error obtained.

# SAMPLE CODE:
list_num = X_train.columns

scaler = StandardScaler().fit(X_train[list_num])
X_train[list_num] = scaler.transform(X_train[list_num])
X_test[list_num] = scaler.transform(X_test[list_num])

alphas = np.linspace(0.0001,1000,100)

ridgecv = RidgeCV(alphas=alphas, fit_intercept=True, scoring="neg_mean_squared_error", cv=5)
ridgecv.fit(X_train, y_train)
ridgecv.alpha_

0.0001

ridge = Ridge(alpha=ridgecv.alpha_)
ridge.fit(X_train, y_train)
pd.Series(ridge.coef_, index=X_train.columns)

Income -262.323521
Limit 330.480411
Rating 262.561427
Cards 19.237719
Age -12.880879
Education -1.563786
Gender_Female -0.001867
Student_Yes 120.955032
Married_Yes -0.029201
Ethnicity_Asian 10.916195
Ethnicity_Caucasian 7.175075
dtype: float64

mean_squared_error(y_test, ridge.predict(X_test))

12827.039551508242

6. Fit a lasso model on the training set, with λ chosen by 5-fold cross-validation. Report the test error obtained, along with the number of
non-zero coefficient estimates.

# SAMPLE CODE:

lassocv = LassoCV(fit_intercept=True, cv=5, random_state=0, max_iter=10000)
lassocv.fit(X_train, y_train)
lassocv.alpha_

1.0350511646444023

lasso = Lasso(max_iter=10000)

lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
pd.Series(lasso.coef_, index=X_train.columns)

Income -257.683223
Limit 325.263111
Rating 263.185199
Cards 18.272406
Age -12.248352
Education -0.419213
Gender_Female 0.000000
Student_Yes 119.565342
Married_Yes -0.000000
Ethnicity_Asian 8.661128
Ethnicity_Caucasian 5.024818
dtype: float64

Variable Gender-female, Married-yes has approached to zero. therefore model without these two variables is the better model.

mean_squared_error(y_test, lasso.predict(X_test))

12899.346237007816
7. Compare the results obtained. Is there much difference among the test errors resulting from the four approaches above? Which method
would you recommend? Justify your conclusions.

YOUR ANSWERS HERE!

lass_err = mean_squared_error(y_test, lasso.predict(X_test))
ridge_err = mean_squared_error(y_test, ridge.predict(X_test))
best_err = mean_squared_error(y_test, y_pred)
print('lasso MSE', lass_err)
print('Ridge MSE', ridge_err)
print('best subset MSE', best_err)

lasso MSE 12899.346237007816

Ridge MSE 12827.039551508242
best subset MSE 12827.009329624554

Best model is best subset selection model since it has lowest MSE

check 0s completed at 12:06

The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou Instant Download
100% (3)
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou Instant Download
71 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
NVT SDS Unit V Final PDF
No ratings yet
NVT SDS Unit V Final PDF
100 pages
HW 3
No ratings yet
HW 3
20 pages
Code PLFS MVPA
No ratings yet
Code PLFS MVPA
12 pages
Data Analysis Report
No ratings yet
Data Analysis Report
16 pages
Python Codes Test 2
No ratings yet
Python Codes Test 2
12 pages
Regression Basics
No ratings yet
Regression Basics
27 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
5 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Week01 Lecture Lyu
No ratings yet
Week01 Lecture Lyu
70 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Assignment Solution 1
No ratings yet
Assignment Solution 1
11 pages
Simple and Multiple Regression
No ratings yet
Simple and Multiple Regression
9 pages
Week 2 MrSumanBera HandsOn
No ratings yet
Week 2 MrSumanBera HandsOn
9 pages
Logistic Regression Implementation in R: The Dataset
No ratings yet
Logistic Regression Implementation in R: The Dataset
8 pages
Stats101a Homework8
No ratings yet
Stats101a Homework8
7 pages
Jana Sir - Final
No ratings yet
Jana Sir - Final
19 pages
Module 6A
No ratings yet
Module 6A
25 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
Intro LOGIT
No ratings yet
Intro LOGIT
46 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
TestExercise 3.ipynb - Colab
No ratings yet
TestExercise 3.ipynb - Colab
8 pages
Python - Codes - Regression - Jupyter Notebook
No ratings yet
Python - Codes - Regression - Jupyter Notebook
7 pages
Lab Linear Regression
No ratings yet
Lab Linear Regression
21 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Pad Assignment No - 01
No ratings yet
Pad Assignment No - 01
6 pages
Regression PDF
No ratings yet
Regression PDF
10 pages
Https Tutorials Iq Harvard Edu R Rstatistics Rstatistics HTML
No ratings yet
Https Tutorials Iq Harvard Edu R Rstatistics Rstatistics HTML
25 pages
Pregunta 5
No ratings yet
Pregunta 5
2 pages
Week01 Lecture BB
No ratings yet
Week01 Lecture BB
70 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
Analysing Panel Data
No ratings yet
Analysing Panel Data
25 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
Note 4
No ratings yet
Note 4
18 pages
Probit and Logit Models Stata Program and Output PDF
No ratings yet
Probit and Logit Models Stata Program and Output PDF
10 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Discrete Choice Modeling: William Greene Stern School of Business New York University
No ratings yet
Discrete Choice Modeling: William Greene Stern School of Business New York University
58 pages
Consolidated Outputs PGDM
No ratings yet
Consolidated Outputs PGDM
52 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
CE1 Sol
No ratings yet
CE1 Sol
7 pages
TP Regression
100% (1)
TP Regression
1 page
Regression 101
No ratings yet
Regression 101
46 pages
Stat 378
No ratings yet
Stat 378
73 pages
Discrete Choice Modeling: William Greene Stern School of Business New York University
No ratings yet
Discrete Choice Modeling: William Greene Stern School of Business New York University
58 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
Results 1
No ratings yet
Results 1
4 pages
STATA Training For Staff
No ratings yet
STATA Training For Staff
23 pages
DS535 Note 4 (With Marks)
No ratings yet
DS535 Note 4 (With Marks)
18 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Empirical Exercises 6
No ratings yet
Empirical Exercises 6
7 pages
Regression 101
No ratings yet
Regression 101
46 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Outputs 1
No ratings yet
Outputs 1
3 pages