0% found this document useful (0 votes)
56 views7 pages

Coding Activity 3.ipynb - Colaboratory

The document analyzes credit card balance data of 400 customers. It describes the variables and aims to determine factors influencing balance. It performs exploratory analysis, splits data into training and test sets, fits a linear model to training set to predict balance and calculates test error, and determines education can be dropped without affecting model.

Uploaded by

Jay Rajapaksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views7 pages

Coding Activity 3.ipynb - Colaboratory

The document analyzes credit card balance data of 400 customers. It describes the variables and aims to determine factors influencing balance. It performs exploratory analysis, splits data into training and test sets, fits a linear model to training set to predict balance and calculates test error, and determines education can be dropped without affecting model.

Uploaded by

Jay Rajapaksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

List of Group Members:

Student_ID Student_Name

196026P Fernando D. N. N.

196047F Madurasinghe M. A. V. N

196086X Sumaiya A. M. N.

196090E Weerahannedige D. C.

196063B Hasali perera

This coding activity involves studying the credit card balance data in the file Credit.csv . The data set contains information on 400 customers
on the following variables.

ID : Identification
Income : Income in $1,000's
Limit : Credit limit
Rating : Credit rating
Cards : Number of credit cards
Age : Age in years
Education : Number of years of education
Gender : A factor with levels Male and Female
Student : A factor with levels No and Yes indicating whether the individual was a student
Married : A factor with levels No and Yes indicating whether the individual was married
Ethnicity : A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity
Balance : Average credit card balance in $

The aim is to determine which factors influence the credit card balance of any given individual.

1. Perform an exploratory analysis of the data including the followings as well:

Examine the data type of each variable and if needed, cast the variables in order to fit a linear regression model.

Produce a numerical summary of all quantitative variables.

Produce a scatterplot matrix of all quantitative variables.

Compute the matrix of correlations between (all quantitative) variables.

Based on the scatterplot matrix and the matrix of correlations, briefly summarize the relationships between the quantitative variables.

For a more detialed exploratory data analysis, refer https://www.kaggle.com/code/suzanaiacob/predicting-credit-card-balance-using-


regression/notebook#Finding-the-Best-Model

# SAMPLE CODE:

import pandas as pd 
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import itertools
import time
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV

%matplotlib inline

credit = pd.read_csv("Credit.csv", index_col=[0])
credit.Gender = credit.Gender.astype("category")
credit.Student = credit.Student.astype("category")
credit.Married = credit.Married.astype("category")
credit.Ethnicity = credit.Ethnicity.astype("category")

credit.describe()
credit.describe(include=['category'])

Gender Student Married Ethnicity

count 400 400 400 400

unique 2 2 2 3

top Female No Yes Caucasian

freq 207 360 245 199

dummies = pd.get_dummies(data=credit, drop_first=True)

numerical = credit.select_dtypes(include=["int64", "float64"])
pd.plotting.scatter_matrix(numerical, alpha=0.2, figsize=(10, 10))
plt.show()

2. Split the data set into a training set and a test set.

# SAMPLE CODE:

y = credit.Balance
X_numerical = numerical.drop(["Balance"], axis=1)

# Create all features
X = pd.concat([X_numerical, dummies[["Gender_Female", "Student_Yes", "Married_Yes", "Ethnicity_Asian", "Ethnicity_Caucasian"]]], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
credit_train = pd.concat([X_train, y_train], axis=1)
3. Fit a linear model using least squares on the training set for predicting Balance using all other explanatory variables. Calculate the test
error obtained.

Do the followings as well:

Use .summary() function to print the results. Comment on the output. For instance:

(i) Is there a significant relationship between all the predictors and the response?

(ii) Which predictors appear to have a statistically significant relationship to the response?

Test if the variable Education can be dropped from the full model. State your conclusion.

# SAMPLE CODE:

mod0 = smf.ols("Balance ~ Income + Limit + Rating + Cards + Age + Education + Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian + Ethnicity_Cauc
print(mod0.summary())

OLS Regression Results


==============================================================================
Dep. Variable: Balance R-squared: 0.956
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 608.4
Date: Fri, 03 Feb 2023 Prob (F-statistic): 1.46e-201
Time: 05:59:44 Log-Likelihood: -1907.0
No. Observations: 320 AIC: 3838.
Df Residuals: 308 BIC: 3883.
Df Model: 11
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept -503.6308 37.411 -13.462 0.000 -577.245 -430.017
Income -7.6994 0.256 -30.133 0.000 -8.202 -7.197
Limit 0.1482 0.036 4.065 0.000 0.076 0.220
Rating 1.7440 0.542 3.219 0.001 0.678 2.810
Cards 13.8773 4.686 2.961 0.003 4.657 23.098
Age -0.7371 0.312 -2.361 0.019 -1.351 -0.123
Education -0.5018 1.731 -0.290 0.772 -3.909 2.905
Gender_Female -0.0038 10.800 -0.000 1.000 -21.255 21.247
Student_Yes 421.3362 18.936 22.250 0.000 384.076 458.597
Married_Yes -0.0601 11.236 -0.005 0.996 -22.169 22.048
Ethnicity_Asian 25.1061 14.957 1.679 0.094 -4.324 54.537
Ethnicity_Caucasian 14.3602 12.970 1.107 0.269 -11.160 39.881
==============================================================================
Omnibus: 27.155 Durbin-Watson: 2.143
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.624
Skew: 0.755 Prob(JB): 1.36e-07
Kurtosis: 3.305 Cond. No. 3.68e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.68e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Not all the predictor variables are significant at 5% level of significance. Gender-female, Education, Matrried-Yes, Ethnicity- Asian, Ethnicity-
caucasion since the p-vale is less than the significance level of 5%.

Income, Limit, Rating, Cards, Age, Student-yes are statistically significant variables because p-value is greater than 5% level of significance.

y_pred = mod0.predict(X_test)
mean_squared_error(y_test, y_pred)

12827.009329624554

mod1 = smf.ols("Balance ~ Income + Limit + Rating + Cards + Age + Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian + Ethnicity_Caucasian", data
print(mod1.summary())
print(sm.stats.anova_lm(mod1,mod0))

OLS Regression Results


==============================================================================
Dep. Variable: Balance R-squared: 0.956
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 671.2
Date: Fri, 03 Feb 2023 Prob (F-statistic): 5.71e-203
Time: 06:07:50 Log-Likelihood: -1907.0
No. Observations: 320 AIC: 3836.
Df Residuals: 309 BIC: 3877.
Df Model: 10
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept -510.5521 28.755 -17.755 0.000 -567.133 -453.971
Income -7.6968 0.255 -30.186 0.000 -8.199 -7.195
Limit 0.1474 0.036 4.061 0.000 0.076 0.219
Rating 1.7554 0.540 3.254 0.001 0.694 2.817
Cards 13.8664 4.679 2.964 0.003 4.660 23.073
Age -0.7397 0.312 -2.374 0.018 -1.353 -0.127
Gender_Female 0.0604 10.782 0.006 0.996 -21.154 21.275
Student_Yes 420.8087 18.820 22.359 0.000 383.776 457.841
Married_Yes -0.2927 11.190 -0.026 0.979 -22.312 21.726
Ethnicity_Asian 25.1786 14.933 1.686 0.093 -4.204 54.561
Ethnicity_Caucasian 14.3779 12.950 1.110 0.268 -11.104 39.860
==============================================================================
Omnibus: 27.238 Durbin-Watson: 2.143
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.730
Skew: 0.755 Prob(JB): 1.29e-07
Kurtosis: 3.312 Cond. No. 2.89e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.89e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
df_resid ssr df_diff ss_diff F Pr(>F)
0 309.0 2.811583e+06 0.0 NaN NaN NaN
1 308.0 2.810816e+06 1.0 766.588031 0.084 0.772143

Since the slope coefficient of the education has higher P-Value than the significant level (0.05) we do not reject the null hypothesis. then we can
drop the varible eduction

4. Perform best subset selection on the training set and use BIC to select the best model. Calculate the test error obtained for the selected
best model.

# SAMPLE CODE:

X_train_const = sm.add_constant(X_train)

def processSubset(feature_set):
  # Fit model on feature_set and calculate RSS
  model = sm.OLS(y_train, X_train_const[list(feature_set)]).fit()
  RSS = ((model.predict(X_train_const[list(feature_set)])-y_train)**2).sum()
  return {"model":model, "RSS":RSS}

def getBest(k):
  tic = time.time()

  results = []

  for combo in itertools.combinations(X_train_const.columns, k):
    results.append(processSubset(combo))
    
  # Wrap everything up in a nice dataframe
  models = pd.DataFrame(results)
    
  # Choose the model with the lowest RSS
  best_model = models.loc[models["RSS"].argmin()]
    
  toc = time.time()
  print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")
    
  # Return the best model, along with some other useful information about the model
  return best_model

models_best = pd.DataFrame(columns=["RSS", "model"])

tic = time.time()
for i in range(1,12):
  models_best.loc[i] = getBest(i)

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")

/usr/local/lib/python3.8/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except


x = pd.concat(x[::order], 1)
Processed 12 models on 1 predictors in 0.03239750862121582 seconds.
Processed 66 models on 2 predictors in 0.17331290245056152 seconds.
Processed 220 models on 3 predictors in 0.6331427097320557 seconds.
Processed 495 models on 4 predictors in 1.627932071685791 seconds.
Processed 792 models on 5 predictors in 2.0984458923339844 seconds.
Processed 924 models on 6 predictors in 2.5077216625213623 seconds.
Processed 792 models on 7 predictors in 2.1479976177215576 seconds.
Processed 495 models on 8 predictors in 1.3949360847473145 seconds.
Processed 220 models on 9 predictors in 0.5890235900878906 seconds.
Processed 66 models on 10 predictors in 0.1715695858001709 seconds.
Processed 12 models on 11 predictors in 0.053664445877075195 seconds.
Total elapsed time: 11.522297382354736 seconds.

print(models_best.loc[7, "model"].summary())

OLS Regression Results


==============================================================================
Dep. Variable: Balance R-squared: 0.956
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 1122.
Date: Fri, 03 Feb 2023 Prob (F-statistic): 2.57e-208
Time: 06:31:10 Log-Likelihood: -1908.5
No. Observations: 320 AIC: 3831.
Df Residuals: 313 BIC: 3857.
Df Model: 6
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const -495.0901 26.071 -18.990 0.000 -546.387 -443.793
Income -7.6960 0.254 -30.327 0.000 -8.195 -7.197
Limit 0.1497 0.036 4.155 0.000 0.079 0.221
Rating 1.7223 0.535 3.218 0.001 0.669 2.775
Cards 14.0663 4.643 3.030 0.003 4.931 23.201
Age -0.7748 0.310 -2.502 0.013 -1.384 -0.166
Student_Yes 421.4346 18.650 22.596 0.000 384.738 458.131
==============================================================================
Omnibus: 26.571 Durbin-Watson: 2.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.843
Skew: 0.748 Prob(JB): 2.01e-07
Kurtosis: 3.279 Cond. No. 2.55e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

models_best.apply(lambda row: row[1].bic, axis=1)

1 4465.255067
2 4367.426611
3 4150.651368
4 3864.274214
5 3860.900589
6 3857.960126
7 3857.390089
8 3861.443547
9 3865.936177
10 3871.616516
11 3877.384808
dtype: float64

since model with 7 variables has the lowest BIC value, therefore that is the best model

5. Fit a ridge regression model on the training set, with λ chosen by 5-fold cross-validation. Report the test error obtained.

# SAMPLE CODE:
list_num = X_train.columns

scaler = StandardScaler().fit(X_train[list_num]) 
X_train[list_num] = scaler.transform(X_train[list_num])
X_test[list_num] = scaler.transform(X_test[list_num])

alphas = np.linspace(0.0001,1000,100)

ridgecv = RidgeCV(alphas=alphas, fit_intercept=True, scoring="neg_mean_squared_error", cv=5)
ridgecv.fit(X_train, y_train)
ridgecv.alpha_

0.0001

ridge = Ridge(alpha=ridgecv.alpha_)
ridge.fit(X_train, y_train)
pd.Series(ridge.coef_, index=X_train.columns)

Income -262.323521
Limit 330.480411
Rating 262.561427
Cards 19.237719
Age -12.880879
Education -1.563786
Gender_Female -0.001867
Student_Yes 120.955032
Married_Yes -0.029201
Ethnicity_Asian 10.916195
Ethnicity_Caucasian 7.175075
dtype: float64

mean_squared_error(y_test, ridge.predict(X_test))

12827.039551508242

6. Fit a lasso model on the training set, with λ chosen by 5-fold cross-validation. Report the test error obtained, along with the number of
non-zero coefficient estimates.

# SAMPLE CODE:

lassocv = LassoCV(fit_intercept=True, cv=5, random_state=0, max_iter=10000)
lassocv.fit(X_train, y_train)
lassocv.alpha_

1.0350511646444023

lasso = Lasso(max_iter=10000)

lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
pd.Series(lasso.coef_, index=X_train.columns)

Income -257.683223
Limit 325.263111
Rating 263.185199
Cards 18.272406
Age -12.248352
Education -0.419213
Gender_Female 0.000000
Student_Yes 119.565342
Married_Yes -0.000000
Ethnicity_Asian 8.661128
Ethnicity_Caucasian 5.024818
dtype: float64

Variable Gender-female, Married-yes has approached to zero. therefore model without these two variables is the better model.

mean_squared_error(y_test, lasso.predict(X_test))

12899.346237007816
7. Compare the results obtained. Is there much difference among the test errors resulting from the four approaches above? Which method
would you recommend? Justify your conclusions.

YOUR ANSWERS HERE!

lass_err = mean_squared_error(y_test, lasso.predict(X_test))
ridge_err = mean_squared_error(y_test, ridge.predict(X_test))
best_err = mean_squared_error(y_test, y_pred)
print('lasso MSE', lass_err)
print('Ridge MSE', ridge_err)
print('best subset MSE', best_err)

lasso MSE 12899.346237007816


Ridge MSE 12827.039551508242
best subset MSE 12827.009329624554

Best model is best subset selection model since it has lowest MSE

check 0s completed at 12:06

You might also like