Coding Activity 3.ipynb - Colaboratory
Coding Activity 3.ipynb - Colaboratory
Student_ID Student_Name
196026P Fernando D. N. N.
196047F Madurasinghe M. A. V. N
196086X Sumaiya A. M. N.
196090E Weerahannedige D. C.
This coding activity involves studying the credit card balance data in the file Credit.csv . The data set contains information on 400 customers
on the following variables.
ID : Identification
Income : Income in $1,000's
Limit : Credit limit
Rating : Credit rating
Cards : Number of credit cards
Age : Age in years
Education : Number of years of education
Gender : A factor with levels Male and Female
Student : A factor with levels No and Yes indicating whether the individual was a student
Married : A factor with levels No and Yes indicating whether the individual was married
Ethnicity : A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity
Balance : Average credit card balance in $
The aim is to determine which factors influence the credit card balance of any given individual.
Examine the data type of each variable and if needed, cast the variables in order to fit a linear regression model.
Based on the scatterplot matrix and the matrix of correlations, briefly summarize the relationships between the quantitative variables.
# SAMPLE CODE:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import itertools
import time
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
%matplotlib inline
credit = pd.read_csv("Credit.csv", index_col=[0])
credit.Gender = credit.Gender.astype("category")
credit.Student = credit.Student.astype("category")
credit.Married = credit.Married.astype("category")
credit.Ethnicity = credit.Ethnicity.astype("category")
credit.describe()
credit.describe(include=['category'])
unique 2 2 2 3
dummies = pd.get_dummies(data=credit, drop_first=True)
numerical = credit.select_dtypes(include=["int64", "float64"])
pd.plotting.scatter_matrix(numerical, alpha=0.2, figsize=(10, 10))
plt.show()
2. Split the data set into a training set and a test set.
# SAMPLE CODE:
y = credit.Balance
X_numerical = numerical.drop(["Balance"], axis=1)
# Create all features
X = pd.concat([X_numerical, dummies[["Gender_Female", "Student_Yes", "Married_Yes", "Ethnicity_Asian", "Ethnicity_Caucasian"]]], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
credit_train = pd.concat([X_train, y_train], axis=1)
3. Fit a linear model using least squares on the training set for predicting Balance using all other explanatory variables. Calculate the test
error obtained.
Use .summary() function to print the results. Comment on the output. For instance:
(i) Is there a significant relationship between all the predictors and the response?
(ii) Which predictors appear to have a statistically significant relationship to the response?
Test if the variable Education can be dropped from the full model. State your conclusion.
# SAMPLE CODE:
mod0 = smf.ols("Balance ~ Income + Limit + Rating + Cards + Age + Education + Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian + Ethnicity_Cauc
print(mod0.summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.68e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Not all the predictor variables are significant at 5% level of significance. Gender-female, Education, Matrried-Yes, Ethnicity- Asian, Ethnicity-
caucasion since the p-vale is less than the significance level of 5%.
Income, Limit, Rating, Cards, Age, Student-yes are statistically significant variables because p-value is greater than 5% level of significance.
y_pred = mod0.predict(X_test)
mean_squared_error(y_test, y_pred)
12827.009329624554
mod1 = smf.ols("Balance ~ Income + Limit + Rating + Cards + Age + Gender_Female + Student_Yes + Married_Yes + Ethnicity_Asian + Ethnicity_Caucasian", data
print(mod1.summary())
print(sm.stats.anova_lm(mod1,mod0))
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.89e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
df_resid ssr df_diff ss_diff F Pr(>F)
0 309.0 2.811583e+06 0.0 NaN NaN NaN
1 308.0 2.810816e+06 1.0 766.588031 0.084 0.772143
Since the slope coefficient of the education has higher P-Value than the significant level (0.05) we do not reject the null hypothesis. then we can
drop the varible eduction
4. Perform best subset selection on the training set and use BIC to select the best model. Calculate the test error obtained for the selected
best model.
# SAMPLE CODE:
X_train_const = sm.add_constant(X_train)
def processSubset(feature_set):
# Fit model on feature_set and calculate RSS
model = sm.OLS(y_train, X_train_const[list(feature_set)]).fit()
RSS = ((model.predict(X_train_const[list(feature_set)])-y_train)**2).sum()
return {"model":model, "RSS":RSS}
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X_train_const.columns, k):
results.append(processSubset(combo))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the lowest RSS
best_model = models.loc[models["RSS"].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
models_best = pd.DataFrame(columns=["RSS", "model"])
tic = time.time()
for i in range(1,12):
models_best.loc[i] = getBest(i)
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
print(models_best.loc[7, "model"].summary())
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
models_best.apply(lambda row: row[1].bic, axis=1)
1 4465.255067
2 4367.426611
3 4150.651368
4 3864.274214
5 3860.900589
6 3857.960126
7 3857.390089
8 3861.443547
9 3865.936177
10 3871.616516
11 3877.384808
dtype: float64
since model with 7 variables has the lowest BIC value, therefore that is the best model
5. Fit a ridge regression model on the training set, with λ chosen by 5-fold cross-validation. Report the test error obtained.
# SAMPLE CODE:
list_num = X_train.columns
scaler = StandardScaler().fit(X_train[list_num])
X_train[list_num] = scaler.transform(X_train[list_num])
X_test[list_num] = scaler.transform(X_test[list_num])
alphas = np.linspace(0.0001,1000,100)
ridgecv = RidgeCV(alphas=alphas, fit_intercept=True, scoring="neg_mean_squared_error", cv=5)
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.0001
ridge = Ridge(alpha=ridgecv.alpha_)
ridge.fit(X_train, y_train)
pd.Series(ridge.coef_, index=X_train.columns)
Income -262.323521
Limit 330.480411
Rating 262.561427
Cards 19.237719
Age -12.880879
Education -1.563786
Gender_Female -0.001867
Student_Yes 120.955032
Married_Yes -0.029201
Ethnicity_Asian 10.916195
Ethnicity_Caucasian 7.175075
dtype: float64
mean_squared_error(y_test, ridge.predict(X_test))
12827.039551508242
6. Fit a lasso model on the training set, with λ chosen by 5-fold cross-validation. Report the test error obtained, along with the number of
non-zero coefficient estimates.
# SAMPLE CODE:
lassocv = LassoCV(fit_intercept=True, cv=5, random_state=0, max_iter=10000)
lassocv.fit(X_train, y_train)
lassocv.alpha_
1.0350511646444023
lasso = Lasso(max_iter=10000)
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(X_train, y_train)
pd.Series(lasso.coef_, index=X_train.columns)
Income -257.683223
Limit 325.263111
Rating 263.185199
Cards 18.272406
Age -12.248352
Education -0.419213
Gender_Female 0.000000
Student_Yes 119.565342
Married_Yes -0.000000
Ethnicity_Asian 8.661128
Ethnicity_Caucasian 5.024818
dtype: float64
Variable Gender-female, Married-yes has approached to zero. therefore model without these two variables is the better model.
mean_squared_error(y_test, lasso.predict(X_test))
12899.346237007816
7. Compare the results obtained. Is there much difference among the test errors resulting from the four approaches above? Which method
would you recommend? Justify your conclusions.
lass_err = mean_squared_error(y_test, lasso.predict(X_test))
ridge_err = mean_squared_error(y_test, ridge.predict(X_test))
best_err = mean_squared_error(y_test, y_pred)
print('lasso MSE', lass_err)
print('Ridge MSE', ridge_err)
print('best subset MSE', best_err)
Best model is best subset selection model since it has lowest MSE