Multiple Regressor - Jupyter Notebook
Multiple Regressor - Jupyter Notebook
import time
# ols library
import statsmodels.api as sm
import statsmodels.formula.api as smf
# pre-processing methods
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
# cross-validation methods
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# feature-selection methods
from sklearn.feature_selection import SelectFromModel
# bootstrap sampling
from sklearn.utils import resample
In [3]: # suppress display of warnings
warnings.filterwarnings('ignore')
2 Data Collection
In [4]: # Reading Concrete data
concrete_df = pd.read_csv("concrete.csv")
Out[5]:
cement slag ash water superplastic coarseagg fineagg
Out[6]: (1030, 9)
In [7]: print("Number of rows = {0} and Number of Columns = {1} in Data frame".format
From the above output, we see that except for the column 'age', all the columns datatype are
float64. The data has 8 quantitative input variables and 1 quantitative output variable - strength
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cement 1030 non-null float64
1 slag 1030 non-null float64
2 ash 1030 non-null float64
3 water 1030 non-null float64
4 superplastic 1030 non-null float64
5 coarseagg 1030 non-null float64
6 fineagg 1030 non-null float64
7 age 1030 non-null int64
8 strength 1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
3 Data Cleaning
In [11]: # Check duplicates in a data frame
concrete_df.duplicated().sum()
Out[11]: 25
concrete_df[duplicates]
Out[12]:
cement slag ash water superplastic coarseagg fineagg
Out[14]: (1005, 9)
# Calculate IQR
Q1 = concrete_df_outliers.quantile(0.25)
Q3 = concrete_df_outliers.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
cement 158.3000000
slag 142.5000000
ash 118.3000000
water 26.3000000
superplastic 10.0000000
coarseagg 99.0000000
fineagg 97.9000000
age 49.0000000
strength 21.3500000
dtype: float64
In [17]: concrete_df.columns
In [18]: # use IQR score to filter out the outliers by keeping only valid values
# Replace every outlier on the upper side by the upper whisker - for 'water', 'sup
# 'fineagg', 'age' and 'strength' columns
for i, j in zip(np.where(concrete_df_outliers > Q3 + 1.5 * IQR)[0], np.where
# Replace every outlier on the lower side by the lower whisker - for 'water' colum
for i, j in zip(np.where(concrete_df_outliers < Q1 - 1.5 * IQR)[0], np.where
In [19]: # Remove outliers columns - 'water', 'superplastic', 'fineagg', 'age', 'water' and
concrete_df.drop(columns = concrete_df.loc[:,], inplace = True)
In [20]: # Add 'water', 'superplastic', 'fineagg', 'age', 'water' and 'strength' with no ou
# concrete_df
concrete_df = pd.concat([concrete_df, concrete_df_outliers], axis = 1)
Out[22]: cement 0
slag 0
ash 0
water 0
superplastic 0
coarseagg 0
fineagg 0
age 0
strength 0
dtype: int64
In [23]: # Check the presence of missing values
concrete_df_missval = concrete_df.copy() # Make a copy of the dataframe
isduplicates = False
for x in concrete_df_missval.columns:
concrete_df_missval[x] = concrete_df_missval[x].astype(str).str.replace
result = concrete_df_missval[x].astype(str).str.isalnum() # Check whether all
if False in result.unique():
isduplicates = True
print('For column "{}" unique values are {}'.format(x, concrete_df_missval
print('\n')
if not isduplicates:
print('No duplicates in this dataset')
Out[25]:
count mean std min 25% 50%
Cement column - Right skewed distribution -- cement is skewed to higher values Slag column -
Right skewed distribution -- slag is skewed to higher values and there are two gaussians Ash
column - Right skewed distribution -- ash is skewed to higher values and there are two
gaussians Water column - Moderately left skewed distribution Superplastic column - Right
skewed distribution -- superplastic is skewed to higher values and there are two gaussians
Coarseagg column - Moderately left skewed distribution Fineagg column - Moderately left
skewed distribution Age column - Right skewed distribution -- age is skewed to higher values
and there are three gaussians
In [27]: plt.figure(figsize=(13,6))
sns.distplot(concrete_df["strength"],color="b",rug=True)
plt.axvline(concrete_df["strength"].mean(), linestyle="dashed",color="k", label
plt.legend(loc="best",prop={"size":14})
plt.title("Concrete compressivee strength distribution")
plt.show()
Out[28]:
count mean std min 25% 50%
The above output prints the important summary statistics of all the numeric variables like the
mean, median (50%), minimum, and maximum values, along with the standard deviation.
Diagonals Analysis If we look at KDE diagonal plots, there are at least 2 Gaussians (2 peaks) in
Slag, Ash, Superplastic and Age, even though it's not unsupervised learning but in this dataset
there are at least 2 clusters and there may be more.
The diagonal analysis give same insights as we got from univariate analysis.
Off Diagonal Analysis: Relationship between indpendent attributes Scatter plots Cement vs
other independent attributes: This attribute does not have any significant relation with other
independent features. It almost spread like a cloud. If we had calculated the r value it would
have come close to 0.
Slag vs other independent attributes: This attribute does not have any significant relation with
other independent features. It almost spread like a cloud. If we had calculated the r value it
would have come close to 0.
Ash vs other independent attributes: This attribute does not have any significant relation with
other independent features. It almost spread like a cloud. If we had calculated the r value it
would have come close to 0.
Water vs other independent attributes: This attribute has negative curvy-linear relationship with
Fineagg, Coarseagg and Superplastic, as Water content increases means Fineagg, Coarseagg
and Superplastic are reducing. It does not have any significant relationship with other
independent atributes.
Superplastic vs other independent attributes:This attribute have negative linear relationship with
water only. It does not have any significant relationship with other independent attributes.
Coarseagg vs other independent attributes: This attribute does not have any significant relation
with other independent features. It almost spread like a cloud. If we had calculated the r value it
would have come close to 0.
Fineagg vs other independent attributes: It has negative linear relationship with water. It does
not have any significant relation with any other attributes. It almost spread like a cloud. If we
had calculated the r value it would have come close to 0.
The reason why we are doing all this analysis is if we find any kind of dimensions which are
very strongly correlated i.e. r value close to 1 or -1 such dimensions are giving same
information to your algorithms, its a redundant dimension. So in such cases we may want to
keep one and drop the other which we should keep and which we should drop depends on
again your domain expertise, which one of the dimension is more prone to errors.I would like to
drop that dimension. Or we have a choice to combine these dimensions and create a composite
dimension out of it.
Out[30]:
cement slag ash water superplastic coarseagg fineagg
plt.figure(figsize = (12,10))
sns.heatmap(lower_triangle, center = 0.5, cmap = 'coolwarm', annot=True, xticklabe
cbar= True, linewidths= 1, mask = mask) # Da Heatmap
plt.show()
Observations:
1. Looking at the Correlation table; 'Cement', 'Water', 'Superplastic' and 'Age' features are
influencing the concrete strength.
2. Concrete strength feature is having Moderate Positive Correlation with Cement feature.
3. Concrete strength feature is having Low Positive Correlation with Superplastic and Age
features
4. Concrete strength feature is having Low Positive Correlation with Water features
5. Concrete strength feature is having negligible Correlation with Slag, Ash, Coarseagg and
Fineagg features
6. Water feature is having Moderate Positive Correlation with Superplastic feature
7. Concrete cement feature is having Low Positive Correlation with Slag and Ash features
8. Concrete fineagg feature is having Low Positive Correlation with Water feature
9. Concrete ash feature is having Low Positive Correlation with Superplastic feature
Except 'Cement', 'Water', 'Superplastic' and 'Age' features, all other features are having very
weak relationship with concrete 'Strength' feature and does not account for making statistical
decision (of correlation).
Concrete Cement feature is having Low Positive Correlation with Slag and Ash features,
perhaps we can create additional features like (cement + slag) and (cement + ash) to predict
the concrete strength.
Concrete Fineagg feature is having Low Positive Correlation with Water feature, perhaps we
can create additional features like (water + fineagg) to predict the concrete strength.
Concrete Ash feature is having Low Positive Correlation with Superplastic feature, perhaps we
can create additional features like (ash + Superplastic) to predict the concrete strength.
5 Feature Engineering
Identify opportunities (if any) to create a composite feature, drop a feature etc. As mentioned in
EDA summary. Independent features are influencing concrete strength - 'Cement', 'Water',
'Superplastic' and 'Age'
Composite features are influencing concrete strength - cement + slag, cement + ash and water
+ fineagg. We can create these composite featues because these features are having some
relationship within them.
Note: Before concluding anything we can try with feature selection methods and then compare
the resutls.
print (model)
print ("**********************************************************************
if scale == 'yes':
# prepare the model with input scaling
pipeline = Pipeline([('scaler', PowerTransformer()), ('model', model
elif scale == 'no':
# prepare the model with input scaling
pipeline = Pipeline([('model', model)])
if of_type == "coef":
# Intercept and Coefficients
print("The intercept for our model is {}".format(model.intercept_),
print ("**********************************************************************
if of_type == "coef":
# Store the accuracy results for each model in a dataframe for final compariso
resultsDf = pd.DataFrame({'Method': method, 'R Squared': r2, 'RMSE': rmse
'Test Accuracy': test_accuracy_score}, index=
resultsDf_common = pd.DataFrame()
i = 1
for name, regressor in models:
# Train and Test the model
reg_resultsDf = train_test_model(regressor, name, X_train_common, X_test_c
# Store the accuracy results for each model in a dataframe for final compa
resultsDf_common = pd.concat([resultsDf_common, reg_resultsDf])
i = i+1
return resultsDf_common
# summarize results
print(name, "- Least: RMSE %f using %s" % (model_grid_result.best_score_
return model_grid_result.best_estimator_
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf
LinearRegression()
*************************************************************************
**
*************************************************************************
**
Out[42]:
Method R Squared RMSE Train Accuracy Test Accuracy
Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.
In [43]: cdf = pd.DataFrame(lr.coef_, X_train.columns, columns=['Coefficients'])
print(cdf)
Coefficients
cement 0.1180563
slag 0.0978802
ash 0.0803260
water -0.1609238
superplastic 0.2661270
coarseagg 0.0089746
fineagg 0.0166621
age 0.2477612
In [44]: lr.intercept_
Out[44]: -12.664422252648194
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is co
rrectly specified.
[2] The condition number is large, 1.07e+05. This might indicate that the
re are
strong multicollinearity or other numerical problems.
R-squared and Adj. R-squared are very close, it is a sign that the predictors are relevant to the
overall model.
F-statistic = 240.8 is large value of F-statistic and p-value = 1.64e-194 is very close to 0 and
also it is less than 0.05 hence we can reject null hypothesis. That means there is evidence that
there is good amount of linear relationship between target variable (Strength) and the
predictors.
By looking into OLS summary coefficients column results, they are the same as sklearn linear
model coefficients and even intercept is same.
By looking into OLS summary t-test columns results: constant variable is -0.564, we have a
p-value = 0.573 which is greater than 0.05 then we accept the null hypothesis.
By looking into OLS summary t-test columns results: Cement, Slag, Ash, Water, and
Superplastic are having p-value < 0.05 becuase we are testing t-test at 95% confidence
interval, so we reject null hypothesis and accept alternate hypothesis. That means that there is
evidence that these predictors are having good amount of linear relationship with target
variable.
By looking into OLS summary t-test columns results: Coarseagg and Fineagg are having
p-value > 0.05 because we are testing t-test at 95% confidence interval, so we accept null
hypothesis. That means that there is evidence these predictors are not having good amount of
linear relationship with target variable.
By looking into OLS summary t-test columns results: std err reflects the level of accuracy of the
coefficients. std err values are very close to 0 except intercept that means the level of accuracy
is high.
Residual Tests Results:
Skew: 0.076, there is small tail to left in the residuals distribution. Kurtosis: 3.285, there is a
peak in the residuals distribution. Prob(Omnibus): 0.225, Prob(JB): 0.216 - indicates that
p-value > 0.05 meaning it's not siginificant and data is normally distributed. The condition
number is large, 1.05e+05. This indicates that some of the features are collinear.
Ridge Regression
In [46]: # Building a Ridge Regression model
rr = Ridge(random_state = 1)
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,rr_resultsDf])
resultsDf
Ridge(random_state=1)
*************************************************************************
**
The intercept for our model is 35.338420398009966
Out[46]:
Method R Squared RMSE Train Accuracy Test Accuracy
Observation: This model performs better on training set as well as test set and RMSE is als
reduced to 6.77
Lasso Regression
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, lasso_resultsDf])
resultsDf
Lasso(random_state=1)
*************************************************************************
**
The intercept for our model is 35.338420398009966
Out[47]:
Method R Squared RMSE Train Accuracy Test Accuracy
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,lr_resultsDf])
resultsDf
LinearRegression()
*************************************************************************
**
*************************************************************************
**
Out[49]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
Fit a simple non regularized linear model on interaction terms - Ridge Regression
In [50]: # Building a Ridge Regression model
rr = Ridge(random_state = 1)
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,rr_resultsDf])
resultsDf
Ridge(random_state=1)
*************************************************************************
**
The intercept for our model is 34.16267671277516
Out[50]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
Observation: Notice that test accuracy is better than Linear regression with interaction features.
This model performs better on training set and performance drops on test set which shows that
it's an overfitting and very complex model.
Fit a simple non regularized linear model on interaction terms - Lasso Regression
In [51]: # Building a Lasso Regression model
lasso = Lasso(random_state = 1)
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, lasso_resultsDf])
resultsDf
Lasso(random_state=1)
*************************************************************************
**
The intercept for our model is 35.63838636257186
Out[51]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.
In [52]: ## Let's try polynomial model on the same data from 1 to 5 degree polynomial featu
In [53]: for i in range(1,6):
pipe = Pipeline([('scaler', PowerTransformer()), ('polynomial', PolynomialFeat
('model', LinearRegression())])
pipe.fit(X_train, y_train) # Fit the model on Training set
prediction = pipe.predict(X_test) # Predict on Test set
By looking at the above results, RMSE is from 1 degree polynomial has 6.77 RMSE and RMSE
came down to 5.50 for 2 degree polynomial features. From 3 degree polynomial, RMSE starts
increasing hence, optimal degree of polynomial is 2 degree polynomial.
R-Squared : 0.877656458878097
ROOT MEAN SQUARED ERROR : 5.372598462804477
Accuracy of Training data set: 0.8787 %
Accuracy of Test data set: 0.8777 %
In [55]: # Store the accuracy results for each model in a dataframe for final comparison
poly_resultsDf = pd.DataFrame({'Method': 'Linear Regression with Polynomial featur
'Test Accuracy': accuracy_score}, index=[7])
resultsDf = pd.concat([resultsDf, poly_resultsDf])
resultsDf
Out[55]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,rr_resultsDf])
resultsDf
Ridge(random_state=1)
*************************************************************************
**
The intercept for our model is 20.104208789635525
Out[57]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, lasso_resultsDf])
resultsDf
Lasso(random_state=1)
*************************************************************************
**
The intercept for our model is 36.07805219223952
Out[58]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.
6.3 Explore for gaussians. If data is likely to be a mix of gaussians,
explore individual clusters and presenting findings in terms of the
independent attributes and their suitability to predict strength
K Means Clustering
labels = clusters.labels_
centroids = clusters.cluster_centers_
cluster_errors.append(clusters.inertia_ )
Out[60]:
num_clusters cluster_errors
0 1 9045.0000000
1 2 7184.1710349
2 3 6171.5046087
3 4 5406.3104852
4 5 4916.4063781
5 6 4433.3515066
6 7 4089.0709455
7 8 3805.7902271
8 9 3633.1877291
9 10 3438.5080988
10 11 3303.1169542
11 12 3091.2006462
12 13 2988.2801883
13 14 2870.5853780
In [61]: # Elbow plot
plt.figure(figsize=(12,6))
plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o"
In [62]: # k = 6
cluster = KMeans(n_clusters = 6, random_state = 1)
cluster.fit(concrete_df_scaled)
In [63]: # Creating a new column "GROUP" which will hold the cluster id of each record
prediction=cluster.predict(concrete_df_scaled)
concrete_df_scaled["GROUP"] = prediction
Out[65]:
cement slag ash water superplastic coarseagg fineagg
In [66]: ## Instead of interpreting the neumerical values of the centroids, let us do a vis
## centroids and the data in the cluster into box plots.
concrete_df_scaled.boxplot(by = 'GROUP', layout=(3,3), figsize=(15, 10));
Here, none of the dimensions are good predictor of target variable. For all the dimensions
(variables) every cluster have a similar range of values except in one case. We can see that the
body of the cluster are overlapping. So in k means, though, there are clusters in datasets on
different dimensions. But we can not see any distinct characteristics of these clusters which tell
us to break data into different clusters and build separate models for them.
KNN Regressor
In [67]: def train_test_transform(X_train, X_test):
scale = PowerTransformer()
X_train_scaled = pd.DataFrame(scale.fit_transform(X_train))
X_test_scaled = pd.DataFrame(scale.fit_transform(X_test))
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
# predict the response
y_pred = knn.predict(X_test_scaled)
error.append(np.mean(y_pred != y_test))
In [69]: plt.figure(figsize=(12,6))
plt.plot(range(1,30), error, color='red', linestyle='dashed',marker='o',markerface
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean error')
Optimal value of K is 2
In [70]: # Building a KNN Regression model
knn = KNeighborsRegressor(n_neighbors = 2)
# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, knn_resultsDf])
resultsDf
KNeighborsRegressor(n_neighbors=2)
*************************************************************************
**
*************************************************************************
**
Out[70]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.
i = 11
for name, regressor in models:
if name == 'SVR':
# Train and Test the model
svr_resultsDf = train_test_model(regressor, name, X_train, X_test,
# Store the accuracy results for each model in a dataframe for final compa
resultsDf = pd.concat([resultsDf, svr_resultsDf])
elif name == 'BaggingRegressor':
# Train and Test the model
bag_resultsDf = train_test_model(regressor, name, X_train, X_test,
# Store the accuracy results for each model in a dataframe for final compa
resultsDf = pd.concat([resultsDf, bag_resultsDf])
else:
# Train and Test the model
ensemble_resultsDf = train_test_model(regressor, name, X_train, X_test
# Store the accuracy results for each model in a dataframe for final compa
resultsDf = pd.concat([resultsDf, ensemble_resultsDf])
i = i+1
SVR(kernel='linear')
*************************************************************************
**
The intercept for our model is [35.08548458]
Out[72]:
R Train Test
Method RMSE
Squared Accuracy Accuracy
I am able to predict the Concrete compressive strength using few ingradients with below details.
I have tried with simple linear regression which is overfit model hence I moved on to non-
regularized models.
a. Simple linear regression with polynomial features with degree = 2 performs better on both
training and test set with 1% difference.
b. Ridge and Lasso with polynomial features turned out to be overfit models.
I have tried with Support Vector Regressor and it performs better on both training and test set.
a. Linear Regression with Polynomial features - Test accuracy = 86.94% with RMSE = 5.50
b. Ridge regression with original features - Test accuracy = 80.24% with RMSE = 6.77
c. SVR with original features - Test accuracy = 80.03% with RMSE = 6.81
7 Optimization
SVR - Least: RMSE 8.340742 using {'C': 50, 'gamma': 'scale', 'kernel': 'p
oly'}
Total duration 2.113403797149658
resultsDf_hp = pd.DataFrame()
i = 1
for name, regressor in models:
# Train and Test the model
resultsDf_hp_ind = train_test_model(regressor, name, X_train, X_test, y_train
# Store the accuracy results for each model in a dataframe for final compariso
resultsDf_hp = pd.concat([resultsDf_hp, resultsDf_hp_ind])
i = i+1
Ridge(alpha=1)
*************************************************************************
**
*************************************************************************
**
SVR(C=50, kernel='poly')
*************************************************************************
**
*************************************************************************
**
Out[74]:
Method R Squared RMSE Train Accuracy Test Accuracy
Current Time
In [76]: import datetime
now = datetime.datetime.now()
print ("Current date and time : ")
print (now.strftime("%Y-%m-%d %H:%M:%S"))
Ridge Regressor
In [77]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000
# fit model
rrTree = Ridge(alpha=1)
# fit against independent variables and corresponding target values
rrTree.fit(train[:,:-1], train[:,-1])
# evaluate model
# predict based on independent variables in the test data
score = rrTree.score(test[:, :-1] , y_bs_test)
predictions = rrTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
rr_stats.append(score)
rr_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(rr_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(rr_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
Lasso Regressor
In [79]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000
# fit model
lrTree = Lasso(alpha=0.02)
# fit against independent variables and corresponding target values
lrTree.fit(train[:,:-1], train[:,-1])
# evaluate model
# predict based on independent variables in the test data
score = lrTree.score(test[:, :-1] , y_bs_test)
predictions = lrTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
lr_stats.append(score)
lr_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(lr_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(lr_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
KNeighbors Regressor
In [81]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000
# fit model
knTree = KNeighborsRegressor(metric='euclidean', n_neighbors=7, weights
# evaluate model
# predict based on independent variables in the test data
score = knTree.score(test[:, :-1] , y_bs_test)
predictions = knTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
kn_stats.append(score)
kn_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(kn_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(kn_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
SVR
In [83]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000
# fit model
svrTree = SVR(C=50, gamma='scale', kernel='poly')
# fit against independent variables and corresponding target values
svrTree.fit(train[:,:-1], train[:,-1])
# evaluate model
# predict based on independent variables in the test data
score = svrTree.score(test[:, :-1] , y_bs_test)
predictions = svrTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
svr_stats.append(score)
svr_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(svr_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(svr_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
Bagging Regressor
In [85]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
brm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")
# fit model
brmTree = BaggingRegressor(n_estimators=50)
# evaluate model
# predict based on independent variables in the test data
score = brmTree.score(test[:, :-1] , y_bs_test)
predictions = brmTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
brm_stats.append(score)
brm_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(brm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(brm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
etm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")
# fit model
etmTree = ExtraTreesRegressor(n_estimators=50)
# evaluate model
# predict based on independent variables in the test data
score = etmTree.score(test[:, :-1] , y_bs_test)
predictions = etmTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
etm_stats.append(score)
etm_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(etm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(etm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
adm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")
# fit model
admTree = AdaBoostRegressor(n_estimators=50)
# evaluate model
# predict based on independent variables in the test data
score = admTree.score(test[:, :-1] , y_bs_test)
predictions = admTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
adm_stats.append(score)
adm_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(adm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(adm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
cbm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")
# fit model
cbmTree = CatBoostRegressor(n_estimators=50)
# evaluate model
# predict based on independent variables in the test data
score = cbmTree.score(test[:, :-1] , y_bs_test)
predictions = cbmTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
cbm_stats.append(score)
cbm_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(cbm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(cbm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
XGB Regressor
In [93]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
xgb_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")
# fit model
xgbTree = XGBRegressor(n_estimators=50)
# evaluate model
# predict based on independent variables in the test data
score = xgbTree.score(test[:, :-1] , y_bs_test)
predictions = xgbTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
xgb_stats.append(score)
xgb_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(xgb_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(xgb_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
GradientBoostingRegressor
In [95]: values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
gbm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")
# fit model
gbmTree = GradientBoostingRegressor(n_estimators=50)
# evaluate model
# predict based on independent variables in the test data
score = gbmTree.score(test[:, :-1] , y_bs_test)
predictions = gbmTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
gbm_stats.append(score)
gbm_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(gbm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(gbm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
RandomForestRegressor
In [106]: values = concrete_df_scaled.values
# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
rf_stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
rfTree = RandomForestRegressor(max_features='auto', n_estimators=100)
# evaluate model
# predict based on independent variables in the test data
score = rfTree.score(test[:, :-1] , y_bs_test)
predictions = rfTree.predict(test[:, :-1])
#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
rf_stats.append(score)
rf_duration = duration
# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(rf_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(rf_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,
8 Conclusion
We were able to predict the concrete compressive strength using original features with an
accuracy of 86.94% on test data with RMSE = 5.50
If we look at the above results from various methods then we got the best accuracy from
original features and followed below steps to gain that much of accuracy.
b. Simple linear regression with polynomial features with degree = 2 performs better on both
training and test set with 1% difference.
We had outliers in 'Water', 'Superplastic', 'Fineagg', 'Age' and 'Strength' column also, handled
these outliers by replacing every outlier with upper and lower side of the whisker.
Except 'Cement', 'Water', 'Superplastic' and 'Age' features, all other features are having very
weak relationship with concrete 'Strength' feature and does not account for making statistical
decision (of correlation).
In [ ]: