0% found this document useful (0 votes)
68 views78 pages

Multiple Regressor - Jupyter Notebook

1. The document discusses importing necessary libraries and packages for analyzing concrete strength data. 2. It then loads and examines the concrete data, which has 1030 rows and 9 columns with mostly float64 data types. The data contains 8 quantitative input variables and 1 quantitative output variable. 3. The data is checked for duplicates, with 25 duplicate rows identified that will need to be removed for further analysis.

Uploaded by

Elyza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views78 pages

Multiple Regressor - Jupyter Notebook

1. The document discusses importing necessary libraries and packages for analyzing concrete strength data. 2. It then loads and examines the concrete data, which has 1030 rows and 9 columns with mostly float64 data types. The data contains 8 quantitative input variables and 1 quantitative output variable. 3. The data is checked for duplicates, with 25 duplicate rows identified that will need to be removed for further analysis.

Uploaded by

Elyza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

1.

Importing the necessary libraries


In [1]:  !pip install missingno
!pip install xgboost
!pip install catboost
!pip install lightgbm

Requirement already satisfied: missingno in c:\users\stephen oliver so\do


cuments\anaconda app\lib\site-packages (0.5.1)
Requirement already satisfied: scipy in c:\users\stephen oliver so\docume
nts\anaconda app\lib\site-packages (from missingno) (1.9.1)
Requirement already satisfied: numpy in c:\users\stephen oliver so\docume
nts\anaconda app\lib\site-packages (from missingno) (1.21.5)
Requirement already satisfied: matplotlib in c:\users\stephen oliver so\d
ocuments\anaconda app\lib\site-packages (from missingno) (3.5.2)
Requirement already satisfied: seaborn in c:\users\stephen oliver so\docu
ments\anaconda app\lib\site-packages (from missingno) (0.11.2)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\stephen oliv
er so\documents\anaconda app\lib\site-packages (from matplotlib->missingn
o) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\stephen oliv
er so\documents\anaconda app\lib\site-packages (from matplotlib->missingn
o) (1.4.2)
Requirement already satisfied: cycler>=0.10 in c:\users\stephen oliver s
o\documents\anaconda app\lib\site-packages (from matplotlib->missingno)
(0.11.0)
Requirement already satisfied: packaging>=20.0 in c:\users\stephen oliver
In [2]:  import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns


import itertools

import time

# used to supress display of warnings


import warnings

# ols library
import statsmodels.api as sm
import statsmodels.formula.api as smf

import missingno as mno


from sklearn.cluster import DBSCAN
from sklearn.cluster import OPTICS

# import zscore for scaling the data


from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

from sklearn.metrics import silhouette_score


from sklearn.cluster import KMeans

# pre-processing methods
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

from sklearn.compose import TransformedTargetRegressor

# the regression models


from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

# cross-validation methods
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn import metrics


from sklearn.pipeline import Pipeline

# feature-selection methods
from sklearn.feature_selection import SelectFromModel

# bootstrap sampling
from sklearn.utils import resample
In [3]:  # suppress display of warnings
warnings.filterwarnings('ignore')

# display all dataframe columns


pd.options.display.max_columns = None

# to set the limit to 3 decimals


pd.options.display.float_format = '{:.7f}'.format

# display all dataframe rows


pd.options.display.max_rows = None

2 Data Collection
In [4]:  # Reading Concrete data
concrete_df = pd.read_csv("concrete.csv")

In [5]:  # Get the top 5 rows


concrete_df.head()

Out[5]:
cement slag ash water superplastic coarseagg fineagg

0 141.3000000 212.0000000 0.0000000 203.5000000 0.0000000 971.8000000 748.5000000

1 168.9000000 42.2000000 124.3000000 158.3000000 10.8000000 1080.8000000 796.2000000

2 250.0000000 0.0000000 95.7000000 187.4000000 5.5000000 956.9000000 861.2000000

3 266.0000000 114.0000000 0.0000000 228.0000000 0.0000000 932.0000000 670.0000000

4 154.8000000 183.4000000 0.0000000 193.3000000 9.1000000 1047.4000000 696.7000000

2.1 Shape of the data

In [6]:  # Get the shape of Concrete data


concrete_df.shape

Out[6]: (1030, 9)

In [7]:  print("Number of rows = {0} and Number of Columns = {1} in Data frame".format

Number of rows = 1030 and Number of Columns = 9 in Data frame


2.2 Data type of each attribute

In [8]:  # Check datatypes


concrete_df.dtypes

Out[8]: cement float64


slag float64
ash float64
water float64
superplastic float64
coarseagg float64
fineagg float64
age int64
strength float64
dtype: object

From the above output, we see that except for the column 'age', all the columns datatype are
float64. The data has 8 quantitative input variables and 1 quantitative output variable - strength

In [9]:  # Check Data frame info


concrete_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cement 1030 non-null float64
1 slag 1030 non-null float64
2 ash 1030 non-null float64
3 water 1030 non-null float64
4 superplastic 1030 non-null float64
5 coarseagg 1030 non-null float64
6 fineagg 1030 non-null float64
7 age 1030 non-null int64
8 strength 1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

In [10]:  # Column names of Data frame


concrete_df.columns

Out[10]: Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',


'fineagg', 'age', 'strength'],
dtype='object')

3 Data Cleaning
In [11]:  # Check duplicates in a data frame
concrete_df.duplicated().sum()

Out[11]: 25

3.1 Check for duplicates

In [12]:  # View the duplicate records


duplicates = concrete_df.duplicated()

concrete_df[duplicates]

Out[12]:
cement slag ash water superplastic coarseagg fineagg

278 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

298 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

400 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

420 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

463 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

468 252.0000000 0.0000000 0.0000000 185.0000000 0.0000000 1111.0000000 784.0000000

482 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

493 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

517 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

525 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

527 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

576 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

577 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

604 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

733 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

738 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

766 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

830 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

880 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

884 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000

892 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

933 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

943 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

967 362.6000000 189.0000000 0.0000000 164.9000000 11.6000000 944.7000000 755.8000000

992 425.0000000 106.3000000 0.0000000 153.5000000 16.5000000 852.1000000 887.1000000


3.2 Drop Duplicates
In [13]:  # Delete duplicate rows
concrete_df.drop_duplicates(inplace=True)

In [14]:  # Get the shape of Concrete data


concrete_df.shape

Out[14]: (1005, 9)

3.3 Check Outliers

In [15]:  # Create a boxplot for all the continuous features


concrete_df.boxplot(column = ['cement', 'slag', 'ash', 'water', 'superplastic'
'fineagg', 'age', 'strength'], rot=45, figsize = (20,10));

3.4 Working with Outliers: Correcting, Removing


In [16]:  concrete_df_outliers = pd.DataFrame(concrete_df.loc[:,])

# Calculate IQR
Q1 = concrete_df_outliers.quantile(0.25)
Q3 = concrete_df_outliers.quantile(0.75)
IQR = Q3 - Q1

print(IQR)

cement 158.3000000
slag 142.5000000
ash 118.3000000
water 26.3000000
superplastic 10.0000000
coarseagg 99.0000000
fineagg 97.9000000
age 49.0000000
strength 21.3500000
dtype: float64

In [17]:  concrete_df.columns

Out[17]: Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',


'fineagg', 'age', 'strength'],
dtype='object')

In [18]:  # use IQR score to filter out the outliers by keeping only valid values

# Replace every outlier on the upper side by the upper whisker - for 'water', 'sup
# 'fineagg', 'age' and 'strength' columns
for i, j in zip(np.where(concrete_df_outliers > Q3 + 1.5 * IQR)[0], np.where

whisker = Q3 + 1.5 * IQR


concrete_df_outliers.iloc[i,j] = whisker[j]

# Replace every outlier on the lower side by the lower whisker - for 'water' colum
for i, j in zip(np.where(concrete_df_outliers < Q1 - 1.5 * IQR)[0], np.where

whisker = Q1 - 1.5 * IQR


concrete_df_outliers.iloc[i,j] = whisker[j]

In [19]:  # Remove outliers columns - 'water', 'superplastic', 'fineagg', 'age', 'water' and
concrete_df.drop(columns = concrete_df.loc[:,], inplace = True)

In [20]:  # Add 'water', 'superplastic', 'fineagg', 'age', 'water' and 'strength' with no ou
# concrete_df
concrete_df = pd.concat([concrete_df, concrete_df_outliers], axis = 1)

3.5 Check Outliers after correction


In [21]:  # Create a boxplot for all the continuous features
concrete_df.boxplot(column = ['cement', 'slag', 'ash', 'water', 'superplastic'
'fineagg', 'age', 'strength'], rot=45, figsize = (20,10));

4.6 Check Missing Values

In [22]:  # Check the presence of missing values


concrete_df.isnull().sum()

Out[22]: cement 0
slag 0
ash 0
water 0
superplastic 0
coarseagg 0
fineagg 0
age 0
strength 0
dtype: int64
In [23]:  # Check the presence of missing values
concrete_df_missval = concrete_df.copy() # Make a copy of the dataframe
isduplicates = False

for x in concrete_df_missval.columns:
concrete_df_missval[x] = concrete_df_missval[x].astype(str).str.replace
result = concrete_df_missval[x].astype(str).str.isalnum() # Check whether all
if False in result.unique():
isduplicates = True
print('For column "{}" unique values are {}'.format(x, concrete_df_missval
print('\n')

if not isduplicates:
print('No duplicates in this dataset')

No duplicates in this dataset

In [24]:  # Visualize missing values


mno.matrix(concrete_df, figsize = (20, 6));

In [25]:  # Summary statistics


concrete_df.describe().T

Out[25]:
count mean std min 25% 50%

cement 1005.0000000 278.6313433 104.3442607 102.0000000 190.7000000 265.0000000

slag 1005.0000000 72.0372139 86.1499938 0.0000000 0.0000000 20.0000000

ash 1005.0000000 55.5363184 64.2079686 0.0000000 0.0000000 0.0000000

water 1005.0000000 182.0668159 21.1586448 127.1500000 166.6000000 185.7000000

superplastic 1005.0000000 5.9814925 5.7244631 0.0000000 0.0000000 6.1000000

coarseagg 1005.0000000 974.3768159 77.5796667 801.0000000 932.0000000 968.0000000

fineagg 1005.0000000 772.5710945 80.0359343 594.0000000 724.3000000 780.0000000

age 1005.0000000 38.0761194 35.8625492 1.0000000 7.0000000 28.0000000

strength 1005.0000000 35.2263184 16.2202533 2.3300000 23.5200000 33.8000000

3.7 Data Cleaning Summary


1. We had 25 duplicate instances in dataset and dropped those duplicates.
2. We had outliers in 'Water', 'Superplastic', 'Fineagg', 'Age' and 'Strength' column. Handled
these outliers by replacing every outlier with upper side of the whisker.
3. We had outliers in 'Water' column also, handled these outliers by replacing every outlier
with lower side of the whisker.
4. No missing values in dataset.

4 EDA (Data Analysis and Preparation)

5.1 Variable Identification

1. Target variable: 'Strength'


2. Predictors (Input varibles): 'Cement', 'Slag', 'Ash', 'Water', 'Superplastic', 'Coarseagg',
'Fineagg', 'Age'

5.2 Univariate Analysis


In [26]:  cols = [i for i in concrete_df.columns if i not in 'strength']
length = len(cols)
cs = ["b","r","g","c","m","k","lime","c"]
fig = plt.figure(figsize=(13,25))

for i,j,k in itertools.zip_longest(cols,range(length),cs):


plt.subplot(4,2,j+1)
ax = sns.distplot(concrete_df[i],color=k,rug=True)
ax.set_facecolor("w")
plt.axvline(concrete_df[i].mean(),linestyle="dashed",label="mean",color
plt.legend(loc="best")
plt.title(i,color="navy")
plt.xlabel("")
Univariate analysis:

Cement column - Right skewed distribution -- cement is skewed to higher values Slag column -
Right skewed distribution -- slag is skewed to higher values and there are two gaussians Ash
column - Right skewed distribution -- ash is skewed to higher values and there are two
gaussians Water column - Moderately left skewed distribution Superplastic column - Right
skewed distribution -- superplastic is skewed to higher values and there are two gaussians
Coarseagg column - Moderately left skewed distribution Fineagg column - Moderately left
skewed distribution Age column - Right skewed distribution -- age is skewed to higher values
and there are three gaussians

Concrete compressive strength distribution

In [27]:  plt.figure(figsize=(13,6))
sns.distplot(concrete_df["strength"],color="b",rug=True)
plt.axvline(concrete_df["strength"].mean(), linestyle="dashed",color="k", label
plt.legend(loc="best",prop={"size":14})
plt.title("Concrete compressivee strength distribution")
plt.show()

Strength column seems to be uniformly distributed


In [28]:  # Summary statistics
concrete_df.describe().T

Out[28]:
count mean std min 25% 50%

cement 1005.0000000 278.6313433 104.3442607 102.0000000 190.7000000 265.0000000

slag 1005.0000000 72.0372139 86.1499938 0.0000000 0.0000000 20.0000000

ash 1005.0000000 55.5363184 64.2079686 0.0000000 0.0000000 0.0000000

water 1005.0000000 182.0668159 21.1586448 127.1500000 166.6000000 185.7000000

superplastic 1005.0000000 5.9814925 5.7244631 0.0000000 0.0000000 6.1000000

coarseagg 1005.0000000 974.3768159 77.5796667 801.0000000 932.0000000 968.0000000

fineagg 1005.0000000 772.5710945 80.0359343 594.0000000 724.3000000 780.0000000

age 1005.0000000 38.0761194 35.8625492 1.0000000 7.0000000 28.0000000

strength 1005.0000000 35.2263184 16.2202533 2.3300000 23.5200000 33.8000000

The above output prints the important summary statistics of all the numeric variables like the
mean, median (50%), minimum, and maximum values, along with the standard deviation.

4.4 Multivariate Analysis


In [29]:  sns.pairplot(concrete_df, diag_kind = 'hist', corner = True);

Diagonals Analysis If we look at KDE diagonal plots, there are at least 2 Gaussians (2 peaks) in
Slag, Ash, Superplastic and Age, even though it's not unsupervised learning but in this dataset
there are at least 2 clusters and there may be more.

Range of clusters in this dataset is 2 to 4.

The diagonal analysis give same insights as we got from univariate analysis.

Off Diagonal Analysis: Relationship between indpendent attributes Scatter plots Cement vs
other independent attributes: This attribute does not have any significant relation with other
independent features. It almost spread like a cloud. If we had calculated the r value it would
have come close to 0.

Slag vs other independent attributes: This attribute does not have any significant relation with
other independent features. It almost spread like a cloud. If we had calculated the r value it
would have come close to 0.
Ash vs other independent attributes: This attribute does not have any significant relation with
other independent features. It almost spread like a cloud. If we had calculated the r value it
would have come close to 0.

Water vs other independent attributes: This attribute has negative curvy-linear relationship with
Fineagg, Coarseagg and Superplastic, as Water content increases means Fineagg, Coarseagg
and Superplastic are reducing. It does not have any significant relationship with other
independent atributes.

Superplastic vs other independent attributes:This attribute have negative linear relationship with
water only. It does not have any significant relationship with other independent attributes.

Coarseagg vs other independent attributes: This attribute does not have any significant relation
with other independent features. It almost spread like a cloud. If we had calculated the r value it
would have come close to 0.

Fineagg vs other independent attributes: It has negative linear relationship with water. It does
not have any significant relation with any other attributes. It almost spread like a cloud. If we
had calculated the r value it would have come close to 0.

The reason why we are doing all this analysis is if we find any kind of dimensions which are
very strongly correlated i.e. r value close to 1 or -1 such dimensions are giving same
information to your algorithms, its a redundant dimension. So in such cases we may want to
keep one and drop the other which we should keep and which we should drop depends on
again your domain expertise, which one of the dimension is more prone to errors.I would like to
drop that dimension. Or we have a choice to combine these dimensions and create a composite
dimension out of it.

4.5 Study Correlation

In [30]:  # Check the Correlation


concrete_df.corr()

Out[30]:
cement slag ash water superplastic coarseagg fineagg

cement 1.0000000 -0.3033700 -0.3856098 -0.0572088 0.0448168 -0.0862053 -0.2476623

slag -0.3033700 1.0000000 -0.3123648 0.1302267 0.0196023 -0.2775952 -0.2911288

ash -0.3856098 -0.3123648 1.0000000 -0.2845415 0.4361848 -0.0264685 0.0918726

water -0.0572088 0.1302267 -0.2845415 1.0000000 -0.6560121 -0.2103993 -0.4441919

superplastic 0.0448168 0.0196023 0.4361848 -0.6560121 1.0000000 -0.2357154 0.2019397

coarseagg -0.0862053 -0.2775952 -0.0264685 -0.2103993 -0.2357154 1.0000000 -0.1604944

fineagg -0.2476623 -0.2911288 0.0918726 -0.4441919 0.2019397 -0.1604944 1.0000000

age 0.0556490 -0.0546345 -0.0946261 0.1945588 -0.1275331 0.0127844 -0.0979334

strength 0.4886896 0.1024400 -0.0796849 -0.2713811 0.3429833 -0.1457945 -0.1885092

In [31]:  ## pairplot for checking correlation


In [32]:  sns.pairplot(concrete_df[['cement', 'slag', 'ash', 'water', 'superplastic',
'fineagg', 'age', 'strength']], kind = 'reg', corner = True);

In [33]:  ## Heatmap for checking the Correlation


In [34]:  corr = abs(concrete_df.corr()) # correlation matrix
lower_triangle = np.tril(corr, k = -1) # select only the lower triangle of the co
mask = lower_triangle == 0 # to mask the upper triangle in the following heatmap

plt.figure(figsize = (12,10))
sns.heatmap(lower_triangle, center = 0.5, cmap = 'coolwarm', annot=True, xticklabe
cbar= True, linewidths= 1, mask = mask) # Da Heatmap
plt.show()

Observations:

1. Looking at the Correlation table; 'Cement', 'Water', 'Superplastic' and 'Age' features are
influencing the concrete strength.
2. Concrete strength feature is having Moderate Positive Correlation with Cement feature.
3. Concrete strength feature is having Low Positive Correlation with Superplastic and Age
features
4. Concrete strength feature is having Low Positive Correlation with Water features
5. Concrete strength feature is having negligible Correlation with Slag, Ash, Coarseagg and
Fineagg features
6. Water feature is having Moderate Positive Correlation with Superplastic feature
7. Concrete cement feature is having Low Positive Correlation with Slag and Ash features
8. Concrete fineagg feature is having Low Positive Correlation with Water feature
9. Concrete ash feature is having Low Positive Correlation with Superplastic feature

4.6 EDA (Exploratory Data Analysis) Summary

Except 'Cement', 'Water', 'Superplastic' and 'Age' features, all other features are having very
weak relationship with concrete 'Strength' feature and does not account for making statistical
decision (of correlation).

Concrete Cement feature is having Low Positive Correlation with Slag and Ash features,
perhaps we can create additional features like (cement + slag) and (cement + ash) to predict
the concrete strength.

Concrete Fineagg feature is having Low Positive Correlation with Water feature, perhaps we
can create additional features like (water + fineagg) to predict the concrete strength.

Concrete Ash feature is having Low Positive Correlation with Superplastic feature, perhaps we
can create additional features like (ash + Superplastic) to predict the concrete strength.

Range of clusters in this dataset is 2 to 4.

5 Feature Engineering
Identify opportunities (if any) to create a composite feature, drop a feature etc. As mentioned in
EDA summary. Independent features are influencing concrete strength - 'Cement', 'Water',
'Superplastic' and 'Age'

Composite features are influencing concrete strength - cement + slag, cement + ash and water
+ fineagg. We can create these composite featues because these features are having some
relationship within them.

Note: Before concluding anything we can try with feature selection methods and then compare
the resutls.

6 Model Building and Validation


In [35]:  import matplotlib.gridspec as gridspec

# sns styling figures


sns.set(style='white')
sns.set(style='whitegrid',color_codes=True)

6.1 Sampling Techniques - Create Training and Test Set

In [36]:  X = concrete_df.drop(['strength'], axis = 1) # Considering all Predictors


y = concrete_df['strength']
In [37]:  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,

In [38]:  print('X_train shape : ({0},{1})'.format(X_train.shape[0], X_train.shape[1]))


print('y_train shape : ({0},)'.format(y_train.shape[0]))
print('X_test shape : ({0},{1})'.format(X_test.shape[0], X_test.shape[1]))
print('y_test shape : ({0},)'.format(y_test.shape[0]))

X_train shape : (804,8)


y_train shape : (804,)
X_test shape : (201,8)
y_test shape : (201,)

6.2 Decide on complexity of the model, should it be simple linear


model in terms of parameters or would a quadratic or higher degree
help

Train and test model


In [39]:  def train_test_model(model, method, X_train, X_test, y_train, y_test, of_type

print (model)
print ("**********************************************************************

if scale == 'yes':
# prepare the model with input scaling
pipeline = Pipeline([('scaler', PowerTransformer()), ('model', model
elif scale == 'no':
# prepare the model with input scaling
pipeline = Pipeline([('model', model)])

pipeline.fit(X_train, y_train) # Fit the model on Training set


prediction = pipeline.predict(X_test) # Predict on Test set

r2 = metrics.r2_score(y_test, prediction) # Calculate the r squared value on t


rmse = np.sqrt(metrics.mean_squared_error(y_test, prediction)) # Root mean squ

if of_type == "coef":
# Intercept and Coefficients
print("The intercept for our model is {}".format(model.intercept_),

for idx, col_name in enumerate(X_train.columns):


print("The coefficient for {} is {}".format(col_name, model.coef_

# Accuracy of Training data set


train_accuracy_score = pipeline.score(X_train, y_train)

# Accuracy of Test data set


test_accuracy_score = pipeline.score(X_test, y_test)

print ("**********************************************************************

if of_type == "coef":

# FEATURE IMPORTANCES plot


plt.figure(figsize=(13,12))
plt.subplot(211)
print(model.coef_)
coef = pd.DataFrame(np.sort(model.coef_)[::-1].ravel())
coef["feat"] = X_train.columns
ax1 = sns.barplot(coef["feat"],coef[0],palette="jet_r", linewidth=2
ax1.set_facecolor("lightgrey")
ax1.axhline(0,color="k",linewidth=2)
plt.ylabel("coefficients")
plt.xlabel("features")
plt.title(method + ' ' + 'FEATURE IMPORTANCES')

elif of_type == "feat":

# FEATURE IMPORTANCES plot


plt.figure(figsize=(13,12))
plt.subplot(211)
coef = pd.DataFrame(np.sort(model.feature_importances_)[::-1])
coef["feat"] = X_train.columns
ax2 = sns.barplot(coef["feat"], coef[0],palette="jet_r", linewidth=
ax2.set_facecolor("lightgrey")
ax2.axhline(0,color="k",linewidth=2)
plt.ylabel("coefficients")
plt.xlabel("features")
plt.title(method + ' ' + 'FEATURE IMPORTANCES')

# Store the accuracy results for each model in a dataframe for final compariso
resultsDf = pd.DataFrame({'Method': method, 'R Squared': r2, 'RMSE': rmse
'Test Accuracy': test_accuracy_score}, index=

return resultsDf # return all the metrics along with predictions

Train and test all models

In [40]:  def train_test_allmodels(X_train_common, X_test_common, y_train, y_test, scale


# define regressor models
models=[['LinearRegression',LinearRegression()],
['Ridge',Ridge(random_state = 1, shuffle=True)],
['Lasso',Lasso(random_state = 1)],
['KNeighborsRegressor',KNeighborsRegressor(n_neighbors = 3)],
['SVR',SVR(kernel = 'linear')],
['RandomForestRegressor',RandomForestRegressor(random_state = 1, shuffle
['BaggingRegressor',BaggingRegressor(random_state = 1, shuffle=True
['ExtraTreesRegressor',ExtraTreesRegressor(random_state = 1, shuffle
['AdaBoostRegressor',AdaBoostRegressor(random_state = 1, shuffle=True
['GradientBoostingRegressor',GradientBoostingRegressor(random_state
['CatBoostRegressor',CatBoostRegressor(random_state = 1, shuffle=True
['XGBRegressor',XGBRegressor()]
]

resultsDf_common = pd.DataFrame()
i = 1
for name, regressor in models:
# Train and Test the model
reg_resultsDf = train_test_model(regressor, name, X_train_common, X_test_c

# Store the accuracy results for each model in a dataframe for final compa
resultsDf_common = pd.concat([resultsDf_common, reg_resultsDf])
i = i+1

return resultsDf_common

Model with Hyperparameter Tuning


In [41]:  def hyperparameterstune_model(name, model, X_train, y_train, param_grid):

start = time.time() # note the start time

# define grid search


cv = KFold(n_splits=10, random_state=1, shuffle=True)
#grid_search = RandomizedSearchCV(estimator=model, param_distributions=param_g
#scoring = 'neg_root_mean_squared_error', err
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs
scoring = 'neg_root_mean_squared_error'
model_grid_result = grid_search.fit(X_train, y_train)

# summarize results
print(name, "- Least: RMSE %f using %s" % (model_grid_result.best_score_

end = time.time() # note the end time


duration = end - start # calculate the total duration
print("Total duration" , duration, "\n")

return model_grid_result.best_estimator_

Modelling - Linear Regression

In [42]:  # Building a Linear Regression model


lr = LinearRegression()

# Train and Test the model


resultsDf = train_test_model(lr, 'LinearRegression', X_train, X_test, y_train

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf

LinearRegression()
*************************************************************************
**
*************************************************************************
**

Out[42]:
Method R Squared RMSE Train Accuracy Test Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.
In [43]:  cdf = pd.DataFrame(lr.coef_, X_train.columns, columns=['Coefficients'])
print(cdf)

Coefficients
cement 0.1180563
slag 0.0978802
ash 0.0803260
water -0.1609238
superplastic 0.2661270
coarseagg 0.0089746
fineagg 0.0166621
age 0.2477612

In [44]:  lr.intercept_

Out[44]: -12.664422252648194

Linear Regression using Statsmodels


In [45]:  # R^2 is not always a reliable metric as it always increases with addition of more
# influence on the predicted variable. Instead we use adjusted R^2 which removes t

# OLS library expects the X and y to be given in one single dataframe


concrete_df_train = pd.concat([X_train, y_train], axis=1)
concrete_df_train.head()

lr_ols = smf.ols(formula= 'strength ~ cement + slag + ash + water + superplastic +


data = concrete_df_train).fit()

print(lr_ols.summary()) # Inferential statistics

OLS Regression Results


=========================================================================
=====
Dep. Variable: strength R-squared:
0.732
Model: OLS Adj. R-squared:
0.729
Method: Least Squares F-statistic:
271.2
Date: Wed, 16 Nov 2022 Prob (F-statistic): 2.47
e-221
Time: 13:50:46 Log-Likelihood: -2
861.5
No. Observations: 804 AIC:
5741.
Df Residuals: 795 BIC:
5783.
Df Model: 8
Covariance Type: nonrobust
=========================================================================
=======
coef std err t P>|t| [0.025
0.975]
-------------------------------------------------------------------------
-------
Intercept -12.6644 24.911 -0.508 0.611 -61.564
36.235
cement 0.1181 0.008 14.803 0.000 0.102
0.134
slag 0.0979 0.009 10.312 0.000 0.079
0.117
ash 0.0803 0.012 6.933 0.000 0.058
0.103
water -0.1609 0.038 -4.243 0.000 -0.235
-0.086
superplastic 0.2661 0.091 2.935 0.003 0.088
0.444
coarseagg 0.0090 0.009 1.030 0.303 -0.008
0.026
fineagg 0.0167 0.010 1.660 0.097 -0.003
0.036
age 0.2478 0.009 28.871 0.000 0.231
0.265
=========================================================================
=====
Omnibus: 3.836 Durbin-Watson:
2.224
Prob(Omnibus): 0.147 Jarque-Bera (JB):
4.127
Skew: 0.081 Prob(JB):
0.127
Kurtosis: 3.312 Cond. No. 1.0
7e+05
=========================================================================
=====

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is co
rrectly specified.
[2] The condition number is large, 1.07e+05. This might indicate that the
re are
strong multicollinearity or other numerical problems.

Model Statistical Outputs:

R-squared and Adj. R-squared are very close, it is a sign that the predictors are relevant to the
overall model.

F-statistic = 240.8 is large value of F-statistic and p-value = 1.64e-194 is very close to 0 and
also it is less than 0.05 hence we can reject null hypothesis. That means there is evidence that
there is good amount of linear relationship between target variable (Strength) and the
predictors.

Parameters Estimates and the Associated Statistical Tests:

By looking into OLS summary coefficients column results, they are the same as sklearn linear
model coefficients and even intercept is same.

By looking into OLS summary t-test columns results: constant variable is -0.564, we have a
p-value = 0.573 which is greater than 0.05 then we accept the null hypothesis.

By looking into OLS summary t-test columns results: Cement, Slag, Ash, Water, and
Superplastic are having p-value < 0.05 becuase we are testing t-test at 95% confidence
interval, so we reject null hypothesis and accept alternate hypothesis. That means that there is
evidence that these predictors are having good amount of linear relationship with target
variable.

By looking into OLS summary t-test columns results: Coarseagg and Fineagg are having
p-value > 0.05 because we are testing t-test at 95% confidence interval, so we accept null
hypothesis. That means that there is evidence these predictors are not having good amount of
linear relationship with target variable.

By looking into OLS summary t-test columns results: std err reflects the level of accuracy of the
coefficients. std err values are very close to 0 except intercept that means the level of accuracy
is high.
Residual Tests Results:

Skew: 0.076, there is small tail to left in the residuals distribution. Kurtosis: 3.285, there is a
peak in the residuals distribution. Prob(Omnibus): 0.225, Prob(JB): 0.216 - indicates that
p-value > 0.05 meaning it's not siginificant and data is normally distributed. The condition
number is large, 1.05e+05. This indicates that some of the features are collinear.

Ridge Regression
In [46]:  # Building a Ridge Regression model
rr = Ridge(random_state = 1)

# Train and Test the model


rr_resultsDf = train_test_model(rr, 'Ridge', X_train, X_test, y_train, y_test

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,rr_resultsDf])
resultsDf

Ridge(random_state=1)
*************************************************************************
**
The intercept for our model is 35.338420398009966

The coefficient for cement is 8.934292231972137


The coefficient for slag is 4.899253373111003
The coefficient for ash is 0.8432772820660366
The coefficient for water is -5.030475808532426
The coefficient for superplastic is 1.6939788216511638
The coefficient for coarseagg is -0.8162559653347802
The coefficient for fineagg is -1.441253385091848
The coefficient for age is 9.888591208538848
*************************************************************************
**
[ 8.93429223 4.89925337 0.84327728 -5.03047581 1.69397882 -0.81625597
-1.44125339 9.88859121]

Out[46]:
Method R Squared RMSE Train Accuracy Test Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

Observation: This model performs better on training set as well as test set and RMSE is als
reduced to 6.77
Lasso Regression

In [47]:  # Building a Lasso Regression model


lasso = Lasso(random_state = 1)

# Train and Test the model


lasso_resultsDf = train_test_model(lasso, 'Lasso', X_train, X_test, y_train

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, lasso_resultsDf])
resultsDf

Lasso(random_state=1)
*************************************************************************
**
The intercept for our model is 35.338420398009966

The coefficient for cement is 7.999676279735918


The coefficient for slag is 3.962891743494958
The coefficient for ash is 0.0
The coefficient for water is -2.793166133151661
The coefficient for superplastic is 2.550447273600969
The coefficient for coarseagg is -0.0
The coefficient for fineagg is -0.010306103830136706
The coefficient for age is 8.761823835253733
*************************************************************************
**
[ 7.99967628 3.96289174 0. -2.79316613 2.55044727 -0.
-0.0103061 8.76182384]

Out[47]:
Method R Squared RMSE Train Accuracy Test Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636


Observation: This model performs better on training set and performance drops on test set
which shows that it's an overfitting and very complex model.

Adding Interaction Terms - Linear Regression

In [48]:  # Transfom X_train and X_test to polynomial features


pipe = Pipeline([('scaler', PowerTransformer()), ('polynomial', PolynomialFeatures
X_train_poly2 = pd.DataFrame(pipe.fit_transform(X_train))
X_test_poly2 = pd.DataFrame(pipe.fit_transform(X_test))

In [49]:  # Train and Test the model


lr_resultsDf = train_test_model(lr, 'Linear Regression with interaction features'
'none', 4, 'no')

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,lr_resultsDf])
resultsDf

LinearRegression()
*************************************************************************
**
*************************************************************************
**

Out[49]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

Fit a simple non regularized linear model on interaction terms - Ridge Regression
In [50]:  # Building a Ridge Regression model
rr = Ridge(random_state = 1)

# Train and Test the model


rr_resultsDf = train_test_model(rr, 'Ridge with interaction features', X_train_pol

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,rr_resultsDf])
resultsDf

Ridge(random_state=1)
*************************************************************************
**
The intercept for our model is 34.16267671277516

The coefficient for 0 is 0.0


The coefficient for 1 is 11.08194348431824
The coefficient for 2 is 5.546267753440066
The coefficient for 3 is 2.4208404698985264
The coefficient for 4 is -4.1776421316103525
The coefficient for 5 is 1.2044172811213372
The coefficient for 6 is -0.8836859514488947
The coefficient for 7 is -0.275853333645689
The coefficient for 8 is 10.214482140629498
The coefficient for 9 is -3.322211121567527
The coefficient for 10 is -1.290579098806096
The coefficient for 11 is -1.065391581148982
The coefficient for 12 is 0.3432438322590296
The coefficient for 13 is -0.44520817513910244
The coefficient for 14 is -0.18430985567545288
The coefficient for 15 is 0.4193420138468854
The coefficient for 16 is -2.8327863256690495
The coefficient for 17 is 0.3909721094265937
The coefficient for 18 is 1.4739972304860836
The coefficient for 19 is -1.6854416772624385
The coefficient for 20 is 0.32893144855543177
The coefficient for 21 is 1.4961675445297746
The coefficient for 22 is -0.5277370237125991
The coefficient for 23 is -0.9608023832648395
The coefficient for 24 is 0.11411359253743192
The coefficient for 25 is 1.3563595857473605
The coefficient for 26 is 0.9914365736431121
The coefficient for 27 is 0.8651079833055213
The coefficient for 28 is -0.3905272601026541
The coefficient for 29 is 0.6084750994486419
The coefficient for 30 is -0.776198654978076
The coefficient for 31 is 1.1240556018877015
The coefficient for 32 is -0.07885597723686379
The coefficient for 33 is -0.23755533428714926
The coefficient for 34 is 0.6806775573319096
The coefficient for 35 is 0.0002937957890336493
The coefficient for 36 is -0.027356961499184266
*************************************************************************
**
[ 0.00000000e+00 1.10819435e+01 5.54626775e+00 2.42084047e+00
-4.17764213e+00 1.20441728e+00 -8.83685951e-01 -2.75853334e-01
1.02144821e+01 -3.32221112e+00 -1.29057910e+00 -1.06539158e+00
3.43243832e-01 -4.45208175e-01 -1.84309856e-01 4.19342014e-01
-2.83278633e+00 3.90972109e-01 1.47399723e+00 -1.68544168e+00
3.28931449e-01 1.49616754e+00 -5.27737024e-01 -9.60802383e-01
1.14113593e-01 1.35635959e+00 9.91436574e-01 8.65107983e-01
-3.90527260e-01 6.08475099e-01 -7.76198655e-01 1.12405560e+00
-7.88559772e-02 -2.37555334e-01 6.80677557e-01 2.93795789e-04
-2.73569615e-02]

Out[50]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

Observation: Notice that test accuracy is better than Linear regression with interaction features.
This model performs better on training set and performance drops on test set which shows that
it's an overfitting and very complex model.

Fit a simple non regularized linear model on interaction terms - Lasso Regression
In [51]:  # Building a Lasso Regression model
lasso = Lasso(random_state = 1)

# Train and Test the model


lasso_resultsDf = train_test_model(lasso, 'Lasso with interaction features'

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, lasso_resultsDf])
resultsDf

Lasso(random_state=1)
*************************************************************************
**
The intercept for our model is 35.63838636257186

The coefficient for 0 is 0.0


The coefficient for 1 is 7.851829082683121
The coefficient for 2 is 3.7915135295361297
The coefficient for 3 is 0.0
The coefficient for 4 is -2.788144790385948
The coefficient for 5 is 2.364747139956996
The coefficient for 6 is -0.0
The coefficient for 7 is -0.09898319962222801
The coefficient for 8 is 8.760402009316195
The coefficient for 9 is -0.0
The coefficient for 10 is -0.0
The coefficient for 11 is -0.0
The coefficient for 12 is 0.0
The coefficient for 13 is 0.0
The coefficient for 14 is -0.0
The coefficient for 15 is -0.0
The coefficient for 16 is -0.0
The coefficient for 17 is 0.0
The coefficient for 18 is 0.0
The coefficient for 19 is -0.09795749267508529
The coefficient for 20 is 0.0
The coefficient for 21 is 0.0
The coefficient for 22 is 0.0
The coefficient for 23 is -0.5576766002974733
The coefficient for 24 is 0.0
The coefficient for 25 is 0.0
The coefficient for 26 is 0.0
The coefficient for 27 is 0.0
The coefficient for 28 is -0.0
The coefficient for 29 is 0.0
The coefficient for 30 is -0.0
The coefficient for 31 is 0.0
The coefficient for 32 is 0.0
The coefficient for 33 is 0.0
The coefficient for 34 is -0.0
The coefficient for 35 is 0.0
The coefficient for 36 is -0.0
*************************************************************************
**
[ 0. 7.85182908 3.79151353 0. -2.78814479 2.36474714
-0. -0.0989832 8.76040201 -0. -0. -0.
0. 0. -0. -0. -0. 0.
0. -0.09795749 0. 0. 0. -0.5576766
0. 0. 0. 0. -0. 0.
-0. 0. 0. 0. -0. 0.
-0. ]

Out[51]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

6 Lasso with interaction features 0.7588601 7.5427271 0.7911381 0.7588601

Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.

Polynomial Linear Regression

In [52]:  ## Let's try polynomial model on the same data from 1 to 5 degree polynomial featu
In [53]:  for i in range(1,6):
pipe = Pipeline([('scaler', PowerTransformer()), ('polynomial', PolynomialFeat
('model', LinearRegression())])
pipe.fit(X_train, y_train) # Fit the model on Training set
prediction = pipe.predict(X_test) # Predict on Test set

r2 = metrics.r2_score(y_test, prediction) # Calculate the r squared value on t


rmse = np.sqrt(metrics.mean_squared_error(y_test, prediction)) # Root mean squ

print ("R-Squared for {0} degree polynomial is {1}".format(i, r2))


print ("ROOT MEAN SQUARED ERROR for {0} degree polynomial features is {1}"

R-Squared for 1 degree polynomial is 0.7993667770002957


ROOT MEAN SQUARED ERROR for 1 degree polynomial features is 6.88010846699
2264

R-Squared for 2 degree polynomial is 0.877656458878097


ROOT MEAN SQUARED ERROR for 2 degree polynomial features is 5.37259846280
4477

R-Squared for 3 degree polynomial is 0.8444487974144416


ROOT MEAN SQUARED ERROR for 3 degree polynomial features is 6.05801782288
19666

R-Squared for 4 degree polynomial is -15.0975422543148


ROOT MEAN SQUARED ERROR for 4 degree polynomial features is 61.6273718183
5552

R-Squared for 5 degree polynomial is -5.875204185323592e+16


ROOT MEAN SQUARED ERROR for 5 degree polynomial features is 3723105497.11
78994

By looking at the above results, RMSE is from 1 degree polynomial has 6.77 RMSE and RMSE
came down to 5.50 for 2 degree polynomial features. From 3 degree polynomial, RMSE starts
increasing hence, optimal degree of polynomial is 2 degree polynomial.

Let's try 2-degree polynomial model on the same data


In [54]:  pipe = Pipeline([('scaler', PowerTransformer()), ('polynomial', PolynomialFeatures
('model', LinearRegression())])

pipe.fit(X_train, y_train) # Fit the model on Training set


prediction = pipe.predict(X_test) # Predict on Test set

r2 = metrics.r2_score(y_test, prediction) # Calculate the r squared value on the T


rmse = np.sqrt(metrics.mean_squared_error(y_test, prediction)) # Root mean squared

print ("R-Squared :", r2)


print ("ROOT MEAN SQUARED ERROR :", rmse)

# Accuracy of Training data set


print("Accuracy of Training data set: {0:.4f} %".format(pipe.score(X_train,

# Accuracy of Test data set


accuracy_score = pipe.score(X_test, y_test)
print("Accuracy of Test data set: {0:.4f} %".format(accuracy_score))

R-Squared : 0.877656458878097
ROOT MEAN SQUARED ERROR : 5.372598462804477
Accuracy of Training data set: 0.8787 %
Accuracy of Test data set: 0.8777 %

In [55]:  # Store the accuracy results for each model in a dataframe for final comparison
poly_resultsDf = pd.DataFrame({'Method': 'Linear Regression with Polynomial featur
'Test Accuracy': accuracy_score}, index=[7])
resultsDf = pd.concat([resultsDf, poly_resultsDf])
resultsDf

Out[55]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

6 Lasso with interaction features 0.7588601 7.5427271 0.7911381 0.7588601

Linear Regression with Polynomial


7 0.8776565 5.3725985 0.8787110 0.8776565
features

Fit a simple non regularized linear model on polynomial features - Ridge


Regression

In [56]:  # Transfom X_train and X_test to polynomial features


pipe = Pipeline([('scaler', PowerTransformer()), ('polynomial', PolynomialFeatures
X_train_poly_2 = pd.DataFrame(pipe.fit_transform(X_train))
X_test_poly_2 = pd.DataFrame(pipe.fit_transform(X_test))
In [57]:  # Building a Ridge Regression model
rr = Ridge(random_state = 1)

# Train and Test the model


rr_resultsDf = train_test_model(rr, 'Ridge with polynomial features', X_train_poly

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf,rr_resultsDf])
resultsDf

Ridge(random_state=1)
*************************************************************************
**
The intercept for our model is 20.104208789635525

The coefficient for 0 is 0.0


The coefficient for 1 is 13.704749959416688
The coefficient for 2 is 8.004362365351318
The coefficient for 3 is 4.033594783533428
The coefficient for 4 is -3.1563956415368617
The coefficient for 5 is 0.1561568717777544
The coefficient for 6 is 0.6248713257079128
The coefficient for 7 is 1.6688461051068033
The coefficient for 8 is 9.93595064400738
The coefficient for 9 is -0.09578695313817856
The coefficient for 10 is -2.281922004440568
The coefficient for 11 is -0.41581423550090324
The coefficient for 12 is -3.1369432783115196
The coefficient for 13 is -0.8459685540838413
The coefficient for 14 is -1.4650303632860564
The coefficient for 15 is -1.8541045880105278
The coefficient for 16 is 0.692571659209539
The coefficient for 17 is 9.032228302167399
The coefficient for 18 is -0.9323344151392222
The coefficient for 19 is 0.12234455295228139
The coefficient for 20 is 0.980873088235154
The coefficient for 21 is -0.9184413745549345
The coefficient for 22 is 0.4416126375342167
The coefficient for 23 is 1.7355248719850591
The coefficient for 24 is 8.501660299762236
The coefficient for 25 is -0.9515118750022996
The coefficient for 26 is -0.5855366486080389
The coefficient for 27 is 0.5503734714439078
The coefficient for 28 is 1.4022147326712064
The coefficient for 29 is 0.8836997515664446
The coefficient for 30 is -1.3405854508557467
The coefficient for 31 is -2.801804242251407
The coefficient for 32 is -2.625038597500386
The coefficient for 33 is -2.4493420582472996
The coefficient for 34 is -0.22576441174599005
The coefficient for 35 is -2.920313702666027
The coefficient for 36 is -0.9722541111129857
The coefficient for 37 is -2.274966259346656
The coefficient for 38 is 0.04974050554681743
The coefficient for 39 is -0.7621430626042823
The coefficient for 40 is -1.4949345767265203
The coefficient for 41 is 0.6137639608568684
The coefficient for 42 is -1.4353406788569918
The coefficient for 43 is 0.297057897582116
The coefficient for 44 is -1.08502040977898
*************************************************************************
**
[ 0. 13.70474996 8.00436237 4.03359478 -3.15639564 0.15615687
0.62487133 1.66884611 9.93595064 -0.09578695 -2.281922 -0.41581424
-3.13694328 -0.84596855 -1.46503036 -1.85410459 0.69257166 9.0322283
-0.93233442 0.12234455 0.98087309 -0.91844137 0.44161264 1.73552487
8.5016603 -0.95151188 -0.58553665 0.55037347 1.40221473 0.88369975
-1.34058545 -2.80180424 -2.6250386 -2.44934206 -0.22576441 -2.9203137
-0.97225411 -2.27496626 0.04974051 -0.76214306 -1.49493458 0.61376396
-1.43534068 0.2970579 -1.08502041]

Out[57]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

6 Lasso with interaction features 0.7588601 7.5427271 0.7911381 0.7588601

Linear Regression with Polynomial


7 0.8776565 5.3725985 0.8787110 0.8776565
features

8 Ridge with polynomial features 0.8733861 5.4655585 0.8784731 0.8733861

Fit a simple non regularized linear model on polynomial features - Lasso


Regression
In [58]:  # Building a Lasso Regression model
lasso = Lasso(random_state = 1)

# Train and Test the model


lasso_resultsDf = train_test_model(lasso, 'Lasso with polynomial features',

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, lasso_resultsDf])
resultsDf

Lasso(random_state=1)
*************************************************************************
**
The intercept for our model is 36.07805219223952

The coefficient for 0 is 0.0


The coefficient for 1 is 7.92505342354394
The coefficient for 2 is 3.8087605992992457
The coefficient for 3 is 0.0
The coefficient for 4 is -2.7972321880823787
The coefficient for 5 is 2.219381259980926
The coefficient for 6 is -0.0
The coefficient for 7 is -0.05886124954436204
The coefficient for 8 is 8.743686921798869
The coefficient for 9 is 0.0
The coefficient for 10 is -0.0
The coefficient for 11 is -0.0
The coefficient for 12 is -0.0
The coefficient for 13 is 0.0
The coefficient for 14 is 0.0
The coefficient for 15 is -0.0
The coefficient for 16 is -0.0
The coefficient for 17 is 0.0
The coefficient for 18 is -0.0
The coefficient for 19 is 0.0
The coefficient for 20 is 0.0
The coefficient for 21 is -0.08771734981239449
The coefficient for 22 is 0.0
The coefficient for 23 is 0.0
The coefficient for 24 is 0.0
The coefficient for 25 is 0.0
The coefficient for 26 is -0.5420304531967652
The coefficient for 27 is 0.0
The coefficient for 28 is 0.0
The coefficient for 29 is 0.0
The coefficient for 30 is 0.0
The coefficient for 31 is 0.0
The coefficient for 32 is -0.0
The coefficient for 33 is -0.0
The coefficient for 34 is -0.0
The coefficient for 35 is -0.0
The coefficient for 36 is 0.0
The coefficient for 37 is 0.0
The coefficient for 38 is 0.0
The coefficient for 39 is 0.0
The coefficient for 40 is -0.0
The coefficient for 41 is 0.0
The coefficient for 42 is -0.0
The coefficient for 43 is -0.0
The coefficient for 44 is -0.44558873341412325
*************************************************************************
**
[ 0. 7.92505342 3.8087606 0. -2.79723219 2.21938126
-0. -0.05886125 8.74368692 0. -0. -0.
-0. 0. 0. -0. -0. 0.
-0. 0. 0. -0.08771735 0. 0.
0. 0. -0.54203045 0. 0. 0.
0. 0. -0. -0. -0. -0.
0. 0. 0. 0. -0. 0.
-0. -0. -0.44558873]

Out[58]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

6 Lasso with interaction features 0.7588601 7.5427271 0.7911381 0.7588601

Linear Regression with Polynomial


7 0.8776565 5.3725985 0.8787110 0.8776565
features

8 Ridge with polynomial features 0.8733861 5.4655585 0.8784731 0.8733861

9 Lasso with polynomial features 0.7647429 7.4501533 0.7942836 0.7647429

Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.
6.3 Explore for gaussians. If data is likely to be a mix of gaussians,
explore individual clusters and presenting findings in terms of the
independent attributes and their suitability to predict strength

K Means Clustering

In [59]:  # Scale the data using PowerTransformer


scale = PowerTransformer()
concrete_df_scaled = pd.DataFrame(scale.fit_transform(concrete_df))

In [60]:  cluster_range = range(1, 15)


cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans(n_clusters = num_clusters, n_init = 5, random_state =
clusters.fit(concrete_df_scaled)

labels = clusters.labels_
centroids = clusters.cluster_centers_

cluster_errors.append(clusters.inertia_ )

clusters_df = pd.DataFrame({ "num_clusters":cluster_range, "cluster_errors"


clusters_df[0:15]

Out[60]:
num_clusters cluster_errors

0 1 9045.0000000

1 2 7184.1710349

2 3 6171.5046087

3 4 5406.3104852

4 5 4916.4063781

5 6 4433.3515066

6 7 4089.0709455

7 8 3805.7902271

8 9 3633.1877291

9 10 3438.5080988

10 11 3303.1169542

11 12 3091.2006462

12 13 2988.2801883

13 14 2870.5853780
In [61]:  # Elbow plot
plt.figure(figsize=(12,6))
plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o"

In [62]:  # k = 6
cluster = KMeans(n_clusters = 6, random_state = 1)
cluster.fit(concrete_df_scaled)

Out[62]: KMeans(n_clusters=6, random_state=1)

In [63]:  # Creating a new column "GROUP" which will hold the cluster id of each record
prediction=cluster.predict(concrete_df_scaled)
concrete_df_scaled["GROUP"] = prediction

In [64]:  centroids = cluster.cluster_centers_


centroids

Out[64]: array([[-0.49788956, -0.70084416, 1.08010433, -0.6455165 , 0.6166365 ,


0.6187428 , 0.47157378, 0.01551861, -0.15664726],
[ 0.76082127, 0.27816954, -0.87757638, 1.54832179, -1.14902254,
-0.04235032, -1.58810862, 1.14579238, 0.77515871],
[-0.53189637, 0.50905562, 1.02219215, 0.43608761, 0.62840773,
-1.00781291, -0.33696493, -0.01156446, -0.08081618],
[ 1.01540893, 0.64108139, -0.41259592, -0.97356604, 0.98008716,
-0.57991469, 0.14760417, -0.08494664, 1.09319782],
[ 0.61302543, -1.03762913, -0.92298759, 0.40787035, -1.1613292 ,
0.55761958, 0.22169431, -0.18034126, -0.5277424 ],
[-1.03674194, 1.12842068, -0.91230705, 0.4592532 , -0.88284638,
0.03252593, -0.01337835, -0.36105052, -0.77439284]])
In [65]:  centroid_df = pd.DataFrame(centroids, columns = list(concrete_df))
centroid_df

Out[65]:
cement slag ash water superplastic coarseagg fineagg

0 -0.4978896 -0.7008442 1.0801043 -0.6455165 0.6166365 0.6187428 0.4715738 0.0155186

1 0.7608213 0.2781695 -0.8775764 1.5483218 -1.1490225 -0.0423503 -1.5881086 1.1457924

2 -0.5318964 0.5090556 1.0221921 0.4360876 0.6284077 -1.0078129 -0.3369649 -0.0115645

3 1.0154089 0.6410814 -0.4125959 -0.9735660 0.9800872 -0.5799147 0.1476042 -0.0849466

4 0.6130254 -1.0376291 -0.9229876 0.4078704 -1.1613292 0.5576196 0.2216943 -0.1803413

5 -1.0367419 1.1284207 -0.9123071 0.4592532 -0.8828464 0.0325259 -0.0133783 -0.3610505

In [66]:  ## Instead of interpreting the neumerical values of the centroids, let us do a vis
## centroids and the data in the cluster into box plots.
concrete_df_scaled.boxplot(by = 'GROUP', layout=(3,3), figsize=(15, 10));

Here, none of the dimensions are good predictor of target variable. For all the dimensions
(variables) every cluster have a similar range of values except in one case. We can see that the
body of the cluster are overlapping. So in k means, though, there are clusters in datasets on
different dimensions. But we can not see any distinct characteristics of these clusters which tell
us to break data into different clusters and build separate models for them.

KNN Regressor
In [67]:  def train_test_transform(X_train, X_test):
scale = PowerTransformer()

X_train_scaled = pd.DataFrame(scale.fit_transform(X_train))
X_test_scaled = pd.DataFrame(scale.fit_transform(X_test))

return X_train_scaled, X_test_scaled

In [68]:  # empty list that will hold error


error = []

X_train_scaled, X_test_scaled = train_test_transform(X_train, X_test)

# perform error metrics for values from 1,2,3....29


for k in range(1,30):

knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train_scaled, y_train)
# predict the response
y_pred = knn.predict(X_test_scaled)
error.append(np.mean(y_pred != y_test))

In [69]:  plt.figure(figsize=(12,6))
plt.plot(range(1,30), error, color='red', linestyle='dashed',marker='o',markerface
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean error')

Out[69]: Text(0, 0.5, 'Mean error')

Optimal value of K is 2
In [70]:  # Building a KNN Regression model
knn = KNeighborsRegressor(n_neighbors = 2)

# Train and Test the model


knn_resultsDf = train_test_model(knn, 'KNeighborsRegressor', X_train, X_test

# Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, knn_resultsDf])
resultsDf

KNeighborsRegressor(n_neighbors=2)
*************************************************************************
**
*************************************************************************
**

Out[70]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

6 Lasso with interaction features 0.7588601 7.5427271 0.7911381 0.7588601

Linear Regression with Polynomial


7 0.8776565 5.3725985 0.8787110 0.8776565
features

8 Ridge with polynomial features 0.8733861 5.4655585 0.8784731 0.8733861

9 Lasso with polynomial features 0.7647429 7.4501533 0.7942836 0.7647429

10 KNeighborsRegressor 0.7661613 7.4276604 0.9575685 0.7661613

Observation: This model performs better on training set and poorly on test set which shows that
it's an overfitting and very complex model.

Build SupportVectorRegressor, RandomForestRegressor, BaggingRegressor,


ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor,
CatBoostRegressor and XGBRegressor models
In [71]:  # define regressor models
models=[
['SVR',SVR(kernel='linear')],
['DecisionTreeRegressor', DecisionTreeRegressor(random_state = 1)],
['RandomForestRegressor',RandomForestRegressor(random_state = 1)],
['BaggingRegressor',BaggingRegressor(random_state = 1)],
['ExtraTreesRegressor',ExtraTreesRegressor(random_state = 1)],
['AdaBoostRegressor',AdaBoostRegressor(random_state = 1)],
['GradientBoostingRegressor',GradientBoostingRegressor(random_state = 1
['CatBoostRegressor',CatBoostRegressor(random_state = 1, verbose=False)],
['XGBRegressor',XGBRegressor()]
]

i = 11
for name, regressor in models:
if name == 'SVR':
# Train and Test the model
svr_resultsDf = train_test_model(regressor, name, X_train, X_test,

# Store the accuracy results for each model in a dataframe for final compa
resultsDf = pd.concat([resultsDf, svr_resultsDf])
elif name == 'BaggingRegressor':
# Train and Test the model
bag_resultsDf = train_test_model(regressor, name, X_train, X_test,

# Store the accuracy results for each model in a dataframe for final compa
resultsDf = pd.concat([resultsDf, bag_resultsDf])
else:
# Train and Test the model
ensemble_resultsDf = train_test_model(regressor, name, X_train, X_test

# Store the accuracy results for each model in a dataframe for final compa
resultsDf = pd.concat([resultsDf, ensemble_resultsDf])
i = i+1

SVR(kernel='linear')
*************************************************************************
**
The intercept for our model is [35.08548458]

The coefficient for cement is 8.64368465026335


The coefficient for slag is 4.562130554549007
The coefficient for ash is 0.938659559058852
The coefficient for water is -4.182484394140522
The coefficient for superplastic is 2.019012565873412
The coefficient for coarseagg is -0.8975116629693762
The coefficient for fineagg is -1.151398949430746
The coefficient for age is 9.843060616392181
*************************************************************************
**
[[ 8.64368465 4.56213055 0.93865956 -4.18248439 2.01901257 -0.89751166
-1.15139895 9.84306062]]
DecisionTreeRegressor(random_state=1)
*************************************************************************
**
*************************************************************************
**
RandomForestRegressor(random_state=1)
*************************************************************************
**
*************************************************************************
**
BaggingRegressor(random_state=1)
*************************************************************************
**
*************************************************************************
**
ExtraTreesRegressor(random_state=1)
*************************************************************************
**
*************************************************************************
**
AdaBoostRegressor(random_state=1)
*************************************************************************
**
*************************************************************************
**
GradientBoostingRegressor(random_state=1)
*************************************************************************
**
*************************************************************************
**
<catboost.core.CatBoostRegressor object at 0x0000020FAAC08A90>
*************************************************************************
**
*************************************************************************
**
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=No
ne,
gamma=None, gpu_id=None, grow_policy=None, importance_type=N
one,
interaction_constraints=None, learning_rate=None, max_bin=No
ne,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=Non
e,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)
*************************************************************************
**
*************************************************************************
**
By looking at above feature importnce from ensemble models: Cement, Slag, Ash, Water,
Superplastic, Coarsegg and fineagg are top important features
In [72]:  # Show results dataframe
resultsDf

Out[72]:
R Train Test
Method RMSE
Squared Accuracy Accuracy

1 LinearRegression 0.6705317 8.8165909 0.7318618 0.6705317

2 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

3 Lasso 0.7665636 7.4212693 0.7889735 0.7665636

Linear Regression with interaction


4 0.8525819 5.8975189 0.8613861 0.8525819
features

5 Ridge with interaction features 0.8526972 5.8952119 0.8613722 0.8526972

6 Lasso with interaction features 0.7588601 7.5427271 0.7911381 0.7588601

Linear Regression with Polynomial


7 0.8776565 5.3725985 0.8787110 0.8776565
features

8 Ridge with polynomial features 0.8733861 5.4655585 0.8784731 0.8733861

9 Lasso with polynomial features 0.7647429 7.4501533 0.7942836 0.7647429

10 KNeighborsRegressor 0.7661613 7.4276604 0.9575685 0.7661613

11 SVR 0.7931077 6.9866018 0.8097417 0.7931077

12 DecisionTreeRegressor 0.7099813 8.2719308 0.9957330 0.7099813

13 RandomForestRegressor 0.8838252 5.2354002 0.9839902 0.8838252

14 BaggingRegressor 0.8793071 5.3362331 0.9771215 0.8793071

15 ExtraTreesRegressor 0.8650204 5.6432331 0.9957330 0.8650204

16 AdaBoostRegressor 0.7807806 7.1917301 0.8128005 0.7807806

17 GradientBoostingRegressor 0.8987392 4.8878126 0.9463166 0.8987392

18 CatBoostRegressor 0.9369514 3.8568391 0.9882249 0.9369514

19 XGBRegressor 0.9092400 4.6274428 0.9947825 0.9092400

6.4 Overall Summary - Before feature selection

I am able to predict the Concrete compressive strength using few ingradients with below details.

Refer the above table:

I have tried with simple linear regression which is overfit model hence I moved on to non-
regularized models.

a. Ridge performs better on both training and test set.

b. Lasso performs better on training set and poorly on test set.

As mentioned in Multi-variate analysis, there is some interaction between independent features


hence I have tried with simple linear regression and non-regularized models (Ridge and Lasso)
and all of them turned out to be overfit models.

As mentioned in Multi-variate analysis, there are some non-linear(curvy-linear) relatioship within


independent features as well as with target variable hence I have tried with polynomial features.

a. Simple linear regression with polynomial features with degree = 2 performs better on both
training and test set with 1% difference.

b. Ridge and Lasso with polynomial features turned out to be overfit models.

I have tried with Support Vector Regressor and it performs better on both training and test set.

I have tried with KNeighborsRegressor, DecisionTreeRegressor, RandomForestRegressor,


BaggingRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor,
CatBoostRegressor and XGBRegressor models, sad news is all these models turned out to be
overfit models.

Best models are as follows:

a. Linear Regression with Polynomial features - Test accuracy = 86.94% with RMSE = 5.50

b. Ridge regression with original features - Test accuracy = 80.24% with RMSE = 6.77

c. SVR with original features - Test accuracy = 80.03% with RMSE = 6.81

7 Optimization

Ridge and SVR models - Hyperparameter tuning with original


features
In [73]:  # define regressor models
models=[['Ridge',Ridge()],
#['Lasso',Lasso()],
#['KNeighborsRegressor',KNeighborsRegressor()],
['SVR',SVR()]
#['RandomForestRegressor',RandomForestRegressor()],
#['BaggingRegressor',BaggingRegressor()],
#['ExtraTreesRegressor',ExtraTreesRegressor()],
#['AdaBoostRegressor',AdaBoostRegressor()],
#['GradientBoostingRegressor',GradientBoostingRegressor()],
#['CatBoostRegressor',CatBoostRegressor(verbose=False)],
#['XGBRegressor',XGBRegressor()]
]

# define model parameters


ridge_param_grid = {'alpha': [1,0.1,0.01,0.001,0.0001,0]}
lasso_param_grid = {'alpha': [0.02, 0.024, 0.025, 0.026, 0.03]}
knn_param_grid = {'n_neighbors': range(3, 21, 2),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']}
svr_param_grid = {'kernel': ['poly', 'rbf', 'sigmoid'],
'C': [50, 10, 1.0, 0.1, 0.01],
'gamma': ['scale']}
rf_param_grid = {'n_estimators': [10, 100, 1000],
'max_features': ['auto', 'sqrt', 'log2']}
bag_param_grid = {'n_estimators': [10, 100, 1000],
'max_samples': np.arange(0.7, 0.8, 0.05)}
et_param_grid = {'n_estimators': np.arange(10,100,10),
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_split': np.arange(2,15,1)}
adb_param_grid = {'n_estimators': np.arange(30,100,10),
'learning_rate': np.arange(0.1,1,0.5)}
gb_param_grid = {'n_estimators': np.arange(30,100,10),
'learning_rate': np.arange(0.1,1,0.5)}
catb_param_grid = {'depth': [4, 7, 10],
'learning_rate' : [0.03, 0.1, 0.15],
'l2_leaf_reg': [1,4,9],
'iterations': [300]}
xgb_param_grid = {'learning_rate': [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ],
'max_depth' : [ 3, 4, 5, 6, 8, 10, 12, 15],
'min_child_weight': [ 1, 3, 5, 7],
'gamma': [0.0, 0.1, 0.2 , 0.3, 0.4],
'colsample_bytree': [ 0.3, 0.4, 0.5 , 0.7]}

for name, regressor in models:


if name == 'Ridge':
ridge_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'Lasso':
lasso_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'KNeighborsRegressor':
knn_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'SVR':
svr_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'RandomForestRegressor':
rf_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'BaggingRegressor':
bag_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'ExtraTreesRegressor':
et_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'AdaBoostRegressor':
adb_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'GradientBoostingRegressor':
gb_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'CatBoostRegressor':
catb_best_estimator = hyperparameterstune_model(name, regressor, X_train
elif name == 'XGBRegressor':
xgb_best_estimator = hyperparameterstune_model(name, regressor, X_train
Ridge - Least: RMSE 8.589996 using {'alpha': 1}
Total duration 4.47681736946106

SVR - Least: RMSE 8.340742 using {'C': 50, 'gamma': 'scale', 'kernel': 'p
oly'}
Total duration 2.113403797149658

Ridge and SVR models with Hyperparameters


In [74]:  # define regressor models
models=[['Ridge', ridge_best_estimator],
#['Lasso', lasso_best_estimator],
#['KNeighborsRegressor', knn_best_estimator],
['SVR', svr_best_estimator],
#['RandomForestRegressor', rf_best_estimator],
#['BaggingRegressor', bag_best_estimator],
#['ExtraTreesRegressor',et_best_estimator],
#['AdaBoostRegressor', adb_best_estimator],
#['GradientBoostingRegressor', gb_best_estimator],
#['CatBoostRegressor', catb_best_estimator],
#['XGBRegressor', xgb_best_estimator]
]

resultsDf_hp = pd.DataFrame()
i = 1
for name, regressor in models:
# Train and Test the model
resultsDf_hp_ind = train_test_model(regressor, name, X_train, X_test, y_train

# Store the accuracy results for each model in a dataframe for final compariso
resultsDf_hp = pd.concat([resultsDf_hp, resultsDf_hp_ind])
i = i+1

# Show results dataframe


resultsDf_hp

Ridge(alpha=1)
*************************************************************************
**
*************************************************************************
**
SVR(C=50, kernel='poly')
*************************************************************************
**
*************************************************************************
**

Out[74]:
Method R Squared RMSE Train Accuracy Test Accuracy

1 Ridge 0.7993560 6.8802934 0.8126139 0.7993560

2 SVR 0.8075852 6.7377216 0.8827241 0.8075852

7.2 Bootstrap Sampling - Model performance range at 95% confidence level

In [75]:  # Drop K-means cluster group from concrete_df_scaled dataset


concrete_df_scaled.drop(columns=['GROUP'], axis=1, inplace=True)

Current Time
In [76]:  import datetime
now = datetime.datetime.now()
print ("Current date and time : ")
print (now.strftime("%Y-%m-%d %H:%M:%S"))

Current date and time :


2022-11-16 13:51:04

Ridge Regressor
In [77]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000

# size of a ridge regressor sample


n_size = int(len(concrete_df_scaled) * 1)

# run ridge regressor


# empty list that will hold the scores for each bootstrap iteration
rr_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
rrTree = Ridge(alpha=1)
# fit against independent variables and corresponding target values
rrTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = rrTree.score(test[:, :-1] , y_bs_test)
predictions = rrTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
rr_stats.append(score)

rr_duration = duration

0 0.8117313410720539 2022-11-16 13:51:04 0:00:00


1 0.8352662684288275 2022-11-16 13:51:05 0:00:00.398862
2 0.8044342793665806 2022-11-16 13:51:05 0:00:00.777577
3 0.8124383177897976 2022-11-16 13:51:06 0:00:01.132234
4 0.8160672041224952 2022-11-16 13:51:06 0:00:01.518638
5 0.8277613312146704 2022-11-16 13:51:06 0:00:01.872222
6 0.8145630752643905 2022-11-16 13:51:07 0:00:02.228114
7 0.8267487580108295 2022-11-16 13:51:07 0:00:02.607374
8 0.8252370998522808 2022-11-16 13:51:07 0:00:03.014051
9 0.7815118883191534 2022-11-16 13:51:08 0:00:03.412333
10 0.8021954534184204 2022-11-16 13:51:08 0:00:03.716447
11 0.819301295153446 2022-11-16 13:51:08 0:00:04.060270
12 0.8153309303323101 2022-11-16 13:51:09 0:00:04.415410
13 0.8160692559923493 2022-11-16 13:51:09 0:00:04.772076
14 0.801515788221613 2022-11-16 13:51:10 0:00:05.163445
15 0.8118202181350695 2022-11-16 13:51:10 0:00:05.546695
16 0.8393645383356589 2022-11-16 13:51:10 0:00:05.888241
17 0.8271721085403838 2022-11-16 13:51:11 0:00:06.181608
18 0.8183695451922827 2022-11-16 13:51:11 0:00:06.529741
19 0.7798589621457428 2022-11-16 13:51:11 0:00:06.886215
In [78]:  # plot scores
plt.hist(rr_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(rr_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(rr_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 79.0% and 84.0%

Lasso Regressor
In [79]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000

# size of a ridge regressor sample


n_size = int(len(concrete_df_scaled) * 1)

# run ridge regressor


# empty list that will hold the scores for each bootstrap iteration
lr_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
lrTree = Lasso(alpha=0.02)
# fit against independent variables and corresponding target values
lrTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = lrTree.score(test[:, :-1] , y_bs_test)
predictions = lrTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
lr_stats.append(score)

lr_duration = duration

0 0.8121584815965095 2022-11-16 13:57:08 0:00:00.000796


1 0.8034978250431695 2022-11-16 13:57:08 0:00:00.354494
2 0.8175278569028401 2022-11-16 13:57:09 0:00:00.581909
3 0.8296483365581605 2022-11-16 13:57:09 0:00:00.895752
4 0.8290481270468941 2022-11-16 13:57:09 0:00:01.210309
5 0.8197940722637796 2022-11-16 13:57:10 0:00:01.524347
6 0.8155262859281729 2022-11-16 13:57:10 0:00:01.833553
7 0.8007393642604868 2022-11-16 13:57:10 0:00:02.148270
8 0.7987141545582171 2022-11-16 13:57:10 0:00:02.491468
9 0.8113456902960884 2022-11-16 13:57:11 0:00:02.793721
10 0.8231791612483502 2022-11-16 13:57:11 0:00:03.176438
11 0.8208678699645562 2022-11-16 13:57:12 0:00:03.535853
12 0.8368488701580691 2022-11-16 13:57:12 0:00:03.888150
13 0.8201524583386433 2022-11-16 13:57:12 0:00:04.231302
14 0.8155017367280623 2022-11-16 13:57:13 0:00:04.581643
15 0.8004986748152144 2022-11-16 13:57:13 0:00:04.921128
16 0.8322222583991665 2022-11-16 13:57:13 0:00:05.146714
17 0.8126451962675483 2022-11-16 13:57:13 0:00:05.444102
18 0.8022605984852113 2022-11-16 13:57:14 0:00:05.759413
19 0.7963693354408918 2022-11-16 13:57:14 0:00:06.063035
In [80]:  # plot scores
plt.hist(lr_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(lr_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(lr_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 78.8% and 83.6%

KNeighbors Regressor
In [81]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000

# size of a ridge regressor sample


n_size = int(len(concrete_df_scaled) * 1)

# run ridge regressor


# empty list that will hold the scores for each bootstrap iteration
kn_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
knTree = KNeighborsRegressor(metric='euclidean', n_neighbors=7, weights

# fit against independent variables and corresponding target values


knTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = knTree.score(test[:, :-1] , y_bs_test)
predictions = knTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
kn_stats.append(score)

kn_duration = duration

0 0.8373365321291107 2022-11-16 14:02:31 0:00:00.000433


1 0.8351747167260578 2022-11-16 14:02:31 0:00:00.353537
2 0.8654983016833877 2022-11-16 14:02:32 0:00:00.646133
3 0.8537150902832534 2022-11-16 14:02:32 0:00:01.014423
4 0.8776536381132831 2022-11-16 14:02:32 0:00:01.383467
5 0.8299803867142509 2022-11-16 14:02:33 0:00:01.722932
6 0.8573042886707015 2022-11-16 14:02:33 0:00:02.060975
7 0.8336298094170814 2022-11-16 14:02:33 0:00:02.380636
8 0.884443354464485 2022-11-16 14:02:34 0:00:02.709329
9 0.8809825062866832 2022-11-16 14:02:34 0:00:02.947693
10 0.8434597555795651 2022-11-16 14:02:34 0:00:03.274889
11 0.8800122871484926 2022-11-16 14:02:35 0:00:03.597945
12 0.874801943512366 2022-11-16 14:02:35 0:00:03.918780
13 0.824689190081288 2022-11-16 14:02:35 0:00:04.250392
14 0.8528428142508123 2022-11-16 14:02:36 0:00:04.561458
15 0.857295929168791 2022-11-16 14:02:36 0:00:04.886410
16 0.8353360064866507 2022-11-16 14:02:36 0:00:05.119623
17 0.8362434531587024 2022-11-16 14:02:36 0:00:05.444834
18 0.871400951346212 2022-11-16 14:02:37 0:00:05.763741
19 0.8566284299079511 2022-11-16 14:02:37 0:00:06.082646
In [82]:  # plot scores
plt.hist(kn_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(kn_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(kn_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 82.2% and 88.9%

SVR
In [83]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of ridge regressor samples to create
n_iterations = 1000

# size of a ridge regressor sample


n_size = int(len(concrete_df_scaled) * 1)

# run ridge regressor


# empty list that will hold the scores for each bootstrap iteration
svr_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
svrTree = SVR(C=50, gamma='scale', kernel='poly')
# fit against independent variables and corresponding target values
svrTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = svrTree.score(test[:, :-1] , y_bs_test)
predictions = svrTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
svr_stats.append(score)

svr_duration = duration

0 0.8071549214427178 2022-11-16 14:07:47 0:00:00


1 0.813903248725636 2022-11-16 14:07:51 0:00:03.393278
2 0.7893698893453589 2022-11-16 14:07:54 0:00:07.292968
3 0.8305481756171962 2022-11-16 14:07:57 0:00:10.285267
4 0.8060042099034616 2022-11-16 14:08:00 0:00:12.342392
5 0.839187484409504 2022-11-16 14:08:03 0:00:15.505084
6 0.8050730712841294 2022-11-16 14:08:05 0:00:18.106870
7 0.8447845907220685 2022-11-16 14:08:08 0:00:20.751886
8 0.7907933786971704 2022-11-16 14:08:11 0:00:23.535649
9 0.8553330351829087 2022-11-16 14:08:13 0:00:25.394966
10 0.8062799978735604 2022-11-16 14:08:15 0:00:27.333305
11 0.8628443817140082 2022-11-16 14:08:17 0:00:29.776248
12 0.8320872031191263 2022-11-16 14:08:19 0:00:32.020782
13 0.782007434758525 2022-11-16 14:08:21 0:00:34.185121
14 0.8313336304145604 2022-11-16 14:08:24 0:00:36.852606
15 0.8445620631754684 2022-11-16 14:08:26 0:00:38.838819
16 0.845061760451321 2022-11-16 14:08:29 0:00:41.368098
17 0.8248786829124071 2022-11-16 14:08:31 0:00:43.341499
18 0.8303964515108847 2022-11-16 14:08:32 0:00:45.252150
19 0.8319711290187222 2022-11-16 14:08:35 0:00:47.993958
In [84]:  # plot scores
plt.hist(svr_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(svr_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(svr_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 74.5% and 87.0%

Bagging Regressor
In [85]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
brm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
brmTree = BaggingRegressor(n_estimators=50)

# fit against independent variables and corresponding target values


brmTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = brmTree.score(test[:, :-1] , y_bs_test)
predictions = brmTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
brm_stats.append(score)

brm_duration = duration

0 0.9187448135150114 2022-11-16 15:31:19 0:00:00


1 0.8950917200914986 2022-11-16 15:31:20 0:00:00.561253
2 0.8931209397660175 2022-11-16 15:31:20 0:00:01.085392
3 0.9095159888186845 2022-11-16 15:31:21 0:00:01.661913
4 0.8949618226351971 2022-11-16 15:31:21 0:00:02.199554
5 0.9016044815541759 2022-11-16 15:31:22 0:00:02.839043
6 0.8678703555713387 2022-11-16 15:31:23 0:00:03.597715
7 0.8676515386584804 2022-11-16 15:31:23 0:00:04.204904
8 0.8997710503837313 2022-11-16 15:31:24 0:00:04.830560
9 0.8764296667808901 2022-11-16 15:31:24 0:00:05.422590
10 0.9036268421628959 2022-11-16 15:31:25 0:00:05.961343
11 0.9031593054034317 2022-11-16 15:31:25 0:00:06.488579
12 0.8929384326784029 2022-11-16 15:31:26 0:00:06.988337
13 0.8890397998981584 2022-11-16 15:31:26 0:00:07.478039
14 0.8760757185637711 2022-11-16 15:31:27 0:00:07.940305
15 0.9102438177907206 2022-11-16 15:31:27 0:00:08.438231
16 0.8767620680960291 2022-11-16 15:31:28 0:00:08.974790
17 0.9062227953150955 2022-11-16 15:31:28 0:00:09.486851
18 0.9026419949445769 2022-11-16 15:31:29 0:00:10.004392
19 0.8732841527843845 2022-11-16 15:31:30 0:00:10.615803
In [86]:  # plot scores
plt.hist(brm_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(brm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(brm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 86.6% and 91.7%

Extra Trees Regressor


In [87]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
etm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
etmTree = ExtraTreesRegressor(n_estimators=50)

# fit against independent variables and corresponding target values


etmTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = etmTree.score(test[:, :-1] , y_bs_test)
predictions = etmTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
etm_stats.append(score)

etm_duration = duration

0 0.9214138685209669 2022-11-16 15:39:44 0:00:00.000637


1 0.9008689417035448 2022-11-16 15:39:44 0:00:00.508287
2 0.9117583981709843 2022-11-16 15:39:45 0:00:00.995610
3 0.8959460935545699 2022-11-16 15:39:45 0:00:01.526887
4 0.9103810818549599 2022-11-16 15:39:46 0:00:01.965979
5 0.9016806229245191 2022-11-16 15:39:46 0:00:02.503165
6 0.9002684934301688 2022-11-16 15:39:47 0:00:03.011582
7 0.9081780900403188 2022-11-16 15:39:47 0:00:03.524865
8 0.8972749899955198 2022-11-16 15:39:48 0:00:04.008095
9 0.9018052622758782 2022-11-16 15:39:48 0:00:04.498374
10 0.9154509999460958 2022-11-16 15:39:49 0:00:04.995764
11 0.8823598806422398 2022-11-16 15:39:49 0:00:05.495299
12 0.8966176444617057 2022-11-16 15:39:49 0:00:05.901635
13 0.9034385358713337 2022-11-16 15:39:50 0:00:06.384719
14 0.9221239219299666 2022-11-16 15:39:50 0:00:06.856717
15 0.8978393669871326 2022-11-16 15:39:51 0:00:07.364104
16 0.9051067773467005 2022-11-16 15:39:51 0:00:07.869500
17 0.9131056601639014 2022-11-16 15:39:52 0:00:08.391677
18 0.8887837811952229 2022-11-16 15:39:52 0:00:08.904346
19 0.9211096697033392 2022-11-16 15:39:53 0:00:09.295210
In [88]:  # plot scores
plt.hist(etm_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(etm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(etm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 88.2% and 92.5%

Ada Boost Regressor


In [89]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
adm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
admTree = AdaBoostRegressor(n_estimators=50)

# fit against independent variables and corresponding target values


admTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = admTree.score(test[:, :-1] , y_bs_test)
predictions = admTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
adm_stats.append(score)

adm_duration = duration

0 0.7938869679156932 2022-11-16 15:46:57 0:00:00


1 0.8041607197541046 2022-11-16 15:46:58 0:00:00.423597
2 0.8035584444002472 2022-11-16 15:46:58 0:00:00.738413
3 0.7821121259752446 2022-11-16 15:46:59 0:00:01.167161
4 0.8198282058110666 2022-11-16 15:46:59 0:00:01.591825
5 0.8115119825928907 2022-11-16 15:47:00 0:00:02.026594
6 0.8199434796434563 2022-11-16 15:47:00 0:00:02.450724
7 0.8168119337068934 2022-11-16 15:47:00 0:00:02.898815
8 0.8007285410378383 2022-11-16 15:47:01 0:00:03.342246
9 0.8150299345385938 2022-11-16 15:47:01 0:00:03.764647
10 0.781695767357539 2022-11-16 15:47:02 0:00:04.098660
11 0.7860138145098039 2022-11-16 15:47:02 0:00:04.510308
12 0.8062001008249854 2022-11-16 15:47:02 0:00:04.941560
13 0.7747136314369079 2022-11-16 15:47:03 0:00:05.365173
14 0.7863898729785167 2022-11-16 15:47:03 0:00:05.789283
15 0.813323138394588 2022-11-16 15:47:04 0:00:06.214724
16 0.7718903488920885 2022-11-16 15:47:04 0:00:06.621768
17 0.7903816551703471 2022-11-16 15:47:04 0:00:06.966379
18 0.7916427635757171 2022-11-16 15:47:05 0:00:07.391875
19 0.7949144873940841 2022-11-16 15:47:05 0:00:07.803614
In [90]:  # plot scores
plt.hist(adm_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(adm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(adm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 76.9% and 82.9%

Cat Boost Regressor


In [91]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
cbm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
cbmTree = CatBoostRegressor(n_estimators=50)

# fit against independent variables and corresponding target values


cbmTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = cbmTree.score(test[:, :-1] , y_bs_test)
predictions = cbmTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
cbm_stats.append(score)

cbm_duration = duration

Learning rate set to 0.468014


0: learn: 0.7166347 total: 1.64ms remaining: 80.2ms
1: learn: 0.5549921 total: 3.07ms remaining: 73.6ms
2: learn: 0.4735941 total: 4.27ms remaining: 66.9ms
3: learn: 0.4161905 total: 5.48ms remaining: 63ms
4: learn: 0.3762747 total: 6.77ms remaining: 60.9ms
5: learn: 0.3510912 total: 7.98ms remaining: 58.5ms
6: learn: 0.3309939 total: 9.06ms remaining: 55.6ms
7: learn: 0.3112904 total: 10.2ms remaining: 53.3ms
8: learn: 0.2971811 total: 11.3ms remaining: 51.3ms
9: learn: 0.2857111 total: 12.4ms remaining: 49.4ms
10: learn: 0.2777349 total: 13.5ms remaining: 47.7ms
11: learn: 0.2660964 total: 14.6ms remaining: 46.3ms
12: learn: 0.2550353 total: 16.2ms remaining: 46ms
13: learn: 0.2456012 total: 17.3ms remaining: 44.4ms
14: learn: 0.2402010 total: 18.4ms remaining: 42.9ms
15: learn: 0.2379343 total: 19.5ms remaining: 41.5ms
16: learn: 0.2313927 total: 20.6ms remaining: 40ms
17: learn: 0.2275254 total: 21.7ms remaining: 38.6ms
18: learn: 0.2234009 total: 22.8ms remaining: 37.2ms
In [92]:  # plot scores
plt.hist(cbm_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(cbm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(cbm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 89.5% and 93.6%

XGB Regressor
In [93]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
xgb_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
xgbTree = XGBRegressor(n_estimators=50)

# fit against independent variables and corresponding target values


xgbTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = xgbTree.score(test[:, :-1] , y_bs_test)
predictions = xgbTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
xgb_stats.append(score)

xgb_duration = duration

0 0.9009717669958783 2022-11-16 16:01:54 0:00:00


1 0.9016454749360853 2022-11-16 16:01:55 0:00:00.425879
2 0.9211389690673005 2022-11-16 16:01:55 0:00:00.932945
3 0.9028020892335131 2022-11-16 16:01:56 0:00:01.500931
4 0.9073275108139568 2022-11-16 16:01:56 0:00:02.066093
5 0.916887778999676 2022-11-16 16:01:57 0:00:02.611134
6 0.9199879041970025 2022-11-16 16:01:57 0:00:03.160686
7 0.9219606879313536 2022-11-16 16:01:58 0:00:03.713257
8 0.8960962359342337 2022-11-16 16:01:58 0:00:04.171670
9 0.919981738807081 2022-11-16 16:01:59 0:00:04.698719
10 0.9138216381809829 2022-11-16 16:01:59 0:00:05.237628
11 0.9195539517812747 2022-11-16 16:02:00 0:00:05.784738
12 0.924434979750838 2022-11-16 16:02:01 0:00:06.314357
13 0.9143864529710849 2022-11-16 16:02:01 0:00:06.827148
14 0.9100862610615469 2022-11-16 16:02:02 0:00:07.351497
15 0.9026852511699036 2022-11-16 16:02:02 0:00:07.986122
16 0.9053434156313632 2022-11-16 16:02:03 0:00:08.560668
17 0.9020839182025568 2022-11-16 16:02:03 0:00:09.104071
18 0.9167171666958004 2022-11-16 16:02:04 0:00:09.644878
19 0.917731218573828 2022-11-16 16:02:04 0:00:10.187988
In [94]:  # plot scores
plt.hist(xgb_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(xgb_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(xgb_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 88.3% and 92.9%

GradientBoostingRegressor
In [95]:  values = concrete_df_scaled.values
start = datetime.datetime.now()
# Number of bootstrap samples to create
n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
gbm_stats = list()
for i in range(n_iterations):
now = datetime.datetime.now()
duration = now - start
now = now.strftime("%Y-%m-%d %H:%M:%S")

# prepare train and test sets


train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
gbmTree = GradientBoostingRegressor(n_estimators=50)

# fit against independent variables and corresponding target values


gbmTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = gbmTree.score(test[:, :-1] , y_bs_test)
predictions = gbmTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
gbm_stats.append(score)

gbm_duration = duration

0 0.8531113768853553 2022-11-16 16:09:27 0:00:00


1 0.8785094905139466 2022-11-16 16:09:28 0:00:00.400339
2 0.8793407514756606 2022-11-16 16:09:28 0:00:00.772261
3 0.8550349101496013 2022-11-16 16:09:28 0:00:01.163609
4 0.8784907298816702 2022-11-16 16:09:29 0:00:01.541121
5 0.8641277301161968 2022-11-16 16:09:29 0:00:01.906621
6 0.8693134569862236 2022-11-16 16:09:29 0:00:02.182373
7 0.8558255440525219 2022-11-16 16:09:30 0:00:02.559138
8 0.8755427851369855 2022-11-16 16:09:30 0:00:02.958823
9 0.8826126708709126 2022-11-16 16:09:31 0:00:03.439292
10 0.8796970216402821 2022-11-16 16:09:31 0:00:03.940215
11 0.8798981461977504 2022-11-16 16:09:32 0:00:04.348049
12 0.8553412163569275 2022-11-16 16:09:32 0:00:04.808928
13 0.8582615798784747 2022-11-16 16:09:32 0:00:05.128696
14 0.8782820938966774 2022-11-16 16:09:33 0:00:05.513325
15 0.873091818182887 2022-11-16 16:09:33 0:00:05.888762
16 0.8531169691279119 2022-11-16 16:09:33 0:00:06.250020
17 0.875819435292361 2022-11-16 16:09:34 0:00:06.636519
18 0.8845547967987878 2022-11-16 16:09:34 0:00:07.012405
19 0.8626450644361717 2022-11-16 16:09:35 0:00:07.386571
In [96]:  # plot scores
plt.hist(gbm_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(gbm_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(gbm_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 84.8% and 88.9%

RandomForestRegressor
In [106]:  values = concrete_df_scaled.values

# Number of bootstrap samples to create


n_iterations = 1000

# size of a bootstrap sample


n_size = int(len(concrete_df_scaled) * 1)

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
rf_stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size) # Sampling with replacement
test = np.array([x for x in values if x.tolist() not in train.tolist()])

# fit model
rfTree = RandomForestRegressor(max_features='auto', n_estimators=100)

# fit against independent variables and corresponding target values


rfTree.fit(train[:,:-1], train[:,-1])

# Take the target column for all rows in test set


y_bs_test = test[:,-1]

# evaluate model
# predict based on independent variables in the test data
score = rfTree.score(test[:, :-1] , y_bs_test)
predictions = rfTree.predict(test[:, :-1])

#display iteration number, score, current time and cumulative processing time
print(i, score, now, duration)
rf_stats.append(score)

rf_duration = duration

0 0.9056918629282287 2022-11-16 16:15:40 0:06:12.430839


1 0.8971717450949928 2022-11-16 16:15:40 0:06:12.430839
2 0.8998112315312732 2022-11-16 16:15:40 0:06:12.430839
3 0.8984822579893565 2022-11-16 16:15:40 0:06:12.430839
4 0.9086467498333022 2022-11-16 16:15:40 0:06:12.430839
5 0.8890844069882298 2022-11-16 16:15:40 0:06:12.430839
6 0.893406096343789 2022-11-16 16:15:40 0:06:12.430839
7 0.9078784942350959 2022-11-16 16:15:40 0:06:12.430839
8 0.9002966863296532 2022-11-16 16:15:40 0:06:12.430839
9 0.9050679766582665 2022-11-16 16:15:40 0:06:12.430839
10 0.8911945182495491 2022-11-16 16:15:40 0:06:12.430839
11 0.888833512149122 2022-11-16 16:15:40 0:06:12.430839
12 0.8917763356880191 2022-11-16 16:15:40 0:06:12.430839
13 0.8993005391192824 2022-11-16 16:15:40 0:06:12.430839
14 0.9115524370863135 2022-11-16 16:15:40 0:06:12.430839
15 0.908710242492275 2022-11-16 16:15:40 0:06:12.430839
16 0.9149561096346571 2022-11-16 16:15:40 0:06:12.430839
17 0.8764441506078313 2022-11-16 16:15:40 0:06:12.430839
18 0.8871420098441681 2022-11-16 16:15:40 0:06:12.430839
19 0.887459127160874 2022-11-16 16:15:40 0:06:12.430839
In [98]:  # plot scores
plt.hist(rf_stats)
plt.show()

# confidence intervals
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on e
lower = max(0.0, np.percentile(rf_stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(rf_stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100,

95.0 confidence interval 86.8% and 91.5%

Processing Time of All Models


In [107]:  # plot total processing time of all models
data = {'Ridge':363.027549, 'Lasso':322.517661, 'KNN':315.652959, 'SVR':5007.08956
models = list(data.keys())
total_duration = list(data.values())

fig = plt.figure(figsize = (10, 5))


plt.bar(models, total_duration, color ='maroon')
plt.xlabel("Models Used")
plt.ylabel("Total Processing Duration in Seconds")
plt.title("Total Processing Duration of All Models Used")
plt.show()

8 Conclusion
We were able to predict the concrete compressive strength using original features with an
accuracy of 86.94% on test data with RMSE = 5.50

If we look at the above results from various methods then we got the best accuracy from
original features and followed below steps to gain that much of accuracy.

a. As mentioned in Multi-variate analysis, there are some non-linear(curvy-linear) relatioship


within independent features as well as with target variable hence I have tried with polynomial
features.

b. Simple linear regression with polynomial features with degree = 2 performs better on both
training and test set with 1% difference.

We had 25 duplicate instances in dataset and dropped those duplicates.

We had outliers in 'Water', 'Superplastic', 'Fineagg', 'Age' and 'Strength' column also, handled
these outliers by replacing every outlier with upper and lower side of the whisker.

Except 'Cement', 'Water', 'Superplastic' and 'Age' features, all other features are having very
weak relationship with concrete 'Strength' feature and does not account for making statistical
decision (of correlation).

Range of clusters in this dataset is 2 to 6

No missing values in dataset.

Standardization of data using PowerTransformer improves accuracy slightly.

Bootstrap sampling with GradientBoostingRegressor model performance is between 84.8% -


89.0% is better than other Regression algorithms.

Bootstrap sampling with RandomForestRegressor model performance is between 86.8% -


91.6% is better than other Regression algorithms.

Finally Bootstrap sampling with RandomForestRegressor model with an accuracy of 86.6% -


91.6% is our best model.

In [ ]: 

You might also like