Notebook - measures of computer systems
Notebook - measures of computer systems
1 Problem - 1
2 Context
The comp-activ database comprises activity measures of computer systems. Data was gathered
from aSun Sparcstation 20/712 with 128 Mbytes of memory, operating in a multi-user university
department.Users engaged in diverse tasks, such as internet access, file editing, and CPU-intensive
programs.
3 Objective
Being an aspiring data scientist, you aim to establish a linear equation for predicting ‘usr’ (the
percentageof time CPUs operate in user mode). Your goal is to analyze various system attributes
to understand theirinfluence on the system’s ‘usr’ mode.
1
14. atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per
second
15. pgin - Number of page-in requests per second
16. ppgin - Number of pages paged in per second
17. pflt - Number of page faults caused by protection errors (copy-on-writes).
18. vflt - Number of page faults caused by address translation .
19. runqsz - Process run queue size (The number of kernel threads in memory that are waiting
for a CPU torun.Typically, this value should be less than 2. Consistently higher values mean
that the system might be CPU-bound.)
20. freemem - Number of memory pages available to user processes
21.
5 Criteria
5.1 Problem 1 - Define the problem and perform exploratory Data Analysis
• Problem definition
• Check shape, Data types, statistical summary
• Univariateanalysis
• Multivariate analysis
• Use appropriate visualizations to identify the patternsand insights
• Key meaningful observations on individual variables and the relationshipbetween variables
2
[ ]: import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import matplotlib.style
import warnings
warnings.filterwarnings("ignore")
5.6 Objective:
Build a linear regression model that accurately predicts the ‘usr’ mode based on the provided
system measures. Additionally, interpret the coefficients of the model to understand the influence
of each system attribute on the ‘usr’ mode.
[ ]: df = pd.read_excel('compactiv.xlsx')
df.head()
[ ]: lread lwrite scall sread swrite fork exec rchar wchar pgout \
0 1 0 2147 79 68 0.2 0.2 40671.0 53995.0 0.0
1 0 0 170 18 21 0.2 0.2 448.0 8385.0 0.0
2 15 3 2162 159 119 2.0 2.4 NaN 31950.0 0.0
3 0 0 160 12 16 0.2 0.2 NaN 8670.0 0.0
4 5 1 330 39 38 0.4 0.4 NaN 12185.0 0.0
freeswap usr
0 1730946 95
1 1869002 97
2 1021237 87
3 1863704 98
4 1760253 90
[5 rows x 22 columns]
3
[ ]: df.tail()
[5 rows x 22 columns]
[ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8192 entries, 0 to 8191
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lread 8192 non-null int64
1 lwrite 8192 non-null int64
2 scall 8192 non-null int64
3 sread 8192 non-null int64
4 swrite 8192 non-null int64
5 fork 8192 non-null float64
6 exec 8192 non-null float64
7 rchar 8088 non-null float64
8 wchar 8177 non-null float64
9 pgout 8192 non-null float64
10 ppgout 8192 non-null float64
4
11 pgfree 8192 non-null float64
12 pgscan 8192 non-null float64
13 atch 8192 non-null float64
14 pgin 8192 non-null float64
15 ppgin 8192 non-null float64
16 pflt 8192 non-null float64
17 vflt 8192 non-null float64
18 runqsz 8192 non-null object
19 freemem 8192 non-null int64
20 freeswap 8192 non-null int64
21 usr 8192 non-null int64
dtypes: float64(13), int64(8), object(1)
memory usage: 1.4+ MB
[ ]: df.describe().T.round(2)
75% max
lread 20.00 1845.00
lwrite 10.00 575.00
scall 3317.25 12493.00
sread 279.00 5318.00
swrite 185.00 5456.00
fork 2.20 20.12
exec 2.80 59.56
5
rchar 267828.75 2526649.00
wchar 106101.00 1801623.00
pgout 2.40 81.44
ppgout 4.20 184.20
pgfree 5.00 523.00
pgscan 0.00 1237.00
atch 0.60 211.58
pgin 9.76 141.20
ppgin 13.80 292.61
pflt 159.60 899.80
vflt 251.80 1365.00
freemem 2002.25 12027.00
freeswap 1730379.50 2243187.00
usr 94.00 99.00
[ ]: df.select_dtypes(include=['object']).describe().T
[ ]: df['runqsz'].value_counts()
[ ]: runqsz
Not_CPU_Bound 4331
CPU_Bound 3861
Name: count, dtype: int64
[ ]:
6
# Print the missing value information
for col, missing_count in missing_values.items():
print(f"{col}: {missing_count} missing values")
missing_values_summary(df)
[ ]: missing_values_summary(df)
7
pgfree: 0 missing values
pgscan: 0 missing values
atch: 0 missing values
pgin: 0 missing values
ppgin: 0 missing values
pflt: 0 missing values
vflt: 0 missing values
runqsz: 0 missing values
freemem: 0 missing values
freeswap: 0 missing values
usr: 0 missing values
# Histogram
sns.histplot(data[column], ax=axes[0], kde=True)
axes[0].set_title(f'Histogram of {column}')
# Boxplot
sns.boxplot(x=data[column], ax=axes[1])
axes[1].set_title(f'Boxplot of {column}')
plt.show()
histogram_boxplot(df, 'lread')
histogram_boxplot(df, 'lwrite')
histogram_boxplot(df, 'scall')
histogram_boxplot(df, 'sread')
histogram_boxplot(df, 'swrite')
histogram_boxplot(df, 'fork')
histogram_boxplot(df, 'exec')
histogram_boxplot(df, 'rchar')
histogram_boxplot(df, 'wchar')
histogram_boxplot(df, 'pgout')
histogram_boxplot(df, 'ppgout')
histogram_boxplot(df, 'pgfree')
histogram_boxplot(df, 'pgscan')
histogram_boxplot(df, 'atch')
8
histogram_boxplot(df, 'pgin')
histogram_boxplot(df, 'ppgin')
histogram_boxplot(df, 'pflt')
histogram_boxplot(df, 'vflt')
histogram_boxplot(df, 'freemem')
histogram_boxplot(df, 'freeswap')
histogram_boxplot(df, 'usr')
9
10
11
12
13
14
[ ]: df.select_dtypes(include=['int', 'float']).skew().round(2)
[ ]: lread 13.90
lwrite 5.28
scall 0.90
sread 5.46
swrite 9.61
fork 2.25
15
exec 4.07
rchar 2.88
wchar 3.85
pgout 5.07
ppgout 4.68
pgfree 4.77
pgscan 5.81
atch 21.54
pgin 3.24
ppgin 3.90
pflt 1.72
vflt 1.74
freemem 1.81
freeswap -0.79
usr -3.42
dtype: float64
[ ]: plt.title('runqsz')
sns.countplot(df['runqsz'])
plt.subplots_adjust()
plt.tight_layout()
plt.show()
16
[ ]: def segregate_numerical_columns(df):
# Select columns with numeric data types
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
df_num = df[numerical_columns]
return df_num
df_num = segregate_numerical_columns(df)
plt.figure(figsize=(20,7))
sns.heatmap(df_num.corr(),annot=True,mask=np.triu(df_num.
↪corr(),+1),cmap='RdYlGn');
17
Typically, the rule of thumb for correlation values is as follows: 1. r between −0.4 and +0.4
indicates absence of linear dependence 2. r between −0.7 and −0.4 or r between +0.4 and +0.7
indicates moderate linear dependence, the sign indicating its direction 3. r less than −0.7 or r
greater than +0.7 indicates strong linear dependence
[ ]: def filter_correlation(df):
# Calculate the correlation matrix
corr_matrix = df.corr()
moderate_correlations.append((corr_matrix.columns[i],␣
↪corr_matrix.columns[j], corr_matrix.iloc[i, j]))
strong_correlations.append((corr_matrix.columns[i], corr_matrix.
↪columns[j], corr_matrix.iloc[i, j]))
18
return moderate_correlations, strong_correlations
print("Moderate Correlations:")
for corr in moderate_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")
print("\nStrong Correlations:")
for corr in strong_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")
Moderate Correlations:
lread - lwrite: 0.5337368224057958
scall - sread: 0.6968867812358538
scall - swrite: 0.6199837643477688
scall - fork: 0.44676647183917245
scall - pflt: 0.481780709551168
scall - vflt: 0.5317598003509384
sread - fork: 0.4167207141572767
sread - rchar: 0.4999982166371595
sread - wchar: 0.4014265486206024
sread - pflt: 0.45201960899213534
sread - vflt: 0.49104525598137194
swrite - vflt: 0.41657080986074013
exec - pflt: 0.6452390212895793
exec - vflt: 0.6917544848966637
rchar - wchar: 0.4995687409698955
pgout - pgscan: 0.5539159057375587
pgout - ppgin: 0.41486526490355896
ppgout - pgin: 0.4882613687043727
ppgout - ppgin: 0.5423920181151021
pgfree - pgin: 0.5328340692850801
pgfree - ppgin: 0.5933957386640062
pgscan - pgin: 0.4968263206180219
pgscan - ppgin: 0.5649909017757778
vflt - usr: -0.4206853097412153
freemem - freeswap: 0.5726322069049757
freeswap - usr: 0.6785262417399971
Strong Correlations:
sread - swrite: 0.8810693839008278
fork - exec: 0.7639742315330512
fork - pflt: 0.9310399616311366
fork - vflt: 0.9393484703374151
pgout - ppgout: 0.8724453798209877
pgout - pgfree: 0.7303809913990689
ppgout - pgfree: 0.9177904500452804
19
ppgout - pgscan: 0.7852562930865677
pgfree - pgscan: 0.9152168107574576
pgin - ppgin: 0.9236207464900754
pflt - vflt: 0.935369585359654
[ ]: #sns.pairplot(df_num, diag_kind='kde')
#plt.show()
[ ]:
20
5.11 Get the count of Zeros in column
[ ]: def zeros_percentage(df):
# Count the number of zeros in each column
zero_counts = (df == 0).sum()
return zeros_info
zeros_info_df = zeros_percentage(df)
print(zeros_info_df)
21
5.12 Checking Outlier
[ ]: def outliers_summary(df):
# Select only integer and float columns
numeric_columns = df.select_dtypes(include=['int', 'float']).columns
outliers_summary(df)
22
atch: Total outliers = 1209, Upper outliers = 1209, Lower outliers = 0
pgin: Total outliers = 789, Upper outliers = 789, Lower outliers = 0
ppgin: Total outliers = 821, Upper outliers = 821, Lower outliers = 0
pflt: Total outliers = 395, Upper outliers = 395, Lower outliers = 0
vflt: Total outliers = 484, Upper outliers = 484, Lower outliers = 0
freemem: Total outliers = 1185, Upper outliers = 1185, Lower outliers = 0
freeswap: Total outliers = 294, Upper outliers = 0, Lower outliers = 294
usr: Total outliers = 430, Upper outliers = 0, Lower outliers = 430
23
[ ]: def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
plt.figure(figsize=(10,10))
df[cont].boxplot(vert=0)
plt.title('After Outlier Removal',fontsize=16)
plt.show()
24
[ ]: #df_attr = (df[cont])
#sns.pairplot(df_attr, diag_kind='kde')
#plt.show()
[ ]: lread lwrite scall sread swrite fork exec rchar wchar pgout \
0 1.0 0.0 2147.0 79.0 68.0 0.2 0.2 40671.0 53995.0 0.0
1 0.0 0.0 170.0 18.0 21.0 0.2 0.2 448.0 8385.0 0.0
2 15.0 3.0 2162.0 159.0 119.0 2.0 2.4 125473.5 31950.0 0.0
25
3 0.0 0.0 160.0 12.0 16.0 0.2 0.2 125473.5 8670.0 0.0
4 5.0 1.0 330.0 39.0 38.0 0.4 0.4 125473.5 12185.0 0.0
runqsz_Not_CPU_Bound
0 False
1 True
2 True
3 True
4 True
[5 rows x 22 columns]
[ ]: df_dummy.describe().T.round(2)
75% max
lread 20.00 47.00
lwrite 10.00 25.00
26
scall 3317.25 6775.12
sread 279.00 568.50
swrite 185.00 368.00
fork 2.20 4.90
exec 2.80 6.70
rchar 265394.75 611196.12
wchar 106037.00 230625.88
pgout 2.40 6.00
ppgout 4.20 10.50
pgfree 5.00 12.50
pgscan 0.00 0.00
atch 0.60 1.50
pgin 9.76 23.51
ppgin 13.80 33.60
pflt 159.60 361.50
vflt 251.80 561.40
freemem 2002.25 4659.12
freeswap 1730379.50 2243187.00
usr 94.00 99.00
Before outlier treatment, the variable “pgscan” had more than 75% of its value counts as 0. After
outlier treatment, it seems that all the entries for “pgscan” became 0. When calculating the Variance
Inflation Factor (VIF), it involves the computation of the inverse matrix of the correlation matrix
of predictors. However, if a variable has zero variance (i.e., all values are the same), it can cause
issues in the calculation, leading to a division by zero error or NaN values.Having a variable with
zero variance, like “pgscan” in this case, implies that it doesn’t provide any useful information for
prediction since all observations have the same value. Thus, it can be safely dropped from the
model as it doesn’t contribute to the prediction of the target variable.
[ ]: df_dummy.drop(columns=['pgscan'], inplace=True)
[ ]: df_dummy.columns
27
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 ,␣
↪random_state=1)
[ ]: import statsmodels.api as sm
X_train=sm.add_constant(X_train) # This adds the constant term beta0 to the␣
↪Simple Linear Regression.
X_test=sm.add_constant(X_test)
model = sm.OLS(endog=y_train,exog=X_train).fit()
print(model.summary())
28
exec -0.3212 0.052 -6.220 0.000 -0.422
-0.220
rchar -5.167e-06 4.88e-07 -10.598 0.000 -6.12e-06
-4.21e-06
wchar -5.403e-06 1.03e-06 -5.232 0.000 -7.43e-06
-3.38e-06
pgout -0.3688 0.090 -4.098 0.000 -0.545
-0.192
ppgout -0.0766 0.079 -0.973 0.330 -0.231
0.078
pgfree 0.0845 0.048 1.769 0.077 -0.009
0.178
atch 0.6276 0.143 4.394 0.000 0.348
0.908
pgin 0.0200 0.028 0.703 0.482 -0.036
0.076
ppgin -0.0673 0.020 -3.415 0.001 -0.106
-0.029
pflt -0.0336 0.002 -16.957 0.000 -0.037
-0.030
vflt -0.0055 0.001 -3.830 0.000 -0.008
-0.003
freemem -0.0005 5.07e-05 -9.038 0.000 -0.001
-0.000
freeswap 8.832e-06 1.9e-07 46.472 0.000 8.46e-06
9.2e-06
runqsz_Not_CPU_Bound 1.6153 0.126 12.819 0.000 1.368
1.862
==============================================================================
Omnibus: 1103.645 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2372.553
Skew: -1.119 Prob(JB): 0.00
Kurtosis: 5.219 Cond. No. 7.74e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.74e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
29
5.12.5 Interpretation of Coefficients
• The coefficients tell us how one unit change in X can affect y.
• The sign of the coefficient indicates if the relationship is positive or negative.
• Multicollinearity occurs when predictor variables in a regression model are correlated. This
correlation is a problem because predictor variables should be independent. If the collinearity
between variables is high, we might not be able to trust the p-values to identify independent
variables that are statistically significant.
• When we have multicollinearity in the linear model, the coefficients that the model suggests
are unreliable.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
30
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 29.229332
lread 5.350560
lwrite 4.328397
scall 2.960609
sread 6.420172
swrite 5.597135
fork 13.035359
exec 3.241417
rchar 2.133616
wchar 1.584381
pgout 11.360363
ppgout 29.404223
pgfree 16.496748
atch 1.875901
pgin 13.809339
ppgin 13.951855
pflt 12.001460
vflt 15.971049
freemem 1.961304
freeswap 1.841239
runqsz_Not_CPU_Bound 1.156815
dtype: float64
AS few predictors have VIF values > 2 therefore there is some multicolinearity in the data. The
variables with the highest VIF values is ‘ppgout’ a with VIF values of 29.404223. This value suggest
a high degree of multicollinearity.
• The VIF values indicate that the features (ppgout,pgfree,vflt,ppgin,pgin,fork,pflt,pgout,sread,swrite,lread)
are correlated with one or more independent features.
• Multicollinearity affects only the specific independent variables that are correlated.
• To treat multicollinearity, we will have to drop one or more of the correlated features.
• We will drop the variable that has the least impact on the adjusted R-squared of the model.
[ ]: X_train.columns
Let’s remove/drop multicollinear columns one by one and observe the effect on our
predictive model
31
[ ]: import statsmodels.api as sm
import numpy as np
Parameters:
- X_train: DataFrame containing predictor variables.
- y_train: Series containing the target variable.
- vif_values: Dictionary containing VIF values for predictor variables.
Returns:
- List of tuples containing the name of the removed column and the␣
↪corresponding adjusted R-squared and R-squared.
"""
results_adj_r_squared = []
results_r_squared = []
# Initial model
olsmod = sm.OLS(y_train, X_train)
olsres = olsmod.fit()
initial_adj_r_squared = olsres.rsquared_adj
initial_r_squared = olsres.rsquared
results_adj_r_squared.append(('Initial', initial_adj_r_squared))
results_r_squared.append(('Initial', initial_r_squared))
# Append results
results_adj_r_squared.append((column, olsres_temp.rsquared_adj))
results_r_squared.append((column, olsres_temp.rsquared))
# Example usage:
# Assuming X_train and y_train are your training data
vif_values = {
32
'ppgout': 29.404223,
'pgfree': 16.496748,
'vflt': 15.971049,
'ppgin': 13.951855,
'pgin': 13.809339,
'fork': 13.035359,
'pflt': 12.00146,
'pgout': 11.360363,
'sread': 6.420172,
'swrite': 5.597135,
'lread' :5.35056,
'lwrite':4.328397
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
Removed: Initial, R-squared: 0.796
Removed: ppgout, R-squared: 0.796
33
Removed: pgfree, R-squared: 0.796
Removed: vflt, R-squared: 0.796
Removed: ppgin, R-squared: 0.796
Removed: pgin, R-squared: 0.796
Removed: fork, R-squared: 0.796
Removed: pflt, R-squared: 0.786
Removed: pgout, R-squared: 0.796
Removed: sread, R-squared: 0.796
Removed: swrite, R-squared: 0.796
Removed: lread, R-squared: 0.794
Removed: lwrite, R-squared: 0.796
We will remove ppgout first.
34
fork 0.0325 0.132 0.247 0.805 -0.226
0.291
exec -0.3225 0.052 -6.247 0.000 -0.424
-0.221
rchar -5.166e-06 4.88e-07 -10.598 0.000 -6.12e-06
-4.21e-06
wchar -5.45e-06 1.03e-06 -5.283 0.000 -7.47e-06
-3.43e-06
pgout -0.4264 0.068 -6.286 0.000 -0.559
-0.293
pgfree 0.0477 0.029 1.634 0.102 -0.010
0.105
atch 0.6295 0.143 4.407 0.000 0.349
0.909
pgin 0.0212 0.028 0.745 0.456 -0.035
0.077
ppgin -0.0685 0.020 -3.482 0.001 -0.107
-0.030
pflt -0.0336 0.002 -16.957 0.000 -0.037
-0.030
vflt -0.0055 0.001 -3.846 0.000 -0.008
-0.003
freemem -0.0005 5.07e-05 -9.074 0.000 -0.001
-0.000
freeswap 8.824e-06 1.9e-07 46.472 0.000 8.45e-06
9.2e-06
runqsz_Not_CPU_Bound 1.6130 0.126 12.804 0.000 1.366
1.860
==============================================================================
Omnibus: 1102.077 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2366.754
Skew: -1.118 Prob(JB): 0.00
Kurtosis: 5.216 Cond. No. 7.71e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.71e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
35
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 29.021961
lread 5.350387
lwrite 4.328325
scall 2.960379
sread 6.420135
swrite 5.597025
fork 13.027305
exec 3.239231
rchar 2.133614
wchar 1.580894
pgout 6.453978
pgfree 6.172847
atch 1.875553
pgin 13.784007
ppgin 13.898848
pflt 12.001460
vflt 15.966865
freemem 1.959267
freeswap 1.838167
runqsz_Not_CPU_Bound 1.156421
dtype: float64
[ ]: vif_values = {
'vflt': 15.966865,
'ppgin': 13.898848,
'pgin' : 13.784007,
'fork': 13.027305,
'pflt': 12.00146,
'pgout': 6.453978,
'sread': 6.420135,
'pgfree': 6.172847,
'swrite': 5.597025,
'lread': 5.350387,
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
36
print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")
R-squared Results:
Removed: Initial, R-squared: 0.796
Removed: vflt, R-squared: 0.796
Removed: ppgin, R-squared: 0.796
Removed: pgin, R-squared: 0.796
Removed: fork, R-squared: 0.796
Removed: pflt, R-squared: 0.786
Removed: pgout, R-squared: 0.795
Removed: sread, R-squared: 0.796
Removed: pgfree, R-squared: 0.796
Removed: swrite, R-squared: 0.796
Removed: lread, R-squared: 0.794
We are removing vflt
37
Time: 12:53:12 Log-Likelihood: -16665.
No. Observations: 5734 AIC: 3.337e+04
Df Residuals: 5715 BIC: 3.349e+04
Df Model: 18
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0090 0.313 268.139 0.000 83.395
84.623
lread -0.0654 0.009 -7.281 0.000 -0.083
-0.048
lwrite 0.0491 0.013 3.735 0.000 0.023
0.075
scall -0.0007 6.28e-05 -10.769 0.000 -0.001
-0.001
sread -2.068e-05 0.001 -0.021 0.984 -0.002
0.002
swrite -0.0053 0.001 -3.720 0.000 -0.008
-0.003
fork -0.2082 0.116 -1.793 0.073 -0.436
0.019
exec -0.3293 0.052 -6.376 0.000 -0.431
-0.228
rchar -5.294e-06 4.87e-07 -10.871 0.000 -6.25e-06
-4.34e-06
wchar -4.982e-06 1.03e-06 -4.858 0.000 -6.99e-06
-2.97e-06
pgout -0.4205 0.068 -6.194 0.000 -0.554
-0.287
pgfree 0.0408 0.029 1.397 0.162 -0.016
0.098
atch 0.5868 0.143 4.116 0.000 0.307
0.866
pgin 0.0086 0.028 0.305 0.760 -0.047
0.064
ppgin -0.0685 0.020 -3.476 0.001 -0.107
-0.030
pflt -0.0373 0.002 -21.570 0.000 -0.041
-0.034
freemem -0.0005 5.07e-05 -9.165 0.000 -0.001
-0.000
freeswap 8.945e-06 1.87e-07 47.712 0.000 8.58e-06
9.31e-06
runqsz_Not_CPU_Bound 1.6096 0.126 12.761 0.000 1.362
38
1.857
==============================================================================
Omnibus: 1058.324 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2225.362
Skew: -1.085 Prob(JB): 0.00
Kurtosis: 5.145 Cond. No. 7.65e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.65e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 28.641818
lread 5.335455
lwrite 4.327130
scall 2.952947
sread 6.374687
swrite 5.595777
fork 10.089700
exec 3.235396
rchar 2.123783
wchar 1.558923
pgout 6.450724
pgfree 6.149223
atch 1.864254
pgin 13.602134
ppgin 13.898845
pflt 9.131802
freemem 1.957966
freeswap 1.787695
runqsz_Not_CPU_Bound 1.156363
dtype: float64
39
[ ]: vif_values = {
'ppgin': 13.898845,
'pgin': 13.602134,
'fork': 10.0897,
'pflt': 9.131802,
'pgout': 6.450724,
'sread': 6.374687,
'pgfree': 6.149223,
'swrite': 5.595777,
'lread' : 5.335455,
}
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
Removed: Initial, R-squared: 0.796
Removed: ppgin, R-squared: 0.795
Removed: pgin, R-squared: 0.796
Removed: fork, R-squared: 0.795
Removed: pflt, R-squared: 0.779
Removed: pgout, R-squared: 0.794
Removed: sread, R-squared: 0.796
Removed: pgfree, R-squared: 0.795
Removed: swrite, R-squared: 0.795
40
Removed: lread, R-squared: 0.794
[ ]: # Removing ppgin
X_train = X_train.drop(["ppgin"], axis=1)
41
pgfree 0.0311 0.029 1.070 0.284 -0.026
0.088
atch 0.5966 0.143 4.181 0.000 0.317
0.876
pgin -0.0839 0.009 -8.848 0.000 -0.103
-0.065
pflt -0.0374 0.002 -21.568 0.000 -0.041
-0.034
freemem -0.0005 5.08e-05 -9.196 0.000 -0.001
-0.000
freeswap 8.922e-06 1.88e-07 47.572 0.000 8.55e-06
9.29e-06
runqsz_Not_CPU_Bound 1.6017 0.126 12.689 0.000 1.354
1.849
==============================================================================
Omnibus: 1052.296 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2207.367
Skew: -1.081 Prob(JB): 0.00
Kurtosis: 5.137 Cond. No. 7.65e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.65e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 28.594882
lread 5.304009
lwrite 4.316362
scall 2.951826
sread 6.374556
swrite 5.595670
fork 10.074886
exec 3.235387
rchar 2.090401
42
wchar 1.558921
pgout 6.445478
pgfree 6.093623
atch 1.863536
pgin 1.529142
pflt 9.131545
freemem 1.957713
freeswap 1.785393
runqsz_Not_CPU_Bound 1.155990
dtype: float64
[ ]: vif_values = {
'fork' :10.074886,
'pflt' :9.131545,
'pgout' :6.445478,
'sread' :6.374556,
'pgfree' :6.093623,
'swrite' :5.59567,
'lread' :5.304009,
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
43
Removed: Initial, R-squared: 0.795
Removed: fork, R-squared: 0.795
Removed: pflt, R-squared: 0.778
Removed: pgout, R-squared: 0.794
Removed: sread, R-squared: 0.795
Removed: pgfree, R-squared: 0.795
Removed: swrite, R-squared: 0.795
Removed: lread, R-squared: 0.793
44
wchar -4.978e-06 1.02e-06 -4.870 0.000 -6.98e-06
-2.97e-06
pgout -0.4138 0.068 -6.092 0.000 -0.547
-0.281
pgfree 0.0311 0.029 1.071 0.284 -0.026
0.088
atch 0.5966 0.143 4.183 0.000 0.317
0.876
pgin -0.0839 0.009 -8.852 0.000 -0.103
-0.065
pflt -0.0374 0.002 -21.608 0.000 -0.041
-0.034
freemem -0.0005 5.08e-05 -9.198 0.000 -0.001
-0.000
freeswap 8.922e-06 1.87e-07 47.758 0.000 8.56e-06
9.29e-06
runqsz_Not_CPU_Bound 1.6017 0.126 12.690 0.000 1.354
1.849
==============================================================================
Omnibus: 1052.284 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2207.327
Skew: -1.081 Prob(JB): 0.00
Kurtosis: 5.137 Cond. No. 7.64e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.64e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 28.524054
lread 5.296795
lwrite 4.307417
scall 2.696760
swrite 3.201334
45
fork 10.073151
exec 3.229896
rchar 1.673676
wchar 1.545377
pgout 6.444771
pgfree 6.092930
atch 1.862227
pgin 1.528098
pflt 9.099426
freemem 1.957131
freeswap 1.771837
runqsz_Not_CPU_Bound 1.155984
dtype: float64
[ ]: vif_values = {
'fork': 10.073151,
'pflt': 9.099426,
'pgout': 6.444771,
'pgfree': 6.09293,
'lread': 5.296795,
}
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
Removed: Initial, R-squared: 0.795
Removed: fork, R-squared: 0.795
46
Removed: pflt, R-squared: 0.778
Removed: pgout, R-squared: 0.794
Removed: pgfree, R-squared: 0.795
Removed: lread, R-squared: 0.793
47
atch 0.6002 0.143 4.210 0.000 0.321
0.880
pgin -0.0825 0.009 -8.788 0.000 -0.101
-0.064
pflt -0.0374 0.002 -21.620 0.000 -0.041
-0.034
freemem -0.0005 5.06e-05 -9.304 0.000 -0.001
-0.000
freeswap 8.927e-06 1.87e-07 47.804 0.000 8.56e-06
9.29e-06
runqsz_Not_CPU_Bound 1.5998 0.126 12.676 0.000 1.352
1.847
==============================================================================
Omnibus: 1054.101 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2212.372
Skew: -1.082 Prob(JB): 0.00
Kurtosis: 5.139 Cond. No. 7.64e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.64e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 28.523969
lread 5.291929
lwrite 4.301867
scall 2.693729
swrite 3.200197
fork 10.072215
exec 3.227631
rchar 1.673056
wchar 1.545060
pgout 2.029269
atch 1.861177
48
pgin 1.500133
pflt 9.098450
freemem 1.946319
freeswap 1.770539
runqsz_Not_CPU_Bound 1.155750
dtype: float64
[ ]: vif_values = {
'fork': 10.072215,
'pflt': 9.09845,
'lread': 5.291929,
'lwrite': 4.301867
}
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
Removed: Initial, R-squared: 0.795
Removed: fork, R-squared: 0.795
Removed: pflt, R-squared: 0.778
Removed: lread, R-squared: 0.793
Removed: lwrite, R-squared: 0.795
49
OLS Regression Results
==============================================================================
Dep. Variable: usr R-squared: 0.795
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 1584.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:14 Log-Likelihood: -16673.
No. Observations: 5734 AIC: 3.338e+04
Df Residuals: 5719 BIC: 3.348e+04
Df Model: 14
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0919 0.312 269.420 0.000 83.480
84.704
lread -0.0684 0.009 -7.653 0.000 -0.086
-0.051
lwrite 0.0523 0.013 3.995 0.000 0.027
0.078
scall -0.0007 5.96e-05 -11.111 0.000 -0.001
-0.001
swrite -0.0058 0.001 -5.481 0.000 -0.008
-0.004
exec -0.3568 0.049 -7.355 0.000 -0.452
-0.262
rchar -5.511e-06 4.33e-07 -12.740 0.000 -6.36e-06
-4.66e-06
wchar -4.872e-06 1.02e-06 -4.779 0.000 -6.87e-06
-2.87e-06
pgout -0.3540 0.038 -9.287 0.000 -0.429
-0.279
atch 0.6055 0.143 4.247 0.000 0.326
0.885
pgin -0.0820 0.009 -8.730 0.000 -0.100
-0.064
pflt -0.0396 0.001 -37.292 0.000 -0.042
-0.038
freemem -0.0005 5.06e-05 -9.328 0.000 -0.001
-0.000
freeswap 8.915e-06 1.87e-07 47.769 0.000 8.55e-06
9.28e-06
runqsz_Not_CPU_Bound 1.5953 0.126 12.641 0.000 1.348
1.843
==============================================================================
50
Omnibus: 1045.912 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2203.816
Skew: -1.073 Prob(JB): 0.00
Kurtosis: 5.150 Cond. No. 7.61e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 28.366778
lread 5.272488
lwrite 4.282984
scall 2.653943
swrite 3.012451
exec 2.847353
rchar 1.672481
wchar 1.537067
pgout 2.029172
atch 1.860242
pgin 1.497984
pflt 3.436202
freemem 1.945888
freeswap 1.767780
runqsz_Not_CPU_Bound 1.155214
dtype: float64
[ ]: vif_values = {
'lread': 5.272488,
'lwrite': 4.282984,
'pflt': 3.436202,
'swrite': 3.012451
}
51
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
Removed: Initial, R-squared: 0.795
Removed: lread, R-squared: 0.793
Removed: lwrite, R-squared: 0.794
Removed: pflt, R-squared: 0.745
Removed: swrite, R-squared: 0.794
52
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.1528 0.312 269.584 0.000 83.541
84.765
lread -0.0374 0.004 -8.429 0.000 -0.046
-0.029
scall -0.0007 5.97e-05 -11.237 0.000 -0.001
-0.001
swrite -0.0058 0.001 -5.512 0.000 -0.008
-0.004
exec -0.3696 0.048 -7.627 0.000 -0.465
-0.275
rchar -5.533e-06 4.33e-07 -12.774 0.000 -6.38e-06
-4.68e-06
wchar -4.572e-06 1.02e-06 -4.491 0.000 -6.57e-06
-2.58e-06
pgout -0.3572 0.038 -9.359 0.000 -0.432
-0.282
atch 0.6127 0.143 4.293 0.000 0.333
0.893
pgin -0.0872 0.009 -9.373 0.000 -0.105
-0.069
pflt -0.0405 0.001 -38.806 0.000 -0.043
-0.038
freemem -0.0005 5.07e-05 -9.226 0.000 -0.001
-0.000
freeswap 8.916e-06 1.87e-07 47.713 0.000 8.55e-06
9.28e-06
runqsz_Not_CPU_Bound 1.6330 0.126 12.959 0.000 1.386
1.880
==============================================================================
Omnibus: 1041.933 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2191.377
Skew: -1.070 Prob(JB): 0.00
Kurtosis: 5.144 Cond. No. 7.61e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
53
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 28.299206
lread 1.294870
scall 2.650952
swrite 3.012182
exec 2.834855
rchar 1.672218
wchar 1.528722
pgout 2.028322
atch 1.859941
pgin 1.468363
pflt 3.300995
freemem 1.944841
freeswap 1.767776
runqsz_Not_CPU_Bound 1.148773
dtype: float64
[ ]: vif_values = {
'pflt': 3.300995,
'swrite': 3.012182,
'exec': 2.834855,
'scall': 2.650952,
'pgout': 2.028322
}
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
54
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")
R-squared Results:
Removed: Initial, R-squared: 0.794
Removed: pflt, R-squared: 0.74
Removed: swrite, R-squared: 0.793
Removed: exec, R-squared: 0.792
Removed: scall, R-squared: 0.79
Removed: pgout, R-squared: 0.791
55
-0.197
rchar -5.597e-06 4.34e-07 -12.894 0.000 -6.45e-06
-4.75e-06
wchar -6.141e-06 9.8e-07 -6.267 0.000 -8.06e-06
-4.22e-06
pgout -0.3577 0.038 -9.349 0.000 -0.433
-0.283
atch 0.6264 0.143 4.378 0.000 0.346
0.907
pgin -0.0882 0.009 -9.452 0.000 -0.106
-0.070
pflt -0.0428 0.001 -44.601 0.000 -0.045
-0.041
freemem -0.0004 5.06e-05 -8.679 0.000 -0.001
-0.000
freeswap 8.982e-06 1.87e-07 48.039 0.000 8.62e-06
9.35e-06
runqsz_Not_CPU_Bound 1.6322 0.126 12.920 0.000 1.385
1.880
==============================================================================
Omnibus: 994.479 Durbin-Watson: 2.011
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2042.636
Skew: -1.035 Prob(JB): 0.00
Kurtosis: 5.065 Cond. No. 7.52e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.52e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 27.624372
lread 1.294639
scall 1.738190
exec 2.566477
56
rchar 1.671017
wchar 1.409194
pgout 2.028308
atch 1.859379
pgin 1.467864
pflt 2.775762
freemem 1.923931
freeswap 1.760599
runqsz_Not_CPU_Bound 1.148772
dtype: float64
[ ]: vif_values = {
'pflt' :2.775762,
'exec' :2.566477,
'pgout' :2.028308,
}
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)
R-squared Results:
Removed: Initial, R-squared: 0.793
Removed: pflt, R-squared: 0.721
Removed: exec, R-squared: 0.792
Removed: pgout, R-squared: 0.79
57
print(olsres_10.summary())
58
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.52e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))
VIF values:
const 27.601456
lread 1.285248
scall 1.732634
rchar 1.671015
wchar 1.408716
pgout 2.028061
atch 1.851850
pgin 1.453333
pflt 1.564570
freemem 1.922817
freeswap 1.760449
runqsz_Not_CPU_Bound 1.148484
dtype: float64
Now that we do not have multicollinearity in our data, the p-values of the coefficients have become
reliable also we don’t have any non-significant p value left.
59
5.14 Linearity and Independence of predictors
[ ]: df_pred = pd.DataFrame()
df_pred.head()
60
[ ]: # columns in training set
X_train.columns
No pattern in the data thus the assumption of linearity and independence of predictors satisfied
#plt.show()
subset_df = df_dummy[columns_of_interest]
61
corr_matrix = subset_df.corr()
# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f",␣
↪annot_kws={"size": 10})
plt.title('Correlation Heatmap')
plt.show()
print("Moderate Correlations:")
for corr in moderate_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")
print("\nStrong Correlations:")
for corr in strong_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")
Moderate Correlations:
62
usr - lread: -0.43816331190269986
usr - scall: -0.6189316332505844
usr - rchar: -0.5075608593466034
usr - pgin: -0.459132824706294
usr - pflt: -0.6960625200066154
usr - freeswap: 0.5649289834256034
scall - pflt: 0.4853611683732183
rchar - wchar: 0.48631681076384864
pgout - atch: 0.6429403043922839
pgout - pgin: 0.43791642745583104
pgout - freemem: -0.4698311159510902
atch - freemem: -0.4420624602944536
freemem - freeswap: 0.6070003248418868
Strong Correlations:
63
--------------------------------------------------------------------------------
--------
const 80.3842 0.358 224.249 0.000 79.682
81.087
lread -0.0409 0.004 -9.481 0.000 -0.049
-0.032
scall 0.0009 0.000 6.462 0.000 0.001
0.001
rchar -5.625e-06 4.24e-07 -13.282 0.000 -6.46e-06
-4.8e-06
wchar -7.261e-06 9.58e-07 -7.576 0.000 -9.14e-06
-5.38e-06
pgout -0.3064 0.037 -8.189 0.000 -0.380
-0.233
atch 0.4929 0.139 3.536 0.000 0.220
0.766
pgin -0.1131 0.009 -12.410 0.000 -0.131
-0.095
pflt -0.0274 0.002 -12.818 0.000 -0.032
-0.023
freemem -0.0003 5.02e-05 -5.901 0.000 -0.000
-0.000
freeswap 9.518e-06 1.85e-07 51.451 0.000 9.16e-06
9.88e-06
runqsz_Not_CPU_Bound 1.9046 0.124 15.350 0.000 1.661
2.148
pflt_sq -5.942e-05 5.92e-06 -10.036 0.000 -7.1e-05
-4.78e-05
scall_sq -2.827e-07 2.09e-08 -13.535 0.000 -3.24e-07
-2.42e-07
==============================================================================
Omnibus: 844.795 Durbin-Watson: 2.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1634.856
Skew: -0.918 Prob(JB): 0.00
Kurtosis: 4.864 Cond. No. 7.84e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.84e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
64
# Get the list of column names from X_train
columns_list = X_train.columns
[ ]: [('const', 'lread'),
('const', 'scall'),
('const', 'rchar'),
('const', 'wchar'),
('const', 'pgout'),
('const', 'atch'),
('const', 'pgin'),
('const', 'pflt'),
('const', 'freemem'),
('const', 'freeswap'),
('const', 'runqsz_Not_CPU_Bound'),
('lread', 'scall'),
('lread', 'rchar'),
('lread', 'wchar'),
('lread', 'pgout'),
('lread', 'atch'),
('lread', 'pgin'),
('lread', 'pflt'),
('lread', 'freemem'),
('lread', 'freeswap'),
('lread', 'runqsz_Not_CPU_Bound'),
('scall', 'rchar'),
('scall', 'wchar'),
('scall', 'pgout'),
('scall', 'atch'),
('scall', 'pgin'),
('scall', 'pflt'),
('scall', 'freemem'),
('scall', 'freeswap'),
('scall', 'runqsz_Not_CPU_Bound'),
('rchar', 'wchar'),
('rchar', 'pgout'),
('rchar', 'atch'),
('rchar', 'pgin'),
('rchar', 'pflt'),
('rchar', 'freemem'),
('rchar', 'freeswap'),
('rchar', 'runqsz_Not_CPU_Bound'),
('wchar', 'pgout'),
('wchar', 'atch'),
65
('wchar', 'pgin'),
('wchar', 'pflt'),
('wchar', 'freemem'),
('wchar', 'freeswap'),
('wchar', 'runqsz_Not_CPU_Bound'),
('pgout', 'atch'),
('pgout', 'pgin'),
('pgout', 'pflt'),
('pgout', 'freemem'),
('pgout', 'freeswap'),
('pgout', 'runqsz_Not_CPU_Bound'),
('atch', 'pgin'),
('atch', 'pflt'),
('atch', 'freemem'),
('atch', 'freeswap'),
('atch', 'runqsz_Not_CPU_Bound'),
('pgin', 'pflt'),
('pgin', 'freemem'),
('pgin', 'freeswap'),
('pgin', 'runqsz_Not_CPU_Bound'),
('pflt', 'freemem'),
('pflt', 'freeswap'),
('pflt', 'runqsz_Not_CPU_Bound'),
('freemem', 'freeswap'),
('freemem', 'runqsz_Not_CPU_Bound'),
('freeswap', 'runqsz_Not_CPU_Bound')]
[ ]: interaction_dict = {}
for interaction in interactions:
X_train_int = X_train.copy()
X_train_int['int'] = X_train_int[interaction[0]] *␣
↪X_train_int[interaction[1]]
lr3 = LinearRegression()
lr3.fit(X_train_int, y_train)
interaction_dict[lr3.score(X_train_int, y_train)] = interaction
('freeswap', 'runqsz_Not_CPU_Bound')
('freemem', 'freeswap')
('rchar', 'freeswap')
('wchar', 'freeswap')
('freemem', 'runqsz_Not_CPU_Bound')
(‘freemem’, ‘freeswap’): This interaction term involves two variables related to memory usage,
‘freemem’ and ‘freeswap’. Memory-related variables are often crucial in system performance anal-
66
ysis, and their interaction may capture complex relationships affecting the outcome variable.
(‘rchar’, ‘freeswap’): This interaction term involves ‘rchar’, which represents the number of char-
acters transferred per second by system read calls, and ‘freeswap’, which represents the number of
disk blocks available for page swapping. This interaction might capture the relationship between
disk I/O operations and available swap space, which could impact system performance.
[ ]: X_train_int['freemem_freeswap_interaction'] = X_train_int['freemem'] *␣
↪X_train_int['freeswap']
X_train_int['rchar_freeswap_interaction'] = X_train_int['rchar'] *␣
↪X_train_int['freeswap']
poly_dict = {}
lr = LinearRegression()
lr.fit(X_train_poly, y_train)
max_score = max(poly_dict.keys())
max_score_feature, max_score_degree = poly_dict[max_score]
67
olsmod_12 = sm.OLS(y_train, X_train)
olsres_12 = olsmod_12.fit()
print(olsres_12.summary())
68
5.21e-10 1.1e-09
rchar_freeswap_interaction -2.649e-12 5.88e-13 -4.507 0.000
-3.8e-12 -1.5e-12
==============================================================================
Omnibus: 450.177 Durbin-Watson: 1.955
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1838.480
Skew: -0.295 Prob(JB): 0.00
Kurtosis: 5.710 Cond. No. 1.79e+13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.79e+13. This might indicate that there are
strong multicollinearity or other numerical problems.
[ ]:
5.14.2 Now as observed from above the predictor has p-value>0.05 we remove that
and build the model
[ ]: X_train = X_train.drop(["rchar"], axis=1)
olsmod_13 = sm.OLS(y_train, X_train)
olsres_13 = olsmod_13.fit()
print(olsres_13.summary())
69
-0.002 -0.002
wchar -5.25e-06 7e-07 -7.505 0.000
-6.62e-06 -3.88e-06
pgout -0.2679 0.029 -9.260 0.000
-0.325 -0.211
atch 0.2222 0.104 2.132 0.033
0.018 0.426
pgin -0.1768 0.007 -25.917 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.879 0.000
-0.047 -0.045
freemem -0.0008 0.000 -3.209 0.001
-0.001 -0.000
freeswap 3.956e-05 4.61e-07 85.813 0.000
3.87e-05 4.05e-05
runqsz_Not_CPU_Bound -0.1498 0.098 -1.530 0.126
-0.342 0.042
freeswap_sq -1.46e-11 2.27e-13 -64.409 0.000
-1.5e-11 -1.42e-11
freemem_freeswap_interaction 8.138e-10 1.49e-10 5.472 0.000
5.22e-10 1.11e-09
rchar_freeswap_interaction -2.859e-12 2.38e-13 -12.017 0.000
-3.33e-12 -2.39e-12
==============================================================================
Omnibus: 448.712 Durbin-Watson: 1.955
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1814.195
Skew: -0.297 Prob(JB): 0.00
Kurtosis: 5.691 Cond. No. 1.38e+13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.38e+13. This might indicate that there are
strong multicollinearity or other numerical problems.
70
No. Observations: 5734 AIC: 2.975e+04
Df Residuals: 5721 BIC: 2.984e+04
Df Model: 12
Covariance Type: nonrobust
================================================================================
================
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 73.3141 0.263 278.979 0.000
72.799 73.829
lread -0.0405 0.003 -12.582 0.000
-0.047 -0.034
scall -0.0018 3.67e-05 -48.930 0.000
-0.002 -0.002
wchar -5.138e-06 6.96e-07 -7.384 0.000
-6.5e-06 -3.77e-06
pgout -0.2690 0.029 -9.297 0.000
-0.326 -0.212
atch 0.2229 0.104 2.139 0.032
0.019 0.427
pgin -0.1767 0.007 -25.896 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.938 0.000
-0.047 -0.045
freemem -0.0007 0.000 -2.956 0.003
-0.001 -0.000
freeswap 3.935e-05 4.4e-07 89.484 0.000
3.85e-05 4.02e-05
freeswap_sq -1.449e-11 2.14e-13 -67.822 0.000
-1.49e-11 -1.41e-11
freemem_freeswap_interaction 7.603e-10 1.45e-10 5.259 0.000
4.77e-10 1.04e-09
rchar_freeswap_interaction -2.805e-12 2.35e-13 -11.921 0.000
-3.27e-12 -2.34e-12
==============================================================================
Omnibus: 449.041 Durbin-Watson: 1.956
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1809.231
Skew: -0.299 Prob(JB): 0.00
Kurtosis: 5.686 Cond. No. 1.37e+13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.37e+13. This might indicate that there are
strong multicollinearity or other numerical problems.
71
[ ]: df_pred = pd.DataFrame()
[ ]: sns.histplot(df_pred["Residuals"], kde=True)
plt.title("Normality of residuals")
plt.show()
[ ]: import pylab
import scipy.stats as stats
72
The QQ plot of residuals can be used to visually check the normality assumption. The normal
probability plot of residuals should approximately follow a straight line.
[ ]: ShapiroResult(statistic=0.9624219536781311, pvalue=3.6266662517326524e-36)
Since p-value < 0.05, the residuals are not normal as per shapiro test.
73
• The presence of non-constant variance in the error terms results in heteroscedasticity. Gen-
erally, non-constant variance arises in presence of outliers.
How to check if model has Heteroscedasticity?
• Can use the goldfeldquandt test. If we get p-value > 0.05 we can say that the residuals are
homoscedastic, otherwise they are heteroscedastic.
How to deal with Heteroscedasticity?
• Can be fixed via adding other important features or making transformations.
The null and alternate hypotheses of the goldfeldquandt test are as follows:
• Null hypothesis : Residuals are homoscedastic
• Alternate hypothesis : Residuals have hetroscedasticity
[ ]: 0.22334693193252067
• Since p-value > 0.05 we can say that the residuals are homoscedastic.
5.15 All the assumptions of linear regression are now satisfied. Let’s check the
summary of our final model (olsmod_14).
[ ]: print(olsres_14.summary())
74
wchar -5.138e-06 6.96e-07 -7.384 0.000
-6.5e-06 -3.77e-06
pgout -0.2690 0.029 -9.297 0.000
-0.326 -0.212
atch 0.2229 0.104 2.139 0.032
0.019 0.427
pgin -0.1767 0.007 -25.896 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.938 0.000
-0.047 -0.045
freemem -0.0007 0.000 -2.956 0.003
-0.001 -0.000
freeswap 3.935e-05 4.4e-07 89.484 0.000
3.85e-05 4.02e-05
freeswap_sq -1.449e-11 2.14e-13 -67.822 0.000
-1.49e-11 -1.41e-11
freemem_freeswap_interaction 7.603e-10 1.45e-10 5.259 0.000
4.77e-10 1.04e-09
rchar_freeswap_interaction -2.805e-12 2.35e-13 -11.921 0.000
-3.27e-12 -2.34e-12
==============================================================================
Omnibus: 449.041 Durbin-Watson: 1.956
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1809.231
Skew: -0.299 Prob(JB): 0.00
Kurtosis: 5.686 Cond. No. 1.37e+13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.37e+13. This might indicate that there are
strong multicollinearity or other numerical problems.
[ ]: df_pred = pd.DataFrame()
75
plt.title("Fitted vs Residual plot")
plt.show()
[ ]: const 7.331406e+01
lread -4.045581e-02
scall -1.797301e-03
wchar -5.137688e-06
pgout -2.689676e-01
atch 2.229092e-01
pgin -1.766742e-01
pflt -4.564858e-02
freemem -7.423463e-04
freeswap 3.934889e-05
freeswap_sq -1.448617e-11
76
freemem_freeswap_interaction 7.602648e-10
rchar_freeswap_interaction -2.804670e-12
dtype: float64
5.16.1 Observations
Intercept: The intercept term is 73.3140576481782. This represents the predicted value of ‘usr’
when all predictor variables are zero.
Coefficients:
The coefficients associated with the predictor variables have varying magnitudes, indicating their
respective impacts on the target variable ‘usr’.
Negative coefficients (e.g., for ‘lread’, ‘scall’, ‘wchar’, ‘pgout’, ‘pgin’, ‘pflt’, ‘freemem’) suggest a
negative relationship with ‘usr’. An increase in these variables tends to decrease the predicted
value of ‘usr’.
Positive coefficients (e.g., for ‘atch’, ‘freeswap’, interaction terms) suggest a positive relationship
with ‘usr’. An increase in these variables tends to increase the predicted value of ‘usr’.
Magnitude of Coefficients: The magnitude of the coefficients indicates the strength of the relation-
ship between the predictor variables and the target variable. Larger magnitude coefficients suggest
a stronger influence on the target variable.
77
Interaction Terms:
Interaction terms such as ‘freemem_freeswap_interaction’ and ‘rchar_freeswap_interaction’ are
included in the equation. These terms represent the combined effect of two predictor variables
(‘freemem’ and ‘freeswap’, ‘rchar’ and ‘freeswap’) on the target variable ‘usr’. The coefficients
associated with interaction terms indicate the impact of the interaction between the respective
predictor variables on the target variable.
Squared Term:
The squared term ‘freeswap_sq’ is included in the equation. This suggests that the relationship
between ‘freeswap’ and ‘usr’ may not be linear but quadratic. As ‘freeswap’ increases, the effect on
‘usr’ may not be constant but may vary nonlinearly. Overall, the final equation provides insights
into how each predictor variable contributes to the prediction of ‘usr’ and how their interactions
affect the target variable. It can be used to make predictions and understand the relationship
between the predictors and the target variable in the context of the specific problem domain.
5.17 Predictions
[ ]: X_train.columns
[ ]: X_test.columns
[ ]: X_test.columns
78
[ ]: Index(['const', 'lread', 'scall', 'wchar', 'pgout', 'atch', 'pgin', 'pflt',
'freemem', 'freeswap', 'freemem_freeswap_interaction',
'rchar_freeswap_interaction', 'freeswap_sq'],
dtype='object')
[ ]: 3.232277869882356
[ ]: 255.21624622518092
[ ]: X_test.columns
79
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 72.2401 0.375 192.697 0.000
71.505 72.975
lread -0.0297 0.005 -6.144 0.000
-0.039 -0.020
scall -0.0018 5.89e-05 -30.093 0.000
-0.002 -0.002
wchar -5.483e-06 1.1e-06 -4.986 0.000
-7.64e-06 -3.33e-06
pgout -0.1708 0.043 -3.930 0.000
-0.256 -0.086
atch -0.0002 0.157 -0.001 0.999
-0.309 0.309
pgin -0.1713 0.010 -16.356 0.000
-0.192 -0.151
pflt -0.0441 0.001 -53.917 0.000
-0.046 -0.042
freemem -0.0009 0.000 -2.421 0.016
-0.002 -0.000
freeswap 4.056e-05 6.37e-07 63.694 0.000
3.93e-05 4.18e-05
freemem_freeswap_interaction 8.909e-10 2.09e-10 4.273 0.000
4.82e-10 1.3e-09
rchar_freeswap_interaction -2.872e-12 3.62e-13 -7.929 0.000
-3.58e-12 -2.16e-12
freeswap_sq -1.499e-11 3.19e-13 -46.937 0.000
-1.56e-11 -1.44e-11
==============================================================================
Omnibus: 195.660 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1031.784
Skew: -0.143 Prob(JB): 8.93e-225
Kurtosis: 6.161 Cond. No. 1.23e+13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.23e+13. This might indicate that there are
strong multicollinearity or other numerical problems.
Training data set R-squared: 0.891 Adj. R-squared: 0.891 rmse :3.232277869882356
Testing data set R-squared: 0.885 Adj. R-squared: 0.885 rmse :255.21624622518092
Training Dataset:
R-squared (Coefficient of Determination): The R-squared value of 0.891 indicates that approxi-
80
mately 89.1% of the variability in the target variable (‘usr’ - the percentage of time CPUs operate
in user mode) can be explained by the predictor variables included in the model. A higher R-squared
value suggests that the model fits the training data well.
Adjusted R-squared: The adjusted R-squared value of 0.891 is almost the same as the R-squared
value, which indicates that the model’s performance is consistent even after adjusting for the number
of predictors in the model.
RMSE (Root Mean Squared Error): The RMSE value of 3.232 suggests that, on average, the model’s
predictions deviate from the actual values by approximately 3.232 percentage points. Lower RMSE
values indicate better accuracy of the model.
Testing Dataset:
R-squared: The R-squared value of 0.885 for the testing dataset indicates that approximately
88.5% of the variability in the target variable can be explained by the model. This suggests that
the model’s performance on unseen data is also relatively high.
Adjusted R-squared: Like the training dataset, the adjusted R-squared value is also 0.885, indicating
consistent performance after adjusting for the number of predictors.
RMSE: The RMSE value of 255.216 suggests that, on average, the model’s predictions deviate from
the actual values by approximately 255.216 percentage points on the testing dataset.
Overall Interpretation: Model has some issue and looks overfitting model. Need to further investi-
gate.
[ ]:
81