0% found this document useful (0 votes)
14 views

Notebook - measures of computer systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Notebook - measures of computer systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Saurabh_Agrawal

February 28, 2024

1 Problem - 1

2 Context
The comp-activ database comprises activity measures of computer systems. Data was gathered
from aSun Sparcstation 20/712 with 128 Mbytes of memory, operating in a multi-user university
department.Users engaged in diverse tasks, such as internet access, file editing, and CPU-intensive
programs.

3 Objective
Being an aspiring data scientist, you aim to establish a linear equation for predicting ‘usr’ (the
percentageof time CPUs operate in user mode). Your goal is to analyze various system attributes
to understand theirinfluence on the system’s ‘usr’ mode.

4 Data Description : System measures used:


1. lread - Reads (transfers per second ) between system memory and user memory
2. lwrite - writes (transfers per second) between system memory and user memory
3. scall - Number of system calls of all types per second
4. sread - Number of system read calls per second .
5. swrite - Number of system write calls per second .
6. fork - Number of system fork calls per second.
7. exec - Number of system exec calls per second.
8. rchar - Number of characters transferred per second by system read calls
9. wchar - Number of characters transfreed per second by system write calls
10. pgout - Number of page out requests per second
11. ppgout - Number of pages, paged out per second
12. pgfree - Number of pages per second placed on the free list.
13. pgscan - Number of pages checked if they can be freed per second

1
14. atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per
second
15. pgin - Number of page-in requests per second
16. ppgin - Number of pages paged in per second
17. pflt - Number of page faults caused by protection errors (copy-on-writes).
18. vflt - Number of page faults caused by address translation .
19. runqsz - Process run queue size (The number of kernel threads in memory that are waiting
for a CPU torun.Typically, this value should be less than 2. Consistently higher values mean
that the system might be CPU-bound.)
20. freemem - Number of memory pages available to user processes
21.

4.1 freeswap - Number of disk blocks available for page swapping.


22. usr - Portion of time (%) that cpus run in user modeing for a CPU torun.

5 Criteria
5.1 Problem 1 - Define the problem and perform exploratory Data Analysis
• Problem definition
• Check shape, Data types, statistical summary
• Univariateanalysis
• Multivariate analysis
• Use appropriate visualizations to identify the patternsand insights
• Key meaningful observations on individual variables and the relationshipbetween variables

5.2 Problem 1 - Data Pre-processing


Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection(treat,
if needed) - Feature Engineering - Encode the data - Train-test split

5.3 Problem 1- Model Building - Linear regression


• Apply linear Regression using Sklearn
• Using Statsmodels Perform checks forsignificant variables using the appropriate method
• Create multiple models and checkthe performance of Predictions on Train and Test sets using
Rsquare, RMSE & AdjRsquare.

5.4 Problem 1 - Business Insights & Recommendations


• Comment on the Linear Regression equation from the final model and impact ofrelevant
variables (atleast 2) as per the equation
• Conclude with the key takeaways(actionable insights and recommendations) for the business

2
[ ]: import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import matplotlib.style
import warnings
warnings.filterwarnings("ignore")

5.5 Problem Statement:


Given the comp-activ database comprising activity measures of a computer system, the task is to
develop a linear equation to predict the percentage of time CPUs operate in user mode (‘usr’). This
prediction will be based on various system attributes that influence the ‘usr’ mode. The objective
is to analyze these system attributes and understand their impact on the ‘usr’ mode.

5.6 Objective:
Build a linear regression model that accurately predicts the ‘usr’ mode based on the provided
system measures. Additionally, interpret the coefficients of the model to understand the influence
of each system attribute on the ‘usr’ mode.

[ ]: df = pd.read_excel('compactiv.xlsx')
df.head()

[ ]: lread lwrite scall sread swrite fork exec rchar wchar pgout \
0 1 0 2147 79 68 0.2 0.2 40671.0 53995.0 0.0
1 0 0 170 18 21 0.2 0.2 448.0 8385.0 0.0
2 15 3 2162 159 119 2.0 2.4 NaN 31950.0 0.0
3 0 0 160 12 16 0.2 0.2 NaN 8670.0 0.0
4 5 1 330 39 38 0.4 0.4 NaN 12185.0 0.0

… pgscan atch pgin ppgin pflt vflt runqsz freemem \


0 … 0.0 0.0 1.6 2.6 16.00 26.40 CPU_Bound 4670
1 … 0.0 0.0 0.0 0.0 15.63 16.83 Not_CPU_Bound 7278
2 … 0.0 1.2 6.0 9.4 150.20 220.20 Not_CPU_Bound 702
3 … 0.0 0.0 0.2 0.2 15.60 16.80 Not_CPU_Bound 7248
4 … 0.0 0.0 1.0 1.2 37.80 47.60 Not_CPU_Bound 633

freeswap usr
0 1730946 95
1 1869002 97
2 1021237 87
3 1863704 98
4 1760253 90

[5 rows x 22 columns]

3
[ ]: df.tail()

[ ]: lread lwrite scall sread swrite fork exec rchar wchar \


8187 16 12 3009 360 244 1.6 5.81 405250.0 85282.0
8188 4 0 1596 170 146 2.4 1.80 89489.0 41764.0
8189 16 5 3116 289 190 0.6 0.60 325948.0 52640.0
8190 32 45 5180 254 179 1.2 1.20 62571.0 29505.0
8191 2 0 985 55 46 1.6 4.80 111111.0 22256.0

pgout … pgscan atch pgin ppgin pflt vflt runqsz \


8187 8.02 … 55.11 0.6 35.87 47.90 139.28 270.74 CPU_Bound
8188 3.80 … 0.20 0.8 3.80 4.40 122.40 212.60 Not_CPU_Bound
8189 0.40 … 0.00 0.4 28.40 45.20 60.20 219.80 Not_CPU_Bound
8190 1.40 … 18.04 0.4 23.05 24.25 93.19 202.81 CPU_Bound
8191 0.00 … 0.00 0.2 3.40 6.20 91.80 110.00 CPU_Bound

freemem freeswap usr


8187 387 986647 80
8188 263 1055742 90
8189 400 969106 87
8190 141 1022458 83
8191 659 1756514 94

[5 rows x 22 columns]

[ ]: print('Number of rows: ', df.shape[0], '\n''Number of columns: ', df.shape[1])

Number of rows: 8192


Number of columns: 22

[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8192 entries, 0 to 8191
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lread 8192 non-null int64
1 lwrite 8192 non-null int64
2 scall 8192 non-null int64
3 sread 8192 non-null int64
4 swrite 8192 non-null int64
5 fork 8192 non-null float64
6 exec 8192 non-null float64
7 rchar 8088 non-null float64
8 wchar 8177 non-null float64
9 pgout 8192 non-null float64
10 ppgout 8192 non-null float64

4
11 pgfree 8192 non-null float64
12 pgscan 8192 non-null float64
13 atch 8192 non-null float64
14 pgin 8192 non-null float64
15 ppgin 8192 non-null float64
16 pflt 8192 non-null float64
17 vflt 8192 non-null float64
18 runqsz 8192 non-null object
19 freemem 8192 non-null int64
20 freeswap 8192 non-null int64
21 usr 8192 non-null int64
dtypes: float64(13), int64(8), object(1)
memory usage: 1.4+ MB

[ ]: df.describe().T.round(2)

[ ]: count mean std min 25% 50% \


lread 8192.0 19.56 53.35 0.0 2.0 7.0
lwrite 8192.0 13.11 29.89 0.0 0.0 1.0
scall 8192.0 2306.32 1633.62 109.0 1012.0 2051.5
sread 8192.0 210.48 198.98 6.0 86.0 166.0
swrite 8192.0 150.06 160.48 7.0 63.0 117.0
fork 8192.0 1.88 2.48 0.0 0.4 0.8
exec 8192.0 2.79 5.21 0.0 0.2 1.2
rchar 8088.0 197385.73 239837.49 278.0 34091.5 125473.5
wchar 8177.0 95902.99 140841.71 1498.0 22916.0 46619.0
pgout 8192.0 2.29 5.31 0.0 0.0 0.0
ppgout 8192.0 5.98 15.21 0.0 0.0 0.0
pgfree 8192.0 11.92 32.36 0.0 0.0 0.0
pgscan 8192.0 21.53 71.14 0.0 0.0 0.0
atch 8192.0 1.13 5.71 0.0 0.0 0.0
pgin 8192.0 8.28 13.87 0.0 0.6 2.8
ppgin 8192.0 12.39 22.28 0.0 0.6 3.8
pflt 8192.0 109.79 114.42 0.0 25.0 63.8
vflt 8192.0 185.32 191.00 0.2 45.4 120.4
freemem 8192.0 1763.46 2482.10 55.0 231.0 579.0
freeswap 8192.0 1328125.96 422019.43 2.0 1042623.5 1289289.5
usr 8192.0 83.97 18.40 0.0 81.0 89.0

75% max
lread 20.00 1845.00
lwrite 10.00 575.00
scall 3317.25 12493.00
sread 279.00 5318.00
swrite 185.00 5456.00
fork 2.20 20.12
exec 2.80 59.56

5
rchar 267828.75 2526649.00
wchar 106101.00 1801623.00
pgout 2.40 81.44
ppgout 4.20 184.20
pgfree 5.00 523.00
pgscan 0.00 1237.00
atch 0.60 211.58
pgin 9.76 141.20
ppgin 13.80 292.61
pflt 159.60 899.80
vflt 251.80 1365.00
freemem 2002.25 12027.00
freeswap 1730379.50 2243187.00
usr 94.00 99.00

[ ]: df.select_dtypes(include=['object']).describe().T

[ ]: count unique top freq


runqsz 8192 2 Not_CPU_Bound 4331

[ ]: df['runqsz'].value_counts()

[ ]: runqsz
Not_CPU_Bound 4331
CPU_Bound 3861
Name: count, dtype: int64

5.6.1 Statistical Summary Insights:


Variables such as lread, lwrite, scall, sread, swrite, rchar, wchar, pgout, ppgout, pgfree, pgscan,
atch, pgin, ppgin, pflt, vflt, freemem, and freeswap exhibit a wide range of values, indicating
significant variability in their measurements.
Some variables have noticeable differences between their mean and median values, suggesting po-
tential skewness in their distributions.
Variables like fork and exec have relatively low mean values compared to their maximum values,
indicating potential outliers or skewed distributions towards lower values.
The dependent variable usr has a mean value of 83.97% with a standard deviation of 18.40, indicat-
ing considerable variability in the percentage of time CPUs run in user mode across observations.

[ ]:

5.7 Checking Missing Value


[ ]: def missing_values_summary(df):
# Calculate total missing values for each column
missing_values = df.isnull().sum()

6
# Print the missing value information
for col, missing_count in missing_values.items():
print(f"{col}: {missing_count} missing values")

missing_values_summary(df)

lread: 0 missing values


lwrite: 0 missing values
scall: 0 missing values
sread: 0 missing values
swrite: 0 missing values
fork: 0 missing values
exec: 0 missing values
rchar: 104 missing values
wchar: 15 missing values
pgout: 0 missing values
ppgout: 0 missing values
pgfree: 0 missing values
pgscan: 0 missing values
atch: 0 missing values
pgin: 0 missing values
ppgin: 0 missing values
pflt: 0 missing values
vflt: 0 missing values
runqsz: 0 missing values
freemem: 0 missing values
freeswap: 0 missing values
usr: 0 missing values

5.8 Treating Missing Value


[ ]: df['rchar'] = df['rchar'].fillna(df['rchar'].median())
df['wchar'] = df['wchar'].fillna(df['wchar'].median())

[ ]: missing_values_summary(df)

lread: 0 missing values


lwrite: 0 missing values
scall: 0 missing values
sread: 0 missing values
swrite: 0 missing values
fork: 0 missing values
exec: 0 missing values
rchar: 0 missing values
wchar: 0 missing values
pgout: 0 missing values
ppgout: 0 missing values

7
pgfree: 0 missing values
pgscan: 0 missing values
atch: 0 missing values
pgin: 0 missing values
ppgin: 0 missing values
pflt: 0 missing values
vflt: 0 missing values
runqsz: 0 missing values
freemem: 0 missing values
freeswap: 0 missing values
usr: 0 missing values

5.9 Checking Duplicate


[ ]: dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

5.10 Univariate Analysis


[ ]: def histogram_boxplot(data, column):
fig, axes = plt.subplots(1, 2, figsize=(12, 3))

# Histogram
sns.histplot(data[column], ax=axes[0], kde=True)
axes[0].set_title(f'Histogram of {column}')

# Boxplot
sns.boxplot(x=data[column], ax=axes[1])
axes[1].set_title(f'Boxplot of {column}')

plt.show()

histogram_boxplot(df, 'lread')
histogram_boxplot(df, 'lwrite')
histogram_boxplot(df, 'scall')
histogram_boxplot(df, 'sread')
histogram_boxplot(df, 'swrite')
histogram_boxplot(df, 'fork')
histogram_boxplot(df, 'exec')
histogram_boxplot(df, 'rchar')
histogram_boxplot(df, 'wchar')
histogram_boxplot(df, 'pgout')
histogram_boxplot(df, 'ppgout')
histogram_boxplot(df, 'pgfree')
histogram_boxplot(df, 'pgscan')
histogram_boxplot(df, 'atch')

8
histogram_boxplot(df, 'pgin')
histogram_boxplot(df, 'ppgin')
histogram_boxplot(df, 'pflt')
histogram_boxplot(df, 'vflt')
histogram_boxplot(df, 'freemem')
histogram_boxplot(df, 'freeswap')
histogram_boxplot(df, 'usr')

9
10
11
12
13
14
[ ]: df.select_dtypes(include=['int', 'float']).skew().round(2)

[ ]: lread 13.90
lwrite 5.28
scall 0.90
sread 5.46
swrite 9.61
fork 2.25

15
exec 4.07
rchar 2.88
wchar 3.85
pgout 5.07
ppgout 4.68
pgfree 4.77
pgscan 5.81
atch 21.54
pgin 3.24
ppgin 3.90
pflt 1.72
vflt 1.74
freemem 1.81
freeswap -0.79
usr -3.42
dtype: float64

[ ]: plt.title('runqsz')
sns.countplot(df['runqsz'])

plt.subplots_adjust()
plt.tight_layout()

plt.show()

16
[ ]: def segregate_numerical_columns(df):
# Select columns with numeric data types
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
df_num = df[numerical_columns]
return df_num

df_num = segregate_numerical_columns(df)

plt.figure(figsize=(20,7))
sns.heatmap(df_num.corr(),annot=True,mask=np.triu(df_num.
↪corr(),+1),cmap='RdYlGn');

17
Typically, the rule of thumb for correlation values is as follows: 1. r between −0.4 and +0.4
indicates absence of linear dependence 2. r between −0.7 and −0.4 or r between +0.4 and +0.7
indicates moderate linear dependence, the sign indicating its direction 3. r less than −0.7 or r
greater than +0.7 indicates strong linear dependence

[ ]: def filter_correlation(df):
# Calculate the correlation matrix
corr_matrix = df.corr()

# Initialize lists to store column pairs for moderate and strong␣


↪correlations
moderate_correlations = []
strong_correlations = []

# Iterate over the correlation matrix


for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
# Check for moderate correlation
if ((corr_matrix.iloc[i, j] >= 0.4 and corr_matrix.iloc[i, j] <= 0.
↪7) or

(corr_matrix.iloc[i, j] <= -0.4 and corr_matrix.iloc[i, j] >=␣


↪-0.7)):

moderate_correlations.append((corr_matrix.columns[i],␣
↪corr_matrix.columns[j], corr_matrix.iloc[i, j]))

# Check for strong correlation


elif (corr_matrix.iloc[i, j] < -0.7 or corr_matrix.iloc[i, j] > 0.
↪7):

strong_correlations.append((corr_matrix.columns[i], corr_matrix.
↪columns[j], corr_matrix.iloc[i, j]))

18
return moderate_correlations, strong_correlations

moderate_correlations, strong_correlations = filter_correlation(df_num)

print("Moderate Correlations:")
for corr in moderate_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")

print("\nStrong Correlations:")
for corr in strong_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")

Moderate Correlations:
lread - lwrite: 0.5337368224057958
scall - sread: 0.6968867812358538
scall - swrite: 0.6199837643477688
scall - fork: 0.44676647183917245
scall - pflt: 0.481780709551168
scall - vflt: 0.5317598003509384
sread - fork: 0.4167207141572767
sread - rchar: 0.4999982166371595
sread - wchar: 0.4014265486206024
sread - pflt: 0.45201960899213534
sread - vflt: 0.49104525598137194
swrite - vflt: 0.41657080986074013
exec - pflt: 0.6452390212895793
exec - vflt: 0.6917544848966637
rchar - wchar: 0.4995687409698955
pgout - pgscan: 0.5539159057375587
pgout - ppgin: 0.41486526490355896
ppgout - pgin: 0.4882613687043727
ppgout - ppgin: 0.5423920181151021
pgfree - pgin: 0.5328340692850801
pgfree - ppgin: 0.5933957386640062
pgscan - pgin: 0.4968263206180219
pgscan - ppgin: 0.5649909017757778
vflt - usr: -0.4206853097412153
freemem - freeswap: 0.5726322069049757
freeswap - usr: 0.6785262417399971

Strong Correlations:
sread - swrite: 0.8810693839008278
fork - exec: 0.7639742315330512
fork - pflt: 0.9310399616311366
fork - vflt: 0.9393484703374151
pgout - ppgout: 0.8724453798209877
pgout - pgfree: 0.7303809913990689
ppgout - pgfree: 0.9177904500452804

19
ppgout - pgscan: 0.7852562930865677
pgfree - pgscan: 0.9152168107574576
pgin - ppgin: 0.9236207464900754
pflt - vflt: 0.935369585359654

[ ]: #sns.pairplot(df_num, diag_kind='kde')
#plt.show()

5.10.1 Strong Correlations between Variables:


sread - swrite: There is a strong positive correlation (0.881) between the number of system read
calls per second and the number of system write calls per second. This suggests that these two
activities are closely related.
fork - exec: Another strong positive correlation (0.764) is observed between the number of system
fork calls per second and the number of system exec calls per second. This indicates that processes
created by fork calls are often followed by execution calls.
fork - pflt & fork - vflt: High positive correlations (0.931 and 0.939 respectively) between fork calls
and page faults caused by protection errors and address translation suggest that fork calls might
lead to page faults.
pgout - ppgout: There is a strong positive correlation (0.872) between the number of page out
requests per second and the number of pages paged out per second, indicating consistency in these
activities.
ppgout - pgfree & ppgout - pgscan: These pairs also show strong positive correlations, suggesting
interdependence in the activities they represent.

5.10.2 Dependent Variable (usr) Correlations:


vflt - usr: There is a moderate negative correlation (-0.421) between the number of page faults
caused by address translation and the percentage of time CPUs run in user mode.
freeswap - usr: A moderate positive correlation (0.679) is observed between the amount of free disk
blocks available for page swapping and the percentage of time CPUs run in user mode.

5.10.3 Interpretation and Relationship Between Variables:


Variables showing strong positive correlations may represent related system activities or processes
that tend to occur together.
Negative correlations with the dependent variable usr suggest that higher values of those variables
might lead to lower percentages of CPU time spent in user mode.
Positive correlations with usr indicate that higher values of those variables might lead to higher
percentages of CPU time spent in user mode.

[ ]:

20
5.11 Get the count of Zeros in column
[ ]: def zeros_percentage(df):
# Count the number of zeros in each column
zero_counts = (df == 0).sum()

# Calculate the percentage of zeros in each column


total_counts = df.shape[0]
zero_percentages = (zero_counts / total_counts) * 100

# Create a DataFrame to store the results


zeros_info = pd.DataFrame({'Zero Count': zero_counts, 'Zero Percentage':␣
↪zero_percentages})

return zeros_info

zeros_info_df = zeros_percentage(df)
print(zeros_info_df)

Zero Count Zero Percentage


lread 675 8.239746
lwrite 2684 32.763672
scall 0 0.000000
sread 0 0.000000
swrite 0 0.000000
fork 21 0.256348
exec 21 0.256348
rchar 0 0.000000
wchar 0 0.000000
pgout 4878 59.545898
ppgout 4878 59.545898
pgfree 4869 59.436035
pgscan 6448 78.710938
atch 4575 55.847168
pgin 1220 14.892578
ppgin 1220 14.892578
pflt 3 0.036621
vflt 0 0.000000
runqsz 0 0.000000
freemem 0 0.000000
freeswap 0 0.000000
usr 283 3.454590
Ideally we should delete the column which have more than 75% zero count but since looking at the
column nature, we consider these are genuine enties so we are not deleteing pgscan.

21
5.12 Checking Outlier
[ ]: def outliers_summary(df):
# Select only integer and float columns
numeric_columns = df.select_dtypes(include=['int', 'float']).columns

# Dictionary to store outlier information for each column


outliers_info = {}

# Iterate through each numeric column


for col in numeric_columns:
# Calculate the IQR (Interquartile Range)
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

# Calculate upper and lower bounds for outliers


upper_bound = Q3 + 1.5 * IQR
lower_bound = Q1 - 1.5 * IQR

# Count the number of outliers


upper_outliers = (df[col] > upper_bound).sum()
lower_outliers = (df[col] < lower_bound).sum()
total_outliers = upper_outliers + lower_outliers

# Store the outlier information for the column


outliers_info[col] = (total_outliers, upper_outliers, lower_outliers)

# Print the outlier information


for col, info in outliers_info.items():
print(f"{col}: Total outliers = {info[0]}, Upper outliers = {info[1]},␣
↪Lower outliers = {info[2]}")

outliers_summary(df)

lread: Total outliers = 753, Upper outliers = 753, Lower outliers = 0


lwrite: Total outliers = 1305, Upper outliers = 1305, Lower outliers = 0
scall: Total outliers = 108, Upper outliers = 108, Lower outliers = 0
sread: Total outliers = 340, Upper outliers = 340, Lower outliers = 0
swrite: Total outliers = 495, Upper outliers = 495, Lower outliers = 0
fork: Total outliers = 943, Upper outliers = 943, Lower outliers = 0
exec: Total outliers = 710, Upper outliers = 710, Lower outliers = 0
rchar: Total outliers = 465, Upper outliers = 465, Lower outliers = 0
wchar: Total outliers = 817, Upper outliers = 817, Lower outliers = 0
pgout: Total outliers = 988, Upper outliers = 988, Lower outliers = 0
ppgout: Total outliers = 1315, Upper outliers = 1315, Lower outliers = 0
pgfree: Total outliers = 1555, Upper outliers = 1555, Lower outliers = 0
pgscan: Total outliers = 1744, Upper outliers = 1744, Lower outliers = 0

22
atch: Total outliers = 1209, Upper outliers = 1209, Lower outliers = 0
pgin: Total outliers = 789, Upper outliers = 789, Lower outliers = 0
ppgin: Total outliers = 821, Upper outliers = 821, Lower outliers = 0
pflt: Total outliers = 395, Upper outliers = 395, Lower outliers = 0
vflt: Total outliers = 484, Upper outliers = 484, Lower outliers = 0
freemem: Total outliers = 1185, Upper outliers = 1185, Lower outliers = 0
freeswap: Total outliers = 294, Upper outliers = 0, Lower outliers = 294
usr: Total outliers = 430, Upper outliers = 0, Lower outliers = 430

[ ]: # construct box plot for continuous variables


cont=df.dtypes[(df.dtypes!='object') & (df.dtypes!='bool')].index
plt.figure(figsize=(10,10))
df[cont].boxplot(vert=0)
plt.title('With Outliers',fontsize=16)
plt.show()

23
[ ]: def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

for column in df[cont].columns:


lr,ur=remove_outlier(df[column])
df[column]=np.where(df[column]>ur,ur,df[column])
df[column]=np.where(df[column]<lr,lr,df[column])

plt.figure(figsize=(10,10))
df[cont].boxplot(vert=0)
plt.title('After Outlier Removal',fontsize=16)
plt.show()

24
[ ]: #df_attr = (df[cont])
#sns.pairplot(df_attr, diag_kind='kde')
#plt.show()

5.12.1 Encode the data


[ ]: df_dummy = pd.get_dummies(df, drop_first=True)
df_dummy.head()

[ ]: lread lwrite scall sread swrite fork exec rchar wchar pgout \
0 1.0 0.0 2147.0 79.0 68.0 0.2 0.2 40671.0 53995.0 0.0
1 0.0 0.0 170.0 18.0 21.0 0.2 0.2 448.0 8385.0 0.0
2 15.0 3.0 2162.0 159.0 119.0 2.0 2.4 125473.5 31950.0 0.0

25
3 0.0 0.0 160.0 12.0 16.0 0.2 0.2 125473.5 8670.0 0.0
4 5.0 1.0 330.0 39.0 38.0 0.4 0.4 125473.5 12185.0 0.0

… pgscan atch pgin ppgin pflt vflt freemem freeswap usr \


0 … 0.0 0.0 1.6 2.6 16.00 26.40 4659.125 1730946.0 95.0
1 … 0.0 0.0 0.0 0.0 15.63 16.83 4659.125 1869002.0 97.0
2 … 0.0 1.2 6.0 9.4 150.20 220.20 702.000 1021237.0 87.0
3 … 0.0 0.0 0.2 0.2 15.60 16.80 4659.125 1863704.0 98.0
4 … 0.0 0.0 1.0 1.2 37.80 47.60 633.000 1760253.0 90.0

runqsz_Not_CPU_Bound
0 False
1 True
2 True
3 True
4 True

[5 rows x 22 columns]

[ ]: df_dummy.describe().T.round(2)

[ ]: count mean std min 25% 50% \


lread 8192.0 13.42 15.16 0.0 2.00 7.0
lwrite 8192.0 6.66 9.29 0.0 0.00 1.0
scall 8192.0 2294.48 1593.09 109.0 1012.00 2051.5
sread 8192.0 199.78 146.76 6.0 86.00 166.0
swrite 8192.0 137.97 97.14 7.0 63.00 117.0
fork 8192.0 1.56 1.59 0.0 0.40 0.8
exec 8192.0 1.93 2.03 0.0 0.20 1.2
rchar 8192.0 178884.07 174589.21 278.0 34860.50 125473.5
wchar 8192.0 75645.54 71262.96 1498.0 22977.75 46619.0
pgout 8192.0 1.42 2.20 0.0 0.00 0.0
ppgout 8192.0 2.56 4.04 0.0 0.00 0.0
pgfree 8192.0 3.16 4.98 0.0 0.00 0.0
pgscan 8192.0 0.00 0.00 0.0 0.00 0.0
atch 8192.0 0.39 0.56 0.0 0.00 0.0
pgin 8192.0 6.39 7.68 0.0 0.60 2.8
ppgin 8192.0 9.14 11.16 0.0 0.60 3.8
pflt 8192.0 105.64 101.55 0.0 25.00 63.8
vflt 8192.0 175.62 162.50 0.2 45.40 120.4
freemem 8192.0 1387.62 1605.76 55.0 231.00 579.0
freeswap 8192.0 1328519.88 420782.72 10989.5 1042623.50 1289289.5
usr 8192.0 86.25 9.75 61.5 81.00 89.0

75% max
lread 20.00 47.00
lwrite 10.00 25.00

26
scall 3317.25 6775.12
sread 279.00 568.50
swrite 185.00 368.00
fork 2.20 4.90
exec 2.80 6.70
rchar 265394.75 611196.12
wchar 106037.00 230625.88
pgout 2.40 6.00
ppgout 4.20 10.50
pgfree 5.00 12.50
pgscan 0.00 0.00
atch 0.60 1.50
pgin 9.76 23.51
ppgin 13.80 33.60
pflt 159.60 361.50
vflt 251.80 561.40
freemem 2002.25 4659.12
freeswap 1730379.50 2243187.00
usr 94.00 99.00

Before outlier treatment, the variable “pgscan” had more than 75% of its value counts as 0. After
outlier treatment, it seems that all the entries for “pgscan” became 0. When calculating the Variance
Inflation Factor (VIF), it involves the computation of the inverse matrix of the correlation matrix
of predictors. However, if a variable has zero variance (i.e., all values are the same), it can cause
issues in the calculation, leading to a division by zero error or NaN values.Having a variable with
zero variance, like “pgscan” in this case, implies that it doesn’t provide any useful information for
prediction since all observations have the same value. Thus, it can be safely dropped from the
model as it doesn’t contribute to the prediction of the target variable.

[ ]: df_dummy.drop(columns=['pgscan'], inplace=True)

[ ]: df_dummy.columns

[ ]: Index(['lread', 'lwrite', 'scall', 'sread', 'swrite', 'fork', 'exec', 'rchar',


'wchar', 'pgout', 'ppgout', 'pgfree', 'atch', 'pgin', 'ppgin', 'pflt',
'vflt', 'freemem', 'freeswap', 'usr', 'runqsz_Not_CPU_Bound'],
dtype='object')

5.12.2 Train-test split


[ ]: # Copy all the predictor variables into X dataframe
X = df_dummy.drop('usr', axis=1)

# Copy target into the y dataframe.


y = df_dummy[['usr']]

# Split X and y into training and test set in 70:30 ratio


from sklearn.model_selection import train_test_split

27
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 ,␣
↪random_state=1)

5.12.3 Linear Regression using statsmodel(OLS)

[ ]: import statsmodels.api as sm
X_train=sm.add_constant(X_train) # This adds the constant term beta0 to the␣
↪Simple Linear Regression.

X_test=sm.add_constant(X_test)

# Convert boolean columns to numerical representation


X_train['runqsz_Not_CPU_Bound'] = X_train['runqsz_Not_CPU_Bound'].astype(int)

model = sm.OLS(endog=y_train,exog=X_train).fit()
print(model.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.796
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 1115.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:10 Log-Likelihood: -16657.
No. Observations: 5734 AIC: 3.336e+04
Df Residuals: 5713 BIC: 3.350e+04
Df Model: 20
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.1217 0.316 266.106 0.000 83.502
84.741
lread -0.0635 0.009 -7.071 0.000 -0.081
-0.046
lwrite 0.0482 0.013 3.671 0.000 0.022
0.074
scall -0.0007 6.28e-05 -10.566 0.000 -0.001
-0.001
sread 0.0003 0.001 0.305 0.760 -0.002
0.002
swrite -0.0054 0.001 -3.777 0.000 -0.008
-0.003
fork 0.0293 0.132 0.222 0.824 -0.229
0.288

28
exec -0.3212 0.052 -6.220 0.000 -0.422
-0.220
rchar -5.167e-06 4.88e-07 -10.598 0.000 -6.12e-06
-4.21e-06
wchar -5.403e-06 1.03e-06 -5.232 0.000 -7.43e-06
-3.38e-06
pgout -0.3688 0.090 -4.098 0.000 -0.545
-0.192
ppgout -0.0766 0.079 -0.973 0.330 -0.231
0.078
pgfree 0.0845 0.048 1.769 0.077 -0.009
0.178
atch 0.6276 0.143 4.394 0.000 0.348
0.908
pgin 0.0200 0.028 0.703 0.482 -0.036
0.076
ppgin -0.0673 0.020 -3.415 0.001 -0.106
-0.029
pflt -0.0336 0.002 -16.957 0.000 -0.037
-0.030
vflt -0.0055 0.001 -3.830 0.000 -0.008
-0.003
freemem -0.0005 5.07e-05 -9.038 0.000 -0.001
-0.000
freeswap 8.832e-06 1.9e-07 46.472 0.000 8.46e-06
9.2e-06
runqsz_Not_CPU_Bound 1.6153 0.126 12.819 0.000 1.368
1.862
==============================================================================
Omnibus: 1103.645 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2372.553
Skew: -1.119 Prob(JB): 0.00
Kurtosis: 5.219 Cond. No. 7.74e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.74e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

5.12.4 Interpretation of R-squared


• The R-squared value tells us that our model can explain 79.6 % of the variance in the training
set.

29
5.12.5 Interpretation of Coefficients
• The coefficients tell us how one unit change in X can affect y.
• The sign of the coefficient indicates if the relationship is positive or negative.
• Multicollinearity occurs when predictor variables in a regression model are correlated. This
correlation is a problem because predictor variables should be independent. If the collinearity
between variables is high, we might not be able to trust the p-values to identify independent
variables that are statistically significant.
• When we have multicollinearity in the linear model, the coefficients that the model suggests
are unreliable.

5.12.6 Interpretation of p-values (P > |t|)


• For each predictor variable there is a null hypothesis and alternate hypothesis.
– Null hypothesis : Predictor variable is not significant
– Alternate hypothesis : Predictor variable is significant
• (P > |t|) gives the p-value for each predictor variable to check the null hypothesis.
• If the level of significance is set to 5% (0.05), the p-values greater than 0.05 would indicate
that the corresponding predictor variables are not significant.
• However, due to the presence of multicollinearity in our data, the p-values will also change.
• We need to ensure that there is no multicollinearity in order to interpret the p-values.

5.12.7 How to check for Multicollinearity


• There are different ways of detecting (or testing) multicollinearity. One such way is Variation
Inflation Factor.
• Variance Inflation factor: Variance inflation factors measure the inflation in the variances
of the regression coefficients estimates due to collinearities that exist among the predictors. It
is a measure of how much the variance of the estimated regression coefficient 𝛽𝑘 is “inflated”
by the existence of correlation among the predictor variables in the model.
• General Rule of Thumb:
– If VIF is 1, then there is no correlation among the 𝑘th predictor and the remaining
predictor variables, and hence, the variance of 𝛽𝑘 is not inflated at all.
– If VIF exceeds 5, we say there is moderate VIF, and if it is 10 or exceeding 10, it shows
signs of high multi-collinearity.
– The purpose of the analysis should dictate which threshold to use.unreliable.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

30
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 29.229332
lread 5.350560
lwrite 4.328397
scall 2.960609
sread 6.420172
swrite 5.597135
fork 13.035359
exec 3.241417
rchar 2.133616
wchar 1.584381
pgout 11.360363
ppgout 29.404223
pgfree 16.496748
atch 1.875901
pgin 13.809339
ppgin 13.951855
pflt 12.001460
vflt 15.971049
freemem 1.961304
freeswap 1.841239
runqsz_Not_CPU_Bound 1.156815
dtype: float64

AS few predictors have VIF values > 2 therefore there is some multicolinearity in the data. The
variables with the highest VIF values is ‘ppgout’ a with VIF values of 29.404223. This value suggest
a high degree of multicollinearity.
• The VIF values indicate that the features (ppgout,pgfree,vflt,ppgin,pgin,fork,pflt,pgout,sread,swrite,lread)
are correlated with one or more independent features.
• Multicollinearity affects only the specific independent variables that are correlated.
• To treat multicollinearity, we will have to drop one or more of the correlated features.
• We will drop the variable that has the least impact on the adjusted R-squared of the model.

[ ]: X_train.columns

[ ]: Index(['const', 'lread', 'lwrite', 'scall', 'sread', 'swrite', 'fork', 'exec',


'rchar', 'wchar', 'pgout', 'ppgout', 'pgfree', 'atch', 'pgin', 'ppgin',
'pflt', 'vflt', 'freemem', 'freeswap', 'runqsz_Not_CPU_Bound'],
dtype='object')

Let’s remove/drop multicollinear columns one by one and observe the effect on our
predictive model

31
[ ]: import statsmodels.api as sm
import numpy as np

def remove_multicollinear_columns(X_train, y_train, vif_values):


"""
Function to remove multicollinear columns one by one and observe the effect␣
↪on the adjusted R-squared of the model.

Parameters:
- X_train: DataFrame containing predictor variables.
- y_train: Series containing the target variable.
- vif_values: Dictionary containing VIF values for predictor variables.

Returns:
- List of tuples containing the name of the removed column and the␣
↪corresponding adjusted R-squared and R-squared.

"""
results_adj_r_squared = []
results_r_squared = []

# Initial model
olsmod = sm.OLS(y_train, X_train)
olsres = olsmod.fit()
initial_adj_r_squared = olsres.rsquared_adj
initial_r_squared = olsres.rsquared
results_adj_r_squared.append(('Initial', initial_adj_r_squared))
results_r_squared.append(('Initial', initial_r_squared))

# Iterate through predictor variables sorted by VIF values


for column, vif in sorted(vif_values.items(), key=lambda x: x[1],␣
↪reverse=True):

# Remove one column at a time


X_train_temp = X_train.drop(columns=[column])

# Fit the model


olsmod_temp = sm.OLS(y_train, X_train_temp)
olsres_temp = olsmod_temp.fit()

# Append results
results_adj_r_squared.append((column, olsres_temp.rsquared_adj))
results_r_squared.append((column, olsres_temp.rsquared))

return results_adj_r_squared, results_r_squared

# Example usage:
# Assuming X_train and y_train are your training data
vif_values = {

32
'ppgout': 29.404223,
'pgfree': 16.496748,
'vflt': 15.971049,
'ppgin': 13.951855,
'pgin': 13.809339,
'fork': 13.035359,
'pflt': 12.00146,
'pgout': 11.360363,
'sread': 6.420172,
'swrite': 5.597135,
'lread' :5.35056,
'lwrite':4.328397

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.795
Removed: ppgout, Adjusted R-squared: 0.795
Removed: pgfree, Adjusted R-squared: 0.795
Removed: vflt, Adjusted R-squared: 0.795
Removed: ppgin, Adjusted R-squared: 0.795
Removed: pgin, Adjusted R-squared: 0.795
Removed: fork, Adjusted R-squared: 0.795
Removed: pflt, Adjusted R-squared: 0.785
Removed: pgout, Adjusted R-squared: 0.795
Removed: sread, Adjusted R-squared: 0.795
Removed: swrite, Adjusted R-squared: 0.795
Removed: lread, Adjusted R-squared: 0.794
Removed: lwrite, Adjusted R-squared: 0.795

R-squared Results:
Removed: Initial, R-squared: 0.796
Removed: ppgout, R-squared: 0.796

33
Removed: pgfree, R-squared: 0.796
Removed: vflt, R-squared: 0.796
Removed: ppgin, R-squared: 0.796
Removed: pgin, R-squared: 0.796
Removed: fork, R-squared: 0.796
Removed: pflt, R-squared: 0.786
Removed: pgout, R-squared: 0.796
Removed: sread, R-squared: 0.796
Removed: swrite, R-squared: 0.796
Removed: lread, R-squared: 0.794
Removed: lwrite, R-squared: 0.796
We will remove ppgout first.

[ ]: X_train = X_train.drop(["ppgout"], axis=1)

[ ]: olsmod_2 = sm.OLS(y_train, X_train)


olsres_2 = olsmod_2.fit()
print(olsres_2.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.796
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 1174.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:11 Log-Likelihood: -16658.
No. Observations: 5734 AIC: 3.336e+04
Df Residuals: 5714 BIC: 3.349e+04
Df Model: 19
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.1477 0.315 267.138 0.000 83.530
84.765
lread -0.0635 0.009 -7.077 0.000 -0.081
-0.046
lwrite 0.0482 0.013 3.675 0.000 0.022
0.074
scall -0.0007 6.28e-05 -10.575 0.000 -0.001
-0.001
sread 0.0003 0.001 0.303 0.762 -0.002
0.002
swrite -0.0054 0.001 -3.782 0.000 -0.008
-0.003

34
fork 0.0325 0.132 0.247 0.805 -0.226
0.291
exec -0.3225 0.052 -6.247 0.000 -0.424
-0.221
rchar -5.166e-06 4.88e-07 -10.598 0.000 -6.12e-06
-4.21e-06
wchar -5.45e-06 1.03e-06 -5.283 0.000 -7.47e-06
-3.43e-06
pgout -0.4264 0.068 -6.286 0.000 -0.559
-0.293
pgfree 0.0477 0.029 1.634 0.102 -0.010
0.105
atch 0.6295 0.143 4.407 0.000 0.349
0.909
pgin 0.0212 0.028 0.745 0.456 -0.035
0.077
ppgin -0.0685 0.020 -3.482 0.001 -0.107
-0.030
pflt -0.0336 0.002 -16.957 0.000 -0.037
-0.030
vflt -0.0055 0.001 -3.846 0.000 -0.008
-0.003
freemem -0.0005 5.07e-05 -9.074 0.000 -0.001
-0.000
freeswap 8.824e-06 1.9e-07 46.472 0.000 8.45e-06
9.2e-06
runqsz_Not_CPU_Bound 1.6130 0.126 12.804 0.000 1.366
1.860
==============================================================================
Omnibus: 1102.077 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2366.754
Skew: -1.118 Prob(JB): 0.00
Kurtosis: 5.216 Cond. No. 7.71e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.71e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

35
index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 29.021961
lread 5.350387
lwrite 4.328325
scall 2.960379
sread 6.420135
swrite 5.597025
fork 13.027305
exec 3.239231
rchar 2.133614
wchar 1.580894
pgout 6.453978
pgfree 6.172847
atch 1.875553
pgin 13.784007
ppgin 13.898848
pflt 12.001460
vflt 15.966865
freemem 1.959267
freeswap 1.838167
runqsz_Not_CPU_Bound 1.156421
dtype: float64

[ ]: vif_values = {
'vflt': 15.966865,
'ppgin': 13.898848,
'pgin' : 13.784007,
'fork': 13.027305,
'pflt': 12.00146,
'pgout': 6.453978,
'sread': 6.420135,
'pgfree': 6.172847,
'swrite': 5.597025,
'lread': 5.350387,

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results

36
print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.795
Removed: vflt, Adjusted R-squared: 0.795
Removed: ppgin, Adjusted R-squared: 0.795
Removed: pgin, Adjusted R-squared: 0.795
Removed: fork, Adjusted R-squared: 0.795
Removed: pflt, Adjusted R-squared: 0.785
Removed: pgout, Adjusted R-squared: 0.794
Removed: sread, Adjusted R-squared: 0.795
Removed: pgfree, Adjusted R-squared: 0.795
Removed: swrite, Adjusted R-squared: 0.795
Removed: lread, Adjusted R-squared: 0.794

R-squared Results:
Removed: Initial, R-squared: 0.796
Removed: vflt, R-squared: 0.796
Removed: ppgin, R-squared: 0.796
Removed: pgin, R-squared: 0.796
Removed: fork, R-squared: 0.796
Removed: pflt, R-squared: 0.786
Removed: pgout, R-squared: 0.795
Removed: sread, R-squared: 0.796
Removed: pgfree, R-squared: 0.796
Removed: swrite, R-squared: 0.796
Removed: lread, R-squared: 0.794
We are removing vflt

[ ]: X_train = X_train.drop(["vflt"], axis=1)


olsmod_3 = sm.OLS(y_train, X_train)
olsres_3 = olsmod_3.fit()
print(olsres_3.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.796
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 1235.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00

37
Time: 12:53:12 Log-Likelihood: -16665.
No. Observations: 5734 AIC: 3.337e+04
Df Residuals: 5715 BIC: 3.349e+04
Df Model: 18
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0090 0.313 268.139 0.000 83.395
84.623
lread -0.0654 0.009 -7.281 0.000 -0.083
-0.048
lwrite 0.0491 0.013 3.735 0.000 0.023
0.075
scall -0.0007 6.28e-05 -10.769 0.000 -0.001
-0.001
sread -2.068e-05 0.001 -0.021 0.984 -0.002
0.002
swrite -0.0053 0.001 -3.720 0.000 -0.008
-0.003
fork -0.2082 0.116 -1.793 0.073 -0.436
0.019
exec -0.3293 0.052 -6.376 0.000 -0.431
-0.228
rchar -5.294e-06 4.87e-07 -10.871 0.000 -6.25e-06
-4.34e-06
wchar -4.982e-06 1.03e-06 -4.858 0.000 -6.99e-06
-2.97e-06
pgout -0.4205 0.068 -6.194 0.000 -0.554
-0.287
pgfree 0.0408 0.029 1.397 0.162 -0.016
0.098
atch 0.5868 0.143 4.116 0.000 0.307
0.866
pgin 0.0086 0.028 0.305 0.760 -0.047
0.064
ppgin -0.0685 0.020 -3.476 0.001 -0.107
-0.030
pflt -0.0373 0.002 -21.570 0.000 -0.041
-0.034
freemem -0.0005 5.07e-05 -9.165 0.000 -0.001
-0.000
freeswap 8.945e-06 1.87e-07 47.712 0.000 8.58e-06
9.31e-06
runqsz_Not_CPU_Bound 1.6096 0.126 12.761 0.000 1.362

38
1.857
==============================================================================
Omnibus: 1058.324 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2225.362
Skew: -1.085 Prob(JB): 0.00
Kurtosis: 5.145 Cond. No. 7.65e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.65e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 28.641818
lread 5.335455
lwrite 4.327130
scall 2.952947
sread 6.374687
swrite 5.595777
fork 10.089700
exec 3.235396
rchar 2.123783
wchar 1.558923
pgout 6.450724
pgfree 6.149223
atch 1.864254
pgin 13.602134
ppgin 13.898845
pflt 9.131802
freemem 1.957966
freeswap 1.787695
runqsz_Not_CPU_Bound 1.156363
dtype: float64

39
[ ]: vif_values = {
'ppgin': 13.898845,
'pgin': 13.602134,
'fork': 10.0897,
'pflt': 9.131802,
'pgout': 6.450724,
'sread': 6.374687,
'pgfree': 6.149223,
'swrite': 5.595777,
'lread' : 5.335455,
}

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.795
Removed: ppgin, Adjusted R-squared: 0.795
Removed: pgin, Adjusted R-squared: 0.795
Removed: fork, Adjusted R-squared: 0.795
Removed: pflt, Adjusted R-squared: 0.778
Removed: pgout, Adjusted R-squared: 0.794
Removed: sread, Adjusted R-squared: 0.795
Removed: pgfree, Adjusted R-squared: 0.795
Removed: swrite, Adjusted R-squared: 0.794
Removed: lread, Adjusted R-squared: 0.793

R-squared Results:
Removed: Initial, R-squared: 0.796
Removed: ppgin, R-squared: 0.795
Removed: pgin, R-squared: 0.796
Removed: fork, R-squared: 0.795
Removed: pflt, R-squared: 0.779
Removed: pgout, R-squared: 0.794
Removed: sread, R-squared: 0.796
Removed: pgfree, R-squared: 0.795
Removed: swrite, R-squared: 0.795

40
Removed: lread, R-squared: 0.794

[ ]: # Removing ppgin
X_train = X_train.drop(["ppgin"], axis=1)

olsmod_4 = sm.OLS(y_train, X_train)


olsres_4 = olsmod_4.fit()
print(olsres_4.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.795
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 1305.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:13 Log-Likelihood: -16671.
No. Observations: 5734 AIC: 3.338e+04
Df Residuals: 5716 BIC: 3.350e+04
Df Model: 17
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0531 0.313 268.240 0.000 83.439
84.667
lread -0.0678 0.009 -7.563 0.000 -0.085
-0.050
lwrite 0.0513 0.013 3.909 0.000 0.026
0.077
scall -0.0007 6.29e-05 -10.692 0.000 -0.001
-0.001
sread -4.826e-06 0.001 -0.005 0.996 -0.002
0.002
swrite -0.0054 0.001 -3.732 0.000 -0.008
-0.003
fork -0.1927 0.116 -1.659 0.097 -0.420
0.035
exec -0.3290 0.052 -6.364 0.000 -0.430
-0.228
rchar -5.506e-06 4.84e-07 -11.385 0.000 -6.45e-06
-4.56e-06
wchar -4.978e-06 1.03e-06 -4.849 0.000 -6.99e-06
-2.97e-06
pgout -0.4138 0.068 -6.091 0.000 -0.547
-0.281

41
pgfree 0.0311 0.029 1.070 0.284 -0.026
0.088
atch 0.5966 0.143 4.181 0.000 0.317
0.876
pgin -0.0839 0.009 -8.848 0.000 -0.103
-0.065
pflt -0.0374 0.002 -21.568 0.000 -0.041
-0.034
freemem -0.0005 5.08e-05 -9.196 0.000 -0.001
-0.000
freeswap 8.922e-06 1.88e-07 47.572 0.000 8.55e-06
9.29e-06
runqsz_Not_CPU_Bound 1.6017 0.126 12.689 0.000 1.354
1.849
==============================================================================
Omnibus: 1052.296 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2207.367
Skew: -1.081 Prob(JB): 0.00
Kurtosis: 5.137 Cond. No. 7.65e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.65e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 28.594882
lread 5.304009
lwrite 4.316362
scall 2.951826
sread 6.374556
swrite 5.595670
fork 10.074886
exec 3.235387
rchar 2.090401

42
wchar 1.558921
pgout 6.445478
pgfree 6.093623
atch 1.863536
pgin 1.529142
pflt 9.131545
freemem 1.957713
freeswap 1.785393
runqsz_Not_CPU_Bound 1.155990
dtype: float64

[ ]: vif_values = {
'fork' :10.074886,
'pflt' :9.131545,
'pgout' :6.445478,
'sread' :6.374556,
'pgfree' :6.093623,
'swrite' :5.59567,
'lread' :5.304009,

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.795
Removed: fork, Adjusted R-squared: 0.794
Removed: pflt, Adjusted R-squared: 0.778
Removed: pgout, Adjusted R-squared: 0.793
Removed: sread, Adjusted R-squared: 0.795
Removed: pgfree, Adjusted R-squared: 0.795
Removed: swrite, Adjusted R-squared: 0.794
Removed: lread, Adjusted R-squared: 0.792

R-squared Results:

43
Removed: Initial, R-squared: 0.795
Removed: fork, R-squared: 0.795
Removed: pflt, R-squared: 0.778
Removed: pgout, R-squared: 0.794
Removed: sread, R-squared: 0.795
Removed: pgfree, R-squared: 0.795
Removed: swrite, R-squared: 0.795
Removed: lread, R-squared: 0.793

[ ]: X_train = X_train.drop(["sread"], axis=1)

olsmod_5 = sm.OLS(y_train, X_train)


olsres_5 = olsmod_5.fit()
print(olsres_5.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.795
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 1387.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:13 Log-Likelihood: -16671.
No. Observations: 5734 AIC: 3.338e+04
Df Residuals: 5717 BIC: 3.349e+04
Df Model: 16
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0530 0.313 268.596 0.000 83.440
84.666
lread -0.0677 0.009 -7.569 0.000 -0.085
-0.050
lwrite 0.0513 0.013 3.913 0.000 0.026
0.077
scall -0.0007 6.01e-05 -11.189 0.000 -0.001
-0.001
swrite -0.0054 0.001 -4.939 0.000 -0.008
-0.003
fork -0.1927 0.116 -1.659 0.097 -0.420
0.035
exec -0.3290 0.052 -6.370 0.000 -0.430
-0.228
rchar -5.507e-06 4.33e-07 -12.728 0.000 -6.36e-06
-4.66e-06

44
wchar -4.978e-06 1.02e-06 -4.870 0.000 -6.98e-06
-2.97e-06
pgout -0.4138 0.068 -6.092 0.000 -0.547
-0.281
pgfree 0.0311 0.029 1.071 0.284 -0.026
0.088
atch 0.5966 0.143 4.183 0.000 0.317
0.876
pgin -0.0839 0.009 -8.852 0.000 -0.103
-0.065
pflt -0.0374 0.002 -21.608 0.000 -0.041
-0.034
freemem -0.0005 5.08e-05 -9.198 0.000 -0.001
-0.000
freeswap 8.922e-06 1.87e-07 47.758 0.000 8.56e-06
9.29e-06
runqsz_Not_CPU_Bound 1.6017 0.126 12.690 0.000 1.354
1.849
==============================================================================
Omnibus: 1052.284 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2207.327
Skew: -1.081 Prob(JB): 0.00
Kurtosis: 5.137 Cond. No. 7.64e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.64e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 28.524054
lread 5.296795
lwrite 4.307417
scall 2.696760
swrite 3.201334

45
fork 10.073151
exec 3.229896
rchar 1.673676
wchar 1.545377
pgout 6.444771
pgfree 6.092930
atch 1.862227
pgin 1.528098
pflt 9.099426
freemem 1.957131
freeswap 1.771837
runqsz_Not_CPU_Bound 1.155984
dtype: float64

[ ]: vif_values = {
'fork': 10.073151,
'pflt': 9.099426,
'pgout': 6.444771,
'pgfree': 6.09293,
'lread': 5.296795,
}

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.795
Removed: fork, Adjusted R-squared: 0.794
Removed: pflt, Adjusted R-squared: 0.778
Removed: pgout, Adjusted R-squared: 0.793
Removed: pgfree, Adjusted R-squared: 0.795
Removed: lread, Adjusted R-squared: 0.793

R-squared Results:
Removed: Initial, R-squared: 0.795
Removed: fork, R-squared: 0.795

46
Removed: pflt, R-squared: 0.778
Removed: pgout, R-squared: 0.794
Removed: pgfree, R-squared: 0.795
Removed: lread, R-squared: 0.793

[ ]: X_train = X_train.drop(["pgfree"], axis=1)

olsmod_6 = sm.OLS(y_train, X_train)


olsres_6 = olsmod_6.fit()
print(olsres_6.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.795
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 1479.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:14 Log-Likelihood: -16672.
No. Observations: 5734 AIC: 3.338e+04
Df Residuals: 5718 BIC: 3.348e+04
Df Model: 15
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0536 0.313 268.595 0.000 83.440
84.667
lread -0.0675 0.009 -7.540 0.000 -0.085
-0.050
lwrite 0.0508 0.013 3.877 0.000 0.025
0.077
scall -0.0007 6.01e-05 -11.231 0.000 -0.001
-0.001
swrite -0.0053 0.001 -4.919 0.000 -0.007
-0.003
fork -0.1915 0.116 -1.649 0.099 -0.419
0.036
exec -0.3275 0.052 -6.343 0.000 -0.429
-0.226
rchar -5.498e-06 4.33e-07 -12.709 0.000 -6.35e-06
-4.65e-06
wchar -4.993e-06 1.02e-06 -4.886 0.000 -7e-06
-2.99e-06
pgout -0.3536 0.038 -9.277 0.000 -0.428
-0.279

47
atch 0.6002 0.143 4.210 0.000 0.321
0.880
pgin -0.0825 0.009 -8.788 0.000 -0.101
-0.064
pflt -0.0374 0.002 -21.620 0.000 -0.041
-0.034
freemem -0.0005 5.06e-05 -9.304 0.000 -0.001
-0.000
freeswap 8.927e-06 1.87e-07 47.804 0.000 8.56e-06
9.29e-06
runqsz_Not_CPU_Bound 1.5998 0.126 12.676 0.000 1.352
1.847
==============================================================================
Omnibus: 1054.101 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2212.372
Skew: -1.082 Prob(JB): 0.00
Kurtosis: 5.139 Cond. No. 7.64e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.64e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 28.523969
lread 5.291929
lwrite 4.301867
scall 2.693729
swrite 3.200197
fork 10.072215
exec 3.227631
rchar 1.673056
wchar 1.545060
pgout 2.029269
atch 1.861177

48
pgin 1.500133
pflt 9.098450
freemem 1.946319
freeswap 1.770539
runqsz_Not_CPU_Bound 1.155750
dtype: float64

[ ]: vif_values = {
'fork': 10.072215,
'pflt': 9.09845,
'lread': 5.291929,
'lwrite': 4.301867
}

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.795
Removed: fork, Adjusted R-squared: 0.794
Removed: pflt, Adjusted R-squared: 0.778
Removed: lread, Adjusted R-squared: 0.793
Removed: lwrite, Adjusted R-squared: 0.794

R-squared Results:
Removed: Initial, R-squared: 0.795
Removed: fork, R-squared: 0.795
Removed: pflt, R-squared: 0.778
Removed: lread, R-squared: 0.793
Removed: lwrite, R-squared: 0.795

[ ]: X_train = X_train.drop(["fork"], axis=1)

olsmod_7 = sm.OLS(y_train, X_train)


olsres_7 = olsmod_7.fit()
print(olsres_7.summary())

49
OLS Regression Results
==============================================================================
Dep. Variable: usr R-squared: 0.795
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 1584.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:14 Log-Likelihood: -16673.
No. Observations: 5734 AIC: 3.338e+04
Df Residuals: 5719 BIC: 3.348e+04
Df Model: 14
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.0919 0.312 269.420 0.000 83.480
84.704
lread -0.0684 0.009 -7.653 0.000 -0.086
-0.051
lwrite 0.0523 0.013 3.995 0.000 0.027
0.078
scall -0.0007 5.96e-05 -11.111 0.000 -0.001
-0.001
swrite -0.0058 0.001 -5.481 0.000 -0.008
-0.004
exec -0.3568 0.049 -7.355 0.000 -0.452
-0.262
rchar -5.511e-06 4.33e-07 -12.740 0.000 -6.36e-06
-4.66e-06
wchar -4.872e-06 1.02e-06 -4.779 0.000 -6.87e-06
-2.87e-06
pgout -0.3540 0.038 -9.287 0.000 -0.429
-0.279
atch 0.6055 0.143 4.247 0.000 0.326
0.885
pgin -0.0820 0.009 -8.730 0.000 -0.100
-0.064
pflt -0.0396 0.001 -37.292 0.000 -0.042
-0.038
freemem -0.0005 5.06e-05 -9.328 0.000 -0.001
-0.000
freeswap 8.915e-06 1.87e-07 47.769 0.000 8.55e-06
9.28e-06
runqsz_Not_CPU_Bound 1.5953 0.126 12.641 0.000 1.348
1.843
==============================================================================

50
Omnibus: 1045.912 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2203.816
Skew: -1.073 Prob(JB): 0.00
Kurtosis: 5.150 Cond. No. 7.61e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 28.366778
lread 5.272488
lwrite 4.282984
scall 2.653943
swrite 3.012451
exec 2.847353
rchar 1.672481
wchar 1.537067
pgout 2.029172
atch 1.860242
pgin 1.497984
pflt 3.436202
freemem 1.945888
freeswap 1.767780
runqsz_Not_CPU_Bound 1.155214
dtype: float64

[ ]: vif_values = {
'lread': 5.272488,
'lwrite': 4.282984,
'pflt': 3.436202,
'swrite': 3.012451
}

51
removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.794
Removed: lread, Adjusted R-squared: 0.792
Removed: lwrite, Adjusted R-squared: 0.794
Removed: pflt, Adjusted R-squared: 0.745
Removed: swrite, Adjusted R-squared: 0.793

R-squared Results:
Removed: Initial, R-squared: 0.795
Removed: lread, R-squared: 0.793
Removed: lwrite, R-squared: 0.794
Removed: pflt, R-squared: 0.745
Removed: swrite, R-squared: 0.794

[ ]: X_train = X_train.drop(["lwrite"], axis=1)

olsmod_8 = sm.OLS(y_train, X_train)


olsres_8 = olsmod_8.fit()
print(olsres_8.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.794
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 1700.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:14 Log-Likelihood: -16681.
No. Observations: 5734 AIC: 3.339e+04
Df Residuals: 5720 BIC: 3.348e+04
Df Model: 13
Covariance Type: nonrobust
================================================================================
========

52
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 84.1528 0.312 269.584 0.000 83.541
84.765
lread -0.0374 0.004 -8.429 0.000 -0.046
-0.029
scall -0.0007 5.97e-05 -11.237 0.000 -0.001
-0.001
swrite -0.0058 0.001 -5.512 0.000 -0.008
-0.004
exec -0.3696 0.048 -7.627 0.000 -0.465
-0.275
rchar -5.533e-06 4.33e-07 -12.774 0.000 -6.38e-06
-4.68e-06
wchar -4.572e-06 1.02e-06 -4.491 0.000 -6.57e-06
-2.58e-06
pgout -0.3572 0.038 -9.359 0.000 -0.432
-0.282
atch 0.6127 0.143 4.293 0.000 0.333
0.893
pgin -0.0872 0.009 -9.373 0.000 -0.105
-0.069
pflt -0.0405 0.001 -38.806 0.000 -0.043
-0.038
freemem -0.0005 5.07e-05 -9.226 0.000 -0.001
-0.000
freeswap 8.916e-06 1.87e-07 47.713 0.000 8.55e-06
9.28e-06
runqsz_Not_CPU_Bound 1.6330 0.126 12.959 0.000 1.386
1.880
==============================================================================
Omnibus: 1041.933 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2191.377
Skew: -1.070 Prob(JB): 0.00
Kurtosis: 5.144 Cond. No. 7.61e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.61e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

53
vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 28.299206
lread 1.294870
scall 2.650952
swrite 3.012182
exec 2.834855
rchar 1.672218
wchar 1.528722
pgout 2.028322
atch 1.859941
pgin 1.468363
pflt 3.300995
freemem 1.944841
freeswap 1.767776
runqsz_Not_CPU_Bound 1.148773
dtype: float64

[ ]: vif_values = {
'pflt': 3.300995,
'swrite': 3.012182,
'exec': 2.834855,
'scall': 2.650952,
'pgout': 2.028322
}

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:

54
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.794
Removed: pflt, Adjusted R-squared: 0.74
Removed: swrite, Adjusted R-squared: 0.793
Removed: exec, Adjusted R-squared: 0.792
Removed: scall, Adjusted R-squared: 0.789
Removed: pgout, Adjusted R-squared: 0.791

R-squared Results:
Removed: Initial, R-squared: 0.794
Removed: pflt, R-squared: 0.74
Removed: swrite, R-squared: 0.793
Removed: exec, R-squared: 0.792
Removed: scall, R-squared: 0.79
Removed: pgout, R-squared: 0.791

[ ]: X_train = X_train.drop(["swrite"], axis=1)

olsmod_9 = sm.OLS(y_train, X_train)


olsres_9 = olsmod_9.fit()
print(olsres_9.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.793
Model: OLS Adj. R-squared: 0.793
Method: Least Squares F-statistic: 1830.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:15 Log-Likelihood: -16696.
No. Observations: 5734 AIC: 3.342e+04
Df Residuals: 5721 BIC: 3.350e+04
Df Model: 12
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 83.8871 0.309 271.300 0.000 83.281
84.493
lread -0.0377 0.004 -8.482 0.000 -0.046
-0.029
scall -0.0009 4.84e-05 -17.826 0.000 -0.001
-0.001
exec -0.2874 0.046 -6.217 0.000 -0.378

55
-0.197
rchar -5.597e-06 4.34e-07 -12.894 0.000 -6.45e-06
-4.75e-06
wchar -6.141e-06 9.8e-07 -6.267 0.000 -8.06e-06
-4.22e-06
pgout -0.3577 0.038 -9.349 0.000 -0.433
-0.283
atch 0.6264 0.143 4.378 0.000 0.346
0.907
pgin -0.0882 0.009 -9.452 0.000 -0.106
-0.070
pflt -0.0428 0.001 -44.601 0.000 -0.045
-0.041
freemem -0.0004 5.06e-05 -8.679 0.000 -0.001
-0.000
freeswap 8.982e-06 1.87e-07 48.039 0.000 8.62e-06
9.35e-06
runqsz_Not_CPU_Bound 1.6322 0.126 12.920 0.000 1.385
1.880
==============================================================================
Omnibus: 994.479 Durbin-Watson: 2.011
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2042.636
Skew: -1.035 Prob(JB): 0.00
Kurtosis: 5.065 Cond. No. 7.52e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.52e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 27.624372
lread 1.294639
scall 1.738190
exec 2.566477

56
rchar 1.671017
wchar 1.409194
pgout 2.028308
atch 1.859379
pgin 1.467864
pflt 2.775762
freemem 1.923931
freeswap 1.760599
runqsz_Not_CPU_Bound 1.148772
dtype: float64

[ ]: vif_values = {
'pflt' :2.775762,
'exec' :2.566477,
'pgout' :2.028308,
}

removed_columns_adj_r_squared, removed_columns_r_squared =␣
↪remove_multicollinear_columns(X_train, y_train, vif_values)

# Print the adjusted R-squared results


print("Adjusted R-squared Results:")
for column, adj_r_squared in removed_columns_adj_r_squared:
print(f"Removed: {column}, Adjusted R-squared: {np.round(adj_r_squared,␣
↪3)}")

# Print the R-squared results


print("\nR-squared Results:")
for column, r_squared in removed_columns_r_squared:
print(f"Removed: {column}, R-squared: {np.round(r_squared, 3)}")

Adjusted R-squared Results:


Removed: Initial, Adjusted R-squared: 0.793
Removed: pflt, Adjusted R-squared: 0.721
Removed: exec, Adjusted R-squared: 0.792
Removed: pgout, Adjusted R-squared: 0.79

R-squared Results:
Removed: Initial, R-squared: 0.793
Removed: pflt, R-squared: 0.721
Removed: exec, R-squared: 0.792
Removed: pgout, R-squared: 0.79

[ ]: X_train = X_train.drop(["exec"], axis=1)

olsmod_10 = sm.OLS(y_train, X_train)


olsres_10 = olsmod_10.fit()

57
print(olsres_10.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.792
Model: OLS Adj. R-squared: 0.792
Method: Least Squares F-statistic: 1980.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:15 Log-Likelihood: -16715.
No. Observations: 5734 AIC: 3.345e+04
Df Residuals: 5722 BIC: 3.353e+04
Df Model: 11
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
--------
const 83.8317 0.310 270.346 0.000 83.224
84.440
lread -0.0400 0.004 -9.015 0.000 -0.049
-0.031
scall -0.0009 4.85e-05 -18.147 0.000 -0.001
-0.001
rchar -5.594e-06 4.36e-07 -12.844 0.000 -6.45e-06
-4.74e-06
wchar -6.029e-06 9.83e-07 -6.133 0.000 -7.96e-06
-4.1e-06
pgout -0.3551 0.038 -9.251 0.000 -0.430
-0.280
atch 0.5698 0.143 3.977 0.000 0.289
0.851
pgin -0.0939 0.009 -10.088 0.000 -0.112
-0.076
pflt -0.0467 0.001 -64.665 0.000 -0.048
-0.045
freemem -0.0004 5.07e-05 -8.504 0.000 -0.001
-0.000
freeswap 8.993e-06 1.88e-07 47.941 0.000 8.63e-06
9.36e-06
runqsz_Not_CPU_Bound 1.6447 0.127 12.977 0.000 1.396
1.893
==============================================================================
Omnibus: 950.679 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1909.425
Skew: -1.003 Prob(JB): 0.00
Kurtosis: 4.992 Cond. No. 7.52e+06

58
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.52e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: # let's check the VIF of the predictors


from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_series1 = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.
↪shape[1])],

index=X_train.columns,
)
print("VIF values: \n\n{}\n".format(vif_series1))

VIF values:

const 27.601456
lread 1.285248
scall 1.732634
rchar 1.671015
wchar 1.408716
pgout 2.028061
atch 1.851850
pgin 1.453333
pflt 1.564570
freemem 1.922817
freeswap 1.760449
runqsz_Not_CPU_Bound 1.148484
dtype: float64

Now that we do not have multicollinearity in our data, the p-values of the coefficients have become
reliable also we don’t have any non-significant p value left.

5.13 Testing the Assumptions of Linear Regression


For Linear Regression, we need to check if the following assumptions hold:-
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of error terms
5. No strong Multicollinearity

59
5.14 Linearity and Independence of predictors
[ ]: df_pred = pd.DataFrame()

df_pred["Actual Values"] = y_train.values.flatten() # actual values


df_pred["Fitted Values"] = olsres_10.fittedvalues.values # predicted values
df_pred["Residuals"] = olsres_10.resid.values # residuals

df_pred.head()

[ ]: Actual Values Fitted Values Residuals


0 91.0 91.390211 -0.390211
1 94.0 91.651718 2.348282
2 61.5 74.111751 -12.611751
3 83.0 80.837876 2.162124
4 94.0 98.081853 -4.081853

[ ]: # let us plot the fitted values vs residuals


sns.set_style("whitegrid")
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()

60
[ ]: # columns in training set
X_train.columns

[ ]: Index(['const', 'lread', 'scall', 'rchar', 'wchar', 'pgout', 'atch', 'pgin',


'pflt', 'freemem', 'freeswap', 'runqsz_Not_CPU_Bound'],
dtype='object')

No pattern in the data thus the assumption of linearity and independence of predictors satisfied

[ ]: # checking the distribution of variables in training set with dependent variable


#sns.pairplot(df_dummy[["usr", 'lread', 'scall', 'rchar', 'wchar', 'pgout',␣
↪'atch', 'pgin','pflt', 'freemem', 'freeswap', 'runqsz_Not_CPU_Bound']])

#plt.show()

[ ]: # Extract the subset of columns


columns_of_interest = ["usr", 'lread', 'scall', 'rchar', 'wchar', 'pgout',␣
↪'atch', 'pgin', 'pflt', 'freemem', 'freeswap', 'runqsz_Not_CPU_Bound']

subset_df = df_dummy[columns_of_interest]

# Calculate the correlation matrix

61
corr_matrix = subset_df.corr()

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f",␣
↪annot_kws={"size": 10})

plt.title('Correlation Heatmap')
plt.show()

[ ]: moderate_correlations, strong_correlations = filter_correlation(subset_df)

print("Moderate Correlations:")
for corr in moderate_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")

print("\nStrong Correlations:")
for corr in strong_correlations:
print(f"{corr[0]} - {corr[1]}: {corr[2]}")

Moderate Correlations:

62
usr - lread: -0.43816331190269986
usr - scall: -0.6189316332505844
usr - rchar: -0.5075608593466034
usr - pgin: -0.459132824706294
usr - pflt: -0.6960625200066154
usr - freeswap: 0.5649289834256034
scall - pflt: 0.4853611683732183
rchar - wchar: 0.48631681076384864
pgout - atch: 0.6429403043922839
pgout - pgin: 0.43791642745583104
pgout - freemem: -0.4698311159510902
atch - freemem: -0.4420624602944536
freemem - freeswap: 0.6070003248418868

Strong Correlations:

5.14.1 Feature Engineering


Feature Engineering is the process of taking certain variables (features) from our dataset and trans-
forming them in a predictive model. Essentially, we will be trying to manipulate single variables
and combinations of variables in order to engineer new features. By creating these new features,
we are increasing the likelihood that one of the new variables has more predictive power over our
outcome variable than the original, un-transformed variables.

[ ]: # using square transformation


X_train["pflt_sq"] = np.square(X_train["pflt"])
X_train["scall_sq"] = np.square(X_train["scall"])

# let's create a model with the transformed data


olsmod_11 = sm.OLS(y_train, X_train)
olsres_11 = olsmod_11.fit()
print(olsres_11.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.803
Model: OLS Adj. R-squared: 0.803
Method: Least Squares F-statistic: 1798.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:35 Log-Likelihood: -16553.
No. Observations: 5734 AIC: 3.313e+04
Df Residuals: 5720 BIC: 3.323e+04
Df Model: 13
Covariance Type: nonrobust
================================================================================
========
coef std err t P>|t| [0.025
0.975]

63
--------------------------------------------------------------------------------
--------
const 80.3842 0.358 224.249 0.000 79.682
81.087
lread -0.0409 0.004 -9.481 0.000 -0.049
-0.032
scall 0.0009 0.000 6.462 0.000 0.001
0.001
rchar -5.625e-06 4.24e-07 -13.282 0.000 -6.46e-06
-4.8e-06
wchar -7.261e-06 9.58e-07 -7.576 0.000 -9.14e-06
-5.38e-06
pgout -0.3064 0.037 -8.189 0.000 -0.380
-0.233
atch 0.4929 0.139 3.536 0.000 0.220
0.766
pgin -0.1131 0.009 -12.410 0.000 -0.131
-0.095
pflt -0.0274 0.002 -12.818 0.000 -0.032
-0.023
freemem -0.0003 5.02e-05 -5.901 0.000 -0.000
-0.000
freeswap 9.518e-06 1.85e-07 51.451 0.000 9.16e-06
9.88e-06
runqsz_Not_CPU_Bound 1.9046 0.124 15.350 0.000 1.661
2.148
pflt_sq -5.942e-05 5.92e-06 -10.036 0.000 -7.1e-05
-4.78e-05
scall_sq -2.827e-07 2.09e-08 -13.535 0.000 -3.24e-07
-2.42e-07
==============================================================================
Omnibus: 844.795 Durbin-Watson: 2.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1634.856
Skew: -0.918 Prob(JB): 0.00
Kurtosis: 4.864 Cond. No. 7.84e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 7.84e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: X_train.drop(['pflt_sq', 'scall_sq'], axis=1, inplace=True)

from itertools import combinations

64
# Get the list of column names from X_train
columns_list = X_train.columns

# Generate combinations of column names for interaction terms


interactions = list(combinations(columns_list, 2))
interactions

[ ]: [('const', 'lread'),
('const', 'scall'),
('const', 'rchar'),
('const', 'wchar'),
('const', 'pgout'),
('const', 'atch'),
('const', 'pgin'),
('const', 'pflt'),
('const', 'freemem'),
('const', 'freeswap'),
('const', 'runqsz_Not_CPU_Bound'),
('lread', 'scall'),
('lread', 'rchar'),
('lread', 'wchar'),
('lread', 'pgout'),
('lread', 'atch'),
('lread', 'pgin'),
('lread', 'pflt'),
('lread', 'freemem'),
('lread', 'freeswap'),
('lread', 'runqsz_Not_CPU_Bound'),
('scall', 'rchar'),
('scall', 'wchar'),
('scall', 'pgout'),
('scall', 'atch'),
('scall', 'pgin'),
('scall', 'pflt'),
('scall', 'freemem'),
('scall', 'freeswap'),
('scall', 'runqsz_Not_CPU_Bound'),
('rchar', 'wchar'),
('rchar', 'pgout'),
('rchar', 'atch'),
('rchar', 'pgin'),
('rchar', 'pflt'),
('rchar', 'freemem'),
('rchar', 'freeswap'),
('rchar', 'runqsz_Not_CPU_Bound'),
('wchar', 'pgout'),
('wchar', 'atch'),

65
('wchar', 'pgin'),
('wchar', 'pflt'),
('wchar', 'freemem'),
('wchar', 'freeswap'),
('wchar', 'runqsz_Not_CPU_Bound'),
('pgout', 'atch'),
('pgout', 'pgin'),
('pgout', 'pflt'),
('pgout', 'freemem'),
('pgout', 'freeswap'),
('pgout', 'runqsz_Not_CPU_Bound'),
('atch', 'pgin'),
('atch', 'pflt'),
('atch', 'freemem'),
('atch', 'freeswap'),
('atch', 'runqsz_Not_CPU_Bound'),
('pgin', 'pflt'),
('pgin', 'freemem'),
('pgin', 'freeswap'),
('pgin', 'runqsz_Not_CPU_Bound'),
('pflt', 'freemem'),
('pflt', 'freeswap'),
('pflt', 'runqsz_Not_CPU_Bound'),
('freemem', 'freeswap'),
('freemem', 'runqsz_Not_CPU_Bound'),
('freeswap', 'runqsz_Not_CPU_Bound')]

[ ]: interaction_dict = {}
for interaction in interactions:
X_train_int = X_train.copy()
X_train_int['int'] = X_train_int[interaction[0]] *␣
↪X_train_int[interaction[1]]

lr3 = LinearRegression()
lr3.fit(X_train_int, y_train)
interaction_dict[lr3.score(X_train_int, y_train)] = interaction

[ ]: top_5 = sorted(interaction_dict.keys(), reverse = True)[:5]


for interaction in top_5:
print(interaction_dict[interaction])

('freeswap', 'runqsz_Not_CPU_Bound')
('freemem', 'freeswap')
('rchar', 'freeswap')
('wchar', 'freeswap')
('freemem', 'runqsz_Not_CPU_Bound')
(‘freemem’, ‘freeswap’): This interaction term involves two variables related to memory usage,
‘freemem’ and ‘freeswap’. Memory-related variables are often crucial in system performance anal-

66
ysis, and their interaction may capture complex relationships affecting the outcome variable.
(‘rchar’, ‘freeswap’): This interaction term involves ‘rchar’, which represents the number of char-
acters transferred per second by system read calls, and ‘freeswap’, which represents the number of
disk blocks available for page swapping. This interaction might capture the relationship between
disk I/O operations and available swap space, which could impact system performance.

[ ]: X_train_int['freemem_freeswap_interaction'] = X_train_int['freemem'] *␣
↪X_train_int['freeswap']

X_train_int['rchar_freeswap_interaction'] = X_train_int['rchar'] *␣
↪X_train_int['freeswap']

from sklearn.linear_model import LinearRegression


from sklearn.preprocessing import PolynomialFeatures

poly_dict = {}

for feature in X_train_int.columns:


for p in range(2, 5):
X_train_poly = X_train_int.copy() # Make a copy of the DataFrame
X_train_poly['sq'] = X_train_poly[feature] ** p

lr = LinearRegression()
lr.fit(X_train_poly, y_train)

poly_dict[lr.score(X_train_poly, y_train)] = [feature, p]

max_score = max(poly_dict.keys())
max_score_feature, max_score_degree = poly_dict[max_score]

print("Max R-squared:", max_score)


print("Feature with max R-squared:", max_score_feature)
print("Degree of polynomial for max R-squared:", max_score_degree)

Max R-squared: 0.8916079062437935


Feature with max R-squared: freeswap
Degree of polynomial for max R-squared: 2

[ ]: # using square transformation


X_train["freeswap_sq"] = np.square(X_train["freeswap"])

# Define interaction terms for ('freemem', 'freeswap') and ('rchar', 'freeswap')


X_train['freemem_freeswap_interaction'] = X_train['freemem'] *␣
↪X_train['freeswap']

X_train['rchar_freeswap_interaction'] = X_train['rchar'] * X_train['freeswap']

# let's create a model with the transformed data

67
olsmod_12 = sm.OLS(y_train, X_train)
olsres_12 = olsmod_12.fit()
print(olsres_12.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.891
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 3339.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:37 Log-Likelihood: -14862.
No. Observations: 5734 AIC: 2.975e+04
Df Residuals: 5719 BIC: 2.985e+04
Df Model: 14
Covariance Type: nonrobust
================================================================================
================
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 73.4619 0.344 213.297 0.000
72.787 74.137
lread -0.0407 0.003 -12.633 0.000
-0.047 -0.034
scall -0.0018 3.75e-05 -48.185 0.000
-0.002 -0.002
rchar -3.057e-07 7.82e-07 -0.391 0.696
-1.84e-06 1.23e-06
wchar -5.198e-06 7.12e-07 -7.295 0.000
-6.59e-06 -3.8e-06
pgout -0.2678 0.029 -9.255 0.000
-0.325 -0.211
atch 0.2245 0.104 2.151 0.032
0.020 0.429
pgin -0.1766 0.007 -25.812 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.807 0.000
-0.047 -0.045
freemem -0.0008 0.000 -3.197 0.001
-0.001 -0.000
freeswap 3.945e-05 5.45e-07 72.345 0.000
3.84e-05 4.05e-05
runqsz_Not_CPU_Bound -0.1447 0.099 -1.465 0.143
-0.338 0.049
freeswap_sq -1.457e-11 2.41e-13 -60.539 0.000
-1.5e-11 -1.41e-11
freemem_freeswap_interaction 8.123e-10 1.49e-10 5.459 0.000

68
5.21e-10 1.1e-09
rchar_freeswap_interaction -2.649e-12 5.88e-13 -4.507 0.000
-3.8e-12 -1.5e-12
==============================================================================
Omnibus: 450.177 Durbin-Watson: 1.955
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1838.480
Skew: -0.295 Prob(JB): 0.00
Kurtosis: 5.710 Cond. No. 1.79e+13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.79e+13. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]:

5.14.2 Now as observed from above the predictor has p-value>0.05 we remove that
and build the model
[ ]: X_train = X_train.drop(["rchar"], axis=1)
olsmod_13 = sm.OLS(y_train, X_train)
olsres_13 = olsmod_13.fit()
print(olsres_13.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.891
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 3596.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:37 Log-Likelihood: -14862.
No. Observations: 5734 AIC: 2.975e+04
Df Residuals: 5720 BIC: 2.985e+04
Df Model: 13
Covariance Type: nonrobust
================================================================================
================
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 73.3764 0.266 275.953 0.000
72.855 73.898
lread -0.0407 0.003 -12.647 0.000
-0.047 -0.034
scall -0.0018 3.74e-05 -48.341 0.000

69
-0.002 -0.002
wchar -5.25e-06 7e-07 -7.505 0.000
-6.62e-06 -3.88e-06
pgout -0.2679 0.029 -9.260 0.000
-0.325 -0.211
atch 0.2222 0.104 2.132 0.033
0.018 0.426
pgin -0.1768 0.007 -25.917 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.879 0.000
-0.047 -0.045
freemem -0.0008 0.000 -3.209 0.001
-0.001 -0.000
freeswap 3.956e-05 4.61e-07 85.813 0.000
3.87e-05 4.05e-05
runqsz_Not_CPU_Bound -0.1498 0.098 -1.530 0.126
-0.342 0.042
freeswap_sq -1.46e-11 2.27e-13 -64.409 0.000
-1.5e-11 -1.42e-11
freemem_freeswap_interaction 8.138e-10 1.49e-10 5.472 0.000
5.22e-10 1.11e-09
rchar_freeswap_interaction -2.859e-12 2.38e-13 -12.017 0.000
-3.33e-12 -2.39e-12
==============================================================================
Omnibus: 448.712 Durbin-Watson: 1.955
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1814.195
Skew: -0.297 Prob(JB): 0.00
Kurtosis: 5.691 Cond. No. 1.38e+13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.38e+13. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: X_train = X_train.drop(["runqsz_Not_CPU_Bound"], axis=1)


olsmod_14 = sm.OLS(y_train, X_train)
olsres_14 = olsmod_14.fit()
print(olsres_14.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.891
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 3895.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:37 Log-Likelihood: -14863.

70
No. Observations: 5734 AIC: 2.975e+04
Df Residuals: 5721 BIC: 2.984e+04
Df Model: 12
Covariance Type: nonrobust
================================================================================
================
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 73.3141 0.263 278.979 0.000
72.799 73.829
lread -0.0405 0.003 -12.582 0.000
-0.047 -0.034
scall -0.0018 3.67e-05 -48.930 0.000
-0.002 -0.002
wchar -5.138e-06 6.96e-07 -7.384 0.000
-6.5e-06 -3.77e-06
pgout -0.2690 0.029 -9.297 0.000
-0.326 -0.212
atch 0.2229 0.104 2.139 0.032
0.019 0.427
pgin -0.1767 0.007 -25.896 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.938 0.000
-0.047 -0.045
freemem -0.0007 0.000 -2.956 0.003
-0.001 -0.000
freeswap 3.935e-05 4.4e-07 89.484 0.000
3.85e-05 4.02e-05
freeswap_sq -1.449e-11 2.14e-13 -67.822 0.000
-1.49e-11 -1.41e-11
freemem_freeswap_interaction 7.603e-10 1.45e-10 5.259 0.000
4.77e-10 1.04e-09
rchar_freeswap_interaction -2.805e-12 2.35e-13 -11.921 0.000
-3.27e-12 -2.34e-12
==============================================================================
Omnibus: 449.041 Durbin-Watson: 1.956
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1809.231
Skew: -0.299 Prob(JB): 0.00
Kurtosis: 5.686 Cond. No. 1.37e+13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.37e+13. This might indicate that there are
strong multicollinearity or other numerical problems.

71
[ ]: df_pred = pd.DataFrame()

df_pred["Actual Values"] = y_train.values.flatten() # actual values


df_pred["Fitted Values"] = olsres_14.fittedvalues.values # predicted values
df_pred["Residuals"] = olsres_14.resid.values # residuals

5.14.3 Test for Normality

[ ]: sns.histplot(df_pred["Residuals"], kde=True)
plt.title("Normality of residuals")
plt.show()

• The residual terms are normally distributed.

[ ]: import pylab
import scipy.stats as stats

stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)


plt.show()

72
The QQ plot of residuals can be used to visually check the normality assumption. The normal
probability plot of residuals should approximately follow a straight line.

[ ]: from scipy import stats


stats.shapiro(df_pred["Residuals"])

[ ]: ShapiroResult(statistic=0.9624219536781311, pvalue=3.6266662517326524e-36)

Since p-value < 0.05, the residuals are not normal as per shapiro test.

5.14.4 Test for Homoscedasticity


5.14.5 TEST FOR HOMOSCEDASTICITY
• Homoscedacity - If the variance of the residuals are symmetrically distributed across the
regression line , then the data is said to homoscedastic.
• Heteroscedacity - If the variance is unequal for the residuals across the regression line, then
the data is said to be heteroscedastic. In this case the residuals can form an arrow shape or
any other non symmetrical shape.
Why the test?

73
• The presence of non-constant variance in the error terms results in heteroscedasticity. Gen-
erally, non-constant variance arises in presence of outliers.
How to check if model has Heteroscedasticity?
• Can use the goldfeldquandt test. If we get p-value > 0.05 we can say that the residuals are
homoscedastic, otherwise they are heteroscedastic.
How to deal with Heteroscedasticity?
• Can be fixed via adding other important features or making transformations.
The null and alternate hypotheses of the goldfeldquandt test are as follows:
• Null hypothesis : Residuals are homoscedastic
• Alternate hypothesis : Residuals have hetroscedasticity

[ ]: import statsmodels.stats.api as sms


sms.het_goldfeldquandt(df_pred["Residuals"], X_train)[1]

[ ]: 0.22334693193252067

• Since p-value > 0.05 we can say that the residuals are homoscedastic.

5.15 All the assumptions of linear regression are now satisfied. Let’s check the
summary of our final model (olsmod_14).
[ ]: print(olsres_14.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.891
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 3895.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:53:40 Log-Likelihood: -14863.
No. Observations: 5734 AIC: 2.975e+04
Df Residuals: 5721 BIC: 2.984e+04
Df Model: 12
Covariance Type: nonrobust
================================================================================
================
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 73.3141 0.263 278.979 0.000
72.799 73.829
lread -0.0405 0.003 -12.582 0.000
-0.047 -0.034
scall -0.0018 3.67e-05 -48.930 0.000
-0.002 -0.002

74
wchar -5.138e-06 6.96e-07 -7.384 0.000
-6.5e-06 -3.77e-06
pgout -0.2690 0.029 -9.297 0.000
-0.326 -0.212
atch 0.2229 0.104 2.139 0.032
0.019 0.427
pgin -0.1767 0.007 -25.896 0.000
-0.190 -0.163
pflt -0.0456 0.001 -86.938 0.000
-0.047 -0.045
freemem -0.0007 0.000 -2.956 0.003
-0.001 -0.000
freeswap 3.935e-05 4.4e-07 89.484 0.000
3.85e-05 4.02e-05
freeswap_sq -1.449e-11 2.14e-13 -67.822 0.000
-1.49e-11 -1.41e-11
freemem_freeswap_interaction 7.603e-10 1.45e-10 5.259 0.000
4.77e-10 1.04e-09
rchar_freeswap_interaction -2.805e-12 2.35e-13 -11.921 0.000
-3.27e-12 -2.34e-12
==============================================================================
Omnibus: 449.041 Durbin-Watson: 1.956
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1809.231
Skew: -0.299 Prob(JB): 0.00
Kurtosis: 5.686 Cond. No. 1.37e+13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.37e+13. This might indicate that there are
strong multicollinearity or other numerical problems.

[ ]: df_pred = pd.DataFrame()

df_pred["Actual Values"] = y_train.values.flatten() # actual values


df_pred["Fitted Values"] = olsres_14.fittedvalues.values # predicted values
df_pred["Residuals"] = olsres_14.resid.values # residuals

# let us plot the fitted values vs residuals


sns.set_style("whitegrid")
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")

75
plt.title("Fitted vs Residual plot")
plt.show()

5.16 The model equation will be as follows:


[ ]: # let's check the model parameters
olsres_14.params

[ ]: const 7.331406e+01
lread -4.045581e-02
scall -1.797301e-03
wchar -5.137688e-06
pgout -2.689676e-01
atch 2.229092e-01
pgin -1.766742e-01
pflt -4.564858e-02
freemem -7.423463e-04
freeswap 3.934889e-05
freeswap_sq -1.448617e-11

76
freemem_freeswap_interaction 7.602648e-10
rchar_freeswap_interaction -2.804670e-12
dtype: float64

[ ]: # Let us write the equation of linear regression


Equation = "usr ="
print(Equation, end=" ")
for i in range(len(X_train.columns)):
if i == 0:
print(olsres_14.params.iloc[i], "+", end=" ")
elif i != len(X_train.columns) - 1:
print(
olsres_14.params.iloc[i],
"* (",
X_train.columns[i],
")",
"+",
end=" ",
)
else:
print(olsres_14.params.iloc[i], "* (", X_train.columns[i], ")")

usr = 73.3140576481782 + -0.04045581278423039 * ( lread ) +


-0.001797301323264955 * ( scall ) + -5.1376876443090065e-06 * ( wchar ) +
-0.2689676302487187 * ( pgout ) + 0.22290922929807327 * ( atch ) +
-0.17667416630777863 * ( pgin ) + -0.045648578057200836 * ( pflt ) +
-0.0007423462908686035 * ( freemem ) + 3.934888880829468e-05 * ( freeswap ) +
-1.4486165831222174e-11 * ( freeswap_sq ) + 7.602648419605341e-10 * (
freemem_freeswap_interaction ) + -2.8046701130685438e-12 * (
rchar_freeswap_interaction )

5.16.1 Observations
Intercept: The intercept term is 73.3140576481782. This represents the predicted value of ‘usr’
when all predictor variables are zero.
Coefficients:
The coefficients associated with the predictor variables have varying magnitudes, indicating their
respective impacts on the target variable ‘usr’.
Negative coefficients (e.g., for ‘lread’, ‘scall’, ‘wchar’, ‘pgout’, ‘pgin’, ‘pflt’, ‘freemem’) suggest a
negative relationship with ‘usr’. An increase in these variables tends to decrease the predicted
value of ‘usr’.
Positive coefficients (e.g., for ‘atch’, ‘freeswap’, interaction terms) suggest a positive relationship
with ‘usr’. An increase in these variables tends to increase the predicted value of ‘usr’.
Magnitude of Coefficients: The magnitude of the coefficients indicates the strength of the relation-
ship between the predictor variables and the target variable. Larger magnitude coefficients suggest
a stronger influence on the target variable.

77
Interaction Terms:
Interaction terms such as ‘freemem_freeswap_interaction’ and ‘rchar_freeswap_interaction’ are
included in the equation. These terms represent the combined effect of two predictor variables
(‘freemem’ and ‘freeswap’, ‘rchar’ and ‘freeswap’) on the target variable ‘usr’. The coefficients
associated with interaction terms indicate the impact of the interaction between the respective
predictor variables on the target variable.
Squared Term:
The squared term ‘freeswap_sq’ is included in the equation. This suggests that the relationship
between ‘freeswap’ and ‘usr’ may not be linear but quadratic. As ‘freeswap’ increases, the effect on
‘usr’ may not be constant but may vary nonlinearly. Overall, the final equation provides insights
into how each predictor variable contributes to the prediction of ‘usr’ and how their interactions
affect the target variable. It can be used to make predictions and understand the relationship
between the predictors and the target variable in the context of the specific problem domain.

5.17 Predictions
[ ]: X_train.columns

[ ]: Index(['const', 'lread', 'scall', 'wchar', 'pgout', 'atch', 'pgin', 'pflt',


'freemem', 'freeswap', 'freeswap_sq', 'freemem_freeswap_interaction',
'rchar_freeswap_interaction'],
dtype='object')

[ ]: X_test.columns

[ ]: Index(['const', 'lread', 'lwrite', 'scall', 'sread', 'swrite', 'fork', 'exec',


'rchar', 'wchar', 'pgout', 'ppgout', 'pgfree', 'atch', 'pgin', 'ppgin',
'pflt', 'vflt', 'freemem', 'freeswap', 'runqsz_Not_CPU_Bound'],
dtype='object')

[ ]: X_test = X_test.drop(["ppgout"], axis=1)


X_test = X_test.drop(["vflt"], axis=1)
X_test = X_test.drop(["ppgin"], axis=1)
X_test = X_test.drop(["sread"], axis=1)
X_test = X_test.drop(["pgfree"], axis=1)
X_test = X_test.drop(["fork"], axis=1)
X_test = X_test.drop(["lwrite"], axis=1)
X_test = X_test.drop(["swrite"], axis=1)
X_test = X_test.drop(["exec"], axis=1)
X_test['freemem_freeswap_interaction'] = X_test['freemem'] * X_test['freeswap']
X_test['rchar_freeswap_interaction'] = X_test['rchar'] * X_test['freeswap']
X_test["freeswap_sq"] = np.square(X_test["freeswap"])
X_test = X_test.drop(["rchar"], axis=1)
X_test = X_test.drop(["runqsz_Not_CPU_Bound"], axis=1)

[ ]: X_test.columns

78
[ ]: Index(['const', 'lread', 'scall', 'wchar', 'pgout', 'atch', 'pgin', 'pflt',
'freemem', 'freeswap', 'freemem_freeswap_interaction',
'rchar_freeswap_interaction', 'freeswap_sq'],
dtype='object')

[ ]: # let's make predictions on the test set


y_pred_test = olsres_14.predict(X_test)
y_pred_train = olsres_14.predict(X_train)

[ ]: # To check model performance


from sklearn.metrics import mean_absolute_error, mean_squared_error

[ ]: # let's check the RMSE on the train data


rmse1 = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse1

[ ]: 3.232277869882356

[ ]: # let's check the RMSE on the test data


rmse2 = np.sqrt(mean_squared_error(y_test, y_pred_test))
rmse2

[ ]: 255.21624622518092

[ ]: X_test.columns

[ ]: Index(['const', 'lread', 'scall', 'wchar', 'pgout', 'atch', 'pgin', 'pflt',


'freemem', 'freeswap', 'freemem_freeswap_interaction',
'rchar_freeswap_interaction', 'freeswap_sq'],
dtype='object')

[ ]: olsmod_14 = sm.OLS(y_test, X_test)


olsres_14 = olsmod_14.fit()
print(olsres_14.summary())

OLS Regression Results


==============================================================================
Dep. Variable: usr R-squared: 0.885
Model: OLS Adj. R-squared: 0.885
Method: Least Squares F-statistic: 1575.
Date: Tue, 27 Feb 2024 Prob (F-statistic): 0.00
Time: 12:54:01 Log-Likelihood: -6398.0
No. Observations: 2458 AIC: 1.282e+04
Df Residuals: 2445 BIC: 1.290e+04
Df Model: 12
Covariance Type: nonrobust
================================================================================
================

79
coef std err t P>|t|
[0.025 0.975]
--------------------------------------------------------------------------------
----------------
const 72.2401 0.375 192.697 0.000
71.505 72.975
lread -0.0297 0.005 -6.144 0.000
-0.039 -0.020
scall -0.0018 5.89e-05 -30.093 0.000
-0.002 -0.002
wchar -5.483e-06 1.1e-06 -4.986 0.000
-7.64e-06 -3.33e-06
pgout -0.1708 0.043 -3.930 0.000
-0.256 -0.086
atch -0.0002 0.157 -0.001 0.999
-0.309 0.309
pgin -0.1713 0.010 -16.356 0.000
-0.192 -0.151
pflt -0.0441 0.001 -53.917 0.000
-0.046 -0.042
freemem -0.0009 0.000 -2.421 0.016
-0.002 -0.000
freeswap 4.056e-05 6.37e-07 63.694 0.000
3.93e-05 4.18e-05
freemem_freeswap_interaction 8.909e-10 2.09e-10 4.273 0.000
4.82e-10 1.3e-09
rchar_freeswap_interaction -2.872e-12 3.62e-13 -7.929 0.000
-3.58e-12 -2.16e-12
freeswap_sq -1.499e-11 3.19e-13 -46.937 0.000
-1.56e-11 -1.44e-11
==============================================================================
Omnibus: 195.660 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1031.784
Skew: -0.143 Prob(JB): 8.93e-225
Kurtosis: 6.161 Cond. No. 1.23e+13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.23e+13. This might indicate that there are
strong multicollinearity or other numerical problems.
Training data set R-squared: 0.891 Adj. R-squared: 0.891 rmse :3.232277869882356
Testing data set R-squared: 0.885 Adj. R-squared: 0.885 rmse :255.21624622518092
Training Dataset:
R-squared (Coefficient of Determination): The R-squared value of 0.891 indicates that approxi-

80
mately 89.1% of the variability in the target variable (‘usr’ - the percentage of time CPUs operate
in user mode) can be explained by the predictor variables included in the model. A higher R-squared
value suggests that the model fits the training data well.
Adjusted R-squared: The adjusted R-squared value of 0.891 is almost the same as the R-squared
value, which indicates that the model’s performance is consistent even after adjusting for the number
of predictors in the model.
RMSE (Root Mean Squared Error): The RMSE value of 3.232 suggests that, on average, the model’s
predictions deviate from the actual values by approximately 3.232 percentage points. Lower RMSE
values indicate better accuracy of the model.
Testing Dataset:
R-squared: The R-squared value of 0.885 for the testing dataset indicates that approximately
88.5% of the variability in the target variable can be explained by the model. This suggests that
the model’s performance on unseen data is also relatively high.
Adjusted R-squared: Like the training dataset, the adjusted R-squared value is also 0.885, indicating
consistent performance after adjusting for the number of predictors.
RMSE: The RMSE value of 255.216 suggests that, on average, the model’s predictions deviate from
the actual values by approximately 255.216 percentage points on the testing dataset.
Overall Interpretation: Model has some issue and looks overfitting model. Need to further investi-
gate.

[ ]:

81

You might also like