Intro
Intro
ipynb - Colab
Data Science is a blend of various tools, algorithms, and machine learning principles to uncover insights from the hidden patterns of a raw
dataset.
1. Load Dataset
2. Inspect Dataset (Feature Engineering)
3. Clean Dataset (Feature Engineering)
4. Split training and testing dataset for modelling (Machine Learning)
5. Evaluate model performance (Evulation)
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 1/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
How can you ensure that your dataset is ready for machine learning (ML)?
How to choose the most suitable algorithms for your dataset?
How to define the feature variables that can potentially be used for machine learning?
Exploratory Data Analysis (EDA) can help to answer all these questions and ensures the best outcomes for the project. It is an approach for
summarizing, visualizing, and becoming intimately familiar with the important characteristics of a data set.
2. Get a feel of the data ,describe the data,look at a sample of data like first and last rows
3. Take a deeper look into the data by querying or indexing the data
Data profiling
4 Objectives of EDA
Discover Patterns
Spot Anomalies
Frame Hypothesis
Check Assumptions
Univariate Analysis
Bivariate Analysis
Trends
Distribution
Mean
Median
Outlier
Correlations
Hypothesis testing
Visual Exploration
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 2/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
Description
In this dataset, there are 17 measured features namely:
Numerical Features
Categorical Features
1. Load data
2. Check total number of entries and column types
3. Check any null values
4. Check duplicate entries
5. Univariate Analysis
6. Bi-Variate Analysis
7. Multi-Variate Analysis (FYI only)
8. Summary of EDA
In google colab, the files must be uploaded to the session's temporary storage using the file icon tab on your left of the python notebook.
df = pd.read_csv("/content/data.csv") #Take note, your file path might be different according to your naming.
But, before we do any cleaning, lets load the data and get an initiate feel of how the data looks like.
We will use head, tail, columns, shape and info methods to diagnose data.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 3/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
2021-01-
0 1.41 0.40 0.27 -0.31 1.75 -0.93 0.09
01 00:00
2021-01-
1 -1.41 -0.02 -0.52 -1.83 -3.13 -0.52 -2.04
01 01:00
2021-01-
2 0.29 0.67 -0.36 -0.01 -1.93 0.33 -0.51
01 02:00
categorical=df[[
'Quality_type'
]]
['DateTime',
'In_Flow',
'In_Angle',
'In_Temp',
'Out_Temp',
'In_Pressure',
'Viscosity',
'Humidity',
'Rot_Speed',
'Valve_Opening',
'Tank_%',
'Out_Speed',
'Thickness',
'Vibration',
'Col_density',
'Oxygen',
'Quality_type']
2021-07-
4995 0.22 -0.42 0.56 -0.97 1.40 1.48 -0.74
27 03:00
2021-07-
4996 0.87 -1.54 1.78 -0.64 1.04 0.66 1.97
27 04:00
2021-07-
4997 0.17 -0.75 0.46 0.33 1.05 0.99 0.70
27 05:00
0 1.85
1 -1.16
2 0.72
3 1.16
4 1.31
5 0.51
6 -4.79
7 -0.14
8 0.17
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 4/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
9 2.59
Name: Rot_Speed, dtype: float64
(5000, 17)
# We can also try some simple visualisation. Surveying the data labels (column heading name) in the given dataset, let's inspect the amo
df['Quality_type'].value_counts() # I chose this as a demo since quality type is the only categorical data
Quality_type
Premium 1698
Approved 1678
Rejected 1624
Name: count, dtype: int64
# You can also plot it as a graph of your choice for better visualisation
barlist = df['Quality_type'].value_counts().plot(kind='bar', rot=0) # rot is rotation of text
plt.figure(figsize=(10,10))
Insights
Our total measured data stands at 5000. From the initial inspection, about 34% (1698), 33%(1678) and 32%(1624) of the tanks were graded as
"Premium", "Approved" and "Rejected" quality type.
Note The reason why am i talk about variance here is that, if your data does not exhibit any variance at all, then the dataset by an initial look
have problem and machine learning will not be possible!!
df.corr(numeric_only=True)
# Please take note, after 14 May 2024, the .corr() function have change in syntax and if like in our case if we use correlation study to
# numerical data, then we need to include numeric_only=True in the arguement () like above.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 5/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
Insights
From a basic correlation table, humidity and rot_speed may have some reaction to the quality type, due to them having a high correlation with
the in_flow feature at 58% and 56% respectively.
We can make conversion data types like from str to categorical or from int to float
# info gives data type like dataframe, number of sample or row, number of feature or column, feature types and memory usage
df.info()
# If you want to convert from one data type to another you can follow the example below:
# lets convert object(str) to categorical and int to float.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 5000 non-null object
1 In_Flow 5000 non-null float64
2 In_Angle 5000 non-null float64
3 In_Temp 5000 non-null float64
4 Out_Temp 5000 non-null float64
5 In_Pressure 5000 non-null float64
6 Viscosity 5000 non-null float64
7 Humidity 5000 non-null float64
8 Rot_Speed 5000 non-null float64
9 Valve_Opening 5000 non-null float64
10 Tank_% 5000 non-null float64
11 Out_Speed 5000 non-null float64
12 Thickness 5000 non-null float64
13 Vibration 5000 non-null float64
14 Col_density 5000 non-null float64
15 Oxygen 5000 non-null float64
16 Quality_type 5000 non-null object
dtypes: float64(15), object(2)
memory usage: 664.2+ KB
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 6/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
For columns that are categorical, i.e., columns that take on a limited, and usually fixed, number of possible values, we set their type as
“category”. E.g., gender, blood type, number of days per week and country are all categorical data. Our dataset do not have such data.
For columns that are numeric, we can either set their type as “int64” (integer) or “float64” (floating point number). E.g., sales, temperature and
number of people are all numeric data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 5000 non-null object
1 In_Flow 5000 non-null float64
2 In_Angle 5000 non-null float64
3 In_Temp 5000 non-null float64
4 Out_Temp 5000 non-null float64
5 In_Pressure 5000 non-null float64
6 Viscosity 5000 non-null float64
7 Humidity 5000 non-null float64
8 Rot_Speed 5000 non-null float64
9 Valve_Opening 5000 non-null float64
10 Tank_% 5000 non-null float64
11 Out_Speed 5000 non-null float64
12 Thickness 5000 non-null float64
13 Vibration 5000 non-null float64
14 Col_density 5000 non-null float64
15 Oxygen 5000 non-null float64
16 Quality_type 5000 non-null category
dtypes: category(1), float64(15), object(1)
memory usage: 630.1+ KB
Column name inconsistency like upper-lower case letter or space between words
duplicated data
missing data
leave as is
drop them with dropna()
fill missing value with fillna()
fill missing values with test statistics like mean
Assert statement: check that you can turn on or turn off when you are done with your testing of the program
(5000, 3)
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 7/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# generate preview of entries with null values
if demodf.isnull().any(axis=None):
print("\nPreview of data with null values")
print(demodf[demodf.isnull().any(axis=1)].head(3))
missingno.matrix(demodf)
plt.show()
# Now, from the demo output, you can see that missing values occur in the quality_type column
Should there be any missing entries, you should decide on the cleaning steps required before proceeding further to the other parts of EDA.
One common way of dealing with missing data is to drop them and fill them with something like 'empty' as shown in the following
example.
If the values with the label column have too many missing data, consider dropping the whole label instead if you deem it is not too important
df2 = df.drop(['In_Flow'],axis=1)
Example drop the column "In_Flow" and transfer all remaining features to new dataframe named df2
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 8/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Lets assume that i dont know In_Flow have missing value
# Lets check
demodf["Quality_type"].value_counts(dropna =False)
#demodf["Quality_type"].value_counts(dropna =True) # This command will drop all NaN values
Quality_type
Premium 1696
Approved 1676
Rejected 1619
NaN 9
Name: count, dtype: int64
# If is numerical data, for simplicity and foundation purpose, you can fill
# them with '0'
Quality_type
Premium 1696
Approved 1676
Rejected 1619
empty 9
Name: count, dtype: int64
0 False
1 True
2 False
3 True
4 True
...
4995 True
4996 True
4997 True
4998 True
4999 True
Name: Quality_type, Length: 5000, dtype: bool
In our case, there are also no duplicated entries, which the code will point out directly by printing the output “No duplicated entries found”.
In the event of duplicated entries, the output will show the number of duplicated entries and a preview of the duplicated entries.
Should there be any duplicated entries, you should decide on the cleaning steps required before proceeding further to the other parts of
EDA.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 9/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
One common way of dealing with duplicated is to drop them using the following code below:
df.drop_duplicates(inplace=True)
NOTE: We will revert back to our original dataset df from this point
# Since, Quality type is our target, it may be worthwile to further categorise the target's feature
# namely Premium, Approved and Rejected for latter EDA
Premium_type = df[df["Quality_type"]=='Premium']
Approved_type = df[df["Quality_type"]=='Approved']
Rejected_type = df[df["Quality_type"]=='Rejected']
Below shows all numeric feature distributions using the histogram (barplot).
The x-axis represents the given values and the y-axis represents their frequencies.
df.hist(figsize=[15,15])
plt.suptitle("Histogram Demostration")
plt.show()
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 10/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
Insights
Observing the histograms for all input features, it seems like our raw data are quite centrally distributed.
Good sign as this is a first feel that our data did not possess any outliers.
In addition, another common definition considers any point away from the mean by more than a fixed number of standard deviations be an
“outlier”. In other words, we can consider data values corresponding to areas of the population with low density or probability as suspected
outliers.
Boxplot is very good at presenting statistical information such as outliers. The plot consists of a rectangular box bounded above and below by
“hinges” that represent 75% and 25% quantiles respectively.
We can view the “median” as the horizontal line through the box. You can also see the upper and lower “whiskers”. The vertical axis is in the
units of the quantitative variable.
Data points above the upper whiskers and far away are the suspected “outliers”.
For demonstration, below shows an example of plotting boxplot for three variables "In_Flow", "In_Angle" and "In_Temp"
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 11/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
sns.boxplot(x=df["Quality_type"],y=df['In_Flow'],data=df)
plt.show()
sns.boxplot(x=df["Quality_type"],y=df['In_Angle'],data=df)
plt.show()
sns.boxplot(x=df["Quality_type"],y=df['In_Temp'],data=df)
plt.show()
Insights
In_Temp in general does not contain much outliers, probably just two or three points under the "rejected" quality type. This is neglectable to the
whole analysis.
In_Angle however shows two distinctive outliers under "Premium" at points -3 and -4.5. Due to its low numbers, I believe it can also be neglected
for now.
In_Pressure shows a small cluster of outliers at range of below -3 for both "Approved" and "Rejected". Again, as the numbers are rather small,
we can choose to neglect for now.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 12/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
sns.FacetGrid(df,hue="Quality_type",height = 5).map(sns.distplot,"In_Flow").add_legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:854: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
func(*plot_args, **plot_kwargs)
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:854: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
func(*plot_args, **plot_kwargs)
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:854: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
func(*plot_args, **plot_kwargs)
Insights
It is highly likely that the inlet flowrate of the water tank does affect the quality type alot, in specific the "Approved" grade, especially in higher
flow rates averaging at 2.
This is supported by the phenomenon observed by the massive overlapping observed when in_flow between -2 to 4.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 13/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
sns.set_style("whitegrid")
sns.FacetGrid(df, hue = "Quality_type" , height = 6).map(plt.scatter,"In_Flow","In_Temp").add_legend()
plt.show()
Insights
From the scatterplot, it is quite evident that both In_Temp and In_Flow have significant influence on the rejected quality type with the dstinctive
cluster in green.
In addition, the scatterplot also shows that the quality type regardless its nature are mainly affected when inlet temperature and flow are in the
-2 to 2 regions.
Moreover, we can plot pairs of variables for a better understanding of variables associations. Below, we plot the pair association for the In_Flow
rate and In_Temp value to see their relationship on the quality type.
selected_numeric_features = df[["Quality_type","In_Flow","In_Temp"]]
sns.set_style("whitegrid")
sns.pairplot(selected_numeric_features, hue = "Quality_type", height = 5)
plt.show()
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 14/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
As a result, some key decisions can be made when creating a correlation matrix. One is the choice of relevent correlation statistics and the
coding of the variables. Another is the treatment of the missing data and dropping highly correlated variables from feature sets.
Below we plot a correlation heatmap for all the numerical features present in the original dataset.
corr = numeric.corr() # recall above, "numeric" is the variable name for all numerical feature
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(240, 10, n=9)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.4, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .8}) # feel free to adjust vmax for the scale
plt.show()
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 15/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
Insights
As per my hunch, high correlation exist on Humidityand Rot_Speed and also In_Pressure with regards to the tank quality. In contrast, weak
correlation exists col_density and In_Pressure distinctively.
In the process of EDA, sometimes, the inspection of single or pairs of variables won’t suffice to rule out certain hypothesis (or outliers &
anomalous cases) from your dataset. That’s why multivariate analysis come in play. This type of analysis generally shows the relationship
between two or more variables using some statistical techniques. There comes the need for taking in consideration more variables at a time
during analysis to reveal more insights in your data.
8. Summary of EDA
It is obivous that it is difficult to actually understand the dataset and make conclusions without looking through the entire data set. In fact,
we have shown that spending more time exploring the dataset is well invested time.
Despite it being tedious in the preliminary step to processing data, all the hard work is necessary. In fact, when you begin to actually see
interesting insights you will appreciate every single minute invested in such a process.
EDA enables Data scientists to immediately understand key issues in the data and be able to guide deeper analysis in the right directions.
Successfully exploring the data ensures to stakeholders that they won’t be missing out on opportunities to leverage their data. They can
easily pinpoint risks including poor data quality, unreliable, poor feature engineering, and other uncertainties.
In this section, we have covered a range of useful introductory data exploratory/data analysis methods and visualization techniques. The
shown EDA guidelines should allow Data scientists to have insightful and deeper understanding of the problem at hand and decide on the
next move confidently.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 16/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
In this 2nd section, we will explore how sklearn can help us split our original dataset into training dataset for machine learning algorithms and
testing dataset for model performance evulation in final step.
# Import the function to randomly split the dataset into Training and Testing sets
from sklearn.model_selection import train_test_split
As our target is categorical (Premium, Approved, or Rejected), we choose a “Classifier”. In case we would work with a numerical target (i.e.
predicting a value), we would opt for a “Regressor” but the syntax would be similar.
In this section, i have updated to include each classifiers and also some parameter tuning advice.
The dataset used here is a housing price dataset for demo purpose, as my orginal dataset on the water tank quality is a classification problem.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 17/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
from sklearn.linear_model import LinearRegression
# load dataset
dataset = pd.read_csv("/content/house_prices.csv")
# Size and Target are dataframes. We have to convert them into Array to be used as Training dataset.
# and then use eshape function to convert array.shape(1,n) to array.shape(n,1) so each independent variable has own row.
size = np.array(size_data).reshape(-1,1)
price = np.array(price_data).reshape(-1,1)
# Predict Price
price_predicted = model.predict(size)
Insights
The red line represents the regression equation and provides the relationship between the housing prices with respect to the house sizes given
in an estate.
From here onwards is back to my original water tank dataset denoted by "df"
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 18/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Import LR Library
# The syntax for parameters setting in the logistic regression command as follows:
LR.fit(X_train, y_train)
LR1.fit(X_train, y_train)
LR2.fit(X_train, y_train)
▾ LogisticRegression
LogisticRegression(C=0.6)
We will be using Support Vector Classification (SVC) since our problem is a classification problem.
# Load library
▾ SVC
SVC()
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 19/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# load libaries
# From my experience, KNN is one of the most simplest model to deploy, but is plagued
# with choosing the correct K neighbour value. Too large of K may make your model overfit due to overestimation
# leading to higher inaccuracy
▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)
# The syntax for parameters setting in Decision Tree (DTC) command as follows:
# this is grown with unlimited trees, you can test by limitingthe max_leaf_nodes with a number
▾ DecisionTreeClassifier
DecisionTreeClassifier()
For your assignment, you can try random forest, but take note that the nodes required for computing may be high for your dataset and the
search for solution may take very long.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 20/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Import Library
# The syntax for parameters setting in Random Forest (RFC) command as follows:
# From my experience, since RFC is an extension of DTC, it almost shares similar concern in terms of influencing factors.
rfc = RandomForestClassifier(criterion='gini', max_leaf_nodes=None) # I left the parameters empty to use its default values
▾ RandomForestClassifier
RandomForestClassifier()
#Compare Score
print ("Accuracy Score of Training Set:")
Insights
We notice that the model reaches a 100% accuracy on the training dataset for decision tree and random forest. This could be good news but we
are probably facing an “overfitting” issue, meaning that the model performs perfectly on training data by learning predictions “by heart” and it
will probably fail to reach the same level of accuracy on unseen data.
In another words, we are getting 100% accuracy because the training data are being used for testing. At the time of training, decision tree or
random forest gained the knowledge about that data, and now if you give same data to predict it will give exactly same value. That's why
decision tree and random forest are producing correct results every time.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 21/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Evaluates the performance on the Test Set
#Compare Score
print ("Accuracy Score of Testing Set:")
Insights
The dip of about 20% to 25% dip for SVC, KNN and RFC in accuracy on test data may well suggest the issue of "overfitting" because the test
data is a cut out of the train data, hence, the data range are all within the original dataset region.
In another words, if the model performs better on the training set than on the test set, it means that the model is likely overfitting.
The phenomenon is very prominent when you compare the the training and testing dataset accuracy of KNN, DTC and RFC.
Note
ALso, it is prominent to see that the LR and decision tree have the lowest accuracy score. Of course, you will need to explore the parameters to
tune them and test for their accuracy improvements.
Train dataset Accuracy approximately = Test dataset Accuracy and both are below about 70% (based on my personal guiding principle) -->
Underfitting
# You can use this command to check the prediction vs actual in a dataframe format, its optional and up to your preference.
2021-02-
1332 -0.52 0.14 1.05 -1.48 0.11 -1.44 -0.72
25 12:00
2021-04-
2844 1.15 -1.93 -0.28 2.40 0.86 -0.37 2.71
28 12:00
2021-07-
4665 0.92 1.19 0.98 -2.00 -1.60 0.46 -0.61
13 09:00
2021-05-
3408 3.16 0.75 1.41 -0.60 -1.13 0.63 1.15
22 00:00
2021-02-
975 1.55 1.46 0.56 0.25 2.18 -0.08 2.29
10 15:00
2021-01-
283 2.29 0.80 0.71 -0.09 1.48 0.04 2.75
12 19:00
2021-07-
4982 0.32 0.37 -1.68 -1.21 1.73 -0.34 0.51
26 14:00
2021-07-
4880 -0.93 0.23 0.69 3.46 0.97 0.74 -1.76
22 08:00
2021-05-
2981 1.19 -0.15 0.07 -2.66 0.53 0.80 1.30
04 05:00
2021-03-
1830 1 51 0 65 0 09 0 91 1 60 0 12 0 28
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 22/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
What are the features among the different process parameters which help the model predict the quality of the product?
I have moved this to section 4.3, as the method introduced is only applicable to tree based models and is not considered to be very critical in
terms of model performance evaluation
When the model’s accuracy on the training set is very high but then, the model’s accuracy on the test set is low!
If overfitting does occur, then we need to further reduce the complexity of the model using dimensional reduction method (advanced).
Inspection of Underfitting
When the model’s accuracy on both the training and test datasets are very low.
If underfitting does occur, then we can either try another ML model or increase the complexity of the current model.
In basic approach, this is how we split our training and testing dataset:
In cross validation approach, the special command, cross_val_score, will help you segment your original dataset into specific number of
portions given by the user.
For example:
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 23/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
In the above example, the original dataset is segmented into 10 portions, 9 for training and 1 for testing. The K value determines the number of
segments to be divided, usually no more than 10, and 1 segment is always fixed as test dataset.
After the split, the cross validation command will then inspect the test segment with each individual train segment to achieve small batch
evluation to reduce errors. Therefore, if you decided to split the dataset into 10 segments as shown above, you will then have to expect 10
iterations of inspection. (Warning: sometimes might be slow)
Therefore only perform cross validation on trained models which you think have higher possibility of overfitting problem (e.g. KNN (maybe), DTC
and RFC (definitely))
The above shows the 10 accuracy scores after cross validation for DTC. You can compare this to the training set accuracy score achieved in
section 4.1.
However, to make life easier, you can simply find the mean score and compare.
score_mean = scores.mean()
score_mean
0.6866
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 24/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# For better visualisation, we can use a barplot to compare between the accuracy score of training set and cross validation set for clar
fig = plt.figure()
ax = fig.add_axes([0,0,0.5,1])
langs = ['Train_Accuracy', 'Cross_Val']
students = [Train_acc,score_mean]
ax.bar(langs,students)
plt.show()
Insights
It is obivous that the cross validation results reassure that overfitting occurs in decision tree model. Hence, our next action may include
retuning the model or proceed with dimensionality reduction methods.
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 25/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
Importance
Out_Temp 16.301445
In_Pressure 11.971816
Col_density 10.845535
Rot_Speed 8.824287
Tank_% 8.426211
In_Flow 8.078741
Humidity 8.013428
Oxygen 3.585178
In_Temp 3.568850
Vibration 3.471560
Valve_Opening 3.444680
Viscosity 3.391949
In_Angle 3.380412
Insights:
Out_Speed 3.361289
From the inspection above, we can see that label variables "Out_Temp", "In_Pressure" and "Col_density" plays active role in the predictions.
Thickness 3.334621
Importance
Out_Temp 19.642729
https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 26/26