0% found this document useful (0 votes)
11 views

Intro

Uploaded by

xmisthebest888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Intro

Uploaded by

xmisthebest888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.

ipynb - Colab

keyboard_arrow_down MME 3153 IA Data Analytics Master Guide (2024)


This notebook serves as today's lecture and master guide for your assignment (Updated 21 May 2024)

Summary of changes: Changed some depreciated syntax.

What is Data Science

Data Science is a blend of various tools, algorithms, and machine learning principles to uncover insights from the hidden patterns of a raw
dataset.

Data Analytics is the approach to realise such goals.

Four Common Applications of Data Analytics

1. Predictive Causal Analytics.


2. Prescriptive Analytics.
3. Making Prediction.
4. Pattern Discovery.

Steps to Perform Data Analysis:

1. Load Dataset
2. Inspect Dataset (Feature Engineering)
3. Clean Dataset (Feature Engineering)
4. Split training and testing dataset for modelling (Machine Learning)
5. Evaluate model performance (Evulation)

Mission for Lecture 3

1. Exploratory Data Analysis (EDA)


2. Prepare Data for Machine Learning (ML)
3. Techniques for Model Evaluation

Mission for Lecture 4

1. Discuss Common Machine Learning Models


2. Parameters Tuning
3. See One or Two Perform Metric via Evaluation
4. Detection of Over/Under Fitting
5. Assignment Q&A

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 1/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

keyboard_arrow_down Step 1. Exploratory Data Analysis


What is Exploratory Data Analysis (EDA)?

How can you ensure that your dataset is ready for machine learning (ML)?
How to choose the most suitable algorithms for your dataset?
How to define the feature variables that can potentially be used for machine learning?

Exploratory Data Analysis (EDA) can help to answer all these questions and ensures the best outcomes for the project. It is an approach for
summarizing, visualizing, and becoming intimately familiar with the important characteristics of a data set.

There are a lot of ways to reach these goals as follows:

1. Import the data

2. Get a feel of the data ,describe the data,look at a sample of data like first and last rows

3. Take a deeper look into the data by querying or indexing the data

4. Identify features of interest

5. Recognise the challenges posed by data - missing values, outliers

6. Discover patterns in the data

One of the important things about EDA is Data profiling.

Data profiling

Summarizing your dataset through descriptive statistics.


Done through a variety of measurements to better understand your dataset.
Goal is to have a solid understanding of your data so you can afterwards start querying and visualizing your data in various ways.
Depending on the result of the data profiling, you might decide to correct, discard or handle your data differently.

Key Concepts of Exploratory Data Analysis


2 types of Data Analysis

Confirmatory Data Analysis

Exploratory Data Analysis

4 Objectives of EDA

Discover Patterns

Spot Anomalies

Frame Hypothesis

Check Assumptions

2 methods for exploration

Univariate Analysis

Bivariate Analysis

Stuff done during EDA

Trends

Distribution

Mean

Median

Outlier

Spread measurement (SD)

Correlations

Hypothesis testing

Visual Exploration

keyboard_arrow_down Overview of Dataset


In this demonstration, i will use a dataset comprising the measured characteristics of 5000 pressurised water tanks during inspection. My
objective is to build a predictor via a machine learning model to predict the quality grade of pressurised water tanks using this dataset.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 2/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

Description
In this dataset, there are 17 measured features namely:

Numerical Features

In_Flow: Flowrate through Tank Inlet.


In_Angle: Angle of the Tank Inlet.
In_Temp: Temperature of the Tank Inlet.
Out_Temp: Temperature of the Tank Outlet.
In_Pressure: Pressure in the Tank Inlet.
Viscoity: Viscoity of Test Fluid.
Humidity: Humidity of Tank.
Rot_Speed: Rotor speed of pipe threading machine.
Valve Opening: Position of the valve plug to its closed position against the valve seat.
Tank_%: Tank Size in terms of Percentage.
Out_speed: Flowrate of the liquid exiting the Tank Outlet.
Thickness: Tank wall thickness.
Vibration: Vibration of the tank during water filling.
Col_Density: Column density to represent water level measured.
Oxygen: Oxygen level present in the pressurised tank.

Categorical Features

DateTime: Date and Time of measurement


Quality Type: Grade of the Water Tank

General Outline in this EDA Demostration

Our code template shall perform the following steps:

1. Load data
2. Check total number of entries and column types
3. Check any null values
4. Check duplicate entries
5. Univariate Analysis
6. Bi-Variate Analysis
7. Multi-Variate Analysis (FYI only)
8. Summary of EDA

keyboard_arrow_down 0. Load Common Libraries


import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
%matplotlib inline
from matplotlib import pyplot

keyboard_arrow_down 1. Load Data


Load and store data using pandas for better manipulation.

In google colab, the files must be uploaded to the session's temporary storage using the file icon tab on your left of the python notebook.

df = pd.read_csv("/content/data.csv") #Take note, your file path might be different according to your naming.

keyboard_arrow_down 2. Check Data Characteristics


We need to diagnose and clean data before train our dataset using machine learning algorithms.

But, before we do any cleaning, lets load the data and get an initiate feel of how the data looks like.

We will use head, tail, columns, shape and info methods to diagnose data.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 3/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

df.head() # head shows first 5 rows

DateTime In_Flow In_Angle In_Temp Out_Temp In_Pressure Viscosity Humidity R

2021-01-
0 1.41 0.40 0.27 -0.31 1.75 -0.93 0.09
01 00:00

2021-01-
1 -1.41 -0.02 -0.52 -1.83 -3.13 -0.52 -2.04
01 01:00

2021-01-
2 0.29 0.67 -0.36 -0.01 -1.93 0.33 -0.51
01 02:00

# columns gives column names of features


df.columns

Index(['DateTime', 'In_Flow', 'In_Angle', 'In_Temp', 'Out_Temp', 'In_Pressure',


'Viscosity', 'Humidity', 'Rot_Speed', 'Valve_Opening', 'Tank_%',
'Out_Speed', 'Thickness', 'Vibration', 'Col_density', 'Oxygen',
'Quality_type'],
dtype='object')

# We can extract the numeric and categorical features in


# separate data frames for easier manipulation and vizualisation.
numeric = df[['In_Flow', 'In_Angle', 'In_Temp', 'Out_Temp', 'In_Pressure',
'Viscosity', 'Humidity', 'Rot_Speed', 'Valve_Opening', 'Tank_%',
'Out_Speed', 'Thickness', 'Vibration', 'Col_density', 'Oxygen'
]]

categorical=df[[
'Quality_type'
]]

# To obtain a list of the column header:


list(df)

['DateTime',
'In_Flow',
'In_Angle',
'In_Temp',
'Out_Temp',
'In_Pressure',
'Viscosity',
'Humidity',
'Rot_Speed',
'Valve_Opening',
'Tank_%',
'Out_Speed',
'Thickness',
'Vibration',
'Col_density',
'Oxygen',
'Quality_type']

# tail shows last 5 rows


df.tail()

DateTime In_Flow In_Angle In_Temp Out_Temp In_Pressure Viscosity Humidity

2021-07-
4995 0.22 -0.42 0.56 -0.97 1.40 1.48 -0.74
27 03:00

2021-07-
4996 0.87 -1.54 1.78 -0.64 1.04 0.66 1.97
27 04:00

2021-07-
4997 0.17 -0.75 0.46 0.33 1.05 0.99 0.70
27 05:00

# If you just need to use the data in a single column:


df['Rot_Speed'][:10]

0 1.85
1 -1.16
2 0.72
3 1.16
4 1.31
5 0.51
6 -4.79
7 -0.14
8 0.17

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 4/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
9 2.59
Name: Rot_Speed, dtype: float64

# shape gives number of rows and columns in a tuble


df.shape

(5000, 17)

# We can also try some simple visualisation. Surveying the data labels (column heading name) in the given dataset, let's inspect the amo
df['Quality_type'].value_counts() # I chose this as a demo since quality type is the only categorical data

Quality_type
Premium 1698
Approved 1678
Rejected 1624
Name: count, dtype: int64

# You can also plot it as a graph of your choice for better visualisation
barlist = df['Quality_type'].value_counts().plot(kind='bar', rot=0) # rot is rotation of text
plt.figure(figsize=(10,10))

<Figure size 1000x1000 with 0 Axes>

<Figure size 1000x1000 with 0 Axes>

Insights

Our total measured data stands at 5000. From the initial inspection, about 34% (1698), 33%(1678) and 32%(1624) of the tanks were graded as
"Premium", "Approved" and "Rejected" quality type.

Variance between each quality type were considerably small.

Note The reason why am i talk about variance here is that, if your data does not exhibit any variance at all, then the dataset by an initial look
have problem and machine learning will not be possible!!

# Creates a correlation matrix to identify basic relationships between variables

df.corr(numeric_only=True)

# Please take note, after 14 May 2024, the .corr() function have change in syntax and if like in our case if we use correlation study to
# numerical data, then we need to include numeric_only=True in the arguement () like above.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 5/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

In_Flow In_Angle In_Temp Out_Temp In_Pressure Viscosity Humid

In_Flow 1.000000 -0.003215 -0.013636 0.052537 0.225834 -0.005518 0.5775

In_Angle -0.003215 1.000000 -0.021197 -0.011058 0.020030 0.011512 0.012

In_Temp -0.013636 -0.021197 1.000000 0.017025 -0.005671 -0.004853 -0.0184

Out_Temp 0.052537 -0.011058 0.017025 1.000000 0.335845 0.015195 0.0520

In_Pressure 0.225834 0.020030 -0.005671 0.335845 1.000000 0.014567 0.3594

Viscosity -0.005518 0.011512 -0.004853 0.015195 0.014567 1.000000 -0.005

Humidity 0.577518 0.012110 -0.018469 0.052069 0.359416 -0.005198 1.0000

Rot_Speed 0.563143 -0.020147 0.008818 0.595381 0.199894 0.009250 -0.1446

Valve_Opening -0.020819 -0.013598 -0.020885 -0.012813 -0.009023 -0.011055 -0.0190

Tank_% -0.107559 0.004809 -0.012042 0.019571 0.033507 -0.002515 0.7090

Out_Speed 0.005053 0.004233 -0.002168 -0.014401 -0.002565 -0.008765 0.0066

Thickness -0.022900 0.015907 0.012368 -0.022045 0.009006 -0.007998 -0.0204

Vibration -0.006062 -0.023547 -0.002501 -0.010388 0.014644 -0.007153 0.0058

Col_density -0.038673 0.021942 0.013037 -0.034745 -0.352036 -0.016358 -0.0026

Oxygen -0.000930 -0.003048 -0.000689 -0.015645 0.009033 -0.007108 0.0344

Insights

From a basic correlation table, humidity and rot_speed may have some reaction to the quality type, due to them having a high correlation with
the in_flow feature at 58% and 56% respectively.

keyboard_arrow_down 2.1 Data Types and Conversion Method


There are 5 basic data types: object(string),boolean, integer, float and categorical.

We can make conversion data types like from str to categorical or from int to float

Why is category important:

make dataframe smaller in memory


can be utilized for anlaysis especially for sklearn(we will learn later)

# info gives data type like dataframe, number of sample or row, number of feature or column, feature types and memory usage
df.info()

# If you want to convert from one data type to another you can follow the example below:
# lets convert object(str) to categorical and int to float.

#df['In_Flow'] = df['In_Flow'].astype('int') --> converting from float to int


#df['Quality_type'] = df['Quality_type'].astype('category') --> converting from string to category

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 5000 non-null object
1 In_Flow 5000 non-null float64
2 In_Angle 5000 non-null float64
3 In_Temp 5000 non-null float64
4 Out_Temp 5000 non-null float64
5 In_Pressure 5000 non-null float64
6 Viscosity 5000 non-null float64
7 Humidity 5000 non-null float64
8 Rot_Speed 5000 non-null float64
9 Valve_Opening 5000 non-null float64
10 Tank_% 5000 non-null float64
11 Out_Speed 5000 non-null float64
12 Thickness 5000 non-null float64
13 Vibration 5000 non-null float64
14 Col_density 5000 non-null float64
15 Oxygen 5000 non-null float64
16 Quality_type 5000 non-null object
dtypes: float64(15), object(2)
memory usage: 664.2+ KB

Consideration for Data Type Conversion

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 6/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
For columns that are categorical, i.e., columns that take on a limited, and usually fixed, number of possible values, we set their type as
“category”. E.g., gender, blood type, number of days per week and country are all categorical data. Our dataset do not have such data.

For columns that are numeric, we can either set their type as “int64” (integer) or “float64” (floating point number). E.g., sales, temperature and
number of people are all numeric data.

# set categorical data - "Quality_type"


df['Quality_type'] = df['Quality_type'].astype('category')
df.info()

# Converting to category datatype, especially in this case of string, will


# allow us to save memory space for faster processing.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 5000 non-null object
1 In_Flow 5000 non-null float64
2 In_Angle 5000 non-null float64
3 In_Temp 5000 non-null float64
4 Out_Temp 5000 non-null float64
5 In_Pressure 5000 non-null float64
6 Viscosity 5000 non-null float64
7 Humidity 5000 non-null float64
8 Rot_Speed 5000 non-null float64
9 Valve_Opening 5000 non-null float64
10 Tank_% 5000 non-null float64
11 Out_Speed 5000 non-null float64
12 Thickness 5000 non-null float64
13 Vibration 5000 non-null float64
14 Col_density 5000 non-null float64
15 Oxygen 5000 non-null float64
16 Quality_type 5000 non-null category
dtypes: category(1), float64(15), object(1)
memory usage: 630.1+ KB

keyboard_arrow_down 3. Check for Missing Data


Common Unclean data:

Column name inconsistency like upper-lower case letter or space between words
duplicated data
missing data

If we encounter with missing data, what we can do:

leave as is
drop them with dropna()
fill missing value with fillna()
fill missing values with test statistics like mean
Assert statement: check that you can turn on or turn off when you are done with your testing of the program

# generate preview of entries with null values


if df.isnull().any(axis=None):
print("\nPreview of data with null values")
print(df[df.isnull().any(axis=1)].head(3))
missingno.matrix(df)
plt.show()

# No output is returned as my dataset have no missing values

# My initial dataset have no missing value.


# For demonstration purpose, let me delete some data from the first column "Quality_type" for demo

# Lets load demo dataset !!


demodf = pd.read_csv("/content/MD_Example.csv") # i use demodf to prevent confusion
print(demodf.shape)

(5000, 3)

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 7/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# generate preview of entries with null values
if demodf.isnull().any(axis=None):
print("\nPreview of data with null values")
print(demodf[demodf.isnull().any(axis=1)].head(3))
missingno.matrix(demodf)
plt.show()
# Now, from the demo output, you can see that missing values occur in the quality_type column

Preview of data with null values


DateTime Quality_type In_Flow
8 1/1/2021 8:00 NaN 1.04
33 1/2/2021 9:00 NaN 0.05
46 1/2/2021 22:00 NaN 1.13

Decision Factor for Missing Data

Should there be any missing entries, you should decide on the cleaning steps required before proceeding further to the other parts of EDA.

One common way of dealing with missing data is to drop them and fill them with something like 'empty' as shown in the following
example.

If the values with the label column have too many missing data, consider dropping the whole label instead if you deem it is not too important

df2 = df.drop(['In_Flow'],axis=1)

axis = 0 is for row control

axis = 1 is for column control

Example drop the column "In_Flow" and transfer all remaining features to new dataframe named df2

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 8/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Lets assume that i dont know In_Flow have missing value
# Lets check
demodf["Quality_type"].value_counts(dropna =False)
#demodf["Quality_type"].value_counts(dropna =True) # This command will drop all NaN values

# As you can see, there are 9 NAN value

Quality_type
Premium 1696
Approved 1676
Rejected 1619
NaN 9
Name: count, dtype: int64

# We can choose to fill those NaN with "empty"


demodf["Quality_type"].fillna('empty',inplace = True)
demodf["Quality_type"].value_counts()

# If is numerical data, for simplicity and foundation purpose, you can fill
# them with '0'

Quality_type
Premium 1696
Approved 1676
Rejected 1619
empty 9
Name: count, dtype: int64

keyboard_arrow_down 4. Check for Duplicated Data


# To check duplication within one column, you can use the following code
df["Quality_type"].duplicated()

0 False
1 True
2 False
3 True
4 True
...
4995 True
4996 True
4997 True
4998 True
4999 True
Name: Quality_type, Length: 5000, dtype: bool

# Let us explore our original dataset

# generate count statistics of duplicate entries


if len(df[df.duplicated()]) > 0:
print("Number of duplicated entries: ", len(df[df.duplicated()]))
print(df[df.duplicated(keep=False)].sort_values(by=list(df.columns)).head())
else:
print("No duplicated entries found")

No duplicated entries found

In our case, there are also no duplicated entries, which the code will point out directly by printing the output “No duplicated entries found”.

In the event of duplicated entries, the output will show the number of duplicated entries and a preview of the duplicated entries.

# Let us explore our modified dataset

# generate count statistics of duplicate entries


if len(demodf[demodf.duplicated()]) > 0:
#print("No. of duplicated entries: ", len(demodf[demodf.duplicated()]))
print(demodf[demodf.duplicated(keep=False)].sort_values(by=list(demodf.columns)).head())
else:
print("No duplicated entries found")

DateTime Quality_type In_Flow


0 1/1/2021 0:00 Premium 1.41
1 1/1/2021 0:00 Premium 1.41

Decision Factor for Duplicated Data

Should there be any duplicated entries, you should decide on the cleaning steps required before proceeding further to the other parts of
EDA.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 9/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
One common way of dealing with duplicated is to drop them using the following code below:

df.drop_duplicates(inplace=True)

NOTE: We will revert back to our original dataset df from this point

# The target is whatever the output of the input variables.

# Since, Quality type is our target, it may be worthwile to further categorise the target's feature
# namely Premium, Approved and Rejected for latter EDA

Premium_type = df[df["Quality_type"]=='Premium']
Approved_type = df[df["Quality_type"]=='Approved']
Rejected_type = df[df["Quality_type"]=='Rejected']

keyboard_arrow_down 5. Univariate Analysis


The simplest form of statistical analysis in the EDA process is univariate analysis. Only one variable is involved in this analysis. In fact, the main
purpose of the analysis is to describe, understand the population distribution, detect outliers, summarize and find patterns for a single feature.
Note, that this type of analysis is different toward numeric versus categorical features in terms of the characteristics of interest of data values.

keyboard_arrow_down 5.1 Histogram


Histograms can be used to show the distribution of variables of interest. You can even plot the distributions for multiple variables concurrently.
This will give you a holistic view of all variables for better understanding.

Below shows all numeric feature distributions using the histogram (barplot).

The x-axis represents the given values and the y-axis represents their frequencies.

df.hist(figsize=[15,15])
plt.suptitle("Histogram Demostration")
plt.show()

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 10/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

Insights

Observing the histograms for all input features, it seems like our raw data are quite centrally distributed.

Good sign as this is a first feel that our data did not possess any outliers.

keyboard_arrow_down 5.2 Outliers Inspection (Boxplot)


The definition of outliers is tricky, as there is no generally recognized one formal definition for outliers. Roughly, it refers to values that are
outside the areas of a distribution that would commonly occur.

In addition, another common definition considers any point away from the mean by more than a fixed number of standard deviations be an
“outlier”. In other words, we can consider data values corresponding to areas of the population with low density or probability as suspected
outliers.

Boxplot is very good at presenting statistical information such as outliers. The plot consists of a rectangular box bounded above and below by
“hinges” that represent 75% and 25% quantiles respectively.

We can view the “median” as the horizontal line through the box. You can also see the upper and lower “whiskers”. The vertical axis is in the
units of the quantitative variable.

Data points above the upper whiskers and far away are the suspected “outliers”.

For demonstration, below shows an example of plotting boxplot for three variables "In_Flow", "In_Angle" and "In_Temp"

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 11/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
sns.boxplot(x=df["Quality_type"],y=df['In_Flow'],data=df)
plt.show()
sns.boxplot(x=df["Quality_type"],y=df['In_Angle'],data=df)
plt.show()
sns.boxplot(x=df["Quality_type"],y=df['In_Temp'],data=df)
plt.show()

Insights

In_Temp in general does not contain much outliers, probably just two or three points under the "rejected" quality type. This is neglectable to the
whole analysis.

In_Angle however shows two distinctive outliers under "Premium" at points -3 and -4.5. Due to its low numbers, I believe it can also be neglected
for now.

In_Pressure shows a small cluster of outliers at range of below -3 for both "Approved" and "Rejected". Again, as the numbers are rather small,
we can choose to neglect for now.

keyboard_arrow_down 5.3 Probability Density Function (PDF)


Visualizing using PDF allows us to inspect a single feature. For example, plotting of the PDF to inspect the relationship between "In_Flow" and
"Quality_type". Below, we plot the PDF graph for the "In_Flow". The x-axis represents the value ranges while the y-axis represents the percentage
of data points for each target value.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 12/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
sns.FacetGrid(df,hue="Quality_type",height = 5).map(sns.distplot,"In_Flow").add_legend()
plt.show()

/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:854: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

func(*plot_args, **plot_kwargs)
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:854: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

func(*plot_args, **plot_kwargs)
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:854: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

func(*plot_args, **plot_kwargs)

Insights

It is highly likely that the inlet flowrate of the water tank does affect the quality type alot, in specific the "Approved" grade, especially in higher
flow rates averaging at 2.

This is supported by the phenomenon observed by the massive overlapping observed when in_flow between -2 to 4.

keyboard_arrow_down 6. Bi-Variate Analysis


Bivariate analysis is another step in our EDA process, where the analysis takes place between two variables (features). Its purpose is to explore
the concept of the relationship between two variables. This covers the association, strength and whether there are differences and their
significance.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 13/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

keyboard_arrow_down 6.1 Scatterplot


A scatter plot can be a very useful representation to visualize the relationship between two numerical variables. In fact, it is most beneficial to
plot it before fitting a regression model to inspect the potential linear correlation. The resulting pattern indicates the type (linear or non-linear)
and strength of the relationship between two variables. We can add more information to the 2d scatter plot. For example, we may label points in
relation to the In_Flow rate and In_Temp rate that affects the quality type (Target). Below is the scatter plot comparing the In_Flow with
In_Temp.

sns.set_style("whitegrid")
sns.FacetGrid(df, hue = "Quality_type" , height = 6).map(plt.scatter,"In_Flow","In_Temp").add_legend()
plt.show()

Insights

From the scatterplot, it is quite evident that both In_Temp and In_Flow have significant influence on the rejected quality type with the dstinctive
cluster in green.

In addition, the scatterplot also shows that the quality type regardless its nature are mainly affected when inlet temperature and flow are in the
-2 to 2 regions.

Moreover, we can plot pairs of variables for a better understanding of variables associations. Below, we plot the pair association for the In_Flow
rate and In_Temp value to see their relationship on the quality type.

You can think of this as a combination of scatterplot and PDF.

selected_numeric_features = df[["Quality_type","In_Flow","In_Temp"]]
sns.set_style("whitegrid")
sns.pairplot(selected_numeric_features, hue = "Quality_type", height = 5)
plt.show()

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 14/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

keyboard_arrow_down 6.2 Correlation matrix


A correlation matrix is “square matrix” with the same variables shown in the rows and columns. The level of correlation between the variables is
highlighted with different colour intensities. The numeric values for the correlation range from- 1 “not correlated or negatively correlated ” to 1
“highly correlated”. Among the use cases of a correlation matrix is to summarize data to a more advanced analysis.

As a result, some key decisions can be made when creating a correlation matrix. One is the choice of relevent correlation statistics and the
coding of the variables. Another is the treatment of the missing data and dropping highly correlated variables from feature sets.

Below we plot a correlation heatmap for all the numerical features present in the original dataset.

corr = numeric.corr() # recall above, "numeric" is the variable name for all numerical feature
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(240, 10, n=9)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.4, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .8}) # feel free to adjust vmax for the scale
plt.show()

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 15/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

Insights

As per my hunch, high correlation exist on Humidityand Rot_Speed and also In_Pressure with regards to the tank quality. In contrast, weak
correlation exists col_density and In_Pressure distinctively.

7. Multivariate Analysis (FYI)

In the process of EDA, sometimes, the inspection of single or pairs of variables won’t suffice to rule out certain hypothesis (or outliers &
anomalous cases) from your dataset. That’s why multivariate analysis come in play. This type of analysis generally shows the relationship
between two or more variables using some statistical techniques. There comes the need for taking in consideration more variables at a time
during analysis to reveal more insights in your data.

For your assignment, i expect only until bi-variate analysis

8. Summary of EDA

It is obivous that it is difficult to actually understand the dataset and make conclusions without looking through the entire data set. In fact,
we have shown that spending more time exploring the dataset is well invested time.

Despite it being tedious in the preliminary step to processing data, all the hard work is necessary. In fact, when you begin to actually see
interesting insights you will appreciate every single minute invested in such a process.

EDA enables Data scientists to immediately understand key issues in the data and be able to guide deeper analysis in the right directions.
Successfully exploring the data ensures to stakeholders that they won’t be missing out on opportunities to leverage their data. They can
easily pinpoint risks including poor data quality, unreliable, poor feature engineering, and other uncertainties.

In this section, we have covered a range of useful introductory data exploratory/data analysis methods and visualization techniques. The
shown EDA guidelines should allow Data scientists to have insightful and deeper understanding of the problem at hand and decide on the
next move confidently.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 16/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

keyboard_arrow_down Step 2. Prepare Data for ML using Pandas and SciKit-Learn


Scikit learn is an excellent repository of machine learning algorithms that are well-optimized, reliable, easy to work with in Python, and open
source. SkLearn is built on numpy's functionalities, working with arrays and matrices as the main input. It's also Pandas compliant, so we can
feed pandas dataframes into the SkLearn functions and it knows how to work with them.

In this 2nd section, we will explore how sklearn can help us split our original dataset into training dataset for machine learning algorithms and
testing dataset for model performance evulation in final step.

# Import the function to randomly split the dataset into Training and Testing sets
from sklearn.model_selection import train_test_split

# Extracts the Quality_type column to create a y dataframe


y = df["Quality_type"]

# Extract the other columns from the dataset to create an X dataframe


X = df.drop(["DateTime", "Quality_type"], axis=1)

# Split X and y vectors into Training (80%) and Testing (20%)


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, #you can choose Training 70% and Testing 30%, up to you, just adjust
random_state=22) #
# Train > Test
# Random_State (Extracted from SkLearn Documentation)
# Use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls.
# However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds.
# Popular integer random seeds are 0 and 42. (I tried 22, you are free to increase or decrease, just don't put 0)

print ("X_train:", X_train.shape)


print ("X_test:", X_test.shape)
print ("y_train:", y_train.shape)
print ("y_test:", y_test.shape)

X_train: (4000, 15)


X_test: (1000, 15)
y_train: (4000,)
y_test: (1000,)

keyboard_arrow_down Step 3. Model Creation and Training


The data is now properly structured into Train and Test sets.

As our target is categorical (Premium, Approved, or Rejected), we choose a “Classifier”. In case we would work with a numerical target (i.e.
predicting a value), we would opt for a “Regressor” but the syntax would be similar.

In this section, i have updated to include each classifiers and also some parameter tuning advice.

keyboard_arrow_down 3.0 Linear Regression (For Demo Only)


Note

The dataset used here is a housing price dataset for demo purpose, as my orginal dataset on the water tank quality is a classification problem.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 17/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
from sklearn.linear_model import LinearRegression
# load dataset
dataset = pd.read_csv("/content/house_prices.csv")

# Single Feature Variable - Size of the Houses


size_data = dataset['size']

# Target/Dependent Variable - Price of the Houses


price_data = dataset['price']

# Size and Target are dataframes. We have to convert them into Array to be used as Training dataset.
# and then use eshape function to convert array.shape(1,n) to array.shape(n,1) so each independent variable has own row.

size = np.array(size_data).reshape(-1,1)
price = np.array(price_data).reshape(-1,1)

# Train the Model


model = LinearRegression()
model.fit(size,price)

# Predict Price
price_predicted = model.predict(size)

# Plot the result


plt.scatter(size,price, color="green")
plt.plot(size,price_predicted, color="red")
plt.title("Linear Regression")
plt.xlabel("House Size")
plt.ylabel("House Price")

Text(0, 0.5, 'House Price')

Insights

The red line represents the regression equation and provides the relationship between the housing prices with respect to the house sizes given
in an estate.

keyboard_arrow_down 3.1 Logistic Regression (LR)


Note

From here onwards is back to my original water tank dataset denoted by "df"

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 18/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Import LR Library

from sklearn.linear_model import LogisticRegression

# The syntax for parameters setting in the logistic regression command as follows:

# LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


# intercept_scaling=1, l1_ratio=None, max_iter=100,
# multi_class='auto', n_jobs=None, penalty='l2',
# random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
# warm_start=False)

# From my experience, a common influecing parameter to the model's performance are


# the type of solver, regulariser strength, C and max iteration, max_iter.
# Read the documentation for each parameter's definition here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lo

# Train the Model

LR = LogisticRegression(C=1.0, solver='lbfgs', max_iter=100)


LR1 = LogisticRegression(C=0.8, solver='lbfgs', max_iter=100)
LR2 = LogisticRegression(C=0.6, solver='lbfgs', max_iter=100)
# solver i chosen lbfgs which is the default and handles multinominal loss
# meaning, it can consider every features versus our target

LR.fit(X_train, y_train)
LR1.fit(X_train, y_train)
LR2.fit(X_train, y_train)

▾ LogisticRegression
LogisticRegression(C=0.6)

keyboard_arrow_down 3.2 Support Vector Machine (SVM)


In support vector machine (SVM), there are support vector regression (SVR) and support vector classification (SVC).

We will be using Support Vector Classification (SVC) since our problem is a classification problem.

# Load library

from sklearn.svm import SVC

# The syntax for parameters setting in SVC command as follows:

# SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,


# decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
# max_iter=-1, probability=False, random_state=None, shrinking=True,
# tol=0.001, verbose=False)

# Read up on the parameters section for SVC here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

# From my experience, a common influecing parameter to the model's performance are


# the kernel type and the degree of polynominal where the model will try to fit, "degree"

# remember, most supervised machine learning classifiers gives you the


# prediction based on the best fit polynomial curve

svc = SVC(kernel='rbf', degree=3) # default values used.


svc.fit(X_train, y_train)

▾ SVC
SVC()

keyboard_arrow_down 3.3 K-Nearest Neighbour

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 19/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# load libaries

from sklearn.neighbors import KNeighborsClassifier

# The syntax for parameters setting in KNN command as follows:

# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',


# metric_params=None, n_jobs=None, n_neighbors=3, p=2,
# weights='uniform')

# Read up on the parameters section for KNN here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifi

# From my experience, KNN is one of the most simplest model to deploy, but is plagued
# with choosing the correct K neighbour value. Too large of K may make your model overfit due to overestimation
# leading to higher inaccuracy

# Train the Model


knn = KNeighborsClassifier(n_neighbors=3) # i chose K value of 3 to begin with.
# feel free to play around with the k value to see how it affects the accuracy
knn.fit(X_train, y_train)

▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)

keyboard_arrow_down 3.4 Decision Tree


# Load Library

from sklearn.tree import DecisionTreeClassifier

# The syntax for parameters setting in Decision Tree (DTC) command as follows:

# DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',


# max_depth=None, max_features=None, max_leaf_nodes=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# min_samples_leaf=1, min_samples_split=2,
# min_weight_fraction_leaf=0.0, presort='deprecated',
# random_state=None, splitter='best')

# Read up on the parameters section for DTC here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.

# From my experience, the tree expansion criterion and maximum


# number of leaf nodes plays influencing roles in affecting the accuracy

# Setup Decision Tree Model (dtc)


dtc = DecisionTreeClassifier(criterion='gini', max_leaf_nodes=None) # default value used

# this is grown with unlimited trees, you can test by limitingthe max_leaf_nodes with a number

# Train the Model


dtc.fit(X_train, y_train)

▾ DecisionTreeClassifier
DecisionTreeClassifier()

keyboard_arrow_down 3.5 Random Forest


Let's attempt to evaluate the model using simple approaches first via accuracy calculation!!!

For your assignment, you can try random forest, but take note that the nodes required for computing may be high for your dataset and the
search for solution may take very long.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 20/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Import Library

from sklearn.ensemble import RandomForestClassifier

# The syntax for parameters setting in Random Forest (RFC) command as follows:

#RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,


# criterion='gini', max_depth=None, max_features='auto',
# max_leaf_nodes=None, max_samples=None,
# min_impurity_decrease=0.0, min_impurity_split=None,
# min_samples_leaf=1, min_samples_split=2,
# min_weight_fraction_leaf=0.0, n_estimators=100,
# n_jobs=None, oob_score=False, random_state=None,
# verbose=0, warm_start=False)

# Read up on the parameters section for RFC here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassif

# From my experience, since RFC is an extension of DTC, it almost shares similar concern in terms of influencing factors.

# Setup RFC Model

rfc = RandomForestClassifier(criterion='gini', max_leaf_nodes=None) # I left the parameters empty to use its default values

# Launch Random Forest training


rfc.fit(X_train, y_train)

▾ RandomForestClassifier
RandomForestClassifier()

keyboard_arrow_down Step 4. Basic Model Evulation


keyboard_arrow_down 4.1 Accuracy Appraisal
Let's attempt to evaluate the model using simple approaches first via accuracy calculation!!!

# Evaluates the performance on the Training Set

#Compare Score
print ("Accuracy Score of Training Set:")

# Print out all models' accuracy score for training set


print("LR:",LR.score(X_train, y_train))
print("SVC:",svc.score(X_train, y_train))
print("KNN:",knn.score(X_train, y_train))
print("DTC:",dtc.score(X_train, y_train))
print("RFC:",rfc.score(X_train, y_train))

Accuracy Score of Training Set:


LR: 0.61525
SVC: 0.81725
KNN: 0.8445
DTC: 1.0
RFC: 1.0

Insights

We notice that the model reaches a 100% accuracy on the training dataset for decision tree and random forest. This could be good news but we
are probably facing an “overfitting” issue, meaning that the model performs perfectly on training data by learning predictions “by heart” and it
will probably fail to reach the same level of accuracy on unseen data.

In another words, we are getting 100% accuracy because the training data are being used for testing. At the time of training, decision tree or
random forest gained the knowledge about that data, and now if you give same data to predict it will give exactly same value. That's why
decision tree and random forest are producing correct results every time.

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 21/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# Evaluates the performance on the Test Set

#Compare Score
print ("Accuracy Score of Testing Set:")

# Print out all models' accuracy score for training set


print("LR:",LR.score(X_test, y_test))
print("SVC:",svc.score(X_test, y_test))
print("KNN:",knn.score(X_test, y_test))
print("DTC:",dtc.score(X_test, y_test))
print("RFC:",rfc.score(X_test, y_test))

Accuracy Score of Testing Set:


LR: 0.614
SVC: 0.809
KNN: 0.757
DTC: 0.705
RFC: 0.816

Insights

The dip of about 20% to 25% dip for SVC, KNN and RFC in accuracy on test data may well suggest the issue of "overfitting" because the test
data is a cut out of the train data, hence, the data range are all within the original dataset region.

In another words, if the model performs better on the training set than on the test set, it means that the model is likely overfitting.

The phenomenon is very prominent when you compare the the training and testing dataset accuracy of KNN, DTC and RFC.

Note

ALso, it is prominent to see that the LR and decision tree have the lowest accuracy score. Of course, you will need to explore the parameters to
tune them and test for their accuracy improvements.

keyboard_arrow_down Guideline to Identifying Overfitting and Underfitting


Train dataset Accuracy > Test dataset Accuracy by 20% --> Over fitting

Train dataset Accuracy approximately = Test dataset Accuracy and both are below about 70% (based on my personal guiding principle) -->
Underfitting

# Let's compare predictions with actual values


df["Quality_prediction"]=rfc.predict(X)
df.sample(15, random_state=22)

# look at last two column "Quality_type" vs "Quality_prediction"

# You can use this command to check the prediction vs actual in a dataframe format, its optional and up to your preference.

DateTime In_Flow In_Angle In_Temp Out_Temp In_Pressure Viscosity Humidity

2021-02-
1332 -0.52 0.14 1.05 -1.48 0.11 -1.44 -0.72
25 12:00

2021-04-
2844 1.15 -1.93 -0.28 2.40 0.86 -0.37 2.71
28 12:00

2021-07-
4665 0.92 1.19 0.98 -2.00 -1.60 0.46 -0.61
13 09:00

2021-05-
3408 3.16 0.75 1.41 -0.60 -1.13 0.63 1.15
22 00:00

2021-02-
975 1.55 1.46 0.56 0.25 2.18 -0.08 2.29
10 15:00

2021-01-
283 2.29 0.80 0.71 -0.09 1.48 0.04 2.75
12 19:00

2021-07-
4982 0.32 0.37 -1.68 -1.21 1.73 -0.34 0.51
26 14:00

2021-07-
4880 -0.93 0.23 0.69 3.46 0.97 0.74 -1.76
22 08:00

2021-05-
2981 1.19 -0.15 0.07 -2.66 0.53 0.80 1.30
04 05:00

2021-03-
1830 1 51 0 65 0 09 0 91 1 60 0 12 0 28

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 22/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

Despite things looking still ok, there is a question to think about.

What are the features among the different process parameters which help the model predict the quality of the product?

I have moved this to section 4.3, as the method introduced is only applicable to tree based models and is not considered to be very critical in
terms of model performance evaluation

keyboard_arrow_down 4.2 Overfitting / Underfitting Inspection (Cross Validation Approach)


In this newly added section for Lecture 4, i would like to show all of you a simple approach to further verify issues of overfitting / underfitting.

Inspection of Overfitting (More Common)

When the model’s accuracy on the training set is very high but then, the model’s accuracy on the test set is low!

We can further inspect using Cross-validation (CV) scheme.

If overfitting does occur, then we need to further reduce the complexity of the model using dimensional reduction method (advanced).

Inspection of Underfitting

When the model’s accuracy on both the training and test datasets are very low.

If underfitting does occur, then we can either try another ML model or increase the complexity of the current model.

What is Cross Validation then?

In basic approach, this is how we split our training and testing dataset:

Specific percentage goes to training and remaining dataset goes to test.

In cross validation approach, the special command, cross_val_score, will help you segment your original dataset into specific number of
portions given by the user.

For example:

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 23/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

In the above example, the original dataset is segmented into 10 portions, 9 for training and 1 for testing. The K value determines the number of
segments to be divided, usually no more than 10, and 1 segment is always fixed as test dataset.

After the split, the cross validation command will then inspect the test segment with each individual train segment to achieve small batch
evluation to reduce errors. Therefore, if you decided to split the dataset into 10 segments as shown above, you will then have to expect 10
iterations of inspection. (Warning: sometimes might be slow)

Therefore only perform cross validation on trained models which you think have higher possibility of overfitting problem (e.g. KNN (maybe), DTC
and RFC (definitely))

# Cross Validation on DTC as an example

from sklearn.model_selection import cross_val_score


scores = cross_val_score(dtc, X, y, cv=10) # I set the data to be split into 10 segments with 10 iterations
scores

array([0.708, 0.702, 0.648, 0.67 , 0.7 , 0.678, 0.71 , 0.718, 0.694,


0.638])

The above shows the 10 accuracy scores after cross validation for DTC. You can compare this to the training set accuracy score achieved in
section 4.1.

However, to make life easier, you can simply find the mean score and compare.

score_mean = scores.mean()
score_mean

0.6866

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 24/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab
# For better visualisation, we can use a barplot to compare between the accuracy score of training set and cross validation set for clar

Train_acc = dtc.score(X_train, y_train)

fig = plt.figure()
ax = fig.add_axes([0,0,0.5,1])
langs = ['Train_Accuracy', 'Cross_Val']
students = [Train_acc,score_mean]
ax.bar(langs,students)
plt.show()

Insights

It is obivous that the cross validation results reassure that overfitting occurs in decision tree model. Hence, our next action may include
retuning the model or proceed with dimensionality reduction methods.

keyboard_arrow_down 4.3 Features Inspection (Optional)


The feature inspection method here only applies to tree-based model such as decision tree or random forest. Other classifiers such as logistic
regresion, SVC and KNN, where their algorithm does not include auto feature detection, you will need to use other techniques such as
dimensionality reduction or feature extraction of independent feature.

Example for Random Forest

# Raw features importance display for Random Forest


rfc.feature_importances_

array([0.08078741, 0.03380412, 0.0356885 , 0.16301445, 0.11971816,


0.03391949, 0.08013428, 0.08824287, 0.0344468 , 0.08426211,
0.03361289, 0.03334621, 0.0347156 , 0.10845535, 0.03585178])

# Let's create a DataFrame containing the features importance


rfc_imp = pd.DataFrame(rfc.feature_importances_, columns=['Importance'])

# Let's multiply the importance value by 100 to ease reading


rfc_imp["Importance"]=rfc_imp["Importance"]*100

# Let's use the features names as Index


rfc_imp = rfc_imp.set_index([X.columns])

# Let's display the DataFrame with descending Importance values


display(rfc_imp.sort_values(by="Importance", ascending=False))

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 25/26
5/26/24, 7:43 PM Introduction_to_Data_Analytics_Master_Guide.ipynb - Colab

Importance

Out_Temp 16.301445

In_Pressure 11.971816

Col_density 10.845535

Rot_Speed 8.824287

Tank_% 8.426211

In_Flow 8.078741

Humidity 8.013428

Oxygen 3.585178

In_Temp 3.568850

Vibration 3.471560

Valve_Opening 3.444680

Viscosity 3.391949

In_Angle 3.380412
Insights:
Out_Speed 3.361289
From the inspection above, we can see that label variables "Out_Temp", "In_Pressure" and "Col_density" plays active role in the predictions.
Thickness 3.334621

Example for Decision Tree

# Raw features importance display


dtc.feature_importances_

array([0.06875458, 0.02891116, 0.02224669, 0.19642729, 0.10737901,


0.03242396, 0.08930121, 0.06008405, 0.02833254, 0.09962833,
0.02573949, 0.0212217 , 0.02944355, 0.15445596, 0.0356505 ])

# Let's create a DataFrame containing the features importance


dtc_imp = pd.DataFrame(dtc.feature_importances_, columns=['Importance'])

# Let's multiply the importance value by 100 to ease reading


dtc_imp["Importance"]=dtc_imp["Importance"]*100

# Let's use the features names as Index


dtc_imp = dtc_imp.set_index([X.columns])

# Let's display the DataFrame with descending Importance values


display(dtc_imp.sort_values(by="Importance", ascending=False))

Importance

Out_Temp 19.642729

https://colab.research.google.com/drive/1vId0Qd-ep01kr8VyX5V_N5Rapa4ynDz5#scrollTo=Bhi1dEJB6LU1&printMode=true 26/26

You might also like