0% found this document useful (0 votes)

13 views46 pages

CSC407 - Chapter 2-3

The document discusses the importance of data quality in machine learning and data science, emphasizing that clean data is essential for effective modeling and accurate results. It outlines various forms of data, advantages and disadvantages of data processing, and steps for data cleaning, including handling missing data and outliers. Additionally, it introduces exploratory data analysis (EDA) as a method for understanding datasets and preparing them for further statistical analysis.

Uploaded by

asukuhamza1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views46 pages

CSC407 - Chapter 2-3

Uploaded by

asukuhamza1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Machine Learning/Data Science

(csc 407)
Chapters 2 & 3
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
2: Data Preprocessing
• Importance of Data Quality
• Data is a crucial component in the field of Machine Learning.
• It refers to the set of observations or measurements that can be
used to train a machine-learning model.
• The quality and quantity of data available for training and testing
play a significant role in determining the performance of a
machine-learning model.
• Data is the most important part of all Data Analytics, Machine
Learning, and Artificial Intelligence. Without data, we can’t train
any model, and all modern research and automation will go in
vain.
• For decision-making, the integrity of the conclusions drawn heavily
relies on the cleanliness of the underlying data.
• Without proper data cleaning, inaccuracies, outliers, missing
values, and inconsistencies can compromise the validity of
analytical results.
• Moreover, clean data facilitates more effective modelling and
pattern recognition, as algorithms perform optimally when fed
2: Data Preprocessing
• Importance of Data Quality Cont’d
• Additionally, clean datasets enhance the
interpretability of findings, aiding in the
formulation of actionable insights.
• Big Enterprises are spending lots of money just to
gather as much certain data as possible.
Example: Why did Facebook acquire WhatsApp by
paying a huge price of $19 billion?
• The answer is very simple and logical – it is to
have access to the users’ information that
Facebook may not have but WhatsApp will have.
This information about their users is of paramount
importance to Facebook as it will facilitate the
task of improvement in their services.
2: Data Preprocessing
• Different Forms of Data
• Numeric Data: If a feature represents a characteristic
measured in numbers, it is called a numeric feature.
• Categorical Data: A categorical feature is an
attribute that can take on one of the limited, and
usually fixed number of possible values on the basis of
some qualitative property. A categorical feature is also
called a nominal feature. Example: gender, colour,
race,
• Ordinal Data: This denotes a nominal variable with
categories falling in an ordered list . Examples include
clothing sizes such as small, medium, and large, or a
measurement of customer satisfaction on a scale from
“not at all happy” to “very happy”.
2: Data Preprocessing
• Advantages of data processing in Machine Learning:
• Improved model performance: Data processing helps improve the
performance of the ML model by cleaning and transforming the data
into a format that is suitable for modeling.
• Better representation of the data: Data processing allows the data to be
transformed into a format that better represents the underlying
relationships and patterns in the data, making it easier for the ML model
to learn from the data.
• Increased accuracy: Data processing helps ensure that the data is
accurate, consistent, and free of errors, which can help improve the
accuracy of the ML model.
• Disadvantages of data processing in Machine Learning:
• Time-consuming: Data processing can be a time-consuming task,
especially for large and complex datasets.
• Error-prone: Data processing can be error-prone, as it involves
transforming and cleaning the data, which can result in the loss of
important information or the introduction of new errors.
• Limited understanding of the data: Data processing can lead to a limited
understanding of the data, as the transformed data may not be
representative of the underlying relationships and patterns in the data.
2: Data Preprocessing
• Steps to Perform Data Cleanliness
• Performing data cleaning involves a
systematic process to identify and
rectify errors, inconsistencies, and
inaccuracies in a dataset.
2: Data Preprocessing
• The following are essential steps to perform data
cleaning:
• Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing data
entries for duplicate records, irrelevant information, or data points that do not
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in data
representation. Fixing structure errors enhances data consistency and facilitates
accurate analysis and interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context, decide
whether to remove outliers or transform them to minimize their impact on
analysis. Managing outliers is crucial for obtaining more accurate and reliable
insights from the data.
• Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods,
removing records with missing values, or employing advanced imputation
techniques. Handling missing data ensures a more complete dataset, preventing
biases and maintaining the integrity of analyses.
2: Data Preprocessing
• Python Implementation for Database Cleaning
• Let’s understand each step for Database Cleaning,
using titanic dataset.
• Below are the necessary steps:
• Import the necessary libraries
• Load the dataset
• Check the data information using df.info()

# Here is the code to load the titanic dataset

import pandas as pd
import numpy as np

# URL of the Titanic dataset

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/
titanic.csv'
df = pd.read_csv(url)
df.head() # to see first few rows of the dataset
2: Data Preprocessing
• The outlook of your output should
look like this
2: Data Preprocessing
• Let’s first understand the data by inspecting its structure and identifying
missing values, outliers, and inconsistencies and check the duplicate rows
with below python code:
df.duplicated()

Output:

0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool

The output shows that there are no duplicates

2: Data Preprocessing
#Check the data information using df.info()
df.info()
Output:

From the above data info, we can see that Age and Cabin
have an unequal number of counts. And some of the columns
are categorical and have data type objects and some are
2: Data Preprocessing
• Check the Categorical and Numerical Columns.
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)

Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare’]

• Check the total number of Unique Values in the

Categorical Columns
df[cat_col].nunique()
Output:

Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
2: Data Preprocessing
• Removal of all Above Unwanted
Observations
• This includes deleting duplicate/ redundant or irrelevant
values from your dataset. Duplicate observations most
frequently arise during data collection and Irrelevant
observations are those that don’t actually fit the specific
problem that you’re trying to solve.
• Redundant observations alter the efficiency to a great extent
as the data repeats and may add towards the correct side or
towards the incorrect side, thereby producing unfaithful
results.
• Irrelevant observations are any type of data that is of no use
to us and can be removed directly.
• Here we are dropping the Name columns because the Name
will be always unique and it hasn’t a great influence on target
variables. For the ticket, Let’s first print the 50 unique tickets.

df['Ticket'].unique()[:50]
2: Data Preprocessing
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803',
'373450',

'330877', '17463', '349909', '347742', '237736', 'PP 9549',

'113783', 'A/5. 2151', '347082', '350406', '248706', '382652',

'244373', '345763', '2649', '239865', '248698', '330923',

'113788',

'347077', '2631', '19950', '330959', '349216', 'PC 17601',

'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',

'A./5. 2152', '345764', '2651', '7546', '11668', '349253',

'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',

'2662', '349237', '3101295'], dtype=object)

• From the above tickets, we can observe that it is made of two like
first values ‘A/5 21171’ is joint from of ‘A/5’ and ‘21171’ this may
influence our target variables. Let’s drop the “Name” and “Ticket”
2: Data Preprocessing
• Drop Name and Ticket Columns

df1 = df.drop(columns=['Name','Ticket'])
df1.shape

Output:

(891, 10)
2: Data Preprocessing
• Handling Missing Data
• Missing data is a common issue in real-world datasets, and it can occur due to various reasons
such as human errors, system failures, or data collection issues. Various techniques can be
used to handle missing data, such as imputation, deletion, or substitution.

• Let’s check the % missing values columns-wise for each row using df.isnull()
• it checks whether the values are null or not and gives returns boolean values.
• .sum() will sum the total number of null values rows and we divide it by the total number of
rows present in the dataset then we multiply to get values in % i.e per 100 values how much
values are null.

round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:

PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be handled carefully as they
2: Data Preprocessing
• Handling Missing Data
• The two most common ways to deal with missing data are:
• Dropping Observations with missing values.
• Imputing the missing values from past observations.

• Dropping Observations with missing values.

• As we can see from the above result that Cabin has 77% null values
and Age has 19.87% and Embarked has 0.22% of null values.

• So, it’s not a good idea to fill 77% of null values. So, we will drop
the Cabin column. Embarked column has only 0.22% of null values
so, we drop the null values rows of Embarked column.

df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape

Output:
(889, 9)
2: Data Preprocessing
• Handling Missing Data
• Imputing the missing values from past observations.
• Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real information.
You’re just reinforcing the patterns already provided by other features.
• We can use Mean imputation or Median imputations for the case.

• Note:
• Mean imputation is suitable when the data is normally distributed and has no extreme outliers.
• Median imputation is preferable when the data contains outliers or is skewed.
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values again
df3.isnull().sum()

Output:

PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
2: Data Preprocessing
• Handling Outliers
• Outliers are extreme values that deviate significantly from the majority of
the data.
• They can negatively impact the analysis and model performance.
• Techniques such as clustering, interpolation, or transformation can be used
to handle outliers.

• To check the outliers, We generally use a box plot.

• A box plot, also referred to as a box-and-whisker plot, is a graphical
representation of a dataset’s distribution.
• It shows a variable’s median, quartiles, and potential outliers.
• The line inside the box denotes the median, while the box itself denotes the
interquartile range (IQR).
• The whiskers extend to the most extreme non-outlier values within 1.5
times the IQR.
• Individual points beyond the whiskers are considered potential outliers.
• A box plot offers an easy-to-understand overview of the range of the data
and makes it possible to identify outliers or skewness in the distribution.
2: Data Preprocessing
• Handling Outliers
• Let’s plot the box plot for Age column data.

import matplotlib.pyplot as plt

plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
2: Data Preprocessing

• As we can see from the above Box and whisker plot,

Our age dataset has outliers values. The values less
than 5 and more than 55 are outliers.
# calculate summary statistics
mean = df3['Age'].mean()
2: Data Preprocessing
# Calculate the lower and upper bounds
lower_bound = mean - std*2
upper_bound = mean + std*2

print('Lower Bound :',lower_bound)

print('Upper Bound :',upper_bound)

# Drop the outliers

df4 = df3[(df3['Age'] >= lower_bound)
& (df3['Age'] <= upper_bound)]

Output:
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785

• Similarly, we can remove the outliers of the remaining columns.

3: Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables.
• EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or
modeling.
• The Goals of EDA
• Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It includes
techniques including records imputation, managing missing statistics, and figuring out and getting rid of outliers.
• Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and
distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and percentiles are
usually used.
• Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations
consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles,
trends, and relationships within the facts.
• Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create new
functions or derive meaningful insights. Feature engineering can contain scaling, normalization, binning, encoding
express variables, and creating interplay or derived variables.
• Correlation and Relationships: EDA allows discover relationships and dependencies between variables.
Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power and
direction of relationships between variables.
• Data Segmentation: EDA can contain dividing the information into significant segments based totally on sure
standards or traits. This segmentation allows advantage insights into unique subgroups inside the information and
might cause extra focused analysis.
• Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on the
preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and model building.
• Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It involves
checking for records integrity, consistency, and accuracy to make certain the information is suitable for analysis.
3: Exploratory Data Analysis (EDA)
• Types of EDA
• Univariate Analysis: This sort of evaluation explores each variable in a
dataset. It involves summarizing and visualizing variable at a time to
understand its distribution, relevant tendency, and different applicable records.
Techniques like histograms, field plots, and bar charts are generally used in
univariate analysis.
• Bivariate Analysis: Bivariate evaluation involves exploring the connection
between variables. It enables find associations, correlations, and dependencies
between pairs of variables. Scatter plots, line plots, correlation matrices, and
move-tabulation are generally used strategies in bivariate analysis.
• Multivariate Analysis: Multivariate analysis extends bivariate evaluation to
encompass greater than variables. It ambitions to apprehend the complex
interactions and dependencies among more than one variables in a records set.
Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and
(PCA) are used for multivariate analysis.
• Time Series Analysis: This type of analysis is mainly applied to statistics sets
that have a temporal component. Time collection evaluation entails inspecting
and modelling styles, traits, and seasonality inside the statistics through the
years. Techniques like line plots, autocorrelation analysis, transferring averages,
and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally
3: Exploratory Data Analysis (EDA)
• Types of EDA Cont’d
• Missing Data Analysis: Missing information is a not unusual issue in
datasets, and it may impact the reliability and validity of the evaluation.
Missing statistics analysis includes figuring out missing values, know-how
the patterns of missingness, and using suitable techniques to deal with
missing data. Techniques along with lacking facts styles, imputation
strategies, and sensitivity evaluation are employed in lacking facts
evaluation.
• Outlier Analysis: Outliers are statistics factors that drastically deviate
from the general sample of the facts. Outlier analysis includes identifying
and knowledge the presence of outliers, their capability reasons, and their
impact at the analysis. Techniques along with box plots, scatter plots, z-
rankings, and clustering algorithms are used for outlier evaluation.
• Data Visualization: Data visualization is a critical factor of EDA that
entails creating visible representations of the statistics to facilitate
understanding and exploration. Various visualization techniques, inclusive
of bar charts, histograms, scatter plots, line plots, heatmaps, and
interactive dashboards, are used to represent exclusive kinds of statistics.
The choice of strategies relies on the information traits, research questions, and
3: Exploratory Data Analysis (EDA)
• For the simplicity, we will use a single dataset. We will use
the winequality-red dataset for this. It is a csv file. You can
download it.
Note: if you are using google colab, you need to mount your drive so that colab can have
access to the folder where your data reside. You can do that with the following two lines of
codes:

from google.colab import drive

drive.mount(‘/content/drive’)

The next is to import the necessary Python libraries

# importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore’)

# loading and reading dataset

df = pd.read_csv("winequality-red.csv")
3: Exploratory Data Analysis (EDA)
Output:

# loading and reading dataset

df=pd.read_csv('/content/drive/MyDrive/CSC407/winequality-red.csv',encoding='latin-1’)
#using colab
df=pd.read_csv(winequality-red.scv, encoding=‘latin-1’) #use this if you are running python
locally
print(df.head())

• Analyzing the Data

Gaining general knowledge about the data, including its values, kinds, number of rows and
columns, and missing values is the primary objective of data understanding.
shape: shape will show how many features (columns) and observations (rows) there are in the
dataset.
# shape of the data
df.shape

Output:
(1599, 12)
3: Exploratory Data Analysis (EDA)
• info() facilitates comprehension of the data type and
related information, such as the quantity of records in
each column, whether the data is null or not, the type of
data, and the dataset’s memory use.

#data information
df.info()
3: Exploratory Data Analysis (EDA)
# describing the data
df.describe()

The DataFrame “df”

is statistically
summarized by the
code df.describe(),
which gives the
count, mean,
standard deviation,
minimum, and
quartiles for each
numerical column.
The dataset’s
3: Exploratory Data Analysis (EDA)
• Checking Columns

#column to list
df.columns.tolist()

• The code df.columns.tolist() converts the column names of the DataFrame

‘df’ into a Python list, providing a convenient way to access and
3: Exploratory Data Analysis (EDA)
• Checking Missing Values

# check for missing values:

df.isnull().sum()

• The code df.isnull().sum() checks for missing values in each

column of the DataFrame ‘df’ and returns the sum of null
3: Exploratory Data Analysis (EDA)
• Checking for the unique values
#checking unique values
df.nunique()

• The function df.nunique() determines how many unique values

there are in each column of the DataFrame “df,” offering
information about the variety of data that makes up each feature.
3: Exploratory Data Analysis (EDA)
• Univariate Analysis
• In Univariate analysis, plotting the right charts can help us better
understand the data, which is why data visualization is so important.
Matplotlib and Seaborn libraries are used in this post to visualize our
data.

• Basic charts can be created with Matplotlib, a Python 2D charting

package.

• Seaborn is a Python library that leverages short code segments to

generate and customize statistical charts from Pandas and Numpy,
based on the Matplotlib framework.

• For both numerical and categorical data, univariate analysis is an

option.

• In this example, we are going to plot different types of plots like

3: Exploratory Data Analysis (EDA)
• Univariate Analysis
# Assuming 'df' is your DataFrame
quality_counts = df['quality'].value_counts()

# Using Matplotlib to create a count plot

plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='darpink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

Here, this count plot graph shows the count of the wine with its
3: Exploratory Data Analysis (EDA)
• Kernel density plot
# Set Seaborn style
sns.set_style("darkgrid")

# Identify numerical columns

numerical_columns = df.select_dtypes(include=["int64",
"float64"]).columns

# Plot distribution of each numerical feature

plt.figure(figsize=(14, len(numerical_columns) * 3))
for idx, feature in enumerate(numerical_columns, 1):
plt.subplot(len(numerical_columns), 2, idx)
sns.histplot(df[feature], kde=True)
plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")

# Adjust layout and show plots

plt.tight_layout()
3: Exploratory Data Analysis (EDA)
• Kernel density plot

Here, in the kernel

density plot is about
the skewness of the
of the corresponding
feature. The features
in this dataset that
have skewness that
are exactly 0 depicts
the symmetrical
distribution and the
plots with skewness
1 or above 1 is
positively or right
skewed distribution.
In right skewed or
positively skewed
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis

• When doing a bivariate analysis, two variables are

examined simultaneously in order to look for patterns,
dependencies, or interactions between them.

• Understanding how changes in one variable may

correspond to changes in another requires the use of this
statistical method.

• Bivariate analysis allows for a thorough comprehension of

the interdependence between two variables within a
dataset by revealing information on the type and intensity
of associations.

• Let’s plot a pair plot for the data.

3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Pair plot

# Set the color palette

sns.set_palette("Pastel1")

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))

# Using Seaborn to create a pair plot with the specified color

palette
sns.pairplot(df)

plt.suptitle('Pair Plot for DataFrame')

plt.show()
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
If the plot is diagonal ,
• Pair plot histograms of kernel density
plots , it shows the
distribution of the individual
variables.

If the scatter plot is in the

lower triangle, it displays the
relationship between the pairs
of the variables.

If the scatter plots above and

below the diagonal are mirror
images, indicating symmetry.

If the histogram plots are

more centered, it represents
the locations of peaks.

Skewness is depicted by
observing whether the
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Violin plot

# Assuming 'df' is your DataFrame

df['quality'] = df['quality'].astype(str) # Convert 'quality' to
categorical

plt.figure(figsize=(10, 8))

# Using Seaborn to create a violin plot

sns.violinplot(x="quality", y="alcohol", data=df, palette={
'3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6':
'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)

plt.title('Violin Plot for Quality and Alcohol')

plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Violin plot Interpreting the
Violin Plot

• If the width is wider,

it indicates higher
density, suggesting
more data points.
• Symmetrical plot
indicates a balanced
distribution.
• Peak or bulge in the
violin plot represents
most common value
in distribution.
• Longer tails indicate
great variability.
• Median line is the
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Box plot Interpreting the box
plot
#plotting box plot between alcohol and quality
sns.boxplot(x='quality', y='alcohol', data=df)
• Box represents the
IQR. Longer the box,
greater the
variability.
• The median line in
the box indicates
central tendency.
• Whiskers extend
from box to the
smallest and largest
values within a
specified range.
• Individual points
beyond the whiskers
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
• Interactions between three or more variables in a dataset are
simultaneously analyzed and interpreted in multivariate analysis.

• In order to provide a comprehensive understanding of the collective

behavior of several variables, it seeks to reveal intricate patterns,
relationships, and interactions between them.

• Multivariate analysis examines correlations and dependencies between

numerous variables by using sophisticated statistical techniques such
factor analysis, principal component analysis, and multivariate regression.

• Multivariate analysis, which is widely applied in domains such as biology,

economics, and marketing, enables thorough insights and helps decision-
makers make well-informed judgments based on complex relationships
found in multidimensional datasets.

• Here, we are going to show the multivariate analysis using a correlation

matrix plot.
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
• Correlation Matrix
# Assuming 'df' is your DataFrame
plt.figure(figsize=(15, 10))

# Using Seaborn to create a heatmap

sns.heatmap(df.corr(), annot=True, fmt='.2f',
cmap='Pastel2', linewidths=2)

plt.title('Correlation Heatmap')
plt.show()
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
Interpreting the
correlation matrix
plot
• Values close to +1
indicates strong
positive correlation, -1
indicates a strong
negative correlation
and 0 indicates
suggests no linear
correlation.
• Darker colors signify
strong correlation,
while light colors
represents weaker
correlations.
• Positive correlation
variable move in same
directions. As one
3: Exploratory Data Analysis (EDA)
• In summary, the Python-based exploratory data analysis (EDA)
of the wine dataset has yielded important new information about
the properties of the wine samples.
• We investigated correlations between variables, identified
outliers, and obtained a knowledge of the distribution of
important features using statistical summaries and
visualizations.
• The quantitative and qualitative features of the dataset were
analyzed in detail through the use of various plots, including
pair, box, and histogram plots. Finding patterns, trends, and
possible topics for more research was made easier by this EDA
method.
• Furthermore, the analysis demonstrated the ability to visualize
and analyze complicated datasets using Python tools such as
Matplotlib, Seaborn, and Pandas.
• The results provide a thorough grasp of the wine dataset and lay

Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Itep Grammar Practice Exercises: Complete The Sentence: Error Correction
No ratings yet
Itep Grammar Practice Exercises: Complete The Sentence: Error Correction
5 pages
Saso Iso 17089 2 2020 e
No ratings yet
Saso Iso 17089 2 2020 e
45 pages
Solicitation Letter
No ratings yet
Solicitation Letter
5 pages
1 (B) - Laterally Loaded Piles
No ratings yet
1 (B) - Laterally Loaded Piles
6 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
ISO 37001 New
No ratings yet
ISO 37001 New
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
15-Nguyen Van Thin-Bai Bao28!3!2007
No ratings yet
15-Nguyen Van Thin-Bai Bao28!3!2007
8 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Unit 2
No ratings yet
Unit 2
18 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Prac 7
No ratings yet
Prac 7
5 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Unit - II
No ratings yet
Unit - II
56 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
CS322 - Lec 3 - S25
No ratings yet
CS322 - Lec 3 - S25
42 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Dsbda 4
No ratings yet
Dsbda 4
16 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Module 2
No ratings yet
Module 2
8 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Assessment Task 1.2
No ratings yet
Assessment Task 1.2
14 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Design of Heat Exchangers Using Aspen EDR
No ratings yet
Design of Heat Exchangers Using Aspen EDR
7 pages
IECEx PRE 19.0093U 000
No ratings yet
IECEx PRE 19.0093U 000
5 pages
Department of Electrical Engineering: M.B.M Engineering College, Jodhpur
No ratings yet
Department of Electrical Engineering: M.B.M Engineering College, Jodhpur
16 pages
Personal Values - Mark Manson
No ratings yet
Personal Values - Mark Manson
52 pages
1750 Blood Angels (Warhammer 40,000 9th Edition) (92 PL, 11CP, 1,749pts)
No ratings yet
1750 Blood Angels (Warhammer 40,000 9th Edition) (92 PL, 11CP, 1,749pts)
15 pages
Chapter 3.the Case Study Method
No ratings yet
Chapter 3.the Case Study Method
5 pages
CT Series
No ratings yet
CT Series
6 pages
JFY Laser Sources
No ratings yet
JFY Laser Sources
2 pages
1.1 Intro Earth Sciences
No ratings yet
1.1 Intro Earth Sciences
49 pages
Agnico Eagle 2023 Sustainability Performance Data - 25042024
No ratings yet
Agnico Eagle 2023 Sustainability Performance Data - 25042024
147 pages
American Culture and Drug Abuse
No ratings yet
American Culture and Drug Abuse
1 page
Y9 2. Possibility Diagram
No ratings yet
Y9 2. Possibility Diagram
13 pages
GS4 Ethics Notes by @CSEWhy
No ratings yet
GS4 Ethics Notes by @CSEWhy
26 pages
Business Etiquette in South Korea - 20230908 - 122053 - 0000
No ratings yet
Business Etiquette in South Korea - 20230908 - 122053 - 0000
8 pages
12322-Article Text-35970-1-10-20160617
No ratings yet
12322-Article Text-35970-1-10-20160617
5 pages
Steelez - Auction 24th May To 02nd June 2025
No ratings yet
Steelez - Auction 24th May To 02nd June 2025
7 pages
8 TQ Quarter4
No ratings yet
8 TQ Quarter4
2 pages
Konica Monolta Drum (Photoconductor) DR512-DR512K
No ratings yet
Konica Monolta Drum (Photoconductor) DR512-DR512K
4 pages
Section 1-Short Cantilever ST
No ratings yet
Section 1-Short Cantilever ST
5 pages
Career Opportunities - Food Security Cluster Coordinator - WFP
No ratings yet
Career Opportunities - Food Security Cluster Coordinator - WFP
4 pages
1.develop A Program To Draw A Line Using Bresenham's Line Drawing Technique
No ratings yet
1.develop A Program To Draw A Line Using Bresenham's Line Drawing Technique
1 page
The Use of Smart Materials in Building Design
No ratings yet
The Use of Smart Materials in Building Design
5 pages
Faculty Profile: Professional (Industry) Experience (32 Years)
No ratings yet
Faculty Profile: Professional (Industry) Experience (32 Years)
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

CSC407 - Chapter 2-3

Uploaded by

CSC407 - Chapter 2-3

Uploaded by

Machine Learning/Data Science

# Here is the code to load the titanic dataset

# URL of the Titanic dataset

The output shows that there are no duplicates

• Check the total number of Unique Values in the

'330877', '17463', '349909', '347742', '237736', 'PP 9549',

'113783', 'A/5. 2151', '347082', '350406', '248706', '382652',

'244373', '345763', '2649', '239865', '248698', '330923',

'347077', '2631', '19950', '330959', '349216', 'PC 17601',

'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',

'A./5. 2152', '345764', '2651', '7546', '11668', '349253',

'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',

'2662', '349237', '3101295'], dtype=object)

• Dropping Observations with missing values.

• To check the outliers, We generally use a box plot.

import matplotlib.pyplot as plt

• As we can see from the above Box and whisker plot,

print('Lower Bound :',lower_bound)

# Drop the outliers

• Similarly, we can remove the outliers of the remaining columns.

from google.colab import drive

The next is to import the necessary Python libraries

# loading and reading dataset

# loading and reading dataset

• Analyzing the Data

The DataFrame “df”

• The code df.columns.tolist() converts the column names of the DataFrame

# check for missing values:

• The code df.isnull().sum() checks for missing values in each

• The function df.nunique() determines how many unique values

• Basic charts can be created with Matplotlib, a Python 2D charting

• Seaborn is a Python library that leverages short code segments to

• For both numerical and categorical data, univariate analysis is an

• In this example, we are going to plot different types of plots like

# Using Matplotlib to create a count plot

# Identify numerical columns

# Plot distribution of each numerical feature

# Adjust layout and show plots

Here, in the kernel

• When doing a bivariate analysis, two variables are

• Understanding how changes in one variable may

• Bivariate analysis allows for a thorough comprehension of

• Let’s plot a pair plot for the data.

# Set the color palette

# Assuming 'df' is your DataFrame

# Using Seaborn to create a pair plot with the specified color

plt.suptitle('Pair Plot for DataFrame')

If the scatter plot is in the

If the scatter plots above and

If the histogram plots are

# Assuming 'df' is your DataFrame

# Using Seaborn to create a violin plot

plt.title('Violin Plot for Quality and Alcohol')

• If the width is wider,

• In order to provide a comprehensive understanding of the collective

• Multivariate analysis examines correlations and dependencies between

• Multivariate analysis, which is widely applied in domains such as biology,

• Here, we are going to show the multivariate analysis using a correlation

# Using Seaborn to create a heatmap

You might also like