0% found this document useful (0 votes)
13 views46 pages

CSC407 - Chapter 2-3

The document discusses the importance of data quality in machine learning and data science, emphasizing that clean data is essential for effective modeling and accurate results. It outlines various forms of data, advantages and disadvantages of data processing, and steps for data cleaning, including handling missing data and outliers. Additionally, it introduces exploratory data analysis (EDA) as a method for understanding datasets and preparing them for further statistical analysis.

Uploaded by

asukuhamza1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views46 pages

CSC407 - Chapter 2-3

The document discusses the importance of data quality in machine learning and data science, emphasizing that clean data is essential for effective modeling and accurate results. It outlines various forms of data, advantages and disadvantages of data processing, and steps for data cleaning, including handling missing data and outliers. Additionally, it introduces exploratory data analysis (EDA) as a method for understanding datasets and preparing them for further statistical analysis.

Uploaded by

asukuhamza1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Machine Learning/Data Science

(csc 407)
Chapters 2 & 3
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
2: Data Preprocessing
• Importance of Data Quality
• Data is a crucial component in the field of Machine Learning.
• It refers to the set of observations or measurements that can be
used to train a machine-learning model.
• The quality and quantity of data available for training and testing
play a significant role in determining the performance of a
machine-learning model.
• Data is the most important part of all Data Analytics, Machine
Learning, and Artificial Intelligence. Without data, we can’t train
any model, and all modern research and automation will go in
vain.
• For decision-making, the integrity of the conclusions drawn heavily
relies on the cleanliness of the underlying data.
• Without proper data cleaning, inaccuracies, outliers, missing
values, and inconsistencies can compromise the validity of
analytical results.
• Moreover, clean data facilitates more effective modelling and
pattern recognition, as algorithms perform optimally when fed
2: Data Preprocessing
• Importance of Data Quality Cont’d
• Additionally, clean datasets enhance the
interpretability of findings, aiding in the
formulation of actionable insights.
• Big Enterprises are spending lots of money just to
gather as much certain data as possible.
Example: Why did Facebook acquire WhatsApp by
paying a huge price of $19 billion?
• The answer is very simple and logical – it is to
have access to the users’ information that
Facebook may not have but WhatsApp will have.
This information about their users is of paramount
importance to Facebook as it will facilitate the
task of improvement in their services.
2: Data Preprocessing
• Different Forms of Data
• Numeric Data: If a feature represents a characteristic
measured in numbers, it is called a numeric feature.
• Categorical Data: A categorical feature is an
attribute that can take on one of the limited, and
usually fixed number of possible values on the basis of
some qualitative property. A categorical feature is also
called a nominal feature. Example: gender, colour,
race,
• Ordinal Data: This denotes a nominal variable with
categories falling in an ordered list . Examples include
clothing sizes such as small, medium, and large, or a
measurement of customer satisfaction on a scale from
“not at all happy” to “very happy”.
2: Data Preprocessing
• Advantages of data processing in Machine Learning:
• Improved model performance: Data processing helps improve the
performance of the ML model by cleaning and transforming the data
into a format that is suitable for modeling.
• Better representation of the data: Data processing allows the data to be
transformed into a format that better represents the underlying
relationships and patterns in the data, making it easier for the ML model
to learn from the data.
• Increased accuracy: Data processing helps ensure that the data is
accurate, consistent, and free of errors, which can help improve the
accuracy of the ML model.
• Disadvantages of data processing in Machine Learning:
• Time-consuming: Data processing can be a time-consuming task,
especially for large and complex datasets.
• Error-prone: Data processing can be error-prone, as it involves
transforming and cleaning the data, which can result in the loss of
important information or the introduction of new errors.
• Limited understanding of the data: Data processing can lead to a limited
understanding of the data, as the transformed data may not be
representative of the underlying relationships and patterns in the data.
2: Data Preprocessing
• Steps to Perform Data Cleanliness
• Performing data cleaning involves a
systematic process to identify and
rectify errors, inconsistencies, and
inaccuracies in a dataset.
2: Data Preprocessing
• The following are essential steps to perform data
cleaning:
• Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing data
entries for duplicate records, irrelevant information, or data points that do not
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in data
representation. Fixing structure errors enhances data consistency and facilitates
accurate analysis and interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context, decide
whether to remove outliers or transform them to minimize their impact on
analysis. Managing outliers is crucial for obtaining more accurate and reliable
insights from the data.
• Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods,
removing records with missing values, or employing advanced imputation
techniques. Handling missing data ensures a more complete dataset, preventing
biases and maintaining the integrity of analyses.
2: Data Preprocessing
• Python Implementation for Database Cleaning
• Let’s understand each step for Database Cleaning,
using titanic dataset.
• Below are the necessary steps:
• Import the necessary libraries
• Load the dataset
• Check the data information using df.info()

# Here is the code to load the titanic dataset


import pandas as pd
import numpy as np

# URL of the Titanic dataset


url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/
titanic.csv'
df = pd.read_csv(url)
df.head() # to see first few rows of the dataset
2: Data Preprocessing
• The outlook of your output should
look like this
2: Data Preprocessing
• Let’s first understand the data by inspecting its structure and identifying
missing values, outliers, and inconsistencies and check the duplicate rows
with below python code:
df.duplicated()

Output:

0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool

The output shows that there are no duplicates


2: Data Preprocessing
#Check the data information using df.info()
df.info()
Output:

From the above data info, we can see that Age and Cabin
have an unequal number of counts. And some of the columns
are categorical and have data type objects and some are
2: Data Preprocessing
• Check the Categorical and Numerical Columns.
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)

Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare’]

• Check the total number of Unique Values in the


Categorical Columns
df[cat_col].nunique()
Output:

Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
2: Data Preprocessing
• Removal of all Above Unwanted
Observations
• This includes deleting duplicate/ redundant or irrelevant
values from your dataset. Duplicate observations most
frequently arise during data collection and Irrelevant
observations are those that don’t actually fit the specific
problem that you’re trying to solve.
• Redundant observations alter the efficiency to a great extent
as the data repeats and may add towards the correct side or
towards the incorrect side, thereby producing unfaithful
results.
• Irrelevant observations are any type of data that is of no use
to us and can be removed directly.
• Here we are dropping the Name columns because the Name
will be always unique and it hasn’t a great influence on target
variables. For the ticket, Let’s first print the 50 unique tickets.

df['Ticket'].unique()[:50]
2: Data Preprocessing
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803',
'373450',

'330877', '17463', '349909', '347742', '237736', 'PP 9549',

'113783', 'A/5. 2151', '347082', '350406', '248706', '382652',

'244373', '345763', '2649', '239865', '248698', '330923',


'113788',

'347077', '2631', '19950', '330959', '349216', 'PC 17601',

'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',

'A./5. 2152', '345764', '2651', '7546', '11668', '349253',

'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',

'2662', '349237', '3101295'], dtype=object)


• From the above tickets, we can observe that it is made of two like
first values ‘A/5 21171’ is joint from of ‘A/5’ and ‘21171’ this may
influence our target variables. Let’s drop the “Name” and “Ticket”
2: Data Preprocessing
• Drop Name and Ticket Columns

df1 = df.drop(columns=['Name','Ticket'])
df1.shape

Output:

(891, 10)
2: Data Preprocessing
• Handling Missing Data
• Missing data is a common issue in real-world datasets, and it can occur due to various reasons
such as human errors, system failures, or data collection issues. Various techniques can be
used to handle missing data, such as imputation, deletion, or substitution.

• Let’s check the % missing values columns-wise for each row using df.isnull()
• it checks whether the values are null or not and gives returns boolean values.
• .sum() will sum the total number of null values rows and we divide it by the total number of
rows present in the dataset then we multiply to get values in % i.e per 100 values how much
values are null.

round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:

PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be handled carefully as they
2: Data Preprocessing
• Handling Missing Data
• The two most common ways to deal with missing data are:
• Dropping Observations with missing values.
• Imputing the missing values from past observations.

• Dropping Observations with missing values.


• As we can see from the above result that Cabin has 77% null values
and Age has 19.87% and Embarked has 0.22% of null values.

• So, it’s not a good idea to fill 77% of null values. So, we will drop
the Cabin column. Embarked column has only 0.22% of null values
so, we drop the null values rows of Embarked column.

df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape

Output:
(889, 9)
2: Data Preprocessing
• Handling Missing Data
• Imputing the missing values from past observations.
• Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real information.
You’re just reinforcing the patterns already provided by other features.
• We can use Mean imputation or Median imputations for the case.

• Note:
• Mean imputation is suitable when the data is normally distributed and has no extreme outliers.
• Median imputation is preferable when the data contains outliers or is skewed.
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values again
df3.isnull().sum()

Output:

PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
2: Data Preprocessing
• Handling Outliers
• Outliers are extreme values that deviate significantly from the majority of
the data.
• They can negatively impact the analysis and model performance.
• Techniques such as clustering, interpolation, or transformation can be used
to handle outliers.

• To check the outliers, We generally use a box plot.


• A box plot, also referred to as a box-and-whisker plot, is a graphical
representation of a dataset’s distribution.
• It shows a variable’s median, quartiles, and potential outliers.
• The line inside the box denotes the median, while the box itself denotes the
interquartile range (IQR).
• The whiskers extend to the most extreme non-outlier values within 1.5
times the IQR.
• Individual points beyond the whiskers are considered potential outliers.
• A box plot offers an easy-to-understand overview of the range of the data
and makes it possible to identify outliers or skewness in the distribution.
2: Data Preprocessing
• Handling Outliers
• Let’s plot the box plot for Age column data.

import matplotlib.pyplot as plt


plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
2: Data Preprocessing

• As we can see from the above Box and whisker plot,


Our age dataset has outliers values. The values less
than 5 and more than 55 are outliers.
# calculate summary statistics
mean = df3['Age'].mean()
2: Data Preprocessing
# Calculate the lower and upper bounds
lower_bound = mean - std*2
upper_bound = mean + std*2

print('Lower Bound :',lower_bound)


print('Upper Bound :',upper_bound)

# Drop the outliers


df4 = df3[(df3['Age'] >= lower_bound)
& (df3['Age'] <= upper_bound)]

Output:
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785

• Similarly, we can remove the outliers of the remaining columns.


3: Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables.
• EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or
modeling.
• The Goals of EDA
• Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It includes
techniques including records imputation, managing missing statistics, and figuring out and getting rid of outliers.
• Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and
distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and percentiles are
usually used.
• Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations
consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles,
trends, and relationships within the facts.
• Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create new
functions or derive meaningful insights. Feature engineering can contain scaling, normalization, binning, encoding
express variables, and creating interplay or derived variables.
• Correlation and Relationships: EDA allows discover relationships and dependencies between variables.
Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power and
direction of relationships between variables.
• Data Segmentation: EDA can contain dividing the information into significant segments based totally on sure
standards or traits. This segmentation allows advantage insights into unique subgroups inside the information and
might cause extra focused analysis.
• Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on the
preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and model building.
• Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It involves
checking for records integrity, consistency, and accuracy to make certain the information is suitable for analysis.
3: Exploratory Data Analysis (EDA)
• Types of EDA
• Univariate Analysis: This sort of evaluation explores each variable in a
dataset. It involves summarizing and visualizing variable at a time to
understand its distribution, relevant tendency, and different applicable records.
Techniques like histograms, field plots, and bar charts are generally used in
univariate analysis.
• Bivariate Analysis: Bivariate evaluation involves exploring the connection
between variables. It enables find associations, correlations, and dependencies
between pairs of variables. Scatter plots, line plots, correlation matrices, and
move-tabulation are generally used strategies in bivariate analysis.
• Multivariate Analysis: Multivariate analysis extends bivariate evaluation to
encompass greater than variables. It ambitions to apprehend the complex
interactions and dependencies among more than one variables in a records set.
Techniques inclusive of heatmaps, parallel coordinates, aspect analysis, and
(PCA) are used for multivariate analysis.
• Time Series Analysis: This type of analysis is mainly applied to statistics sets
that have a temporal component. Time collection evaluation entails inspecting
and modelling styles, traits, and seasonality inside the statistics through the
years. Techniques like line plots, autocorrelation analysis, transferring averages,
and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally
3: Exploratory Data Analysis (EDA)
• Types of EDA Cont’d
• Missing Data Analysis: Missing information is a not unusual issue in
datasets, and it may impact the reliability and validity of the evaluation.
Missing statistics analysis includes figuring out missing values, know-how
the patterns of missingness, and using suitable techniques to deal with
missing data. Techniques along with lacking facts styles, imputation
strategies, and sensitivity evaluation are employed in lacking facts
evaluation.
• Outlier Analysis: Outliers are statistics factors that drastically deviate
from the general sample of the facts. Outlier analysis includes identifying
and knowledge the presence of outliers, their capability reasons, and their
impact at the analysis. Techniques along with box plots, scatter plots, z-
rankings, and clustering algorithms are used for outlier evaluation.
• Data Visualization: Data visualization is a critical factor of EDA that
entails creating visible representations of the statistics to facilitate
understanding and exploration. Various visualization techniques, inclusive
of bar charts, histograms, scatter plots, line plots, heatmaps, and
interactive dashboards, are used to represent exclusive kinds of statistics.
The choice of strategies relies on the information traits, research questions, and
3: Exploratory Data Analysis (EDA)
• For the simplicity, we will use a single dataset. We will use
the winequality-red dataset for this. It is a csv file. You can
download it.
Note: if you are using google colab, you need to mount your drive so that colab can have
access to the folder where your data reside. You can do that with the following two lines of
codes:

from google.colab import drive


drive.mount(‘/content/drive’)

The next is to import the necessary Python libraries


# importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore’)

# loading and reading dataset


df = pd.read_csv("winequality-red.csv")
3: Exploratory Data Analysis (EDA)
Output:

# loading and reading dataset


df=pd.read_csv('/content/drive/MyDrive/CSC407/winequality-red.csv',encoding='latin-1’)
#using colab
df=pd.read_csv(winequality-red.scv, encoding=‘latin-1’) #use this if you are running python
locally
print(df.head())

• Analyzing the Data


Gaining general knowledge about the data, including its values, kinds, number of rows and
columns, and missing values is the primary objective of data understanding.
shape: shape will show how many features (columns) and observations (rows) there are in the
dataset.
# shape of the data
df.shape

Output:
(1599, 12)
3: Exploratory Data Analysis (EDA)
• info() facilitates comprehension of the data type and
related information, such as the quantity of records in
each column, whether the data is null or not, the type of
data, and the dataset’s memory use.

#data information
df.info()
3: Exploratory Data Analysis (EDA)
# describing the data
df.describe()

The DataFrame “df”


is statistically
summarized by the
code df.describe(),
which gives the
count, mean,
standard deviation,
minimum, and
quartiles for each
numerical column.
The dataset’s
3: Exploratory Data Analysis (EDA)
• Checking Columns

#column to list
df.columns.tolist()

• The code df.columns.tolist() converts the column names of the DataFrame


‘df’ into a Python list, providing a convenient way to access and
3: Exploratory Data Analysis (EDA)
• Checking Missing Values

# check for missing values:


df.isnull().sum()

• The code df.isnull().sum() checks for missing values in each


column of the DataFrame ‘df’ and returns the sum of null
3: Exploratory Data Analysis (EDA)
• Checking for the unique values
#checking unique values
df.nunique()

• The function df.nunique() determines how many unique values


there are in each column of the DataFrame “df,” offering
information about the variety of data that makes up each feature.
3: Exploratory Data Analysis (EDA)
• Univariate Analysis
• In Univariate analysis, plotting the right charts can help us better
understand the data, which is why data visualization is so important.
Matplotlib and Seaborn libraries are used in this post to visualize our
data.

• Basic charts can be created with Matplotlib, a Python 2D charting


package.

• Seaborn is a Python library that leverages short code segments to


generate and customize statistical charts from Pandas and Numpy,
based on the Matplotlib framework.

• For both numerical and categorical data, univariate analysis is an


option.

• In this example, we are going to plot different types of plots like


3: Exploratory Data Analysis (EDA)
• Univariate Analysis
# Assuming 'df' is your DataFrame
quality_counts = df['quality'].value_counts()

# Using Matplotlib to create a count plot


plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='darpink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

Here, this count plot graph shows the count of the wine with its
3: Exploratory Data Analysis (EDA)
• Kernel density plot
# Set Seaborn style
sns.set_style("darkgrid")

# Identify numerical columns


numerical_columns = df.select_dtypes(include=["int64",
"float64"]).columns

# Plot distribution of each numerical feature


plt.figure(figsize=(14, len(numerical_columns) * 3))
for idx, feature in enumerate(numerical_columns, 1):
plt.subplot(len(numerical_columns), 2, idx)
sns.histplot(df[feature], kde=True)
plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")

# Adjust layout and show plots


plt.tight_layout()
3: Exploratory Data Analysis (EDA)
• Kernel density plot

Here, in the kernel


density plot is about
the skewness of the
of the corresponding
feature. The features
in this dataset that
have skewness that
are exactly 0 depicts
the symmetrical
distribution and the
plots with skewness
1 or above 1 is
positively or right
skewed distribution.
In right skewed or
positively skewed
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis

• When doing a bivariate analysis, two variables are


examined simultaneously in order to look for patterns,
dependencies, or interactions between them.

• Understanding how changes in one variable may


correspond to changes in another requires the use of this
statistical method.

• Bivariate analysis allows for a thorough comprehension of


the interdependence between two variables within a
dataset by revealing information on the type and intensity
of associations.

• Let’s plot a pair plot for the data.


3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Pair plot

# Set the color palette


sns.set_palette("Pastel1")

# Assuming 'df' is your DataFrame


plt.figure(figsize=(10, 6))

# Using Seaborn to create a pair plot with the specified color


palette
sns.pairplot(df)

plt.suptitle('Pair Plot for DataFrame')


plt.show()
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
If the plot is diagonal ,
• Pair plot histograms of kernel density
plots , it shows the
distribution of the individual
variables.

If the scatter plot is in the


lower triangle, it displays the
relationship between the pairs
of the variables.

If the scatter plots above and


below the diagonal are mirror
images, indicating symmetry.

If the histogram plots are


more centered, it represents
the locations of peaks.

Skewness is depicted by
observing whether the
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Violin plot

# Assuming 'df' is your DataFrame


df['quality'] = df['quality'].astype(str) # Convert 'quality' to
categorical

plt.figure(figsize=(10, 8))

# Using Seaborn to create a violin plot


sns.violinplot(x="quality", y="alcohol", data=df, palette={
'3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6':
'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)

plt.title('Violin Plot for Quality and Alcohol')


plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Violin plot Interpreting the
Violin Plot

• If the width is wider,


it indicates higher
density, suggesting
more data points.
• Symmetrical plot
indicates a balanced
distribution.
• Peak or bulge in the
violin plot represents
most common value
in distribution.
• Longer tails indicate
great variability.
• Median line is the
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Box plot Interpreting the box
plot
#plotting box plot between alcohol and quality
sns.boxplot(x='quality', y='alcohol', data=df)
• Box represents the
IQR. Longer the box,
greater the
variability.
• The median line in
the box indicates
central tendency.
• Whiskers extend
from box to the
smallest and largest
values within a
specified range.
• Individual points
beyond the whiskers
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
• Interactions between three or more variables in a dataset are
simultaneously analyzed and interpreted in multivariate analysis.

• In order to provide a comprehensive understanding of the collective


behavior of several variables, it seeks to reveal intricate patterns,
relationships, and interactions between them.

• Multivariate analysis examines correlations and dependencies between


numerous variables by using sophisticated statistical techniques such
factor analysis, principal component analysis, and multivariate regression.

• Multivariate analysis, which is widely applied in domains such as biology,


economics, and marketing, enables thorough insights and helps decision-
makers make well-informed judgments based on complex relationships
found in multidimensional datasets.

• Here, we are going to show the multivariate analysis using a correlation


matrix plot.
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
• Correlation Matrix
# Assuming 'df' is your DataFrame
plt.figure(figsize=(15, 10))

# Using Seaborn to create a heatmap


sns.heatmap(df.corr(), annot=True, fmt='.2f',
cmap='Pastel2', linewidths=2)

plt.title('Correlation Heatmap')
plt.show()
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
Interpreting the
correlation matrix
plot
• Values close to +1
indicates strong
positive correlation, -1
indicates a strong
negative correlation
and 0 indicates
suggests no linear
correlation.
• Darker colors signify
strong correlation,
while light colors
represents weaker
correlations.
• Positive correlation
variable move in same
directions. As one
3: Exploratory Data Analysis (EDA)
• In summary, the Python-based exploratory data analysis (EDA)
of the wine dataset has yielded important new information about
the properties of the wine samples.
• We investigated correlations between variables, identified
outliers, and obtained a knowledge of the distribution of
important features using statistical summaries and
visualizations.
• The quantitative and qualitative features of the dataset were
analyzed in detail through the use of various plots, including
pair, box, and histogram plots. Finding patterns, trends, and
possible topics for more research was made easier by this EDA
method.
• Furthermore, the analysis demonstrated the ability to visualize
and analyze complicated datasets using Python tools such as
Matplotlib, Seaborn, and Pandas.
• The results provide a thorough grasp of the wine dataset and lay

You might also like