0% found this document useful (0 votes)
4 views

Lesson 5 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a critical step in Business Analytics that prepares data for modeling by identifying patterns, anomalies, and relationships among variables. It employs data visualization and statistical techniques to ensure data quality and appropriateness for analysis, ultimately aiding data scientists in achieving valid business outcomes. The document outlines practical steps for conducting EDA using Python, including loading data, identifying null values, and visualizing unique counts.

Uploaded by

Saadie Essie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lesson 5 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a critical step in Business Analytics that prepares data for modeling by identifying patterns, anomalies, and relationships among variables. It employs data visualization and statistical techniques to ensure data quality and appropriateness for analysis, ultimately aiding data scientists in achieving valid business outcomes. The document outlines practical steps for conducting EDA using Python, including loading data, identifying null values, and visualizing unique counts.

Uploaded by

Saadie Essie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

BUSINESS INTELLIGENCE AND ANALYTICS

Lesson 5: Exploratory data analysis (EDA)


Setting the context
Before you start a Business Analytics project, it’s important to ensure that the
data is ready for modeling work.
o Exploratory Data Analysis (EDA) ensures the readiness of the data for
Business Analytics.
o In fact, EDA ensures that the data is more usable. Without a proper EDA,
Machine Learning work suffer from accuracy issues and many times, the
algorithms won't work.

What is exploratory data analysis?

 Exploratory data analysis (EDA) is used by data scientists to analyze and


investigate data sets and summarize their main characteristics, often employing data
visualization methods.
o It helps determine how best to manipulate data sources to get the
answers you need, making it easier for data scientists
 to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.

 EDA is primarily used to see what data can reveal beyond the formal modeling
or hypothesis testing task and provides a better understanding of data set
variables and the relationships between them.
 It can also help determine if the statistical techniques you are considering for data
analysis are appropriate.
 Originally developed by American mathematician John Tukey in the 1970s,
EDA techniques continue to be a widely used method in the data discovery
process today.

Why is exploratory data analysis important in Business Analytics?

The main purpose of EDA is to help look at data before making any assumptions. It can
help identify obvious errors, as well as better understand patterns within the data, detect
outliers or anomalous events, find interesting relations among the variables.

1|P ag e
Data scientists can use exploratory analysis to ensure the results they produce are
valid and applicable to any desired business outcomes and goals.
 EDA also helps stakeholders by confirming they are asking the right questions.
 EDA can help answer questions about standard deviations, categorical variables,
and confidence intervals.
 Once EDA is complete and insights are drawn, its features can then be used for
more sophisticated data analysis or modeling, including machine learning.

Programming Language Used

Python: an interpreted, object-oriented programming language with dynamic


semantics. Its high-level, built-in data structures, combined with dynamic
typing and dynamic binding, make it very attractive for rapid application
development, as well as for use as a scripting or glue language to connect
existing components together.
Python and EDA can be used together to identify missing values in a data set, which
is important so you can decide how to handle missing values for machine learning.

Practical Case study to illustrate how to conduct Exploratory data analysis (EDA)

Using Python Language.

Some steps used to investigate data

1. Exploratory Data Analysis - EDA.


2. Load the Data.
3. Basic information about data - EDA.
4. Duplicate values.
5. Summary statistics i.e mean, count, standard deviation, etc.
6. Unique values in the data.
7. Visualize the Unique counts.
8. Find the Null values.
9. Replace the Null values.

1. Load the Data

Well, first things first. We will load the titanic dataset into python to perform EDA.

2|P ag e
2. Basic information about data - EDA

The df.info () function will give us the basic information about the dataset. For any data,
it is good to start by knowing its information. Let’s see how it works with our data.

3|P ag e
Using this function, you can see the number of null values, datatypes, and memory
usage as shown in the above outputs along with descriptive statistics.

3. Duplicate values

You can use the df.duplicate.sum () function to the sum of duplicate value present if
any. It will show the number of duplicate values if they are present in the data.

4|P ag e
Well, the function returned ‘0’. This means, there is not a single duplicate value present
in our dataset and it is a very good thing to know.

4. Unique values in the data

You can find the number of unique values in the particular column using unique
() function in python.

array ([3, 1, 2], dtype=int64)

array ([0, 1], dtype=int64)

array (['male', 'female'], dtype=object)

The unique () function has returned the unique values which are present in the data and
it is pretty much cool!

5. Visualize the Unique counts

Yes, you can visualize the unique values present in the data. For this, we will be using
the seaborn library. You have to call the sns. Count plot () function and specify the
variable to plot the count plot.

5|P ag e
6. Find the Null values

Finding the null values is the most important step in the EDA. ensuring the quality of data
is paramount.

6|P ag e
we have some null values in the ‘Age’ and ‘Cabin’ variables.

7. Replace the Null values

Hey, we got a replace () function to replace all the null values with a specific data. It is
too good!

It is very easy to find and replace the null values in the data as shown. I have used 0 to
replace null values. You can even opt for more meaningful methods such as mean or
median.

8. Know the datatypes

Knowing the datatypes which you are exploring is very important and an easy process
too. Let’s see how it works.

7|P ag e
You have to use the types function for this a shown and you will get the datatypes of
each attribute.

9. Filter the Data

Yes, you can filter the data based on some logic.

the above code has returned only data values that belong to class 1.

10. A quick box plot

You can create a box plot for any numerical column using a single line of code.

8|P ag e
11. Correlation Plot - EDA

Finally, to find the correlation among the variables, we can make use of the correlation
function. This will give you a fair idea of the correlation strength between different
variables.

This is the correlation matrix with the range from +1 to -1 where +1 is highly and
positively correlated and -1 will be highly negatively correlated.

12. seaborn library

You can even visualize the correlation matrix using

9|P ag e
Exploratory Data Analysis – EDA Summary

 EDA is applied to investigate the data and summarize the key


insights.
 It will give you the basic understanding of your data, it’s
distribution, null values and much more.
 You can either explore data using graphs or through some python
functions.
 There will be two type of analysis. Univariate and Bivariate. In the
univariate, you will be analyzing a single attribute. But in the
bivariate, you will be analyzing an attribute with the target attribute.
 In the non-graphical approach,
o you will be using functions such as shape, summary, describe,
is null, info, datatypes and more.
 In the graphical approach,
o you will be using plots such as scatter, box, bar, density and
correlation plots

Revision Questions Exploratory Data Analysis (EDA)


1. What is the Difference between Univariate, Bivariate, and Multivariate
analysis? in EDA analysis.

2. During the data preprocessing step, how should one treat missing/null
values? How will you deal with them?

3. What is an outlier and how to identify them?

10 | P a g e

You might also like