Lesson 5 Exploratory Data Analysis
Lesson 5 Exploratory Data Analysis
EDA is primarily used to see what data can reveal beyond the formal modeling
or hypothesis testing task and provides a better understanding of data set
variables and the relationships between them.
It can also help determine if the statistical techniques you are considering for data
analysis are appropriate.
Originally developed by American mathematician John Tukey in the 1970s,
EDA techniques continue to be a widely used method in the data discovery
process today.
The main purpose of EDA is to help look at data before making any assumptions. It can
help identify obvious errors, as well as better understand patterns within the data, detect
outliers or anomalous events, find interesting relations among the variables.
1|P ag e
Data scientists can use exploratory analysis to ensure the results they produce are
valid and applicable to any desired business outcomes and goals.
EDA also helps stakeholders by confirming they are asking the right questions.
EDA can help answer questions about standard deviations, categorical variables,
and confidence intervals.
Once EDA is complete and insights are drawn, its features can then be used for
more sophisticated data analysis or modeling, including machine learning.
Practical Case study to illustrate how to conduct Exploratory data analysis (EDA)
Well, first things first. We will load the titanic dataset into python to perform EDA.
2|P ag e
2. Basic information about data - EDA
The df.info () function will give us the basic information about the dataset. For any data,
it is good to start by knowing its information. Let’s see how it works with our data.
3|P ag e
Using this function, you can see the number of null values, datatypes, and memory
usage as shown in the above outputs along with descriptive statistics.
3. Duplicate values
You can use the df.duplicate.sum () function to the sum of duplicate value present if
any. It will show the number of duplicate values if they are present in the data.
4|P ag e
Well, the function returned ‘0’. This means, there is not a single duplicate value present
in our dataset and it is a very good thing to know.
You can find the number of unique values in the particular column using unique
() function in python.
The unique () function has returned the unique values which are present in the data and
it is pretty much cool!
Yes, you can visualize the unique values present in the data. For this, we will be using
the seaborn library. You have to call the sns. Count plot () function and specify the
variable to plot the count plot.
5|P ag e
6. Find the Null values
Finding the null values is the most important step in the EDA. ensuring the quality of data
is paramount.
6|P ag e
we have some null values in the ‘Age’ and ‘Cabin’ variables.
Hey, we got a replace () function to replace all the null values with a specific data. It is
too good!
It is very easy to find and replace the null values in the data as shown. I have used 0 to
replace null values. You can even opt for more meaningful methods such as mean or
median.
Knowing the datatypes which you are exploring is very important and an easy process
too. Let’s see how it works.
7|P ag e
You have to use the types function for this a shown and you will get the datatypes of
each attribute.
the above code has returned only data values that belong to class 1.
You can create a box plot for any numerical column using a single line of code.
8|P ag e
11. Correlation Plot - EDA
Finally, to find the correlation among the variables, we can make use of the correlation
function. This will give you a fair idea of the correlation strength between different
variables.
This is the correlation matrix with the range from +1 to -1 where +1 is highly and
positively correlated and -1 will be highly negatively correlated.
9|P ag e
Exploratory Data Analysis – EDA Summary
2. During the data preprocessing step, how should one treat missing/null
values? How will you deal with them?
10 | P a g e