Unit - Iii - Eda
Unit - Iii - Eda
1. Discovery:
The first phase is discovery, which involves asking the right questions.
When we start any data science project, we need to determine what are the basic
requirements, priorities, and project budget.
In this phase, we need to determine all the requirements of the project such as the number of
people, technology, time, data, an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:
3. Model Planning:
In this phase, we need to determine the various methods and techniques to establish the
relation between input variables.
We will apply Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what data can
inform us.
Common tools used for model planning are:
1. SQL Analysis Services
2. R
3. SAS
4. Python
4. Model-building:
5. Operationalize:
In this phase, we will deliver the final reports of the project, along with briefings, code, and
technical documents.
This phase provides us a clear overview of complete project performance and other
components on a small scale before the full deployment.
6. Communicate results:
In this phase, we will check if we reach the goal, which we have set on the initial phase. We
will communicate the findings and final result with the business team.
Exploratory Data Analysis(EDA)
What is Data Analysis:
Exploratory Data Analysis (EDA) is the first step in data analysis process .
Exploratory Data Analysis (EDA) is developed by “John Tukey” in the 1970s.
Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually.
This step is very important especially when we arrive at modeling the data in order to apply
Machine learning.
Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more.
It often takes much time to explore the data.
Exploratory Data Analysis helps us to –
Give insight into a data set.
Understand the underlying structure.
Extract important parameters and relationships that hold between them.
Test underlying assumptions
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
2. Loading the data into the data frame.
Loading the data into the pandas data frame is certainly one of the most important steps in
EDA.
The value from the data set is comma-separated. So read the CSV into a data frame
Pandas data frame is used to read the data.
df =pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/employees.csv")
df.head() # To display the top 5 rows
Here we check for the datatypes because sometimes the Salary or the Salary of the employees
would be stored as a string. If in that case, we have to convert that string to the integer data,
only then integer data we can plot the data via a graph.
Here, in this case, the data is already in integer format so nothing to worry.
df.dtypes
Let’s see the shape of the data using the shape.
df.shape
(1000,8)
This means that this dataset has 1000 rows and 8 columns.
Let’s get a quick summary of the dataset using the describe() method. The describe() function
applies basic statistical computations on the dataset like extreme values, count of data points
standard deviation, etc. Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
df.describe()
Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()
4. Dropping irrelevant columns.
This step is certainly needed in every EDA because sometimes there would be many columns
that we never use. In such cases drop the irrelevant columns.
For example, In this case, the columns such as Last Login Time, Senior Management, doesn't
make any sense so just drop.
df.head(5)
In this instance, most of the column names are very confusing to read, so rename their column
names.
This is a good approach it improves the readability of the data set.
df = df.rename(columns={“Start Date": “SDate", })
df.head(5)
Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to as
NA(Not Available) values in pandas. There are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
print(df.isnull().sum())
We can see that every column has a different amount of missing values. Like Gender as 145
missing values and salary has 0. Now for handling these missing values there can be several
cases like dropping the rows containing NaN or replacing NaN with either mean, median,
mode, or some other value.
Now, let’s try to fill the missing values of gender with the string “No Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()
Now, Let’s fill the senior management with the mode value.
mode = df['Senior Management'].mode().values[0]
df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
df.isnull().sum()
Now for the first name and team, we cannot fill the missing values with arbitrary data, so,
let’s drop all the rows containing these missing values.
df = df.dropna(axis = 0, how ='any')
print(df.isnull().sum())
df.shape
8. Detecting Outliers:
Outliers are nothing but an extreme value that deviates from the other observations in the
dataset.
An outlier is a point or set of points that are different from other points.
Sometimes they can be very high or very low.
It's often a good idea to detect and remove the outliers. Because outliers are one of the
primary reasons for resulting in a less accurate model.
Hence it's a good idea to remove them.
IQR (Inter-Quartile Range) score technique is used to detect and remove outlier.
Outliers can be seen with visualizations using a box plot.
sns.boxplot(x=df['Salary'])
Herein all the plots, we can find some points are outside the box they are none other than
outliers.
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape
8. Data Visualization
Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot
easier to understand the trends or patterns in the data. There are various types of visualizations –
Histogram Plot
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot
Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.
One of the best ways to find the relationship between the features can be done using heat
maps.
Example:
# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)
A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()
Descriptive Statistics
Descriptive Statistics is the default process in Data analysis.
Exploratory Data Analysis (EDA) is not complete without a Descriptive Statistic analysis.
Descriptive Statistics is divided into two parts:
1. Measure of Central Data points and
2. Measure of Dispersion.
The following operations are performed under Measure of Central Data points. Each of these
measures describes a different indication of the typical or central value in the distribution.
1. Count
2. Mean
3. Mode
4. Median
2. Measure of Dispersion
The following operations are performed under Measure of Dispersion. Measures of dispersion can
be defined as positive real numbers that measure how homogeneous or heterogeneous the given data .
1. Range
2. Percentiles (or) Quartiles
3. Standard deviation
4. Variance
5. Skewness
Example:
Consider a file:
https://media.geeksforgeeks.org/wp-content/uploads/employees.csv
Before starting descriptive statistics analysis complete the data collection and cleaning
process.
Data Collection:
Data Cleaning :
Data cleaning means fixing bad data in your data set before data analysis.
Describe() method :
Let’s get a quick summary of the dataset using the describe() method.
The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc.
Any missing value or NaN value is automatically skipped.
describe() function gives a good picture of the distribution of data.
Count
It calculates the total count of numerical column data (or) each category of the categorical
variables.
height =df["height"]
print(height)
Mean
The Sum of values present in the column divided by total rows of that column is known as
mean.
The median value divides the data points into two parts. That means 50% of data points are
present above the median and 50% below.
Mode
There is only one mean and median value for each column. But, attributes can have more
than one mode value
Output:
Measures of Dispersion
Range
The difference between the max value to min value in a column is known as the range.
Standard Deviation
The standard deviation value tells us how much all data points deviate from the mean value.
The standard deviation is affected by the outliers because it uses the mean for its calculation
σ = Standard Deviation
xi = Terms Given in the Data
x̄ = Mean
n = Total number of Terms
Output:
2.442525704031867
Variance
In the case of outliers, the variance value becomes large and noticeable.
Output:
5.965931814856368
Skewness
Ideally, the distribution of data should be in the shape of Gaussian (bell curve).
But practically, data shapes are skewed or have asymmetry. This is known as skewness in
data.
Skewness value can be negative (left) skew or positive (right) skew. Its value should be close
to zero.
Example for skewness
df.skew()
df.loc[:,"height"].skew()
output:
0.06413448813322854
Percentiles or Quartiles
Quartiles
Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:
IQR = Q3 - Q1
Output:
8718.5
Basic tools (plots, graphs and summary statistics) of EDA
Exploratory data analysis or “EDA” is a critical first step in analyzing the data
The uses of EDA are:
1. Detection of mistakes
2. Checking of assumptions
3. Preliminary selection of appropriate models
4. Determining relationships among the exploratory variables
Data Types:
Categorical Data
Nominal Data
Nominal values represent discrete units and are used to label variables that have no
quantitative value.
Nominal data has no order.
If the order of the values changed their meaning would not change.
Examples:
.
Ordinal Data
Numerical Data
1. Discrete Data
2. Continuous Data
Continuous Data represents measurements and therefore their values can’t be counted but
they can be measured.
An example would be the height of a person, which can describe by using intervals on the real
number line.
Interval Data
Interval values represent ordered units that have the same difference.
Example:
Ratio Data
Ratio values are also ordered units that have the same difference.
Ratio values are the same as interval values, with the difference that they do have an absolute
zero.
Good examples are height, weight, length etc.
Example:
Types of EDA
Univariate non-graphical:
This is the simplest form of data analysis among the four options.
In this type of analysis, the data that is being analysed consists of just a single variable.
The main purpose of this analysis is to describe the data and to find patterns.
Univariate graphical:
Multivariate non-graphical:
The multivariate non-graphical type of EDA generally depicts the relationship between
multiple variables of data through cross-tabulation or statistics.
Multivariate graphical:
This type of EDA displays the relationship between two or more set of data.
A bar chart, where each group represents a level of one of the variables and each bar within
the group represents levels of other variables.
COVARIANCE
Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
This can be represented with the following equation:
CORRELATION
this can be calculated easily within Python - particulatly when using Pandas as
import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.corr()
The important reasons to implement EDA when working with data are:
1. To gain intuition about the data;
2. To make comparisons between distributions;
3. For sanity checking (making sure the data is on the scale we expect, in the
format we thought it should be);
4. To find out where data is missing or if there are outliers; and to summarize the data.
In the context of data generated from logs, EDA also helps with debugging the logging
process.
In the end, EDA helps us to make sure the product is performing as intended.
There’s lots of visualization involved in EDA.
The distinguish between EDA and data visualization is that EDA is done toward the
beginning of analysis, and data visualization is done toward the end to Communicate one’s
findings.
With EDA, the graphics are solely done for us to understand what’s going on.
EDA are used to improve the development of algorithms.
Data Visualization
What is Data Visualization :
Clarity - Clarity ensures that the data set is complete and relevant.
Accuracy – Accuracy ensures using appropriate graphical representation to convey the right
message.
Efficiency - Efficiency uses efficient visualization technique which highlights all the data
points
Visual effect
Coordination System
Data Types and Scale
Informative Interpretation
Visual effect - Visual Effect includes the usage of appropriate shapes, colors, and
size to represent the analyzed data.
Coordination System - The Coordinate System helps to organize the data points
within the provided coordinates.
Data Types and Scale - The Data Types and Scale choose the type of data such as
numeric or categorical.
Informative Interpretation – The Informative Interpretation helps create visuals in
an effective and easily interpreted ill manner using labels, title legends, and pointers.
Matplotlib
Pandas Visualization
Seaborn
ggplot
Plotly
Plots (graphics), also known as charts, are a visual representation of data in the form of
colored (mostly) graphics.
Histogram Plot
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot
Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps Plot
Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.
One of the best ways to find the relationship between the features can be done using heat
maps.
Example:
# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)
A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()