0% found this document useful (0 votes)

28 views25 pages

Unit - Iii - Eda

Uploaded by

rp402948

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views25 pages

Unit - Iii - Eda

Uploaded by

rp402948

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Data Science

What is Data Science?

 Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
 It is a multidisciplinary field that uses tools and techniques to manipulate the data so that we
can find something new and meaningful.
 In short, we can say that data science is all about:
 Asking the correct questions and analyzing the raw data.
 Modeling the data using various complex and efficient algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and finding the final result.

Data Science Lifecycle :

 The life-cycle of data science consists of 6 stages.

1. Discovery:

 The first phase is discovery, which involves asking the right questions.
 When we start any data science project, we need to determine what are the basic
requirements, priorities, and project budget.
 In this phase, we need to determine all the requirements of the project such as the number of
people, technology, time, data, an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:

 Data preparation is also known as Data Munging (Transformation).

 In this phase, we need to perform the following tasks:
1. Data cleaning
2. Data Reduction
3. Data integration
4. Data transformation
 After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning:

 In this phase, we need to determine the various methods and techniques to establish the
relation between input variables.
 We will apply Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what data can
inform us.
 Common tools used for model planning are:
1. SQL Analysis Services
2. R
3. SAS
4. Python

4. Model-building:

 In this phase, the process of model building starts.

 We will create datasets for training and testing purpose.
 We will apply different techniques such as association, classification, and clustering, to build
the model.
 Following are some common Model building tools:
1. SAS Enterprise Miner
2. WEKA
3. SPCS Modeler
4. MATLAB

5. Operationalize:

 In this phase, we will deliver the final reports of the project, along with briefings, code, and
technical documents.
 This phase provides us a clear overview of complete project performance and other
components on a small scale before the full deployment.

6. Communicate results:

 In this phase, we will check if we reach the goal, which we have set on the initial phase. We
will communicate the findings and final result with the business team.
Exploratory Data Analysis(EDA)
What is Data Analysis:

Data Analysis is a process of inspecting, cleaning, transforming, and modeling data to

discover useful information for business decision-making.

Steps for Data Analysis, Data Manipulation and Data Visualization:

 Transform Raw Data in a Desired Format

 Clean the Transformed Data (Step 1 and 2 also called as a Pre-processing of Data)
 Prepare a Model
 Analyze Trends and Make Decisions

Exploratory Data Analysis(EDA)

 Exploratory Data Analysis (EDA) is the first step in data analysis process .
 Exploratory Data Analysis (EDA) is developed by “John Tukey” in the 1970s.

What is Exploratory Data Analysis ?

 Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main
characteristics often plotting them visually.
 This step is very important especially when we arrive at modeling the data in order to apply
Machine learning.
 Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more.
 It often takes much time to explore the data.
 Exploratory Data Analysis helps us to –
 Give insight into a data set.
 Understand the underlying structure.
 Extract important parameters and relationships that hold between them.
 Test underlying assumptions

How to perform Exploratory Data Analysis ?

 There is no one method or common methods in order to perform EDA.
 EDA is performed depends on the dataset that we are working.
Example:
Consider a data set related to employee.
This dataset contains more of 1000rows and more than 8 columns

1. Importing the required libraries for EDA

 import pandas as pd
 import numpy as np
 import seaborn as sns
 import matplotlib.pyplot as plt
 %matplotlib inline
 sns.set(color_codes=True)
2. Loading the data into the data frame.

 Loading the data into the pandas data frame is certainly one of the most important steps in
EDA.
 The value from the data set is comma-separated. So read the CSV into a data frame
 Pandas data frame is used to read the data.
df =pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/employees.csv")
 df.head() # To display the top 5 rows

 df.tail() # To display the bottom 5 rows

3. Checking the types of data

 Here we check for the datatypes because sometimes the Salary or the Salary of the employees
would be stored as a string. If in that case, we have to convert that string to the integer data,
only then integer data we can plot the data via a graph.
 Here, in this case, the data is already in integer format so nothing to worry.

 df.dtypes
 Let’s see the shape of the data using the shape.
df.shape
(1000,8)
This means that this dataset has 1000 rows and 8 columns.

 Let’s get a quick summary of the dataset using the describe() method. The describe() function
applies basic statistical computations on the dataset like extreme values, count of data points
standard deviation, etc. Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
df.describe()

 Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()
4. Dropping irrelevant columns.

 This step is certainly needed in every EDA because sometimes there would be many columns
that we never use. In such cases drop the irrelevant columns.
 For example, In this case, the columns such as Last Login Time, Senior Management, doesn't
make any sense so just drop.

df = df.drop([‘Last Login Time’, ‘Senior Management’], axis=1, inplace=True)

df.head(5)

5. Renaming the columns

 In this instance, most of the column names are very confusing to read, so rename their column
names.
 This is a good approach it improves the readability of the data set.
df = df.rename(columns={“Start Date": “SDate", })
df.head(5)

6. Dropping the duplicate rows

 First finding the no of rows & columns.

df.shape
df.count() # Used to count the number of rows
 Finding no of duplicate data.
duplicate_rows_df = df.duplicated()
print("number of duplicate rows: ", duplicate_rows_df.shape)
number of duplicate rows: (0, 6)
 removing rows of duplicate data
df = df.drop_duplicates()
df.count() # Used to count the number of rows

7. Dropping the missing or null values.

 Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to as
NA(Not Available) values in pandas. There are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
print(df.isnull().sum())

 We can see that every column has a different amount of missing values. Like Gender as 145
missing values and salary has 0. Now for handling these missing values there can be several
cases like dropping the rows containing NaN or replacing NaN with either mean, median,
mode, or some other value.
 Now, let’s try to fill the missing values of gender with the string “No Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()

 Now, Let’s fill the senior management with the mode value.
mode = df['Senior Management'].mode().values[0]
df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
df.isnull().sum()

 Now for the first name and team, we cannot fill the missing values with arbitrary data, so,
let’s drop all the rows containing these missing values.
df = df.dropna(axis = 0, how ='any')
print(df.isnull().sum())
df.shape
8. Detecting Outliers:

 Outliers are nothing but an extreme value that deviates from the other observations in the
dataset.
 An outlier is a point or set of points that are different from other points.
 Sometimes they can be very high or very low.
 It's often a good idea to detect and remove the outliers. Because outliers are one of the
primary reasons for resulting in a less accurate model.
 Hence it's a good idea to remove them.
 IQR (Inter-Quartile Range) score technique is used to detect and remove outlier.
 Outliers can be seen with visualizations using a box plot.
 sns.boxplot(x=df['Salary'])

 Herein all the plots, we can find some points are outside the box they are none other than
outliers.
 Q1 = df.quantile(0.25)
 Q3 = df.quantile(0.75)
 IQR = Q3 - Q1
 print(IQR)
 df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
 df.shape

8. Data Visualization

Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot
easier to understand the trends or patterns in the data. There are various types of visualizations –

Histogram Plot

 A histogram is a graphical representation commonly used to visualize the distribution of

numerical data
 A histogram divides the values within a numerical variable into “bins”, and counts the
number of observations that fall into each bin

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Box Plot

Box Plot is the visual representation of the depicting groups of numerical data through their
quartiles. Boxplot is also used for detect the outlier in data set. It captures the summary of the data
efficiently with a simple box and whiskers and allows us to compare easily across groups.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()

Heat Maps Plot

 Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.

 One of the best ways to find the relationship between the features can be done using heat
maps.
Example:

# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)

Heat Maps Plot

 A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables.
 Calling the scatter() method on the plot member draws a plot between two variables or two
columns of pandas DataFrame.

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df[‘First Name'], df[‘Salary'])
ax.set_xlabel(‘First Name')
ax.set_ylabel(‘Salary')
plt.show()
Descriptive Statistics
 Descriptive Statistics is the default process in Data analysis.
 Exploratory Data Analysis (EDA) is not complete without a Descriptive Statistic analysis.
 Descriptive Statistics is divided into two parts:
1. Measure of Central Data points and
2. Measure of Dispersion.

1. Measure of Central Data points:

The following operations are performed under Measure of Central Data points. Each of these
measures describes a different indication of the typical or central value in the distribution.

1. Count
2. Mean
3. Mode
4. Median

2. Measure of Dispersion

The following operations are performed under Measure of Dispersion. Measures of dispersion can
be defined as positive real numbers that measure how homogeneous or heterogeneous the given data .

1. Range
2. Percentiles (or) Quartiles
3. Standard deviation
4. Variance
5. Skewness

Example:

 Consider a file:

https://media.geeksforgeeks.org/wp-content/uploads/employees.csv

 Before starting descriptive statistics analysis complete the data collection and cleaning
process.

Data Collection:

 # loading data set as Pandas dataframe

import pandas as pd
df = pd.DataFrame()
df =pd.read_csv("https://media.geeksforgeeks.org/wpcontent/uploads/employees.csv")

 df.head() # To display the top 5 rows

 df.tail() # To display the last 5 rows
 Here we check for the datatypes
df.dtypes
 Let’s see the shape of the data using the shape.
df.shape
df.count() # Used to count the number of rows
 Now, let’s also the columns and their data types. For this, we will use the info() method.
df.info()

Data Cleaning :

Data cleaning means fixing bad data in your data set before data analysis.

 Empty cells or Null values

 Data in wrong format
 Wrong data
 Duplicates

Describe() method :

 Let’s get a quick summary of the dataset using the describe() method.
 The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc.
 Any missing value or NaN value is automatically skipped.
 describe() function gives a good picture of the distribution of data.

Measure of Central Data points:

Count

It calculates the total count of numerical column data (or) each category of the categorical
variables.

# get column height from df

height =df["height"]

print(height)

Mean

 The Sum of values present in the column divided by total rows of that column is known as
mean.

 It is also known as average.

Median

 The center value of an attribute is known as a median.

 The median value divides the data points into two parts. That means 50% of data points are
present above the median and 50% below.

Mode

 The mode is that data point whose count is maximum in a column.

 There is only one mean and median value for each column. But, attributes can have more
than one mode value

#Calculating mean, median, mode of dataset height

mean = height.mean()
median =height.median()
mode = height.mode()
print(mean , median, mode)

Output:

53.73152709359609 54.1 0 50.8 dtype: float64

Measures of Dispersion

Range

 The difference between the max value to min value in a column is known as the range.

Standard Deviation

 The standard deviation value tells us how much all data points deviate from the mean value.

The standard deviation is affected by the outliers because it uses the mean for its calculation

 Formulas for Standard Deviation

 Notations for Standard Deviation

 σ = Standard Deviation
 xi = Terms Given in the Data
 x̄ = Mean
 n = Total number of Terms

 Example for standard deviation

#standard deviation of data set using std() function

std_dev =df.std()
print(std_dev)
# standard deviation of the specific column
sv_height=df.loc[:,"height"].std()
print(sv_height)

Output:

2.442525704031867

Variance

 Variance is the square of standard deviation.

Variance = (Standard deviation)2= σ2

 In the case of outliers, the variance value becomes large and noticeable.

 Example for standard variance

# variance of data set using var() functionvariance=df.var()

print(variance)
#variance of the specific column
var_height=df.loc[:,"height"].var()
print(var_height)

Output:

5.965931814856368

Skewness

 Ideally, the distribution of data should be in the shape of Gaussian (bell curve).

 But practically, data shapes are skewed or have asymmetry. This is known as skewness in
data.

 Formula for skewness is:

Skew = 3 * (Mean – Median) / Standard Deviation

 Skewness value can be negative (left) skew or positive (right) skew. Its value should be close
to zero.
 Example for skewness

 df.skew()

# skewness of the specific column

df.loc[:,"height"].skew()

output:

0.06413448813322854

Percentiles or Quartiles

 Column values can be spread by calculating the summary of several percentiles.

 Median is also known as the 50th percentile of data.

 Here is a different percentile.

1. The minimum value equals to 0th percentile.

2. The maximum value equals to 100th percentile.
3. The first quartile equals to 25th percentile.
4. The third quartile equals to 75th percentile.

Quartiles

 It divides the data set into four equal points.

 First quartile = 25th percentile

 Second quartile = 50th percentile (Median)

 Third quartile = 75th percentile

 Based on the quartile, there is a another measure called inter-quartile range that also measures
the variability in the dataset. It is defined as:

IQR = Q3 - Q1

 IQR is not affected by the presence of outliers.

 Example :
price = df.price.sort_values()
Q1 = np.percentile(price, 25)
Q2 = np.percentile(price, 50)
Q3 = np.percentile(price, 75)
IQR = Q3 - Q1
IQR

Output:

8718.5
Basic tools (plots, graphs and summary statistics) of EDA
 Exploratory data analysis or “EDA” is a critical ﬁrst step in analyzing the data
 The uses of EDA are:
1. Detection of mistakes
2. Checking of assumptions
3. Preliminary selection of appropriate models
4. Determining relationships among the exploratory variables

Typical data format (Data Types) and the types of EDA

Data Types:

Data types are mainly classified in to 2 types. Those are

Categorical Data

 Categorical data represents characteristics.

 It can represent things like a person’s gender, language etc.
 Categorical data can also take on numerical values.
Example: 1 for female and 0 for male.
 These numbers don’t have mathematical meaning

Nominal Data

 Nominal values represent discrete units and are used to label variables that have no
quantitative value.
 Nominal data has no order.
 If the order of the values changed their meaning would not change.

Examples:

.
Ordinal Data

 Ordinal values represent discrete and ordered units.

 It is same as nominal data, except that it’s ordering matters.
Example :

Numerical Data

1. Discrete Data

 Discrete data contains values as distinct and separate.

 This type of data can’t be measured but it can be counted.
 It basically represents information that can be categorized into a classification.
 An example is the number of heads in 100 coin flips.

2. Continuous Data

 Continuous Data represents measurements and therefore their values can’t be counted but
they can be measured.
 An example would be the height of a person, which can describe by using intervals on the real
number line.

Interval Data

 Interval values represent ordered units that have the same difference.

Example:

Ratio Data
 Ratio values are also ordered units that have the same difference.
 Ratio values are the same as interval values, with the difference that they do have an absolute
zero.
 Good examples are height, weight, length etc.

Example:

Types of EDA

 The EDA types of techniques are either graphical or quantitative (non-graphical).

 While the graphical methods involve summarizing the data in a diagrammatic or visual way.
 The quantitative method, on the other hand, involves the calculation of summary statistics.
 These two types of methods are further divided into univariate and multivariate methods.
 Univariate methods consider one variable (data column) at a time.
 Multivariate methods consider two or more variables at a time to explore relationships.
 Totally there are four types of EDA .
1. Univariate graphical,
2. Multivariate graphical,
3. Univariate non-graphical, and
4. Multivariate non-graphical.

Univariate non-graphical:

 This is the simplest form of data analysis among the four options.
 In this type of analysis, the data that is being analysed consists of just a single variable.
 The main purpose of this analysis is to describe the data and to find patterns.

Univariate graphical:

 The graphical method provides the full picture of the data.

 The three main methods of analysis under this type are histogram, stem and leaf plot, and box
plots.
 The histogram represents the total count of cases for a range of values.
 Along with the data values, the stem and leaf plot shows the shape of the distribution.
 The box plots graphically depict a summary of minimum, first quartile median, third quartile,
and maximum.

Multivariate non-graphical:
 The multivariate non-graphical type of EDA generally depicts the relationship between
multiple variables of data through cross-tabulation or statistics.

Multivariate graphical:

 This type of EDA displays the relationship between two or more set of data.
 A bar chart, where each group represents a level of one of the variables and each bar within
the group represents levels of other variables.

Summary Statistics of EDA

 one purpose of EDA is to spot problems in data (as part of data wrangling) and understand
variable properties like:
1. central trends (mean)
2. spread (variance)
3. skew
 Suggest possible modeling strategies (e.g., probability distributions)
 EDA is used to understand relationship between pairs of variables, e.g. their correlation or
covariance.

COVARIANCE

 Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
 This can be represented with the following equation:

 xi is the ith observation in variable x,

 x¯ is the mean for variable x,
 yi is the ith observation in variable y,
 y¯ is the mean for variable y, and
 N is the number of observations
 This can be calculated easily within Python - particulatly when using Pandas as
import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.cov()

CORRELATION

 It is a statistical metric to measure what extent different variables are interdependent.

 In short, if one variable changes, how does it affect other variable.

 Where, xi is the ith observation in variable x,

 x¯ is the mean for variable x,
 yi is the ith observation in variable y,
 y¯ is the mean for variable y, and
 N is the number of observations
 sx is the standard deviation for variable x
 sy is the standard deviation for variable y

 this can be calculated easily within Python - particulatly when using Pandas as

import pandas as pd
df = pd.DataFrame()
df = pd.read_csv(“data.csv")
df.corr()

Philosophy of Exploratory Data Analysis

 The important reasons to implement EDA when working with data are:
1. To gain intuition about the data;
2. To make comparisons between distributions;
3. For sanity checking (making sure the data is on the scale we expect, in the
format we thought it should be);
4. To find out where data is missing or if there are outliers; and to summarize the data.
 In the context of data generated from logs, EDA also helps with debugging the logging
process.
 In the end, EDA helps us to make sure the product is performing as intended.
 There’s lots of visualization involved in EDA.
 The distinguish between EDA and data visualization is that EDA is done toward the
beginning of analysis, and data visualization is done toward the end to Communicate one’s
findings.
 With EDA, the graphics are solely done for us to understand what’s going on.
 EDA are used to improve the development of algorithms.
Data Visualization
What is Data Visualization :

 Data Visualization is the presentation of data in graphical format.

 Presenting huge amount of data in a simple and easy-to-understand format and helps
communicate information clearly and effectively.
 The data in a graphical format allows them to identify new trends and patterns easily.
 The main benefits of data visualization are as follows:
1. It simplifies the complex quantitative information
2. It helps analyze and explore big data easily
3. It identifies the areas that need attention or improvement
4. It identifies the relationship between data points and variables
5. It explores new patterns and reveals hidden patterns in the data
 Three major considerations for Data Visualization:
 Clarity
 Accuracy
 Efficiency

Clarity - Clarity ensures that the data set is complete and relevant.

Accuracy – Accuracy ensures using appropriate graphical representation to convey the right
message.

Efficiency - Efficiency uses efficient visualization technique which highlights all the data
points

 Some basic factors to be aware of before visualizing the data.

 Visual effect
 Coordination System
 Data Types and Scale
 Informative Interpretation
Visual effect - Visual Effect includes the usage of appropriate shapes, colors, and
size to represent the analyzed data.
Coordination System - The Coordinate System helps to organize the data points
within the provided coordinates.
Data Types and Scale - The Data Types and Scale choose the type of data such as
numeric or categorical.
Informative Interpretation – The Informative Interpretation helps create visuals in
an effective and easily interpreted ill manner using labels, title legends, and pointers.

 Python offers multiple great graphing libraries.

 Some popular plotting libraries:

 Matplotlib
 Pandas Visualization
 Seaborn
 ggplot
 Plotly
 Plots (graphics), also known as charts, are a visual representation of data in the form of
colored (mostly) graphics.

Histogram Plot

 A histogram is a graphical representation commonly used to visualize the distribution of

numerical data
 A histogram divides the values within a numerical variable into “bins”, and counts the
number of observations that fall into each bin

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()

Box Plot

Example:

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x="Salary", y='Team', data=df, )
plt.show()
Heat Maps Plot

 Heat Maps is a type of plot which is necessary when we need to find the dependent
variables.

 One of the best ways to find the relationship between the features can be done using heat
maps.

Example:

# importing packages
import seaborn as sns
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG", cannot=True)

Heat Maps Plot

UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Analysis Midterm 5
0% (3)
Data Analysis Midterm 5
25 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
Gmath Finals Module 2 Chapter 4
100% (3)
Gmath Finals Module 2 Chapter 4
39 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Phython Example
No ratings yet
Phython Example
12 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Applications of Statistics in Daily Life
No ratings yet
Applications of Statistics in Daily Life
11 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
Quiz On Normal Distrribution
100% (1)
Quiz On Normal Distrribution
2 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
11 Economics MCQ
No ratings yet
11 Economics MCQ
70 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Exploratory Data Analysis: Prasad Deshmukh
No ratings yet
Exploratory Data Analysis: Prasad Deshmukh
15 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Lumitester PD-30: Instruction Manual
No ratings yet
Lumitester PD-30: Instruction Manual
16 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
5-Introduction To The Normal Distribution (Bell Curve)
No ratings yet
5-Introduction To The Normal Distribution (Bell Curve)
9 pages
FRM Part 1: Basic Statistics
No ratings yet
FRM Part 1: Basic Statistics
28 pages
Psychology Module IGNOU
No ratings yet
Psychology Module IGNOU
11 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
15 pages
AppliedStatistics PDF
No ratings yet
AppliedStatistics PDF
401 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Pearson 1932
No ratings yet
Pearson 1932
59 pages
Quiz 3 SV E
No ratings yet
Quiz 3 SV E
10 pages
Data Analytics Fundamentals-2
No ratings yet
Data Analytics Fundamentals-2
34 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Groebner ch05
No ratings yet
Groebner ch05
69 pages
Central Tendency and Dispersion
No ratings yet
Central Tendency and Dispersion
3 pages
Data Analytics
No ratings yet
Data Analytics
36 pages
Statistics in Music Education Research (Joshua A. Russell) (Z-Library)
No ratings yet
Statistics in Music Education Research (Joshua A. Russell) (Z-Library)
353 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
84 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Quantitative Methods Online Course
No ratings yet
Quantitative Methods Online Course
119 pages
Math7 Q4 LAS 50pp-Grayscale
100% (1)
Math7 Q4 LAS 50pp-Grayscale
50 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Dissicusion 1
100% (1)
Dissicusion 1
3 pages
AP Bio Summer Assignment
No ratings yet
AP Bio Summer Assignment
11 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
Unit 1
No ratings yet
Unit 1
23 pages
FSS Question - SPLENDOUR AT NAPSS LEVEL
No ratings yet
FSS Question - SPLENDOUR AT NAPSS LEVEL
5 pages
Descriptive Statistics Updated
No ratings yet
Descriptive Statistics Updated
38 pages
CH 14 Statistics
No ratings yet
CH 14 Statistics
45 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Unit 2
No ratings yet
Unit 2
21 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Stsa 1624
No ratings yet
Stsa 1624
121 pages
Stat Book Versao Final
No ratings yet
Stat Book Versao Final
76 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
LESSON 8 Normal Distribution
No ratings yet
LESSON 8 Normal Distribution
51 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Chapter 14statistics - 20250205112144 2
No ratings yet
Chapter 14statistics - 20250205112144 2
3 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
Unit 2
No ratings yet
Unit 2
36 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
Endsem Imp Bi Unit 4
No ratings yet
Endsem Imp Bi Unit 4
36 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Dev Core
No ratings yet
Dev Core
7 pages
Lecture 22
No ratings yet
Lecture 22
20 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
5008 Assessment 2
No ratings yet
5008 Assessment 2
10 pages
Geochemical Background Can We Calculate-1
No ratings yet
Geochemical Background Can We Calculate-1
11 pages
Statistics and Probability Questions DCAT
No ratings yet
Statistics and Probability Questions DCAT
7 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet

Unit - Iii - Eda

Uploaded by

Unit - Iii - Eda

Uploaded by

Data Science

What is Data Science?

Data Science Lifecycle :

 Data preparation is also known as Data Munging (Transformation).

 In this phase, the process of model building starts.

Data Analysis is a process of inspecting, cleaning, transforming, and modeling data to

Steps for Data Analysis, Data Manipulation and Data Visualization:

 Transform Raw Data in a Desired Format

Exploratory Data Analysis(EDA)

What is Exploratory Data Analysis ?

How to perform Exploratory Data Analysis ?

1. Importing the required libraries for EDA

 df.tail() # To display the bottom 5 rows

3. Checking the types of data

df = df.drop([‘Last Login Time’, ‘Senior Management’], axis=1, inplace=True)

5. Renaming the columns

6. Dropping the duplicate rows

 First finding the no of rows & columns.

7. Dropping the missing or null values.

 A histogram is a graphical representation commonly used to visualize the distribution of

Heat Maps Plot

Heat Maps Plot

1. Measure of Central Data points:

 # loading data set as Pandas dataframe

 df.head() # To display the top 5 rows

 Empty cells or Null values

Measure of Central Data points:

# get column height from df

 It is also known as average.

 The center value of an attribute is known as a median.

 The mode is that data point whose count is maximum in a column.

#Calculating mean, median, mode of dataset height

53.73152709359609 54.1 0 50.8 dtype: float64

 Formulas for Standard Deviation

 Example for standard deviation

#standard deviation of data set using std() function

 Variance is the square of standard deviation.

Variance = (Standard deviation)2= σ2

 Example for standard variance

# variance of data set using var() functionvariance=df.var()

 Formula for skewness is:

Skew = 3 * (Mean – Median) / Standard Deviation

# skewness of the specific column

 Column values can be spread by calculating the summary of several percentiles.

 Median is also known as the 50th percentile of data.

 Here is a different percentile.

1. The minimum value equals to 0th percentile.

 It divides the data set into four equal points.

 First quartile = 25th percentile

 Second quartile = 50th percentile (Median)

 Third quartile = 75th percentile

 IQR is not affected by the presence of outliers.

Typical data format (Data Types) and the types of EDA

Data types are mainly classified in to 2 types. Those are

 Categorical data represents characteristics.

 Ordinal values represent discrete and ordered units.

 Discrete data contains values as distinct and separate.

 The EDA types of techniques are either graphical or quantitative (non-graphical).

 The graphical method provides the full picture of the data.

Summary Statistics of EDA

 xi is the ith observation in variable x,

 It is a statistical metric to measure what extent different variables are interdependent.

 Where, xi is the ith observation in variable x,

Philosophy of Exploratory Data Analysis

 Data Visualization is the presentation of data in graphical format.

 Some basic factors to be aware of before visualizing the data.

 Python offers multiple great graphing libraries.

 Some popular plotting libraries:

 A histogram is a graphical representation commonly used to visualize the distribution of

Heat Maps Plot

You might also like