0% found this document useful (0 votes)

20 views10 pages

Exploratory Data Analysis (Eda)

Uploaded by

vishnuai4568

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views10 pages

Exploratory Data Analysis (Eda)

Uploaded by

vishnuai4568

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

EXPLORATORY DATA ANALYSIS (EDA)

FUNDAMENTALS

• Data encompasses a collection of discrete objects, numbers, words, events, facts,

Measurements, observations, or even descriptions of things. Such data is collected

And stored by every event or process occurring in several disciplines, including

Biology, economics, engineering, marketing, and others.

• Processing such data elicits useful information and processing such information

Generates useful knowledge.

“EDA is a process of examining the available dataset to discover patterns, spot

Anomalies, test hypotheses, and check assumptions using statistical measures.”

UNDERSTANDING DATA SCIENCE

Data science involves cross-disciplinary knowledge from computer science, data,

Statistics, and mathematics. There are several phases of data analysis, including

• data requirements,

• data collection,

• data processing,

• data cleaning,

• exploratory data analysis,

• modeling and algorithms, and

• data product and

Communication.

These phases are similar to the Cross-Industry Standard Process for data mining

(CRISP) framework in data mining.

✓ Data requirements:
✓

There can be various sources of data for an organization. It is important to

Comprehend what type of data is required for the organization to be collected,

Curated, and stored.

For example, an application tracking the sleeping pattern of patients

Suffering from dementia requires several types of sensors’ data storage, such

As sleep data, heart rate from the patient, electro-dermal activities, and user activities
pattern

All of these data points are required to correctly diagnose the mental state of

The person. Hence, these are mandatory requirements for the application. In

Addition to this, it is required to categorize the data, numerical or categorical,

And the format of storage and dissemination.

• Data collection:

Data collected from several sources must be stored in the correct format and

Transferred to the right information technology personnel within a company.

✓ As mentioned previously, data can be collected from several objects on

Several events using different types of sensors and storage tools.

Data processing:

Preprocessing involves the process of pre-curating the dataset before actual

Analysis. Common tasks involve correctly exporting the dataset, placing them

Under the right tables, structuring them, and exporting them in the correct
Format.

• Data cleaning:

Preprocessed data is still not ready for detailed analysis. It must be correctly

Transformed for an incompleteness check, duplicates check, error check, and

Missing value check. These tasks are performed in the data cleaning stage,

Which involves responsibilities such as matching the correct record, finding

Inaccuracies in the dataset, understanding the overall data quality, removing

Duplicate items, and filling in the missing values.

✓ Finding such data issues requires us to perform some analytical techniques.

Hence, it is most essential for data scientists or EDA experts to comprehend

Different types of datasets. An example of data cleaning would be using

Outlier detection methods for quantitative data cleaning.

• EDA:

✓ Exploratory data analysis, as mentioned before, is the stage where we

Actually start to understand the message contained in the data. It should be

Noted that several types of data transformation techniques might be required

During the process of exploration.

Modeling and algorithm:

✓ From a data science perspective, generalized models or mathematical

Formulas can represent or exhibit relationships among different variables,

Such as correlation or causation.

These models or equations involve one or more variables that depend on

Other variables to cause an event. For example, when buying, say, pens, the
Total price of pens(Total) = price for one pen(UnitPrice) * the number of

Pens bought (Quantity). Hence, our model would be Total = UnitPrice *

Quantity. Here, the total price is dependent on the unit price. Hence, the total price is
referred to as the dependent variable and the unit price is referred to as

an independent variable.

✓ In general, a model always describes the relationship between independent

and dependent variables. Inferential statistics deals with quantifying

relationships between particular variables.

• Data Product:

Any computer software that uses data as inputs, produces outputs, and

provides feedback based on the output to control the environment is referred to

as a data product.

✓ A data product is generally based on a model developed during data analysis,

for example, a recommendation model that inputs user purchase history and

recommends a related item that the user is highly likely to buy.

Communication:

✓ This stage deals with disseminating the results to end stakeholders to use the

result for business intelligence. One of the most notable steps in this stage is

data visualization. Visualization deals with information relay techniques such

as tables, charts, summary diagrams, and bar charts to show the analyzed

result.

THE SIGNIFICANCE OF EDA

Different fields of science, economics, engineering, and marketing

accumulate and store data primarily in electronic databases. Appropriate and well-

established decisions should be made using the data collected.

It is practically impossible to make sense of datasets containing more than a

handful of data points without the help of computer programs. To be certain of the insights

that the collected data provides and to make further decisions, data mining is
performed

where we go through distinctive analysis processes.

Exploratory data analysis is key, and usually the first exercise in data mining.

It allows us to visualize data to understand it as well as to create hypotheses for

further

analysis. The exploratory analysis centers around creating a synopsis of data or insights for

the next steps in a data mining project.

❖ EDA actually reveals ground truth about the content without making any

underlying assumptions. This is the fact that data scientists use this process to
actually

understand what type of modeling and hypotheses can be created.

❖ Key components of exploratory data analysis include summarizing data,

statistical analysis, and visualization of data. Python provides expert tools for
exploratory

analysis, with pandas for summarizing; scipy, along with others, for statistical
analysis; and

matplotlib and plotly for visualizations.

Steps in EDA

• Problem definition:
• Before trying to extract useful insight from the data, it is essential to definethe
business problem to be solved. The problem definition works as the
• Driving force for a data analysis plan execution.

The main tasks involved in problem definition are defining the main

Objective of the analysis, defining the main deliverables, outlining the main

Roles and responsibilities, obtaining the current status of the data, defining the

Timetable, and performing cost/benefit analysis. Based on such a problem

Definition, an execution plan can be created.

• Data preparation:

This step involves methods for preparing the dataset before actual analysis. In

This step, we define the sources of data, define data schemas and tables,

Understand the main characteristics of the data, clean the dataset, delete non-

Relevant datasets, transform the data, and divide the data into required chunks

For analysis.

Data analysis:

• This is one of the most crucial steps that deals with descriptive statistics and
• Analysis of the data. The main tasks involve summarizing the data, finding

The hidden correlation and relationships among the data, developing

Predictive models, evaluating the models, and calculating the accuracies.

Some of the techniques used for data summarization are summary tables,

Graphs, descriptive statistics, inferential statistics, correlation statistics,

Searching, grouping, and mathematical models.

• Development and representation of the results:

• This step involves presenting the dataset to the target audience in the form of
• Graphs, summary tables, maps, and diagrams. This is also an essential step as

The result analyzed from the dataset should be interpretable by the business

Stakeholders, which is one of the major goals of EDA.

• Most of the graphical analysis techniques include scattering plots, character

• Plots, histograms, box plots, residual plots, mean plots, and others.

MAKING SENSE OF DATA

It is crucial to identify the type of data under analysis. In this section, we are going to

Learn about different types of data that you can encounter during analysis. Different
disciplines

Store different kinds of data for different purposes. For example, medical researchers
store

Patients’ data, universities store students’ and teachers’ data, and real estate industries
storehouse

And building datasets.

A dataset contains many observations about a particular object. For instance, a

Dataset about patients in a hospital can contain many observations. A patient can be

Described by a patient identifier (ID), name, address, weight, date of birth, address,
email,

And gender. Each of these features that describes a patient is a variable. Each observation

Can have a specific value for each of these variables.

PATIENT_ID = 1001

Name = Yoshmi Mukhiya

Address = Mannsverk 61, 5094, Bergen, Norway

Email = [email protected] Weight = 10

Gender = Female

PATIENT_ID NAME ADDRESS DOB EMAIL Gender WEIGHT

001
Suresh Kumar

Mukhiya

Mannsverk, 61

30.12.198

[email protected] Male 68

002

Yoshmi

Mukhiya

Mannsverk 61,

5094,

Bergen

10.07.201

[email protected]

Female 1

003

Anju Mukhiya

Mannsverk 61,

5094,

Bergen

10.12.199

[email protected] Female 24

004
Asha

Gaire

Butwal,

Nepal

30.11.199

[email protected] Female 23

005

Ola Nordmann

Danmark,

Sweden

12.12.178

[email protected] Male 75

Most of the dataset broadly falls into two groups—numerical data and categorical data.

1. Numerical data

This data has a sense of measurement involved in it; for example, a person’s age,

Height, weight, blood pressure, heart rate, temperature, number of teeth, number of
bones,

And the number of family members. This data is often referred to as quantitative data
in

Statistics. The numerical dataset can be either discrete or continuous types.

a) Discrete data

This is data that is countable and its values can be listed out. For example, if we

Flip a coin, the number of heads in 200 coin flips can take values from 0 to 200 (finite)

Cases. A variable that represents a discrete dataset is referred to as a discrete variable.

The discrete variable

Takes a fixed number of distinct values. For example, the

Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed.
The

Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.

b) Continuous data

A variable that can have an infinite number of numerical values within a specific

Range is classified as continuous data. A variable describing continuous data is a

Continuous variable.

For example, what is the temperature of your city today? Can we be finite?

Similarly, the weight variable in the previous section is a continuous variable. \

2. Categorical data

This type of data represents the characteristics of an object; for example, gender,

Marital status, type of address, or categories of the movies. This data is often referred to
asqualitative datasets in statistics. To understand clearly, here are some of the most
common

Types of categorical data you can find in data:

Gender (Male, Female, Other, or Unknown)

• Marital Status (Annulled, Divorced, Interlocutory, Legally Separated,

Married, Polygamous, Never Married, Domestic Partner, Unmarried,

Widowed, or Unknown)

• Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy,

Historical, Horror, Mystery, Philosophical, Political, Romance, Saga, Satire,

Science Fiction, Social, Thriller, Urban, or Western)

Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
EDA Lecture Notes
No ratings yet
EDA Lecture Notes
205 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
GATE DA Data Warehousing
No ratings yet
GATE DA Data Warehousing
30 pages
Unit 7
67% (3)
Unit 7
43 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
Capstone Project - Unit2
No ratings yet
Capstone Project - Unit2
81 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Unit I Exploratory Data Analysis
No ratings yet
Unit I Exploratory Data Analysis
38 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
IDS CH2 Bharath S
No ratings yet
IDS CH2 Bharath S
57 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
NetBackup83 9x Tuning Guide
No ratings yet
NetBackup83 9x Tuning Guide
98 pages
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
100% (1)
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
72 pages
Unit 1
No ratings yet
Unit 1
29 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
Project Presentation2
No ratings yet
Project Presentation2
22 pages
Research Assignment 02burhan Ul Din
No ratings yet
Research Assignment 02burhan Ul Din
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Eda 2
No ratings yet
Eda 2
69 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
4/5 (2)
Preparation of Production Report: Learning Activity Sheet No.
100% (1)
Preparation of Production Report: Learning Activity Sheet No.
3 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
DS Lecture 15
No ratings yet
DS Lecture 15
44 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Iso TS 29585-2010
No ratings yet
Iso TS 29585-2010
64 pages
EnggTree Syllabus Aids 2021
No ratings yet
EnggTree Syllabus Aids 2021
76 pages
Introduction To Data Analytics Techniques and Tools
No ratings yet
Introduction To Data Analytics Techniques and Tools
9 pages
CDS View With Join Vs Associations
No ratings yet
CDS View With Join Vs Associations
4 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Unit 1
No ratings yet
Unit 1
36 pages
FTA-Module 1-Notes
No ratings yet
FTA-Module 1-Notes
24 pages
CMS Report
No ratings yet
CMS Report
66 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
DFF Details Steps
No ratings yet
DFF Details Steps
17 pages
Devart ODBCQuick Books
No ratings yet
Devart ODBCQuick Books
95 pages
Business Undestanding and Data Collection
No ratings yet
Business Undestanding and Data Collection
27 pages
Google Premium Professional-Cloud-Developer by - VCEplus 50q-DEMO
No ratings yet
Google Premium Professional-Cloud-Developer by - VCEplus 50q-DEMO
25 pages
Unit 3
No ratings yet
Unit 3
83 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Unit 1
No ratings yet
Unit 1
52 pages
Week 05 Implementing Dimensional Models
No ratings yet
Week 05 Implementing Dimensional Models
25 pages
DSA Module 1 Notes
No ratings yet
DSA Module 1 Notes
24 pages
Data Sciecnce
No ratings yet
Data Sciecnce
16 pages
Salesforce Herokue
No ratings yet
Salesforce Herokue
17 pages
The GIS Application For Electricity Distribution Billing and Revenue Management
No ratings yet
The GIS Application For Electricity Distribution Billing and Revenue Management
24 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit 4
No ratings yet
Unit 4
33 pages
publication_3_28774_1724
No ratings yet
publication_3_28774_1724
6 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
UNit1 - Database Design
No ratings yet
UNit1 - Database Design
18 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Sentiment Analysis Project
No ratings yet
Sentiment Analysis Project
5 pages
PC Software
No ratings yet
PC Software
8 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Unit 1
No ratings yet
Unit 1
19 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
LLM Based Text To SQL
No ratings yet
LLM Based Text To SQL
9 pages
Book Publication Database (Sample Queries & Solutions Using Basic Commands)
No ratings yet
Book Publication Database (Sample Queries & Solutions Using Basic Commands)
21 pages
Govt Quota Details
100% (1)
Govt Quota Details
3 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Inplant PPT GRP-15
No ratings yet
Inplant PPT GRP-15
22 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Biñan City Senior High School - San Antonio Campus
No ratings yet
Biñan City Senior High School - San Antonio Campus
4 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
What Is Payroll Cluster
No ratings yet
What Is Payroll Cluster
8 pages
Iob18202035232pm Swasthya Sale Notice - Ivth E-Auction
No ratings yet
Iob18202035232pm Swasthya Sale Notice - Ivth E-Auction
5 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
Knowledge Management Encyclopedia
No ratings yet
Knowledge Management Encyclopedia
17 pages
CSC2001F June 2023
No ratings yet
CSC2001F June 2023
10 pages
Pandas DataFrame Assignment (1)
No ratings yet
Pandas DataFrame Assignment (1)
2 pages
IPv6 and Multicast Notes (1)
No ratings yet
IPv6 and Multicast Notes (1)
2 pages
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
No ratings yet
Spanish Word Vectors From Wikipedia: Mathias Etcheverry, Dina Wonsever
5 pages
IndOASIS-Registration 28.03.2022 Website
No ratings yet
IndOASIS-Registration 28.03.2022 Website
3 pages
Rakesh Data BI
No ratings yet
Rakesh Data BI
6 pages
Gayathri GCP Cloud Engineer
No ratings yet
Gayathri GCP Cloud Engineer
8 pages
JCR Primer - 2020
No ratings yet
JCR Primer - 2020
6 pages
DA Unit-2
No ratings yet
DA Unit-2
7 pages
Order Online Internship
No ratings yet
Order Online Internship
1 page
acknowledgementSlip_S1366268601510
No ratings yet
acknowledgementSlip_S1366268601510
1 page
1-91b7e93e-a837-4a39-8b8b-e9cdb69252c1
No ratings yet
1-91b7e93e-a837-4a39-8b8b-e9cdb69252c1
1 page
acknowledgementSlip_S1366268601510
No ratings yet
acknowledgementSlip_S1366268601510
1 page
NjynCWzGSaWXQCxSX TMjbs76F526fF5v3G ID3LhymD8JXHXcNbJ 1753622993319 Completion Certificate
No ratings yet
NjynCWzGSaWXQCxSX TMjbs76F526fF5v3G ID3LhymD8JXHXcNbJ 1753622993319 Completion Certificate
1 page
Certificate
No ratings yet
Certificate
1 page
Gabev3vXhuACr48eb SKZxezskWgmFjRvj9 ID3LhymD8JXHXcNbJ 1753676850994 Completion Certificate
No ratings yet
Gabev3vXhuACr48eb SKZxezskWgmFjRvj9 ID3LhymD8JXHXcNbJ 1753676850994 Completion Certificate
1 page
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
Exercise 7,8,9 Basic Commands
No ratings yet
Exercise 7,8,9 Basic Commands
7 pages
PICARD
No ratings yet
PICARD
7 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Acknowledgementslip S1315737081000 PDF
No ratings yet
Acknowledgementslip S1315737081000 PDF
1 page
Querying With Transact-SQL: Lab 5 - Using Functi Ons and Aggregati NG Data
No ratings yet
Querying With Transact-SQL: Lab 5 - Using Functi Ons and Aggregati NG Data
2 pages