Unit 4 Exploratory Data Analysis and the Data Science Process (1)

The document outlines the philosophy and process of Exploratory Data Analysis (EDA), emphasizing curiosity, skepticism, and visualization to uncover insights from data. It details various statistical tools and techniques used in EDA, including measures of central tendency and graphical representations like histograms and box plots. Additionally, it describes the systematic data science process, which involves framing problems, collecting and processing data, exploring insights, performing analysis, and communicating results effectively.

Uploaded by

Aashu Kññûjìyá

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit 4 Exploratory Data Analysis and the Data Science Process (1)

Uploaded by

Aashu Kññûjìyá

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit 4 Exploratory Data Analysis and the Data Science Process

Philosophy of EDA-
The "philosophy of EDA" isn’t a formally defined doctrine, but it reflects a mindset and
approach to data analysis rooted in curiosity, skepticism, and pragmatism. Exploratory Data
Analysis (EDA), pioneered by statistician John Tukey in the 1970s, emphasizes letting the
data speak for itself before imposing rigid models or assumptions. Here’s a breakdown of
its philosophical underpinnings:
1. **Data-Driven Curiosity**
EDA starts with an open-ended question: "What’s in the data?" It’s about exploring without
preconceived notions, treating the dataset as a story waiting to be uncovered. The
philosophy here is that truth lies in the raw observations, not in hypotheses you bring to the
table.
Think of it like a detective poking around a crime scene—looking for clues, not assuming
who the culprit is.
2. Skepticism of Assumptions
Traditional statistical methods often rely on assumptions (e.g., normality, linearity). EDA
rejects jumping straight to those, insisting you first check if they even hold. It’s skeptical of
forcing data into a box it doesn’t fit.
Tukey’s view was that assumptions should be earned through evidence, not blindly applied.
This is a rebellion against overly theoretical, model-first approaches.
3. Visualization as Truth-SeekingEDA leans heavily on graphical tools—histograms, scatter
plots, box plots—because seeing is believing. The philosophy here is that human intuition,
paired with visual patterns, can reveal insights that equations might miss.
It’s almost an artistic stance: the data’s shape, spread, and quirks are more honest when
you *look* at them rather than reduce them to numbers.
4. Embracing the Mess
Real-world data is messy—outliers, missing values, weird distributions. EDA’s philosophy
doesn’t shy away from that chaos; it dives in. Instead of cleaning data to fit a model, it asks,
“What does the mess tell us?”
5. Iterative and Flexible Thinking
EDA isn’t a one-and-done step; it’s a cycle of looking, questioning, and digging deeper. The
philosophy values adaptability—pivot when you spot something unexpected, chase
anomalies, refine your focus as you go.
This pragmatism makes it a tool for doers: engineers, scientists, analysts who need to solve
problems, not just publish papers.
In Essence:
The philosophy of EDA is about approaching data with a beginner’s mind—curious, critical,
and unburdened by dogma (a belief or set of beliefs that people are expected to accept as
true without questioning). It’s a call to listen to the data first, using simple tools to uncover
its structure, quirks, and secrets, before deciding what to do next. It’s less a rigid method
and more a way of thinking: trust the evidence, question the obvious, and let patterns
emerge naturally.

Basic tools (plots, graphs and summary statistics) of EDA

In Exploratory Data Analysis (EDA), summary statistics are used to provide a concise
overview of a dataset's key characteristics, such as its central tendency, dispersion, and
distribution, helping to identify patterns and anomalies.
EDA aims to understand the data, identify patterns, and formulate hypotheses before formal
statistical modeling.
• Types of Summary Statistics:
• Measures of Central Tendency: Mean, median, and mode help describe the
typical or average value of a dataset.
• Measures of Dispersion: Standard deviation, variance, and range quantify the
spread or variability of the data.
• Percentiles and Quartiles: These help understand the distribution of data by
dividing it into different segments

Data can be represented mostly in these forms-

• Univariate Exploratory Data Analysis – In Univariate Data Analysis we use one variable or
feature to determine the characteristics of the dataset. We derive the relationships and
distribution of data concerning only one feature or variable. In this category, we have the
liberty to use either the raw data or follow a graphical approach.
o In the Univariate raw data approach or Non-Graphical , we determine the
distribution of data based on one variable and study a sample from the population.
Also, we may include outlier removal which is a part of this process.
Examples are-
The measure of Central tendency − Central tendency tried to summarize a whole
population or dataset with the help of a single value that represents the central value.
The three measures are the mean, the median, and the mode.
Mean − It is the average of all the observations. i.e., the sum of all observations divided
by the number of observations. Median − It is the middle value of the observations or
distribution after arranging them in ascending or descending order. Mode − It is the most
frequently occurring observation.

Variance − It indicates the spread of the data about the middle or Mean value. It helps us
gather info regarding observations concerning central tendencies like mean. It is calculated as
the mean of the square of all observations.

Skewness − It is the measure of the symmetry of the observations. The distribution can either
be left-skewed or right skewed forming a long tail in either case.
Kurtosis − It measures how much-tailed a particular distribution is concerning a normal
distribution. Medium kurtosis is known as mesokurtic and low kurtosis is known as platykurtic.

o In the Univariate graphical approach, we may use any graphing library to generate
graphs like histograms, boxplots, quantile-quantile plots, violin plots, etc. for
visualization. Data Scientists often use visualization to discover anomalies and patterns.
The graphical method is a more subjective approach to EDA. These are some of the
graphical tools to perform univariate analysis.
o Histograms − They represent an actual count of a particular range of values. It shows the
frequency of data in the form of rectangles' which is also known as bar graph
representation and can be either vertical or horizontal.
o Box plots − Also known as box and whisker plots. They use lines and boxes to show the
distribution of data from one or more than one groups. A central line indicates the
median value. The extended line captures the rest of the data. They are useful in the way
that they can be used to compare groups of data and compare symmetry.

Multivariate Non-Graphical (raw data) − Techniques like tabulation of more than two
variables. ANOVA test can also play a significant role.
ANOVA, which stands for Analysis of Variance, is a technique that determines whether the
averages of three or more independent groups differ significantly from one another.
An ANOVA should use categorical data, such as nominal or ordinal data, as its independent
variable. ANOVA is typically used when the dependent variable is continuous, but the
independent variable(s) must be categorical (e.g., different groups or treatments). You
cannot conduct an ANOVA test if your dependent variable is the nominal data.
• Multivariate Graphical − In visualization analysis for multivariate statistics, the below
plots can be used.
o Scatterplot − It is used to display the relationship between two variables by
plotting the data as dots. Additionally, color coding can be intelligently used to
show groups within the two features based on a third feature.
o Heatmap − In this visualization technique the values are represented with colors
with a legend showing color for different levels of the value. It is a 2d graph.
o Bubble plot − In this graph circles are used to show different values. The radius of
the circle on the chart is proportional to the value of the data point.

Data Science Process-

Data Science is all about a systematic process used by Data Scientists to analyze, visualize
and model large amounts of data. A data science process helps data scientists use the tools
to find unseen patterns, extract data, and convert information to actionable insights that can
be meaningful to the company. This aids companies and businesses in making decisions that
can help in customer retention and profits.
The six data science process steps are as follows:
1. Frame the problem
2. Collect the raw data needed for your problem
3. Process the data for analysis
4. Explore the data
5. Perform in-depth analysis
6. Communicate results of the analysis
A great way to go through this step is to ask questions like:
• Who the customers are?
• How to identify them?
• What is the sale process right now?
• Why are they interested in your products?
• What products they are interested in?
You will need much more context from numbers for them to become insights. At the end of
this step, you must have as much information at hand as possible.
Step 2: Collecting the Raw Data for the Problem
After defining the problem, you will need to collect the requisite data to derive insights and
turn the business problem into a probable solution. The process involves thinking through
your data and finding ways to collect and get the data you need. It can include scanning your
internal databases or purchasing databases from external sources.
Many companies store the sales data they have in customer relationship management
(CRM) systems. The CRM data can be easily analyzed by exporting it to more advanced tools
using data pipelines.
Step 3: Processing the Data to Analyze
Data can be messy if it has not been appropriately maintained, leading to errors that easily
corrupt the analysis. These issues can be values set to null when they should be zero or the
exact opposite, missing values, duplicate values, and many more. You will have to go through
the data and check it for problems to get more accurate insights.
The most common errors that you can encounter and should look out for are:
1. Missing values
2. Corrupted values like invalid entries
3. Time zone differences
4. Date range errors like a recorded sale before the sales even started
Once you have completed the data cleaning process, your data will be ready for an
exploratory data analysis (EDA).
Step 4: Exploring the Data
In this step, you will have to develop ideas that can help identify hidden patterns and
insights. You will have to find more interesting patterns in the data, such as why sales of a
particular product or service have gone up or down. You must analyze or notice this kind of
data more thoroughly. This is one of the most crucial steps in data science process.
Step 5: Performing In-depth Analysis
This step will test your mathematical, statistical, and technological knowledge. You must use
all the data science tools to crunch the data successfully and discover every insight you can.
You might have to prepare a predictive model that can compare your average customer with
those who are underperforming. You might find several reasons in your analysis, like age or
social media activity, as crucial factors in predicting the consumers of a service or product.
You might find several aspects that affect the customer, like some people may prefer being
reached over the phone rather than social media.
Step 6: Communicating Results of this Analysis
After all these data science steps, it is vital to convey your insights and findings to the sales
head and make them understand their importance. It will help if you communicate
appropriately to solve the problem you have been given. Proper communication will lead to
action. In contrast, improper contact may lead to inaction.
You need to link the data you have collected and your insights with the sales head’s
knowledge so that they can understand it better.

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Blackboard Learn Green PDF
71% (7)
Blackboard Learn Green PDF
52 pages
Research in Education Evidence-Based Inquiry James Mcmillan Sally Schumacher Seventh Edition
0% (1)
Research in Education Evidence-Based Inquiry James Mcmillan Sally Schumacher Seventh Edition
7 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
5 pages
EDA QB Full Answers
No ratings yet
EDA QB Full Answers
18 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
DOC-20250125-WA0000.
No ratings yet
DOC-20250125-WA0000.
15 pages
Unit 3
No ratings yet
Unit 3
47 pages
ds unit 2 qb
No ratings yet
ds unit 2 qb
25 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Exploratory Data Analysis Presentation
No ratings yet
Exploratory Data Analysis Presentation
16 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
Assignment EDA
No ratings yet
Assignment EDA
4 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
EDA
No ratings yet
EDA
9 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
Unit 3
No ratings yet
Unit 3
31 pages
Edashsh
No ratings yet
Edashsh
7 pages
Document (4)
No ratings yet
Document (4)
21 pages
Dev 1
No ratings yet
Dev 1
2 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
No ratings yet
AI6322 - Module 3 - Exploratory Data Analysis (EDA) - MODULE
15 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Exploratory Data Analysis in ML
No ratings yet
Exploratory Data Analysis in ML
7 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Data Science- Module 2 (Updated )
No ratings yet
Data Science- Module 2 (Updated )
94 pages
datascience unit-4
No ratings yet
datascience unit-4
6 pages
The analysis_In_EDA
No ratings yet
The analysis_In_EDA
7 pages
Unit-1
No ratings yet
Unit-1
52 pages
Amit_Khilare_Used_Device_Data_PM_Project
No ratings yet
Amit_Khilare_Used_Device_Data_PM_Project
25 pages
DS Lecture 15
No ratings yet
DS Lecture 15
44 pages
exp 4-10 merged
No ratings yet
exp 4-10 merged
89 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
eda1
No ratings yet
eda1
25 pages
Module 2
No ratings yet
Module 2
81 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
EDA_7_Marks_Answers
No ratings yet
EDA_7_Marks_Answers
3 pages
EDA 2
No ratings yet
EDA 2
69 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
Unit II. Methods and Techniques For Data Analytics
No ratings yet
Unit II. Methods and Techniques For Data Analytics
91 pages
EDA Exploratory Data Analysis (1)
No ratings yet
EDA Exploratory Data Analysis (1)
6 pages
Unit .......
No ratings yet
Unit .......
45 pages
MPC 006 2024-25 for ssc and all educational needs
No ratings yet
MPC 006 2024-25 for ssc and all educational needs
27 pages
Fundamentals of Data Source and Preparation For ML v31
No ratings yet
Fundamentals of Data Source and Preparation For ML v31
45 pages
Exploratory Data Analysis unit 2
No ratings yet
Exploratory Data Analysis unit 2
39 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
03a EDA
No ratings yet
03a EDA
47 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
EDA Feature eng- Estimation Inference and Hypothesis
No ratings yet
EDA Feature eng- Estimation Inference and Hypothesis
53 pages
5.1_exploratory_analysis_en
No ratings yet
5.1_exploratory_analysis_en
79 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
UNIT II-DSDA.docx Notes
No ratings yet
UNIT II-DSDA.docx Notes
26 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
UNIT 1
No ratings yet
UNIT 1
23 pages
Unit 2
No ratings yet
Unit 2
20 pages
EDA
No ratings yet
EDA
3 pages
Statistical Data Analysis Made Easy
From Everand
Statistical Data Analysis Made Easy
Pasquale De Marco
No ratings yet
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Quantitative Techniques - Paper 2 PDF
No ratings yet
Quantitative Techniques - Paper 2 PDF
11 pages
Advertising Research: Theory & Practice 2nd Edition – Ebook PDF Versioninstant download
100% (2)
Advertising Research: Theory & Practice 2nd Edition – Ebook PDF Versioninstant download
55 pages
ANOVA_105104
No ratings yet
ANOVA_105104
7 pages
Lubok Saham
No ratings yet
Lubok Saham
35 pages
Case Study 1-Way ANOVA: Yields of Entozoic Amoebae Under 5 Methods of Innoculation
No ratings yet
Case Study 1-Way ANOVA: Yields of Entozoic Amoebae Under 5 Methods of Innoculation
11 pages
60a0078f-95c0-4981-862a-96b7331b9d2f
No ratings yet
60a0078f-95c0-4981-862a-96b7331b9d2f
7 pages
SPSS Assignment
No ratings yet
SPSS Assignment
3 pages
Measures of Dispersion Kurtosis and Skewness
No ratings yet
Measures of Dispersion Kurtosis and Skewness
19 pages
Tolerances For Shaft Diameters
No ratings yet
Tolerances For Shaft Diameters
2 pages
TIGIST TILAYE FETENE Corrected Thesis
No ratings yet
TIGIST TILAYE FETENE Corrected Thesis
75 pages
Metode Analisis Perencanaan I (MAP I) : Fika Febi Novianti (2018280030)
No ratings yet
Metode Analisis Perencanaan I (MAP I) : Fika Febi Novianti (2018280030)
6 pages
Median and Quartiles Practice Strips
No ratings yet
Median and Quartiles Practice Strips
1 page
Free Access to Principles of Corporate Finance 12th Edition Brealey Test Bank Chapter Answers
100% (2)
Free Access to Principles of Corporate Finance 12th Edition Brealey Test Bank Chapter Answers
52 pages
LOGARITHM - STATISTICS PRACTICE SET 1
No ratings yet
LOGARITHM - STATISTICS PRACTICE SET 1
3 pages
Table No. 1 Frequency Distribution of Entrance Examination Scores
No ratings yet
Table No. 1 Frequency Distribution of Entrance Examination Scores
5 pages
Research Presentation
No ratings yet
Research Presentation
29 pages
Hasil Analisa Univariat
No ratings yet
Hasil Analisa Univariat
17 pages
statistical-treatment.pptx-rev
No ratings yet
statistical-treatment.pptx-rev
42 pages
Research Method and Senior Project Questions and Answers
No ratings yet
Research Method and Senior Project Questions and Answers
106 pages
Exercise 1 - Week 3
No ratings yet
Exercise 1 - Week 3
3 pages
CH 11 Test
100% (2)
CH 11 Test
20 pages
Sem 2 20172018 Final Exam Question Bum2413
No ratings yet
Sem 2 20172018 Final Exam Question Bum2413
11 pages
Empcode First Name Last Name Dept Region - Code Branch Hiredate Salary
No ratings yet
Empcode First Name Last Name Dept Region - Code Branch Hiredate Salary
22 pages
Research Data Interpretation
No ratings yet
Research Data Interpretation
157 pages
Tests of Between-Subjects Effects
No ratings yet
Tests of Between-Subjects Effects
2 pages
Interim 02
No ratings yet
Interim 02
6 pages
Efisiensi Biaya
No ratings yet
Efisiensi Biaya
9 pages
Correlation and Regression
No ratings yet
Correlation and Regression
39 pages

Unit 4 Exploratory Data Analysis and the Data Science Process (1)

Uploaded by

Unit 4 Exploratory Data Analysis and the Data Science Process (1)

Uploaded by

Unit 4 Exploratory Data Analysis and the Data Science Process

Basic tools (plots, graphs and summary statistics) of EDA

Data can be represented mostly in these forms-

Data Science Process-

You might also like