Unit 4 Exploratory Data Analysis and the Data Science Process (1)
Unit 4 Exploratory Data Analysis and the Data Science Process (1)
Philosophy of EDA-
The "philosophy of EDA" isn’t a formally defined doctrine, but it reflects a mindset and
approach to data analysis rooted in curiosity, skepticism, and pragmatism. Exploratory Data
Analysis (EDA), pioneered by statistician John Tukey in the 1970s, emphasizes letting the
data speak for itself before imposing rigid models or assumptions. Here’s a breakdown of
its philosophical underpinnings:
1. **Data-Driven Curiosity**
EDA starts with an open-ended question: "What’s in the data?" It’s about exploring without
preconceived notions, treating the dataset as a story waiting to be uncovered. The
philosophy here is that truth lies in the raw observations, not in hypotheses you bring to the
table.
Think of it like a detective poking around a crime scene—looking for clues, not assuming
who the culprit is.
2. Skepticism of Assumptions
Traditional statistical methods often rely on assumptions (e.g., normality, linearity). EDA
rejects jumping straight to those, insisting you first check if they even hold. It’s skeptical of
forcing data into a box it doesn’t fit.
Tukey’s view was that assumptions should be earned through evidence, not blindly applied.
This is a rebellion against overly theoretical, model-first approaches.
3. Visualization as Truth-SeekingEDA leans heavily on graphical tools—histograms, scatter
plots, box plots—because seeing is believing. The philosophy here is that human intuition,
paired with visual patterns, can reveal insights that equations might miss.
It’s almost an artistic stance: the data’s shape, spread, and quirks are more honest when
you *look* at them rather than reduce them to numbers.
4. Embracing the Mess
Real-world data is messy—outliers, missing values, weird distributions. EDA’s philosophy
doesn’t shy away from that chaos; it dives in. Instead of cleaning data to fit a model, it asks,
“What does the mess tell us?”
5. Iterative and Flexible Thinking
EDA isn’t a one-and-done step; it’s a cycle of looking, questioning, and digging deeper. The
philosophy values adaptability—pivot when you spot something unexpected, chase
anomalies, refine your focus as you go.
This pragmatism makes it a tool for doers: engineers, scientists, analysts who need to solve
problems, not just publish papers.
In Essence:
The philosophy of EDA is about approaching data with a beginner’s mind—curious, critical,
and unburdened by dogma (a belief or set of beliefs that people are expected to accept as
true without questioning). It’s a call to listen to the data first, using simple tools to uncover
its structure, quirks, and secrets, before deciding what to do next. It’s less a rigid method
and more a way of thinking: trust the evidence, question the obvious, and let patterns
emerge naturally.
Variance − It indicates the spread of the data about the middle or Mean value. It helps us
gather info regarding observations concerning central tendencies like mean. It is calculated as
the mean of the square of all observations.
Skewness − It is the measure of the symmetry of the observations. The distribution can either
be left-skewed or right skewed forming a long tail in either case.
Kurtosis − It measures how much-tailed a particular distribution is concerning a normal
distribution. Medium kurtosis is known as mesokurtic and low kurtosis is known as platykurtic.
o In the Univariate graphical approach, we may use any graphing library to generate
graphs like histograms, boxplots, quantile-quantile plots, violin plots, etc. for
visualization. Data Scientists often use visualization to discover anomalies and patterns.
The graphical method is a more subjective approach to EDA. These are some of the
graphical tools to perform univariate analysis.
o Histograms − They represent an actual count of a particular range of values. It shows the
frequency of data in the form of rectangles' which is also known as bar graph
representation and can be either vertical or horizontal.
o Box plots − Also known as box and whisker plots. They use lines and boxes to show the
distribution of data from one or more than one groups. A central line indicates the
median value. The extended line captures the rest of the data. They are useful in the way
that they can be used to compare groups of data and compare symmetry.
Multivariate Non-Graphical (raw data) − Techniques like tabulation of more than two
variables. ANOVA test can also play a significant role.
ANOVA, which stands for Analysis of Variance, is a technique that determines whether the
averages of three or more independent groups differ significantly from one another.
An ANOVA should use categorical data, such as nominal or ordinal data, as its independent
variable. ANOVA is typically used when the dependent variable is continuous, but the
independent variable(s) must be categorical (e.g., different groups or treatments). You
cannot conduct an ANOVA test if your dependent variable is the nominal data.
• Multivariate Graphical − In visualization analysis for multivariate statistics, the below
plots can be used.
o Scatterplot − It is used to display the relationship between two variables by
plotting the data as dots. Additionally, color coding can be intelligently used to
show groups within the two features based on a third feature.
o Heatmap − In this visualization technique the values are represented with colors
with a legend showing color for different levels of the value. It is a 2d graph.
o Bubble plot − In this graph circles are used to show different values. The radius of
the circle on the chart is proportional to the value of the data point.