0% found this document useful (0 votes)
5 views

Unit 4 Exploratory Data Analysis and the Data Science Process (1)

The document outlines the philosophy and process of Exploratory Data Analysis (EDA), emphasizing curiosity, skepticism, and visualization to uncover insights from data. It details various statistical tools and techniques used in EDA, including measures of central tendency and graphical representations like histograms and box plots. Additionally, it describes the systematic data science process, which involves framing problems, collecting and processing data, exploring insights, performing analysis, and communicating results effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 4 Exploratory Data Analysis and the Data Science Process (1)

The document outlines the philosophy and process of Exploratory Data Analysis (EDA), emphasizing curiosity, skepticism, and visualization to uncover insights from data. It details various statistical tools and techniques used in EDA, including measures of central tendency and graphical representations like histograms and box plots. Additionally, it describes the systematic data science process, which involves framing problems, collecting and processing data, exploring insights, performing analysis, and communicating results effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit 4 Exploratory Data Analysis and the Data Science Process

Philosophy of EDA-
The "philosophy of EDA" isn’t a formally defined doctrine, but it reflects a mindset and
approach to data analysis rooted in curiosity, skepticism, and pragmatism. Exploratory Data
Analysis (EDA), pioneered by statistician John Tukey in the 1970s, emphasizes letting the
data speak for itself before imposing rigid models or assumptions. Here’s a breakdown of
its philosophical underpinnings:
1. **Data-Driven Curiosity**
EDA starts with an open-ended question: "What’s in the data?" It’s about exploring without
preconceived notions, treating the dataset as a story waiting to be uncovered. The
philosophy here is that truth lies in the raw observations, not in hypotheses you bring to the
table.
Think of it like a detective poking around a crime scene—looking for clues, not assuming
who the culprit is.
2. Skepticism of Assumptions
Traditional statistical methods often rely on assumptions (e.g., normality, linearity). EDA
rejects jumping straight to those, insisting you first check if they even hold. It’s skeptical of
forcing data into a box it doesn’t fit.
Tukey’s view was that assumptions should be earned through evidence, not blindly applied.
This is a rebellion against overly theoretical, model-first approaches.
3. Visualization as Truth-SeekingEDA leans heavily on graphical tools—histograms, scatter
plots, box plots—because seeing is believing. The philosophy here is that human intuition,
paired with visual patterns, can reveal insights that equations might miss.
It’s almost an artistic stance: the data’s shape, spread, and quirks are more honest when
you *look* at them rather than reduce them to numbers.
4. Embracing the Mess
Real-world data is messy—outliers, missing values, weird distributions. EDA’s philosophy
doesn’t shy away from that chaos; it dives in. Instead of cleaning data to fit a model, it asks,
“What does the mess tell us?”
5. Iterative and Flexible Thinking
EDA isn’t a one-and-done step; it’s a cycle of looking, questioning, and digging deeper. The
philosophy values adaptability—pivot when you spot something unexpected, chase
anomalies, refine your focus as you go.
This pragmatism makes it a tool for doers: engineers, scientists, analysts who need to solve
problems, not just publish papers.
In Essence:
The philosophy of EDA is about approaching data with a beginner’s mind—curious, critical,
and unburdened by dogma (a belief or set of beliefs that people are expected to accept as
true without questioning). It’s a call to listen to the data first, using simple tools to uncover
its structure, quirks, and secrets, before deciding what to do next. It’s less a rigid method
and more a way of thinking: trust the evidence, question the obvious, and let patterns
emerge naturally.

Basic tools (plots, graphs and summary statistics) of EDA


In Exploratory Data Analysis (EDA), summary statistics are used to provide a concise
overview of a dataset's key characteristics, such as its central tendency, dispersion, and
distribution, helping to identify patterns and anomalies.
EDA aims to understand the data, identify patterns, and formulate hypotheses before formal
statistical modeling.
• Types of Summary Statistics:
• Measures of Central Tendency: Mean, median, and mode help describe the
typical or average value of a dataset.
• Measures of Dispersion: Standard deviation, variance, and range quantify the
spread or variability of the data.
• Percentiles and Quartiles: These help understand the distribution of data by
dividing it into different segments

Data can be represented mostly in these forms-


• Univariate Exploratory Data Analysis – In Univariate Data Analysis we use one variable or
feature to determine the characteristics of the dataset. We derive the relationships and
distribution of data concerning only one feature or variable. In this category, we have the
liberty to use either the raw data or follow a graphical approach.
o In the Univariate raw data approach or Non-Graphical , we determine the
distribution of data based on one variable and study a sample from the population.
Also, we may include outlier removal which is a part of this process.
Examples are-
The measure of Central tendency − Central tendency tried to summarize a whole
population or dataset with the help of a single value that represents the central value.
The three measures are the mean, the median, and the mode.
Mean − It is the average of all the observations. i.e., the sum of all observations divided
by the number of observations. Median − It is the middle value of the observations or
distribution after arranging them in ascending or descending order. Mode − It is the most
frequently occurring observation.

Variance − It indicates the spread of the data about the middle or Mean value. It helps us
gather info regarding observations concerning central tendencies like mean. It is calculated as
the mean of the square of all observations.

Skewness − It is the measure of the symmetry of the observations. The distribution can either
be left-skewed or right skewed forming a long tail in either case.
Kurtosis − It measures how much-tailed a particular distribution is concerning a normal
distribution. Medium kurtosis is known as mesokurtic and low kurtosis is known as platykurtic.

o In the Univariate graphical approach, we may use any graphing library to generate
graphs like histograms, boxplots, quantile-quantile plots, violin plots, etc. for
visualization. Data Scientists often use visualization to discover anomalies and patterns.
The graphical method is a more subjective approach to EDA. These are some of the
graphical tools to perform univariate analysis.
o Histograms − They represent an actual count of a particular range of values. It shows the
frequency of data in the form of rectangles' which is also known as bar graph
representation and can be either vertical or horizontal.
o Box plots − Also known as box and whisker plots. They use lines and boxes to show the
distribution of data from one or more than one groups. A central line indicates the
median value. The extended line captures the rest of the data. They are useful in the way
that they can be used to compare groups of data and compare symmetry.

Multivariate Non-Graphical (raw data) − Techniques like tabulation of more than two
variables. ANOVA test can also play a significant role.
ANOVA, which stands for Analysis of Variance, is a technique that determines whether the
averages of three or more independent groups differ significantly from one another.
An ANOVA should use categorical data, such as nominal or ordinal data, as its independent
variable. ANOVA is typically used when the dependent variable is continuous, but the
independent variable(s) must be categorical (e.g., different groups or treatments). You
cannot conduct an ANOVA test if your dependent variable is the nominal data.
• Multivariate Graphical − In visualization analysis for multivariate statistics, the below
plots can be used.
o Scatterplot − It is used to display the relationship between two variables by
plotting the data as dots. Additionally, color coding can be intelligently used to
show groups within the two features based on a third feature.
o Heatmap − In this visualization technique the values are represented with colors
with a legend showing color for different levels of the value. It is a 2d graph.
o Bubble plot − In this graph circles are used to show different values. The radius of
the circle on the chart is proportional to the value of the data point.

Data Science Process-


Data Science is all about a systematic process used by Data Scientists to analyze, visualize
and model large amounts of data. A data science process helps data scientists use the tools
to find unseen patterns, extract data, and convert information to actionable insights that can
be meaningful to the company. This aids companies and businesses in making decisions that
can help in customer retention and profits.
The six data science process steps are as follows:
1. Frame the problem
2. Collect the raw data needed for your problem
3. Process the data for analysis
4. Explore the data
5. Perform in-depth analysis
6. Communicate results of the analysis
A great way to go through this step is to ask questions like:
• Who the customers are?
• How to identify them?
• What is the sale process right now?
• Why are they interested in your products?
• What products they are interested in?
You will need much more context from numbers for them to become insights. At the end of
this step, you must have as much information at hand as possible.
Step 2: Collecting the Raw Data for the Problem
After defining the problem, you will need to collect the requisite data to derive insights and
turn the business problem into a probable solution. The process involves thinking through
your data and finding ways to collect and get the data you need. It can include scanning your
internal databases or purchasing databases from external sources.
Many companies store the sales data they have in customer relationship management
(CRM) systems. The CRM data can be easily analyzed by exporting it to more advanced tools
using data pipelines.
Step 3: Processing the Data to Analyze
Data can be messy if it has not been appropriately maintained, leading to errors that easily
corrupt the analysis. These issues can be values set to null when they should be zero or the
exact opposite, missing values, duplicate values, and many more. You will have to go through
the data and check it for problems to get more accurate insights.
The most common errors that you can encounter and should look out for are:
1. Missing values
2. Corrupted values like invalid entries
3. Time zone differences
4. Date range errors like a recorded sale before the sales even started
Once you have completed the data cleaning process, your data will be ready for an
exploratory data analysis (EDA).
Step 4: Exploring the Data
In this step, you will have to develop ideas that can help identify hidden patterns and
insights. You will have to find more interesting patterns in the data, such as why sales of a
particular product or service have gone up or down. You must analyze or notice this kind of
data more thoroughly. This is one of the most crucial steps in data science process.
Step 5: Performing In-depth Analysis
This step will test your mathematical, statistical, and technological knowledge. You must use
all the data science tools to crunch the data successfully and discover every insight you can.
You might have to prepare a predictive model that can compare your average customer with
those who are underperforming. You might find several reasons in your analysis, like age or
social media activity, as crucial factors in predicting the consumers of a service or product.
You might find several aspects that affect the customer, like some people may prefer being
reached over the phone rather than social media.
Step 6: Communicating Results of this Analysis
After all these data science steps, it is vital to convey your insights and findings to the sales
head and make them understand their importance. It will help if you communicate
appropriately to solve the problem you have been given. Proper communication will lead to
action. In contrast, improper contact may lead to inaction.
You need to link the data you have collected and your insights with the sales head’s
knowledge so that they can understand it better.

You might also like