Exploratory Data Analysis unit 2
Exploratory Data Analysis unit 2
Analysis
Understanding your data through visualization and statistics
Presented By:
Prof. Tithirupa Tapaswini
Assistant Professor
Department of CSE
Medicaps University Indore
What is EDA?
Exploratory Data Analysis (EDA) is an approach to analyzing and understanding
data sets with the goal of discovering patterns, identifying anomalies, testing
hypotheses, and checking assumptions. It involves summarizing the main
characteristics of the data, often using visual methods.
Here are some key components of EDA:
1.Descriptive Statistics: Summarizing data through measures such as mean,
median, standard deviation, and variance.
2.Data Visualization: Using graphs and plots to visualize distributions,
relationships, and trends.
3.Data Cleaning: Identifying and addressing missing values, outliers, and
inconsistencies in the data.
4.Univariate Analysis: Examining individual variables to understand their
distribution and central tendency.
5.Bivariate and Multivariate Analysis: Exploring relationships between two or
more variables to identify correlations and patterns.
6.Feature Engineering: Creating new features or modifying existing ones to
improve the quality of the data or make it more suitable for modeling.
EDA is often a preliminary step before more formal statistical analysis or machine
learning, helping to guide the analysis and ensure that any modeling is based on a
sound understanding of the data.
Assumptions in EDA:
The gamut of scientific and engineering experimentation is virtually limitless. In this
sea of diversity is there any common basis that allows the analyst to systematically
and validly arrive at supportable, repeatable research conclusions.
Underlying Assumptions There are four assumptions that typically underlie all
measurement processes; namely, that the data from the process at hand "behave
like":
1. random drawings;
2. from a fixed distribution;
3. with the distribution having fixed location;
4. with the distribution having fixed variation.
Understanding and addressing the underlying assumptions in Exploratory Data
Analysis (EDA) is crucial for several reasons:
1.Validity of Analysis:
Importance: Assumptions help ensure that the methods and techniques used
during EDA are valid and appropriate for the data. If assumptions are violated, the
results of the analysis may be misleading or incorrect.
Example: If you assume data independence but your observations are correlated
(e.g., time series data), statistical tests that assume independence might produce
unreliable results.
Data Preparation:
•Importance: Addressing assumptions helps in cleaning and preparing data
effectively. This includes handling missing values, correcting errors, and transforming
variables to meet assumptions.
•Example: If the data is not normally distributed, you might need to apply
transformations or use non-parametric methods.
Accuracy of Insights:
•Importance: Assumptions impact the accuracy and reliability of the insights derived
from EDA. If the assumptions are not met, the insights may be flawed or incomplete.
•Example: In linear regression, if the relationship between variables is not linear, the
model's predictions and interpretations could be inaccurate.
Detecting Problems:
•Importance: Checking assumptions helps in identifying and diagnosing potential
problems or limitations in the data or analysis process.
•Example: Detecting multicollinearity (high correlation between independent
variables) can prevent issues in regression analysis, such as inflated standard
errors.
•Improving Model Performance:
•Importance: Properly addressing assumptions can improve the performance of
statistical models and machine learning algorithms. It ensures that models are
well-suited to the data, which enhances their predictive accuracy and
generalizability.
•Example: Addressing heteroscedasticity (non-constant variance of errors) in
regression analysis can lead to more reliable coefficient estimates and confidence
intervals.
Communicating Results:
Importance: Being aware of and communicating the assumptions made during EDA
helps stakeholders understand the context and limitations of the analysis. This
transparency builds trust in the findings and helps in making informed decisions.
Example: Clearly stating that the data was assumed to be normally distributed
helps stakeholders understand the basis of the statistical tests used and their
appropriateness.
2.Graphical :On the other hand, there is a large collection of statistical tools that
we generally refer to as graphical techniques.
These include:
scatterplots
histograms
probability plots
residual plots
box plots and block plots
An EDA/Graphics
Example Data:
X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
The goal of the analysis is to compute summary statistics plus determine
the best linear fit for Y as a function of X, the results might be given as:
N=8
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
The above quantitative analysis, although valuable, gives us only limited
insight into the data. Scatter Plot- In contrast, the following simple scatter
plot of the data
Suggests the following:
1. The data set "behaves like" a linear curve with some scatter;
2. there is no justification for a more complicated model (e.g., quadratic);
3. there are no outliers;
4. the vertical spread of the data appears to be of equal height irrespective of the
X-value; this indicates that the data are equally-precise throughout and so a
"regular" (that is, equi-weighted) fit is appropriate.
This kind of characterization for the data serves as the core for getting insight/feel
for the data.
Conclusions from the scatter plots are:
1. data set 1 is clearly linear with some scatter.
2. data set 2 is clearly quadratic.
3. data set 3 clearly has an outlier.
4. data set 4 is obviously the victim of a poor experimental design with a
single point far removed from the bulk of the data "wagging the dog".
Techniques for Testing Assumption
The following EDA techniques are simple, efficient, and powerful for the
routine testing of underlying assumptions:
1. run sequence plot (Yi versus i)
2. lag plot (Yi versus Yi-1)
3. histogram (counts versus subgroups of Y)
4. normal probability plot (ordered Y versus theoretical ordered Y)
Sample Plot: Assumptions Do Not Hold
If one or more of the four underlying assumptions do not hold, then it will
show up in the various plots as demonstrated in the following example
Interpretation of 4-Plot
The four EDA plots discussed on the previous page are used to
test the underlying assumptions:
1. Fixed Location: If the fixed location assumption holds, then
the run sequence plot will be flat and non-drifting.
2. Fixed Variation: If the fixed variation assumption holds,
then the vertical spread in the run sequence plot will be the
approximately the same over the entire horizontal axis.
3. Randomness: If the randomness assumption holds, then the
lag plot will be structureless and random.
4. Fixed Distribution: If the fixed distribution assumption
holds, in particular if the fixed normal distribution holds, then
1. the histogram will be bell-shaped,
and 2. the normal probability plot will be linear.
If all four of the assumptions hold, then the process is said definitionally to
be "in statistical control".
Consequences:
The following sections discuss in more detail the
consequences of invalid assumptions:
1. Consequences of non-randomness
2. Consequences of non-fixed location parameter
3. Consequences of non-fixed variation
4. Consequences related to distributional assumptions
Consequences of Non-Randomness
The randomness assumption is the most critical but the least tested.
If the randomness assumption does not hold, then
1. All of the usual statistical tests are invalid.
2. The calculated uncertainties for commonly used statistics
become meaningless.
3. The calculated minimal sample size required for a pre-
specified tolerance becomes meaningless.
4. The simple model: y = constant + error becomes invalid.
5. The parameter estimates become suspect and non
supportable
Autocorrelation:
One of the common problem in non randomness is autocorrelation.
Autocorrelation is the correlation between Yt and Yt-k, where k is an integer that
defines the lag for the autocorrelation. That is, autocorrelation is a time dependent
non-randomness.
2. The single location estimate may be meaningless (if the process is drifting).
3. The choice of location estimator (e.g., the sample mean) may be sub-optimal.