0% found this document useful (0 votes)
3 views

4. Visualization

The document discusses Exploratory Data Analysis (EDA) and its importance in understanding data without assumptions. It outlines various visualization techniques such as histograms, boxplots, and scatter plots, which help in identifying patterns, outliers, and relationships between variables. EDA is emphasized as a precursor to confirmatory analysis to ensure appropriate methods are applied to the data set.

Uploaded by

varmakdc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

4. Visualization

The document discusses Exploratory Data Analysis (EDA) and its importance in understanding data without assumptions. It outlines various visualization techniques such as histograms, boxplots, and scatter plots, which help in identifying patterns, outliers, and relationships between variables. EDA is emphasized as a precursor to confirmatory analysis to ensure appropriate methods are applied to the data set.

Uploaded by

varmakdc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Visualization (EDA)

n EXPLORATORY DATA
ANALYSIS (EDA)
n Descriptive data

summarization
EDA
n Why EDA -data should first be explored without
assumptions about probabilistic models, error
distributions, relationships between the variables,
etc.

n Purpose -
n discover what the data can tell us about the phenomena we
are investigating.
n visualize the underlying process that generated it

n Goal - is to explore the data to reveal patterns and


features that will help the analyst better understand,
analyze and model the data.
EDA
n EDA should precede confirmatory analysis (such
as hypothesis testing/ANOVA) to ensure that the
analysis is appropriate for the data set. Eg

n Time series - plot the values over time to look for


patterns such as trends, seasonal effects or change
points.
n Linear or a nonlinear relationship? Are there patterns

that can provide insight into the process that relates


the variables?
n Look for outliers or aberrant observations that might
contaminate the results.
Exploring Univariate Data—Graphical Method

n Two important goals of EDA :


n determine a reasonable model for the process that

generated the data, and


n locate possible outliers in the sample.

n For example,
n Is the distribution that generated the data is

symmetric or skewed.
n Has one mode or many modes.

The univariate visualization techniques might


answer such questions.
Graphic Displays of Basic Statistical
Descriptions of Data

n graphic displays of basic statistical descriptions. These include


n quantile plots,
n quantile–quantile plots,

n histograms, and

n scatter plots.

n Helpful for the visual inspection of data, which is useful for data
preprocessing. The first three of these show univariate distributions (i.e.,
data for one attribute), while scatter plots show bivariate distributions (i.e.,
involving two attributes).
n A quantile plot is a simple and effective way to have a first look at
a univariate data distribution.
n First, it displays all of the data for the given attribute (allowing the user

to assess both the overall behavior and unusual occurrences).


n Second, it plots quantile information.
Histogram Analysis
n Graph displays of basic statistical class descriptions
n Frequency histograms

n A univariate graphical method


n Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data

§ A relative frequency histogram uses the same information as a


frequency histogram but compares each class interval to the total number of
items
Histograms
Stem-and-leaf plots

Stem-and-leaf plots display data in a structured list. Presenting data in a table


or an ordered list does not readily convey information about how the data are
distributed, as is the case with histograms.

Eg: Use of Stem chart while grading


Measuring the Dispersion of Data Boxplots, and Outliers

Boxplots are a popular way of visualizing a distribution. A boxplot


incorporates the five-number summary as follows:
n Typically, the ends of the box are at the quartiles so that the box length is
the IQR.
n The median is marked by a line within the box.
n Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
n For branch 1, we see that the median price of items sold is $80, Q1 is $60, and Q3 is
$100. Notice that two outlying observations for this branch were plotted
individually, as their values of 175 and 202 are more than 1.5 times the IQR here of
40.
Visualization of Data Dispersion: Boxplot Analysis
Properties of Normal Distribution Curve

n The normal (distribution) curve


n From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation)


n From μ–2σ to μ+2σ: contains about 95% of it
n From μ–3σ to μ+3σ: contains about 99.7% of it
Quantile Plot
n Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
n Plots quantile information
n For a data xi data sorted in increasing order, fi

indicates that approximately 100 fi% of the data are


below or equal to the value xi
Quantile Plot

n Simple way -1st look at a univariate data distribution.


n Displays the given attribute to assess both
n the overall behavior and
n unusual occurrences.
n It plots quantile information.
n Let xi , for i = 1 to N, be the data sorted in increasing order
n x1 is the smallest observation and
n xN is the largest for some attribute X.
n Each observation, xi , is paired with a percentage, fi , which => approx fi
100% of the data are below the value, xi
n Note: 0.25 percentile corresponds to quartile Q1, the 0.50 percentile is the
median, and the 0.75 percentile is Q3.
Quantile Plot
n Let
Quantile-Quantile (Q-Q) Plot
n Graphs the quantiles of one univariate distribution
Vs corresponding quantiles of another
n Allows the user to view whether there is a shift in
going from one distribution to another
Quantile-Quantile (Q-Q) Plot
n Each point corresponds to the same quantile for each data set
and shows the unit price of items sold at branch 1 Vs 2 for that
quantile.
n For comparison, the straight line represents => for each given
quantile, the unit price at each branch is the same.
n The darker points - data for Q1, the median, and Q3.
n At Q1, the unit price of items sold at branch 1 < at branch 2.
ie, 25% of items sold at branch 1 were <= $60, Vs 25% of
items at branch 2 <= $64.
n At Q2, the 50th percentile (marked by the median), 50% of
items sold at branch 1 <= $75, Vs branch 2 <= $85.

n In general, a shift in the distribution of branch 1 Vs 2 in that the


unit prices of items sold at branch 1 < at branch 2.
Positively and Negatively Correlated Data

Positively correlated Negatively correlated

July 4, 2019 Data Mining: Concepts and Techniques 19


Not Correlated Data

July 4, 2019 Data Mining: Concepts and Techniques 20


Exploring Bivariate Data
Scatter plot
n A scatter plot - effective graphical methods for
determining if there appears to be a relationship, pattern,
or trend between two numeric attributes.
n Provides a first look at bivariate data to see clusters of
points, outliers, etc
n Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Surface Plots

n If data that represents a function defined over a


bivariate domain,

z = f(x,y) ,

then view values for z as a surface.


Example 5.13
Contour Plots
n Use contour plots to view surface. Contour
plots show lines of constant surface values,
similar to topographical maps
Summary: Graphic Displays of Basic Statistical Descriptions

n Histogram
n Boxplot
n Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are £ xi
n Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles
of another
n Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
n Loess (local regression) curve: add a smooth curve to a
scatter plot to provide better perception of the pattern of
dependence

You might also like