0% found this document useful (0 votes)
13 views

Lecture 02- Exploratory Data and Descriptive Statistics

Exploratory Data Analysis (EDA) is a statistical approach that summarizes data characteristics using visual methods to understand underlying structures and identify anomalies. Key concepts include measures of central tendency (mean, median, mode) and variation (range, interquartile range, variance, standard deviation). Box-and-whisker plots are used to visually represent data distributions and identify skewness.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 02- Exploratory Data and Descriptive Statistics

Exploratory Data Analysis (EDA) is a statistical approach that summarizes data characteristics using visual methods to understand underlying structures and identify anomalies. Key concepts include measures of central tendency (mean, median, mode) and variation (range, interquartile range, variance, standard deviation). Box-and-whisker plots are used to visually represent data distributions and identify skewness.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Exploratory Data

Analysis /
Descriptive
Statistics
• In statistics, exploratory data analysis is an
approach to analyzing data sets to
summarize their main characteristics, often
with visual methods.

Exploratory • It focuses on exploring data to understand


the data’s underlying structure and
Data Analysis variables, to develop intuition about the
data set, to consider how that data set came
into existence, and to decide how it can be
investigated with more formal statistical
methods.
Exploratory Data
Analysis

• Perform Exploratory
Data Analysis (EDA) to
understand the distribution
of a variable and to check
for anomalies and outliers.

• Create histograms and


boxplots, transform variables,
and examine trade-offs in
visualizations
Probability & Statistics

Descriptive Statistics

▪ Customers in using Statistics Scenarios are


asking questions about numerical variables.
When Summarizing and describing numerical
variables.
▪ You have to do more than just prepare the tables
▪ You need to consider the Central Tendency,
Variation, shapes etc., of each numerical
variable.
Numerical Descriptive Measures
Central Tendency
Mean

The arithmetic mean For a population of


(mean) is the most size N
common measure of
central tendency For a sample of size n
Median
Probability & Statistics

The location of the median

If the number of values is odd, the median is the


middle number. If the number of values is even, the
median is the average of the two middle numbers
n +1
Median position = position in the ordered data
2
n +1
Note that is not the value of the median, only
2
the position of the median in the ranked data
• Value with the highest frequency
• Ameasure of central tendency
• Value that occurs most often
• Not affected by extreme values
MODE • Used for either numerical or categorical
data
• There may be no mode
• There may be several modes
Probability & Statistics

Mode

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 9 No Mode
Probability & Statistics

QUARTILES

Quartiles split the ranked data into 4 segments


with an equal number of values per segment
25% 25% 25% 25%

Q1 Q2 Q3

The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50%
are larger)
Only 25% of the observations are greater than the third
quartile
Probability & Statistics

Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where

First quartile position: Q1 = (n+1)/4


Second quartile (median)position:
Q2 = (n+1)/2
Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values.
Probability & Statistics

Quartiles
E x a m p l e : F i n d t h e first quartile

S a m p l e D a t a in O r d e r e d Ar r a y: 11 12 13 16 16 17 18 21 22

(n = 9)
Q 1 = is in the (9+1)/4 = 2.5 position of the r a n k e d d a ta

s o u s e the value half w a y b e t w e e n the 2nd a n d 3rd values,


so Q1 = 12.5
Q 1 a n d Q 3 are m e a s u r e s of n o n central location
Q 2 = median, a m e a s u r e of central tendency
Probability & Statistics

EXERCISE
Consider th e following stem-and-leaf display
Find Ra n ge , M e d i a n M o d e , Q1, Q 2 a n d Inter Quartile R a n g e

-2 2
-1 20
-0 5320
0 01146688
1 3357
2 23346889999
3 056789
4 235799
5 48
6 38
7
8 6
Probability & Statistics

EXERCISE
Consider th e following stem-and-leaf display
Find Ra n ge , M e d i a n M o d e , Q1, Q 2 a n d Inter Quartile R a n g e

-2 2 R a n g e = 8 6 – (-22)=108
-1 20
-0 5320 M e d i a n = (47+1)/2th value
0 01146688 = 24 t h Value
1 3357 = 26
2 23346889999
3 056789 Mode = 29
4 235799
5 48 Q 1 = (47+1)/4 t h Value = 12 t h value
6 38 =6
7 Q 3 = (47+1) *3/4 t h value = 36 t h Value
8 6 = 39
I QR = 3 9 – 6 = 3 3
Probability & Statistics

Measures of Variation

Va r i a t i o n

Range In te r q u a r tile Va r i a n c e Standard Coefficient


Range Deviation o f Va r i a t i o n

◼ M e a s u r e s of var i at i on g i v e
i n f o r m a t i o n o n t h e s p r e a d or
va r i a b i l i t y o f t h e d a t a v a l u e s .

S a m e c e n t e r,
different variation
Probability & Statistics

Range

Range
• S i m p l e s t m e a s u r e o f va r i at i o n
• D i f fe r e n c e b e t w e e n t h e l a rge st a n d t h e
smallest observations:

R a n g e = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

Saturday, J anuar y 2, 2 0 1 6
Probability & Statistics

Inter Quartile Range

Interquartile Range
• Can eliminate some outlier problems by using the
interquartile range

• Eliminate some high- and low-valued observations and


calculate the range from the remaining values

Inter-quartile range = 3 rd quartile – 1 st


quartile = Q 3 – Q 1

Saturday, J a n u a ry 2, 2 0 1 6
Probability & Statistics

Variance

• Ave ra ge (approximately) of s q u a re d deviations


o f values f ro m t h e m e a n

– S a m p l e variance: n

 (Xi − X) 2
S2 = i=1

n -1
W h e re
X= a r ithmetic m e a n
n = s a m p l e size

X i = i t h v a l u e o f t h e va r i a b l e X
Probability & Statistics

Standard Deviation

• M o s t c o m m o n l y u s e d m e a s u r e o f va r i at i o n
• S h o w s va r i at i o n a b o u t t h e m e a n
• H a s t h e s a m e u n i t s a s t h e o r i g i n a l d ata

– S a m p l e sta n d a rd d ev i at i o n :

 (Xi − X) 2
S = i=1
n -1
Probability & Statistics

Population
Standard Deviation
Here we use the formula,

 (x )
n

−x
2
i
i=1
=
n

That is replace n – 1 of Sample Standard deviation


formula by n in the denominator
Probability & Statistics

Small standard deviation

Large standard deviation

Saturday, January 2, 2016


Probability & Statistics

Shape of a Distribution
• D e s c r i b e s h o w d ata is dist ributed
• Measures of shape
• – Symmetric or skewed

Saturday, Fe b r uar y 4, 2 0 1 7 Chap 3-29


Probability & Statistics

Exploratory Data Analysis


• B o x - a n d - W h i s ke r P l o t : A G r a p h i c a l d i s p l a y o f
data using 5-number su mmary :

M i n i m u m -- Q 1 -- M e d i a n -- Q 3 -- M a x i m u m

Example:

25% 25% 25% 25%


Probability & Statistics

Shape of Box-and-Whisker Plots

• The Box and central line are centered between


the endpoints if data are symmetric around
the median

Min Q1 Median Q3 Max

• A Box-and-Whisker plot can be shown in either


vertical or horizontal format
Probability & Statistics

Distribution Shape and


Box-and-Whisker Plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

You might also like