0% found this document useful (0 votes)
6 views21 pages

Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals

The document provides an overview of Exploratory Data Analysis (EDA) in the context of Data Science, emphasizing its importance in understanding data sets and generating hypotheses. It discusses various methods for visualizing data, including frequency tables, histograms, and stem-and-leaf plots, as well as the concept of the five-number summary. The document highlights the significance of careful data representation to maintain the richness of the data while identifying trends and outliers.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views21 pages

Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals

The document provides an overview of Exploratory Data Analysis (EDA) in the context of Data Science, emphasizing its importance in understanding data sets and generating hypotheses. It discusses various methods for visualizing data, including frequency tables, histograms, and stem-and-leaf plots, as well as the concept of the five-number summary. The document highlights the significance of careful data representation to maintain the richness of the data while identifying trends and outliers.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

School of Computing

Science and Engineering

Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science
Fundamentals
Exploratory Data Analysis

Statistics 2126
Introduction
• If you are going to find out anything about
a data set you must first understand the
data
• Basically getting a feel for you numbers
– Easier to find mistakes
– Easier to guess what actually happened
– Easier to find odd values
Introduction
• One of the most important and overlooked
part of statistics is Exploratory Data
Analysis or EDA
• Developed by John Tukey
• Allows you to generate hypotheses as well
as get a feel for you data
• Get an idea of how the experiment went
without losing any richness in the data
Hey look, numbers!
x (the value) f (frequency)
10 1
23 2
25 5
30 2
33 1
35 1
Frequency tables make stuff easy

 xf
• 10(1)+23(2)+25(5)+30(2)+33(1)+35(10
• = 309
Relative Frequency Histogram
• You can use this to Relative Frequency Histogram

make a relative 6

frequency histogram 5

• Lose no richness in 4

the data

Frequency
3 Frequency

• Easy to reconstruct 2

data set 1

• Allows you to spot 0

oddities
Score
Categorical Data
• With categorical data you do not get a
histogram, you get a bar graph
• You could do a pie chart too, though I hate
them (but I love pie)
• Pretty much the same thing, but the x axis
really does not have a scale so to speak
• So say we have a STAT 2126 class with
38 Psych majors, 15 Soc, 18 CESD
majors and five Bio majors
Like this
STAT 2126
STAT 2126
Biology

40

35

30 CESD

25
Count

20 Psych
15

10

0
Psych Soc CESD Biology Soc
Major
Quantitative Variables
• So with these of course we use a
histogram
• We can see central tendency
• Spread
• shape
Skewness

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Kurtosis

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.

• Leptokurtic means peaked


• Platykurtic means flat
More on shape
• A distribution can be symmetrical or
asymmetrical
• It may also be unimodal or bimodal
• It could be uniform
An example
• Number of goals
scored per year by
Mario Lemieux
• 43 48 54 70 85 45 19
44 69 17 69 50 35 6
28 1 7
• A histogram is a good
start, but you
probably need to
group the values
Mario could sorta play
Goals Per Season

6
5
4
Frequency 3
2
1
0
10 20 30 40 50 60 70 80 90
Goal Totals

• Wait a second, what is with that 90?


• Labels are midpoints, limits are 5-14 … 85-94
• Real limits are 85.5 – 94.5
Careful
• You have to make sure the scale makes
sense
• Especially the Y axis
• One of the problems with a histogram with
grouped data like this is that you lose
some of the richness of the data, which is
OK with a big data set, perhaps not here
though
Stem and Leaf Plot
0 1 6 7 • This one is an
ordered stem and leaf
1 7 9
• You interpret this like
2 8 a histogram
3 5 • Easy to sp ot outliers
4 3 4 5 8 • Preserves data
5 0 4 • Easy to get the
6 9 9 middle or 50th
percentile which is
7 0 44 in this case
8 5
The Five Number Summary
• You can get other stuff from a stem and
leaf as well
• Median
• First quartile (17.5 in our case)
• Third quartile (61.5 here)
• Quartiles are the 25th and 75th percentiles
• So halfway between the minimum and the
median, and the median and the maximum
You said there were five
numbers..
• Yeah so also there is the minimum 1
• And the maximum, 85
– These two by the way, give you the range
• Now you take those five numbers and
make what is called a box and whisker
plot, or a boxplot
• Gives you an idea of the shape of the data
And here you go…

You might also like