Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals
Program: Course Code: Course Name:: M.C.A. MCAS9220 Data Science Fundamentals
Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science
Fundamentals
Exploratory Data Analysis
Statistics 2126
Introduction
• If you are going to find out anything about
a data set you must first understand the
data
• Basically getting a feel for you numbers
– Easier to find mistakes
– Easier to guess what actually happened
– Easier to find odd values
Introduction
• One of the most important and overlooked
part of statistics is Exploratory Data
Analysis or EDA
• Developed by John Tukey
• Allows you to generate hypotheses as well
as get a feel for you data
• Get an idea of how the experiment went
without losing any richness in the data
Hey look, numbers!
x (the value) f (frequency)
10 1
23 2
25 5
30 2
33 1
35 1
Frequency tables make stuff easy
xf
• 10(1)+23(2)+25(5)+30(2)+33(1)+35(10
• = 309
Relative Frequency Histogram
• You can use this to Relative Frequency Histogram
make a relative 6
frequency histogram 5
• Lose no richness in 4
the data
Frequency
3 Frequency
• Easy to reconstruct 2
data set 1
oddities
Score
Categorical Data
• With categorical data you do not get a
histogram, you get a bar graph
• You could do a pie chart too, though I hate
them (but I love pie)
• Pretty much the same thing, but the x axis
really does not have a scale so to speak
• So say we have a STAT 2126 class with
38 Psych majors, 15 Soc, 18 CESD
majors and five Bio majors
Like this
STAT 2126
STAT 2126
Biology
40
35
30 CESD
25
Count
20 Psych
15
10
0
Psych Soc CESD Biology Soc
Major
Quantitative Variables
• So with these of course we use a
histogram
• We can see central tendency
• Spread
• shape
Skewness
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Kurtosis
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
6
5
4
Frequency 3
2
1
0
10 20 30 40 50 60 70 80 90
Goal Totals