0% found this document useful (0 votes)

66 views54 pages

Data Analytics Theory

Exploratory data analysis involves formulating a problem, collecting relevant data, visualizing the data, performing statistical calculations to interpret the results. The document discusses exploratory data analysis techniques including descriptive statistics, inferential statistics, data visualization using tables, graphs and charts. It covers concepts such as data types, measures of central tendency, dispersion, outliers, and the use of standardization to compare data.

Uploaded by

Chandra Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views54 pages

Data Analytics Theory

Uploaded by

Chandra Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Exploratory Data Analysis

“Statistical Techniques/Methods”

Formulate Get some Visualize the

problem data data

Do some Interpret
statistical results
calculations
DATA AND SUMMARIZATION
Primary Uses of Statistics

• Descriptive statistics – the collection, organization,

presentation and summary of data.

• Inferential statistics – generalizing from a sample to a

population, estimating unknown parameters, drawing
conclusions, making decisions.
Basic Vocabulary of Statistics

POPULATION
A population consists of all the items or individuals about which
you want to draw a conclusion.

SAMPLE
A sample is the portion of a population selected for analysis.

PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.

STATISTIC
A statistic is a numerical measure that describes a characteristic
of a sample.
Qualitative(Categ
Quantitative
orical)

Discrete (no.
of customers, Ordinal (customer
no of claims) satisfaction,
efficiency of workers,
bond rating)
Continuous
(salary, price)
Nominal (sex,
nationality,
eye color)
Cross-Sectional Data

• Cross-sectional data: Data collected at the same or approximately the

same point in time
• Time series data: data collected over different time periods
Data Visualization
Preliminary
Analysis: Page
Views
Treatment Average
Control 420
Treatment 1 501
Treatment 2 483
Preliminary
Analysis: Calls
Treatment Average
Control 34
Treatment 1 37
Treatment 2 42
Preliminary
Analysis:
Reservations
Treatment Average
Control 34
Treatment 1 34
Treatment 2 42
Preliminary
Analysis: Page
Views
Treatment Res Type Average
Control Chain 600
Independent 300
Treatment 1 Chain 690
Independent 375
Treatment 2 Chain 691
Independent 345
Preliminary
Analysis: Calls
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 44
Independent 33
Treatment 2 Chain 48
Independent 37
Preliminary
Analysis:
Reservations
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 40
Independent 30
Treatment 2 Chain 48
Independent 37
Frequency distribution table of Reservations
data
Class interval Freqency Density
15-20 90 90/30000=0.0006
20-25 1387
25-30 5776
30-35 7551
35-40 6809
40-45 4620
45-50 2082
50-55 922
55-60 500
60-65
65-70
70-75
75-80
Total 30000 1
Skewness
Skewed to left
Skewness
Symmetric
Skewness
Skewed to right
What is the point? Why collect this data?
Data and randomness

• Three questions that good business managers ask themselves when

they look at “the numbers”:-

• What is a typical or central value?

• How much variability is present in the data set?

• Are there unusual shocks/events/cases (shape of the curve)?

Dispersion
• Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value

• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Measures of Dispersion

125

• Which of the 100

75
distributions of 50
demand has the larger 25
dispersion? 0
1 2 3 4 5 6 7 8 9 10

The upper distribution 125

has more dispersion 100

75
because the scores 50

are more spread out 25

0
1 2 3 4 5 6 7 8 9 10
Measures of Dispersion

• There are four main measures of dispersion:

• Range
• Variance
• Standard Deviation
• Inter-quartile range (IQR)
Interpretation

• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)

• Relative measure (unit free) used for the purpose

of comparison of variability.

• Relative Measure=absolute measure/avg. *100

s
CV  100 
x
Percentiles, Quartiles and IQR

• Percentiles are data that have been divided into 100

groups (99 percentiles).
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-
takers scored below you.
• Deciles are data that have been divided into 10 groups
(9 deciles).
• Quartiles are data that have been divided into 4 groups
(3 quartiles).
Percentiles, quartiles, and the IQR

The 10th percentile (denoted by P10) is the number

such that 10% of the values are less than it and 90%
are bigger.

The median is the 50th percentile.

The 1st quartile (denoted by Q1) is the data such

that 25% of the values are less than it and 75% are
bigger.

Inter quartile range (IQR) = Q3-Q1

Box Plot

Describes the overall distribution of a set of

numbers but is simpler than a histogram.
Useful when comparing several samples because
too many histograms on one graph would be
both crowded and confusing.
Also produces useful display with small data
sets.

Useful to detect outliers / extreme values

EXAMPLE: Page views data

Measures pageviews calls reservations

Min. : 145.0 17.00 15.00

1st Qu.: 328.0 32.00 31.00
Median : 391.0 37.00 36.00
Mean : 468.1 37.71 36.55
3rd Qu.: 636.0 42.00 41.00
Max. : 929.0 77.00 79.00
SD : 168.16 7.97 7.99
MAD : 149.74 7.41 7.41

restaurant_type
chain :12000
independent:18000
Box Plot

S=smallest, L=Largest, M=median

Q1=lower quartile, Q3=upper quartile
Detection of Outliers (Box Plot)

• Calculate Q1-1.5IQR and Q3+1.5IQR

• Any data lying outside this region is an outlier
BoxPlot

83 84 85 86 87 88 89 90 91
IBM

BoxPlot

18.5 19 19.5 20 20.5 21 21.5 22 22.5

EDS
A large number of fast-food restaurants with drive-through
windows offering drivers and their passengers the
advantages of quick service. To measure how good the
service is, an organization called QSR planned a study
wherein the amount of time taken by a sample of drive-
through customers at each of five restaurants was
recorded. Compare the five sets of data using a box plot
and interpret the results.
Standardising Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
Standardising Data

How: To compare each data point to the

natural range and variation of the dataset.
xx
z
s

z score can be both positive or negative

Capturing variation

 Chebyshev’s Theorem
Applies to any distribution, regardless of shape

 Empirical Rule
Applies only to roughly mound-shaped and symmetric
distributions
Chebyshev’s Theorem
 1 
1  
 At least  2 of
 the elements of any
 k 
distribution lie within k standard deviations of the
mean
1 1 3
1  1    75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1  2  1    89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2  1   94%
4 16 16
Empirical Rule
 For roughly mound-shaped and symmetric
distributions, approximately:

68% 1 standard deviation

of the mean

95% Lie 2 standard deviations

within of the mean

All 3 standard deviations

of the mean
Empirical Rule
99.72%
95.44%
68.26%

m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Scatter Plots and Correlation

• A scatter plot (or scatter diagram) is used to show

the relationship between two variables
• Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Strong relationships Weak relationships

y y

x x

y y

x x
No relationship

x
Correlation Coefficient

• The correlation coefficient (r) is used to measure

the strength of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient

cov( x, y )
rxy 
sx s y
1
cov( x, y )   ( xi  x )( yi  y )
n
1 1
sx 
n
 ( xi  x ) 2
s y 
n
 ( y i  y ) 2
Features of correlation coefficient
• Unit free
• Range between -1.00 and 1.00
• -1≤r<0 implies that as X ↑ (↓), Y ↓ (↑ )
• 0< r≤1 implies that as X ↑ (↓), Y ↑ (↓)
• The closer to -1.00, the stronger the negative linear relationship
• The closer to 1.00, the stronger the positive linear relationship
• The closer to 0.00, the weaker the linear relationship
• r=0 implies that X and Y are not linearly associated
Examples of Approximate r Values

y y y

x x x
r = -1.00 r = -.60 r = 0.00
y y

x x
r = 0.20 r = 1.00

Lot-Sizing Heuristics Silver Meal Heuristic, Least Unit Cost Heuristic, Wagner-Whitin Heuristic
No ratings yet
Lot-Sizing Heuristics Silver Meal Heuristic, Least Unit Cost Heuristic, Wagner-Whitin Heuristic
34 pages
Reading - Exploratory Data Analysis
No ratings yet
Reading - Exploratory Data Analysis
33 pages
ADS EXP 1
No ratings yet
ADS EXP 1
13 pages
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
No ratings yet
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
33 pages
Business Statistics: Dr. Basheer Ahmad Samim
No ratings yet
Business Statistics: Dr. Basheer Ahmad Samim
70 pages
Lecture 2b - Describing Data-Numerical
No ratings yet
Lecture 2b - Describing Data-Numerical
47 pages
Arm & Sa Spring 13
No ratings yet
Arm & Sa Spring 13
64 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Part 2-Chapter 3 - Describing Data - Edit
No ratings yet
Part 2-Chapter 3 - Describing Data - Edit
46 pages
BB Module 2 BASIC STATISTICS
No ratings yet
BB Module 2 BASIC STATISTICS
63 pages
Quantitative Analysis: Dr. Basheer Ahmad Samim
No ratings yet
Quantitative Analysis: Dr. Basheer Ahmad Samim
71 pages
Chapter 3 Review
100% (1)
Chapter 3 Review
12 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
Module 1 Statistical Inference
No ratings yet
Module 1 Statistical Inference
67 pages
2Descriptives
No ratings yet
2Descriptives
43 pages
Module 1 Statistical Inference
No ratings yet
Module 1 Statistical Inference
67 pages
EECM3724_Unit_1_Ch3_slides_2022
No ratings yet
EECM3724_Unit_1_Ch3_slides_2022
48 pages
Spring Semester, 2020-2021
No ratings yet
Spring Semester, 2020-2021
40 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Session 2 Descriptive Statistics
No ratings yet
Session 2 Descriptive Statistics
33 pages
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
No ratings yet
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
41 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
Analysis of Statistcal Data
No ratings yet
Analysis of Statistcal Data
46 pages
Lec5&6 02sep2016
No ratings yet
Lec5&6 02sep2016
32 pages
02 Descriptive Statistics
No ratings yet
02 Descriptive Statistics
49 pages
Ch 3_250408_170537
No ratings yet
Ch 3_250408_170537
33 pages
2025-02-25_15-19-30_gBgFpFhD4P7W4A8jcNqeiM6UE3yAWzC05C0SBjgT
No ratings yet
2025-02-25_15-19-30_gBgFpFhD4P7W4A8jcNqeiM6UE3yAWzC05C0SBjgT
36 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
Descriptive Statistics
100% (1)
Descriptive Statistics
7 pages
Lecture 2-3 Data Analysis Location & Dispression
No ratings yet
Lecture 2-3 Data Analysis Location & Dispression
43 pages
Lec006 - Measures of Dispersion
No ratings yet
Lec006 - Measures of Dispersion
42 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
RM-EBBA-class-8-CH0-11-Quatitative-analysis
No ratings yet
RM-EBBA-class-8-CH0-11-Quatitative-analysis
37 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
Module 1 Overview_of_Statistics
No ratings yet
Module 1 Overview_of_Statistics
11 pages
Topic3 Descriptive Statistics
No ratings yet
Topic3 Descriptive Statistics
50 pages
Statistical Inference
No ratings yet
Statistical Inference
69 pages
Lecture_04
No ratings yet
Lecture_04
88 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
Descriptive Statistics - Numerical Measures
No ratings yet
Descriptive Statistics - Numerical Measures
102 pages
Summary and Revision Quiz 1
No ratings yet
Summary and Revision Quiz 1
5 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
26 pages
program-1_
No ratings yet
program-1_
15 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Stats
No ratings yet
Stats
109 pages
CH03
No ratings yet
CH03
59 pages
ch03 Ver3
No ratings yet
ch03 Ver3
25 pages
8
No ratings yet
8
6 pages
Central Tendency Variation Outliers
No ratings yet
Central Tendency Variation Outliers
59 pages
4.statistics 2
No ratings yet
4.statistics 2
55 pages
3 Descriptive Statistics - Numerical(1)
No ratings yet
3 Descriptive Statistics - Numerical(1)
82 pages
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Bomb Calorimeter C3000
No ratings yet
Bomb Calorimeter C3000
2 pages
1Z0-1074-24-Demo
No ratings yet
1Z0-1074-24-Demo
4 pages
York Action Plan 2016 Template
No ratings yet
York Action Plan 2016 Template
12 pages
Advanced Control of Power Converters Hasan Komurcugildownload
100% (2)
Advanced Control of Power Converters Hasan Komurcugildownload
55 pages
Dcm4chee Ref
No ratings yet
Dcm4chee Ref
40 pages
How To Fix Lenovo Laptop Black Screen in Windows
No ratings yet
How To Fix Lenovo Laptop Black Screen in Windows
16 pages
Design in Ergonomics Guideline
100% (1)
Design in Ergonomics Guideline
66 pages
Production Scheduling Optimization in Foundry Using Hybrid Particle Swarm Optimization Algorithm
No ratings yet
Production Scheduling Optimization in Foundry Using Hybrid Particle Swarm Optimization Algorithm
8 pages
Computer Fraud & Abuse
No ratings yet
Computer Fraud & Abuse
45 pages
Ds 9850 Series-469642
No ratings yet
Ds 9850 Series-469642
3 pages
Belden® Optical Fiber Catalog - Original - 17106 PDF
No ratings yet
Belden® Optical Fiber Catalog - Original - 17106 PDF
64 pages
Manual Copiaa
No ratings yet
Manual Copiaa
11 pages
7c73c50a-b846-4844-80b3-194c539b1d08
No ratings yet
7c73c50a-b846-4844-80b3-194c539b1d08
49 pages
Triple Des
No ratings yet
Triple Des
22 pages
Ariba Install CBA Addon+Config+Maintenance - V1.0
No ratings yet
Ariba Install CBA Addon+Config+Maintenance - V1.0
71 pages
Advanced Microcontroller and Embedded Systems
100% (1)
Advanced Microcontroller and Embedded Systems
64 pages
Urn Uvci 01 Ro D6gxwoey932n62o943z781k5vp40ql#y
No ratings yet
Urn Uvci 01 Ro D6gxwoey932n62o943z781k5vp40ql#y
2 pages
Experiment 7: AIM - Simulate Bus Topology Using NS3 With 8 Csma Nodes. Theory
No ratings yet
Experiment 7: AIM - Simulate Bus Topology Using NS3 With 8 Csma Nodes. Theory
25 pages
8800 rm-13 33 Service Manuall Level4
No ratings yet
8800 rm-13 33 Service Manuall Level4
79 pages
FLeX Controller
No ratings yet
FLeX Controller
2 pages
Technics Sx-px9 User Manual
No ratings yet
Technics Sx-px9 User Manual
27 pages
DOC-20250213-WA0005.
No ratings yet
DOC-20250213-WA0005.
1 page
Assignment, Power System Communication
No ratings yet
Assignment, Power System Communication
15 pages
Audio Mixer & Audio Interface
No ratings yet
Audio Mixer & Audio Interface
5 pages
Mastering Astronomy Chapter 3 Homework
100% (1)
Mastering Astronomy Chapter 3 Homework
4 pages
DS2CD202WFIW
No ratings yet
DS2CD202WFIW
4 pages
BSV For Data Economy - BSV Blockchain EBook - 230914 - 222812
No ratings yet
BSV For Data Economy - BSV Blockchain EBook - 230914 - 222812
12 pages
Iot Instrumented Food and Grain Warehouse Traceability System For Farmers
No ratings yet
Iot Instrumented Food and Grain Warehouse Traceability System For Farmers
15 pages

Data Analytics Theory

Uploaded by

Data Analytics Theory

Uploaded by

Exploratory Data Analysis

Formulate Get some Visualize the

• Descriptive statistics – the collection, organization,

• Inferential statistics – generalizing from a sample to a

• Cross-sectional data: Data collected at the same or approximately the

• Three questions that good business managers ask themselves when

• What is a typical or central value?

• How much variability is present in the data set?

• Are there unusual shocks/events/cases (shape of the curve)?

• Which of the 100

The upper distribution 125

has more dispersion 100

are more spread out 25

• There are four main measures of dispersion:

• Relative measure (unit free) used for the purpose

• Relative Measure=absolute measure/avg. *100

• Percentiles are data that have been divided into 100

The 10th percentile (denoted by P10) is the number

The median is the 50th percentile.

The 1st quartile (denoted by Q1) is the data such

Inter quartile range (IQR) = Q3-Q1

Describes the overall distribution of a set of

Useful to detect outliers / extreme values

Measures pageviews calls reservations

Min. : 145.0 17.00 15.00

S=smallest, L=Largest, M=median

• Calculate Q1-1.5*IQR and Q3+1.5*IQR

18.5 19 19.5 20 20.5 21 21.5 22 22.5

How: To compare each data point to the

z score can be both positive or negative

68% 1 standard deviation

95% Lie 2 standard deviations

All 3 standard deviations

• A scatter plot (or scatter diagram) is used to show

• The correlation coefficient (r) is used to measure

You might also like

• Calculate Q1-1.5IQR and Q3+1.5IQR