0% found this document useful (0 votes)
40 views

Data Analytics Theory

Exploratory data analysis involves formulating a problem, collecting relevant data, visualizing the data, performing statistical calculations to interpret the results. The document discusses exploratory data analysis techniques including descriptive statistics, inferential statistics, data visualization using tables, graphs and charts. It covers concepts such as data types, measures of central tendency, dispersion, outliers, and the use of standardization to compare data.

Uploaded by

Chandra Mohan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Data Analytics Theory

Exploratory data analysis involves formulating a problem, collecting relevant data, visualizing the data, performing statistical calculations to interpret the results. The document discusses exploratory data analysis techniques including descriptive statistics, inferential statistics, data visualization using tables, graphs and charts. It covers concepts such as data types, measures of central tendency, dispersion, outliers, and the use of standardization to compare data.

Uploaded by

Chandra Mohan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Exploratory Data Analysis

“Statistical Techniques/Methods”

Formulate Get some Visualize the


problem data data

Do some Interpret
statistical results
calculations
DATA AND SUMMARIZATION
Primary Uses of Statistics

• Descriptive statistics – the collection, organization,


presentation and summary of data.

• Inferential statistics – generalizing from a sample to a


population, estimating unknown parameters, drawing
conclusions, making decisions.
Basic Vocabulary of Statistics

POPULATION
A population consists of all the items or individuals about which
you want to draw a conclusion.

SAMPLE
A sample is the portion of a population selected for analysis.

PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.

STATISTIC
A statistic is a numerical measure that describes a characteristic
of a sample.
Qualitative(Categ
Quantitative
orical)

Discrete (no.
of customers, Ordinal (customer
no of claims) satisfaction,
efficiency of workers,
bond rating)
Continuous
(salary, price)
Nominal (sex,
nationality,
eye color)
Cross-Sectional Data

• Cross-sectional data: Data collected at the same or approximately the


same point in time
• Time series data: data collected over different time periods
Data Visualization
Preliminary
Analysis: Page
Views
Treatment Average
Control 420
Treatment 1 501
Treatment 2 483
Preliminary
Analysis: Calls
Treatment Average
Control 34
Treatment 1 37
Treatment 2 42
Preliminary
Analysis:
Reservations
Treatment Average
Control 34
Treatment 1 34
Treatment 2 42
Preliminary
Analysis: Page
Views
Treatment Res Type Average
Control Chain 600
Independent 300
Treatment 1 Chain 690
Independent 375
Treatment 2 Chain 691
Independent 345
Preliminary
Analysis: Calls
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 44
Independent 33
Treatment 2 Chain 48
Independent 37
Preliminary
Analysis:
Reservations
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 40
Independent 30
Treatment 2 Chain 48
Independent 37
Frequency distribution table of Reservations
data
Class interval Freqency Density
15-20 90 90/30000=0.0006
20-25 1387
25-30 5776
30-35 7551
35-40 6809
40-45 4620
45-50 2082
50-55 922
55-60 500
60-65
65-70
70-75
75-80
Total 30000 1
Skewness
Skewed to left
Skewness
Symmetric
Skewness
Skewed to right
What is the point? Why collect this data?
Data and randomness

• Three questions that good business managers ask themselves when


they look at “the numbers”:-

• What is a typical or central value?

• How much variability is present in the data set?

• Are there unusual shocks/events/cases (shape of the curve)?


Dispersion
• Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value

• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Measures of Dispersion

125

• Which of the 100


75
distributions of 50
demand has the larger 25
dispersion? 0
1 2 3 4 5 6 7 8 9 10

The upper distribution 125

has more dispersion 100

75
because the scores 50

are more spread out 25

0
1 2 3 4 5 6 7 8 9 10
Measures of Dispersion

• There are four main measures of dispersion:


• Range
• Variance
• Standard Deviation
• Inter-quartile range (IQR)
Interpretation

• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)

• Relative measure (unit free) used for the purpose


of comparison of variability.

• Relative Measure=absolute measure/avg. *100

s
CV  100 
x
Percentiles, Quartiles and IQR

• Percentiles are data that have been divided into 100


groups (99 percentiles).
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-
takers scored below you.
• Deciles are data that have been divided into 10 groups
(9 deciles).
• Quartiles are data that have been divided into 4 groups
(3 quartiles).
Percentiles, quartiles, and the IQR

The 10th percentile (denoted by P10) is the number


such that 10% of the values are less than it and 90%
are bigger.

The median is the 50th percentile.

The 1st quartile (denoted by Q1) is the data such


that 25% of the values are less than it and 75% are
bigger.

Inter quartile range (IQR) = Q3-Q1


Box Plot

Describes the overall distribution of a set of


numbers but is simpler than a histogram.
Useful when comparing several samples because
too many histograms on one graph would be
both crowded and confusing.
Also produces useful display with small data
sets.

Useful to detect outliers / extreme values


EXAMPLE: Page views data

Measures pageviews calls reservations

Min. : 145.0 17.00 15.00


1st Qu.: 328.0 32.00 31.00
Median : 391.0 37.00 36.00
Mean : 468.1 37.71 36.55
3rd Qu.: 636.0 42.00 41.00
Max. : 929.0 77.00 79.00
SD : 168.16 7.97 7.99
MAD : 149.74 7.41 7.41

restaurant_type
chain :12000
independent:18000
Box Plot

S=smallest, L=Largest, M=median


Q1=lower quartile, Q3=upper quartile
Detection of Outliers (Box Plot)

• Calculate Q1-1.5*IQR and Q3+1.5*IQR


• Any data lying outside this region is an outlier
BoxPlot

83 84 85 86 87 88 89 90 91
IBM

BoxPlot

18.5 19 19.5 20 20.5 21 21.5 22 22.5


EDS
A large number of fast-food restaurants with drive-through
windows offering drivers and their passengers the
advantages of quick service. To measure how good the
service is, an organization called QSR planned a study
wherein the amount of time taken by a sample of drive-
through customers at each of five restaurants was
recorded. Compare the five sets of data using a box plot
and interpret the results.
Standardising Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
Standardising Data

How: To compare each data point to the


natural range and variation of the dataset.
xx
z
s

z score can be both positive or negative


Capturing variation

 Chebyshev’s Theorem
Applies to any distribution, regardless of shape

 Empirical Rule
Applies only to roughly mound-shaped and symmetric
distributions
Chebyshev’s Theorem
 1 
1  
 At least  2 of
 the elements of any
 k 
distribution lie within k standard deviations of the
mean
1 1 3
1  1    75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1  2  1    89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2  1   94%
4 16 16
Empirical Rule
 For roughly mound-shaped and symmetric
distributions, approximately:

68% 1 standard deviation


of the mean

95% Lie 2 standard deviations


within of the mean

All 3 standard deviations


of the mean
Empirical Rule
99.72%
95.44%
68.26%

m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Scatter Plots and Correlation

• A scatter plot (or scatter diagram) is used to show


the relationship between two variables
• Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Strong relationships Weak relationships

y y

x x

y y

x x
No relationship

x
Correlation Coefficient

• The correlation coefficient (r) is used to measure


the strength of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient

cov( x, y )
rxy 
sx s y
1
cov( x, y )   ( xi  x )( yi  y )
n
1 1
sx 
n
 ( xi  x ) 2
s y 
n
 ( y i  y ) 2
Features of correlation coefficient
• Unit free
• Range between -1.00 and 1.00
• -1≤r<0 implies that as X ↑ (↓), Y ↓ (↑ )
• 0< r≤1 implies that as X ↑ (↓), Y ↑ (↓)
• The closer to -1.00, the stronger the negative linear relationship
• The closer to 1.00, the stronger the positive linear relationship
• The closer to 0.00, the weaker the linear relationship
• r=0 implies that X and Y are not linearly associated
Examples of Approximate r Values

y y y

x x x
r = -1.00 r = -.60 r = 0.00
y y

x x
r = 0.20 r = 1.00

You might also like