0% found this document useful (0 votes)
186 views56 pages

Descriptive Data Analytics

The document discusses descriptive data analytics including measures of central tendency, measures of dispersion, population and sample means and standard deviations, and measures of co-movement between variables. Descriptive analytics is used to identify trends, patterns and relationships in data.

Uploaded by

Lia Ann Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views56 pages

Descriptive Data Analytics

The document discusses descriptive data analytics including measures of central tendency, measures of dispersion, population and sample means and standard deviations, and measures of co-movement between variables. Descriptive analytics is used to identify trends, patterns and relationships in data.

Uploaded by

Lia Ann Vargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

DESCRIPTIVE

DATA ANALYTICS
(PART I)
23 September 2023
Content
• Descriptive Data Analytics
• Sample Use Cases
• Measures of Central Tendency
• Measures of Dispersion
• Population Mean and Standard Deviation
• Sample Mean and Standard Deviation
• Measures of Co-movement between Variables
• Presentation of Data Analysis
Descriptive Data Analytics
• Process of using current and historical data to identify trends, patterns and
relationships
• Simplest form of data analysis
Sample Use Cases
Traffic and Engagement Reports
• Analyze user traffic in social media
or webpage
• Evaluate whether advertisements
increase traffic
• Understand the dynamics of user
traffic
Sample Use Cases
Financial Applications
• Look at underlying patterns to assess
company’s financial health
• Understand cost drivers of financial
metrics
• Assess performance of funds and
other investments
Sample Use Cases

Demand Trends
• Determine which products or services
are trending
• Which products are favored at a given
point in time
• Understand patterns in consumer
behavior
Sample Use Cases

Aggregated Survey Results


• Generate key insights from surveys
• Detect correlations of variables and
understand relationships of factors
based on survey responses
Sample Use Cases

Progress to Goals
• Analyze results of Key Performance
Indicators (KPI) to check whether efforts
are on track or adjustments need to be
made
• Generate dashboards for updating project
milestones
Measures of Central Tendency
• Goal: Describe the center of a data set
• Three Common Ways:
(1) Mean: Sum all the numbers and divide by how many numbers you have
(2) Median: Arrange the numbers in ascending order and find the middle number
(3) Mode: Find the most common occurring number
Measures of Central Tendency: Illustrations
Suppose you are given the net worth of 10 people:

$22,000 $13,000 $1,200,000 $150 $45,000


$30,000 $45,000 $40,000 $14,000 $45,000

Find the mean, median and mode.


Measures of Central Tendency: Illustrations
Mean

= [$22,000 + $13,000 + $1,200,000 + $150 + $45,000 + $30,000 + $45,000 +


$40,000 + $14,000 + $45,000]/10
= $145,415
Measures of Central Tendency: Illustrations
Median

Original order:
$22,000 $13,000 $1,200,000 $150 $45,000
$30,000 $45,000 $40,000 $14,000 $45,000

Ascending order:
$150 $13,000 $14,000 $22, 000 $30,000
$40,000 $45,000 $45,000 $45,000 $1,200,000

Median = [$30,000 + $40,000]/2 = $35,000


Measures of Central Tendency: Illustrations
Mode

Original order:
$22,000 $13,000 $1,200,000 $150 $45,000
$30,000 $45,000 $40,000 $14,000 $45,000
Value Frequency
$150 1
$13,000 1
$14,000 1
Mode = $45,000
$22,000 1
$30,000 1
$40,000 1
$45,000 3
$1,200,000 1
Measures of Central Tendency

Measure of Central Advantage When to use


Tendency
• Considers the impact of all values in the data
Mean set No obvious extreme values
• Fairly stable for sufficiently large sample
Median Avoids data distortion with extreme outlier There are extreme outliers observed
values
Provides results in analyzing frequency of Useful for analyzing qualitative data
Mode qualitative data
Measures of Dispersion
• Goal: Describe how spread out the values in a dataset
• Common tools:
(1) Range: Maximum value – Minimum Value
(2) Standard Deviation: “Average” Deviation from the Mean
(3) Interquartile Range: Range in which the middle 50% of the data distribution lies
(or Third Quartile – First Quartile)
Measures of Dispersion: Illustrations
Determine the range of marks on tests A and B

Range of Marks on Test A Range of Marks on Test B


= 70 – 45 = 65 – 45
= 25 = 20
Measures of Dispersion: Illustrations
Determine the standard deviation of the following population of 6 cities in Bay Area
in California:

City San Jose San Ramon San Francisco Daly Palo Alto Oakland

Population 1,000,000 85,000 870,000 100,000 69,000 420,000


Measures of Dispersion: Illustrations
Mean
= [1,000,000 + 85,000 + 870,000 + 100,000 + 69,000 + 420,000]/6
= 424,000
City San Jose San Ramon San Francisco Daly Palo Alto Oakland

Population 1,000,000 85,000 870,000 100,000 69,000 420,000


Deviation from
Mean (Actual – 576,000 -339,000 446,000 -324,000 -355,000 -4,000
Mean)
Squared
Deviation (576,000)2 (-339,000)2 (446,000)2 (-324,000)2 (-355,000)2 (-4,000)2
Measures of Dispersion: Illustrations
City San Jose San Ramon San Francisco Daly Palo Alto Oakland

Population 1,000,000 85,000 870,000 100,000 69,000 420,000


Deviation from
Mean (Actual – 576,000 -339,000 446,000 -324,000 -355,000 -4,000
Mean)
Squared
Deviation (576,000)2 (-339,000)2 (446,000)2 (-324,000)2 (-355,000)2 (-4,000)2

Variance = “Average” Squared Deviation


= [(576,000)2 + (-339,000)2 + (446,000)2 + (-324,000)2 + (-
355,000)2 + (-4,000)2]/6
= 146,105,000,000
Standard Deviation = = 382,236.84
Measures of Dispersion: Illustrations
Interquartile Distribution

First quartile is the point where 25% Third quartile is the point where 75%
of the observations are less than or of the observations are less than or
equal to first quartile and 75% are equal to third quartile and 25% are
greater than or equal to first quartile greater than or equal to first quartile
Measures of Dispersion: Illustrations
Interquartile Distribution

Interquartile range is the distribution of


the observations from Q1 to Q3 which
comprises 50% of the observations
Measures of Dispersion: Illustrations
Interquartile Distribution

Suppose we have the following observations:

15 18 19 20 20 20 21 23 23 24 24 25

Determine the interquartile range.


Measures of Dispersion: Illustrations
Step 1: Determine the median of the observations

15 18 19 20 20 20 21 23 23 24 24 25

Median = [20 + 21]/2 = 20.5


Measures of Dispersion: Illustrations
Step 2: Determine the first quartile of the observations

15 18 19 20 20 20

First Quartile = [19 + 20]/2 = 19.5


Measures of Dispersion: Illustrations
Step 3: Determine the third quartile of the observations

21 23 23 24 24 25

Third Quartile = [23 + 24]/2 = 23.5


Measures of Dispersion: Illustrations
Step 4: Find the interquartile range

Interquartile Range
= Third Quartile – First Quartile
= 23.5 – 19.5
=4
Population Mean and Standard Deviation
Population Mean and Standard Deviation
Recall: Population vs. Sample

• Population: Total elements of a data


set
• Sample: Representative drawn from
population
Population Mean and Standard Deviation

The weights of five children in a family are:

x1 = 3.5kg x2 = 12.3kg x3 = 17.7kg x4 = 20.9kg x5 = 23.1kg

Determine the population mean and standard deviation of their weights.


Population Mean and Standard Deviation
Population Mean and Standard Deviation
Sample Mean and Standard Deviation
Exercise: Sample Mean and Standard
Deviation
Suppose you went to Japan and visited a dojo of
veteran Sumo Wrestlers. You asked a few of them
and noted down the weights of 5 Sumo Wrestlers:

x1 = 205kg x2 = 192kg x3 = 223kg


x4 = 240kg x5 = 188kg

Determine the sample mean and standard


deviation of the weights.
Measures of Co-movement between Variables

• Suppose we want to measure the co-movement between two variables X and Y


• We can get the sample covariance and sample correlation between X and Y
Measures of Co-movement between Variables

• If we have two data series X1, ..., XN, and Y1, ..., YN, we can estimate their expected
covariance using sample covariance

and their correlation using sample correlation


Measures of Co-movement between Variables
Steps in computing for s(X,Y):
(1) Get the average of X and Y
(2) For each sample data in X, determine its distance from the average of X. For each
sample data in Y, determine its distance from the average of Y.
(3) Multiply the calculated distance of X from its respective average with the
corresponding distance of Y from its respective average. We call this new variable Z.
(4) Sum up all the entries in variable Z.
(5) Divide the sum in Step 4 by N – 1 where N is the size of the sample data. The result
obtained in Step 5 is the Sample Covariance. If we have a sufficiently large data,
Sample Covariance can be used to approximate Population Covariance
Measures of Co-movement between Variables
Determine the sample covariance s(X,Y) of two variables X and Y with the following
dataset:

Variable X Variable Y
4 8
2 10
6 5
3 7
Measures of Co-movement between Variables
Step 1. Get the average of X and Y
Variable X Variable Y
4 8
3 10
6 5
3 9

Average of X = [4 + 3 + 6 + 3]/4 = 4
Average of Y = [8 + 10 + 5 + 9]/4 = 8
Measures of Co-movement between Variables
Step 2. For each sample data in X, determine its distance from the average of X. For
each sample data in Y, determine its distance from the average of Y.

X X – Average of X Y Y – Average of Y

4 (4 – 4) = 0 8 (8 – 8) = 0
3 (3 - 4) = -1 10 (10 – 8) = 2
6 (6 – 4) = 2 5 (5 – 8) = -3
3 (3 – 4) = -1 9 (9 – 8) = 1
Measures of Co-movement between Variables
Step 3. Multiply the calculated distance of X from its respective average with the
corresponding distance of Y from its respective average. We call this new variable Z.

X X – Average of X Y Y – Average of Y Z

4 (4 – 4) = 0 8 (8 – 8) = 0 0x0=0
3 (3 - 4) = -1 10 (10 – 8) = 2 -1 x 2 = -2
6 (6 – 4) = 2 5 (5 – 8) = -3 2 x -3 = -6
3 (3 – 4) = -1 9 (9 – 8) = 1 -1 x 1 = -1
Measures of Co-movement between Variables
Step 4. Sum up all the entries in variable Z.

X X – Average of X Y Y – Average of Y Z

4 (4 – 4) = 0 8 (8 – 8) = 0 0x0=0
3 (3 - 4) = -1 10 (10 – 8) = 2 -1 x 2 = -2
6 (6 – 4) = 2 5 (5 – 8) = -3 2 x -3 = -6
3 (3 – 4) = -1 9 (9 – 8) = 1 -1 x 1 = -1

Sum of entries in Z = 0 + (-2) + (-6) + (-1) = -9


Measures of Co-movement between Variables
Step 5. Divide the sum in Step 4 by N – 1 where N is the size of the sample data. The
result obtained in Step 5 is the Sample Covariance. If we have a sufficiently large data,
Sample Covariance can be used to approximate Population Covariance

Sum of entries in Z = 0 + (-2) + (-6) + (-1) = -9


N = 4 since we have 4 data points

Thus, S(X,Y) = Sum of entries in Z/[N – 1]


= -9/[4 – 1]
= -3
Measures of Co-movement between Variables
Additionally, we can determine the sample correlation ρ(X,Y) of variables X and Y thru
the formula:
Measures of Co-movement between Variables
To get sample standard deviation of X (i.e., s(X)):

X X – Average of X (X – Average of X)2

4 (4 – 4) = 0 0
3 (3 - 4) = -1 1
6 (6 – 4) = 2 4
3 (3 – 4) = -1 1

s(X) = (0 + 1 + 4 + 1)/3
=2
Measures of Co-movement between Variables
To get sample standard deviation of Y (i.e., s(Y)):

Y Y – Average of Y (Y – Average of Y)2

8 (8 – 8) = 0 0
10 (10 – 8) = 2 4
5 (5 – 8) = -3 9
9 (9 – 8) = 1 1

s(Y) = (0 + 4 + 9 + 1)/3
= 14/3
= 2.67
Measures of Co-movement between Variables
Finally, using the formula ρ(X,Y) = s(X, Y)/[s(X) * s(Y)], we determine sample
correlation ρ(X,Y) as follows:

ρ(X,Y) = s(X, Y)/[s(X) * s(Y)]


= -3/[2*2.67]
= -0.5618
Measures of Co-movement between Variables
1.5

1.45

1.4

1.35

1.3
STOCK B

1.25

1.2

1.15

1.1

1.05

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Stock A

Example of two stocks A and B with perfectly positive correlation (or covariance)
Measures of Co-movement between Variables
15

14

13

12

11
Stock Y

10

6
0 0.5 1 1.5 2 2.5 3

Stock X

Example of two stocks X and Y with perfectly negative correlation (or covariance)
Measures of Co-movement between Variables
1.6

1.4

1.2
stock d

0.8

0.6

0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Stock c

Example of two uncorrelated stocks C and D


Presentation of Data Analysis
 Histogram
 Box-plot
 Scatter Plot
 Bar Graph
 Line Graph
 Pie Chart
Presentation of Data Analysis: Histogram

 Divide data into a number of classes


 Number or frequency of each class is represented by a vertical rectangle
Presentation of Data Analysis: Box-plot

 Constructed using quartiles


 Gives good indication of spread of data set and its symmetry (or lack of symmetry)
 Consists of a scale, a box drawn between first and third quartile, the median placed
within the box, whiskers on both sides and outliers (if any)
Presentation of Data Analysis: Scatter Plot
15

14

13

12

11
Stock Y

10

6
0 0.5 1 1.5 2 2.5 3

Stock X

Plot two variables X and Y into a two-dimensional X- and Y-coordinate graph


Useful for visualizing correlation of two variables
Presentation of Data Analysis: Bar Graph and
Pie Chart

Useful for visualizing mode of dataset


Presentation of Data Analysis: Line Graph

Useful for visualizing time-series data

You might also like