0% found this document useful (0 votes)
12 views69 pages

Da Stats Topic02 Introduction To Descriptive and Predictive Analysis

The document introduces descriptive and predictive analysis techniques. It discusses descriptive analysis which understands the past, predictive analysis which predicts the future, and prescriptive analysis which recommends actions. It also covers data-driven and model-driven approaches, descriptive statistics including scales of measurement, and measures of central tendency, dispersion and distribution shape.

Uploaded by

soumya.pro01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views69 pages

Da Stats Topic02 Introduction To Descriptive and Predictive Analysis

The document introduces descriptive and predictive analysis techniques. It discusses descriptive analysis which understands the past, predictive analysis which predicts the future, and prescriptive analysis which recommends actions. It also covers data-driven and model-driven approaches, descriptive statistics including scales of measurement, and measures of central tendency, dispersion and distribution shape.

Uploaded by

soumya.pro01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

DA-STATS

Topic 02: Introduction to Descriptive and Predictive


Analysis

Shan E Ahmed Raza


Department of Computer Science
University of Warwick
[email protected]
Outline

• Introduction to Descriptive, Predictive and Prescriptive


Techniques
• Data and Model Driven Modelling
• Descriptive Analysis
Scales of Measurement and data
Measures of central tendency, statistical dispersion
and shape of distribution
• Correlation Techniques

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


2
Analysis
Books – Lecture Material

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


3
Analysis
Statistical Analysis

• Descriptive analysis – understand the past.

• Predictive analysis – predict the future.

• Prescriptive analysis – recommend an action.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


4
Analysis
Data driven vs Model driven approach

• Data-driven modelling approach


Aims to derive a description of behavior from observations of a
system so that it can describe how that system behaves (its
output) under different conditions or scenarios (its input).
Generally, the more data (observations) that can be used to
form the description, the more accurate the description will be
and thus the interest in big data analytics that uses large data
sets.
Machine learning uses a selection of learning algorithms that use
large data sets and a desired outcome to derive an algorithm

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


5
Analysis
Data driven vs Model driven approach
• Model-driven modelling approach
Aims to explain a system’s behavior not just derived from its
inputs but through a representation of the internal system’s
structure.
a real system is simplified into its essential elements (its
processes) and relationships between these elements (its
structure).
in addition to input data, information is required on the system's
processes, the function of these processes and the essential
parts of the relationships between these processes.
are called explanatory models as they represent the real system
and attempt to explain the behavior that occurs.
generally, have far smaller data needs than data-driven models
because of the key role of the representation of structure.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


6
Analysis
Descriptive Analysis

• Use of reports and visual displays to explain or understand


past and current business performance
• Contain statistical summaries of metrics such as sales and
revenue
• Intended to provide an outline of trends in current and
past performance

What Happened?

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


7
Analysis
Predictive Analysis

• Ability to predict future performance


• Detecting patterns or relationships in historical data
• Project these relationships into the future
• Domain knowledge to construct a simplified
representation

What could happen?

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


8
Analysis
Prescriptive Analysis

• Recommend a choice of action from predictions of future


performance
• Optimum decision based on the need to maximize (or
minimize) some aspect of performance
• Many different scenarios can be tested until one is found
that best meets the optimization criteria

What should be done?

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


9
Analysis
Supermarket

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


10
Analysis
Data Driven Modelling

• Data-driven modeling aims to derive a description of


behavior from observations of a system.
• Describe the relationship between input and output
• Also known as descriptive models
Note this is different from descriptive analysis
• Imitates real behavior
• More data (observations) to form the description → more
accurate the description

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


11
Analysis
Model Driven Modelling
• Explain a system’s behavior not just derived from its inputs
but through a representation of the internal system’s
structure.
• A real system is simplified into its essential elements (its
processes) and relationships between these elements (its
structure)
• The effect of a change on design of the process can be
assessed by changing the structure of the model.
• Generally, we have far smaller data needs than data-driven
models

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


12
Analysis
Pros and Cons
• Data driven models
built in error terms.
errors can be quantified, and confidence levels can be estimated
large amount of data to estimate the parameters to fit the model

• Process driven models


built using mathematical equations
errors may be introduced during simplifications
real-world observations are used to evaluate the model

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


13
Analysis
Descriptive Statistics
• Usually, the first step prior to more complex
analysis
• Scales of Measurement
• Using Tables to Organize Data
• Measures of
Location or central tendency
Statistical dispersion
Shape of distribution

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


14
Analysis
Scales of Measurement

• Measurement, is the assignment of numbers to attributes,


objects, or events according to predetermined rules.
• Four different measurement scales
Nominal Scales
Ordinal Scales
Interval Scales
Ratio Scales

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


15
Analysis
Nominal Scale
• No quantitative information
• Numbers merely to distinguish one type of thing from
another type of thing or one event from another event
• As no quantitative information being communicated, we
are free to exchange one number for any other currently
unused number
• For instance, the numbers assigned to the members of a
football team do not carry any quantitative value.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


16
Analysis
Ordinal Scales

• Adds relative position information to nominal scale


• Reflects a quantitative relationship between the various
categories
• Rankings
Comparatively more or less
Not how much

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


17
Analysis
Interval Scales

• Set of quantitatively ordered categories


• Intervals between the categories are held constant
• Do not possess a true zero point.
• Different interval scale can be used to measure the same
amount

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


18
Analysis
Ratio Scale

• Addition of an absolute zero point to interval scale


• Zero marks the absence of a quantity

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


19
Analysis
Discrete and Continuous Quantities

• An important feature of measuring variables concerns how


many different values can be assigned

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


20
Analysis
Discrete and Continuous Quantities

• Discrete Variables
Can take on only a finite number of values
No meaningful values exist between any two adjacent values
To find statistical features of sets of discrete data it is permissible
to use “in-between” values

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


21
Analysis
Discrete and Continuous Quantities

• Continuous Variables
Theoretically have an infinite number of points between any two
numbers
Variables do not have gaps between adjacent numbers

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


22
Analysis
Unorganised Raw Data - Example
Scores from 90 participants who completed a questionnaire
measuring their achievement.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


23
Analysis
Simple Frequency Distribution
• How many participants received each score?
• Frequency is the number of occurrences of a
repeating event
• Another example is frequency of radio waves
defined per unit time. Number of times a wave
completes its cycle in a unit time (per sec)

1
𝑓=
𝑇

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


24
Analysis
Grouped Frequency Distribution

• The number of scores that fall into each of several ranges


of scores
• Some sets of data cover a wide range of possible scores
which can make the resulting frequency distributions long
and cumbersome
• Exchange loss of information with a table that is easy to
understand

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


25
Analysis
Cumulative Frequency Distributions

• The sum of frequencies found at that interval plus all


preceding intervals
• Keeps a running tally of all scores up through each
given interval

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


26
Analysis
Measures of Central Tendency

• Statistical indices designed to communicate what is the


“center” or “middle” of a distribution
• Three measures of central tendency
Mean
Median
Mode

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


27
Analysis
The Mean
• The mean, colloquially referred to as the “average,” is the
most frequently used measure of central tendency.

Σ𝑋
𝜇=
𝑁
𝜇 = the symbol for the mean of a population
𝑋 = a score in the distribution
𝑁 = population size
Σ = sum up a set of scores, Σ𝑋 = 𝑋1 + 𝑋2 + 𝑋3 + ⋯ + 𝑋𝑁

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


28
Analysis
The Mean

• What is the mean of this population of scores?

5, 8, 10, 11, 12

5 + 8 + 10 + 11 + 12 46
𝜇= = = 9.20
5 5

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


29
Analysis
The Mean of a Frequency Distribution

• Mean of a frequency distribution

Σ𝑋𝑓
𝜇=
Σ𝑓
𝑓 = frequency with which a score appears

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


30
Analysis
Calculating the mean from a frequency distribution

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


Analysis 31
The Weighted Mean

• Weighted Mean
𝑛1 𝑉1 + 𝑛2 𝑉2 + 𝑛3 𝑉3 + ⋯ 𝑛4 𝑉5
𝑀=
𝑛1 + 𝑛2 + 𝑛3 + ⋯ 𝑛𝑛

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


32
Analysis
Weighted Mean - Example
• Let’s consider score of students in a subject from three
schools
School A School B School B
Student Score Student Score Student Score
A 60 C 80 F 75
B 40 D 70 G 60
E 60 H 60

Mean: 𝑀1 = 50 𝑀2 = 70 𝑀3 = 65

Mean of all scores: ~63.125

Mean of 𝑀1, 𝑀2, 𝑀3: ~61.667


2 50 + 3 70 + 3 65
Weighted Mean of 𝑀1, 𝑀2, 𝑀3 where 𝑛1 = 2, 𝑛2 = 3, 𝑛3 = 3 = = 63.125
2+3+3
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
33
Analysis
The Median

• The median divides the distribution based on the


frequency or number of scores above and below a given
point.
• Not algebraically defined but there are algorithms to
calculate the median

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


34
Analysis
Calculating the Median – Example 1

• What is the median of this distribution?


40, 1, 4, 42, 6, 8, 43, 45, 47
Sort

1, 4, 6, 8, 40, 42, 43, 45, 47

4 scores 4 scores

Median

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


35
Analysis
Calculating the Median – Example 2

• What is the median of this distribution? (with even


number of scores)
3, 9, 15, 16, 19, 22

2 scores 2 scores

15+16
Median = = 15.5
2

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


36
Analysis
The Mode

• The most typical or most frequent score in the


distribution.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


37
Analysis
The Mode - Example

• What is the mode of this distribution?


100, 101, 105, 105, 107, 108

105

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


38
Analysis
The Mode from Frequency Distribution
Mode: 15 & 16
• This is known as a bimodal distribution
(“bi,” meaning two).

• A distribution with a single mode is


termed unimodal (“uni,” meaning one).
• Having more than two modes is called
"multimodal".
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
39
Analysis
How the Shape of Distributions Affects
Measures of Central Tendency

• If the distribution is symmetrical, then all three measures


of central tendency will be identical.
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
40
Analysis
How the Shape of Distributions Affects
Measures of Central Tendency

• If the distribution is symmetrical but bimodal, then mean


and median are identical but not the mode.
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
41
Analysis
How the Shape of Distributions Affects
Measures of Central Tendency
negatively skewed positively skewed

• If a distribution is skewed, then the mean, median, and


mode will all be different.
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
42
Analysis
When to Use the Mean, Median, and
Mode?

• Mean – no meaningful quantitative


information for nominal or ordinal scales
• Median – measure of choice for ordinal scale
• Mode – preferred for nominal scale
• Sensitivity to extreme scores
• Skewed Distributions
• Scores of a distribution are truncated

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


43
Analysis
Measures of Variability

• Convey the degree to which scores are spread out and


dispersed around a central point
Range
Mean Deviation
The Variance
The Standard Deviation

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


45
Analysis
Range

• Overall span of the scores in a distribution – from the


lowest value up to the highest value

• Range
𝑅𝑎𝑛𝑔𝑒 = 𝑋𝐻 − 𝑋𝐿

𝑋𝐻 = Highest score in the distribution


𝑋𝐿 = Lowest score in the distribution

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


46
Analysis
The Range - Example

• What is the range of this distribution?


17, 44, 50, 23, 42

Range = 50 – 17 = 33

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


47
Analysis
The Interquartile Range and Semi-
Interquartile Range

• Every distribution can be divided into four equal sections


or quartiles.
• A quartile is one-fourth of a distribution of scores.
• The bottom 25% of the values in a distribution make up
the first quartile.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


48
Analysis
The Interquartile Range and Semi-
Interquartile Range
• Interquartile range, IQR
IQR = Q 3 − Q1
𝑄3 = the third quartile (75th percentile)
𝑄1 = the first quartile (25th percentile)

• Semi-interquartile range, SIQR


Q 3 − Q1
𝑆IQR =
2
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
49
Analysis
Mean Deviation

• The degree to which scores deviate from the mean

Σ 𝑋−𝜇
𝑀𝐷 =
𝑁
𝑋 = a raw score
𝜇 = the population mean
𝑁 = the number of scores in the population

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


50
Analysis
Mean Deviation – Example (1)

• Consider these two distributions:


Distribution A: 11, 12, 13, 14, 15, 16, 17
Distribution B: 5, 8, 11, 14, 17, 20, 23
𝜇 = 14

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


51
Analysis
Mean Deviation – Example (2)

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


52
Analysis
The Variance
• Variance
2
2
Σ 𝑋−𝜇
𝜎 =
𝑁

𝜎 2 = the symbol for the population variance


𝑋 = a raw score
𝜇 = the population mean
𝑁 = the number of scores in the population

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


53
Analysis
The Variance - Example
• What is the variance of this sample of scores?
3, 4, 6, 8, 9

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


54
Analysis
The Standard Deviation
• Variance is a squared value
not stated in the original units of the measured variable
• Standard deviation
the square root of the variance

Σ 𝑋−𝜇 2
𝜎=
𝑁
𝜎 = the symbol for the standard deviation
𝑋 = a raw score
𝜇 = the population mean
𝑁 = the number of scores in the population

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


55
Analysis
The Standard Deviation and the Normal
Curve
• Approximately 68% ( 95%, 99%) of the scores will fall
between plus and minus one (two, three) standard
deviation from the mean

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


56
Analysis
Deciding Which Measure of Variability to
Use
• Range is most vulnerable to extreme scores.
• IQR and SIQR are not much influenced by a small number
of extreme scores
• Variance and standard deviation are also affected by
extreme scores, since squared deviations
• For a skewed distribution, IQR and SIQR best describe
variability
• If the scale of measurement does not allow for the
calculation of a mean, then deviation scores cannot be
calculated.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


57
Analysis
Measures of Shape Distribution

• “Deviation from Normal Distribution”


described by
Number of peaks
Possession of Symmetry
Tendency to Skew
Uniformity

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


58
Analysis
Skewed Distribution
• Most of the scores near one end of a distribution.

A positively skewed distribution A negatively skewed distribution


DA-STATS - Topic 01: Introduction to Descriptive and Predictive
59
Analysis
Skewed Distribution
• For univariate data 𝑋1 , 𝑋2 , … , 𝑋𝑁 , the formula for skewness
is:

N
Σi=1 X i − 𝜇 3 /N
𝑔1 =
s3

𝑠 = standard deviation of the distribution


• The above formula for skewness is referred to as the
Fisher-Pearson coefficient of skewness

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


60
Analysis
Kurtosis
• The quality of the peak of the curve
• Leptokurtic – narrow and accentuated peak
• Platykurtic – broad and muted peak

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


61
Analysis
Kurtosis

• For univariate data 𝑋1 , 𝑋2 , … , 𝑋𝑁 , the formula for kurtosis


is:

N
Σi=1 X i − 𝜇 4 /N
𝑔1 =
s4

𝑠 = standard deviation of the distribution

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


62
Analysis
Correlation

• Measure of the strength of association between two


variables.
• Correlational analyses are often applied to two variables or
scores – bivariate distribution
• A correlation coefficient can range from −1 to +1.

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


63
Analysis
Correlation

• Larger the absolute value of the correlation → stronger


the association
• Correlation coefficient → 0, weaker relationship between
variables.
• A positive sign indicates a positive relationship
• A negative sign indicates a negative relationship

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


64
Analysis
Correlation

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


65
Analysis
Correlation

• Four types of correlations:


Pearson correlation
Kendall rank correlation
Spearman correlation
Point-Biserial

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


66
Analysis
Pearson Correlation
• Most powerful and most frequently used version of the correlation
measure
Σ(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝜌=
2
Σ 𝑥𝑖 − 𝜇𝑥 2 Σ 𝑦𝑖 − 𝜇𝑦

𝜌 = correlation coefficient
𝑥𝑖 = values of the 𝑥 variable in a sample
𝑦𝑖 = values of the 𝑦 variable in a sample
𝜇𝑥 = mean of 𝑥 values
𝜇𝑦 = mean of 𝑦 values

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


67
Analysis
Kendall Rank 𝜏
• Kendall rank correlation is a non-parametric measure of
relationship between two ranked variables.

𝑛𝑐 − 𝑛𝑑
𝜏=
𝑛(𝑛 − 1)/2

𝜏 = Kendall rank 𝜏
𝑛𝑐 = number of concordant pairs
𝑛𝑑 = number of discordant pairs
https://www.statisticshowto.com/kendalls-tau/
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
68
Analysis
Spearman Rank Correlation
• Spearman rank correlation is a non-parametric measure of
relationship between two ranked variables.
• Suited for correlation analysis of variables on ordinal
scale.
6Σ𝑑𝑖2
𝜌 =1−
𝑛 𝑛2 − 1
𝜌 = Spearman Rank Correlation
𝑑𝑖 = difference between the ranks of corresponding pairs
𝑛 = number of observations
https://www.youtube.com/watch?v=DE58QuNKA-c
DA-STATS - Topic 01: Introduction to Descriptive and Predictive
69
Analysis
Spearman Rank Correlation

• Assumptions
data must be at least ordinal

the scores are monotonically related

DA-STATS - Topic 01: Introduction to Descriptive and Predictive


70
Analysis

You might also like