Quantitative Methods - Organizing, Visualizing and Describing Data
Quantitative Methods - Organizing, Visualizing and Describing Data
Quantitative Methods 2
Organizing, Visualizing
and Describing Data
B I N U S F I N A N C I A L A N A LY S T A C A D E M Y P R O G R A M
A d a p t e d f r o m C FA ® In s t i t u t e C u r r i c u l u m L e v e l 1
LOS
A. identify and compare data types;
B. describe how data are organized for quantitative analysis;
C. interpret frequency and related distributions;
D. interpret a contingency table;
E. describe ways that data may be visualized and evaluate uses of specific visualizations;
F. describe how to select among visualization types;
G. calculate and interpret measures of central tendency;
H. evaluate alternative definitions of mean to address an investment problem;
I. calculate quantiles and interpret related visualizations;
J. calculate and interpret measures of dispersion;
K. calculate and interpret target downside deviation;
L. interpret skewness;
M. interpret kurtosis;
N. interpret correlation between two variables (refer to QM-3)
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Data Types
Data can be defined as a collection of numbers, characters, words, and text—as well as images, audio, and
video—in a raw or organized format to represent facts or information.
➢ Numerical data (or quantitative data) are values that represent measured or counted quantities as a number.
✓ Continuous data can be measured and can take on any numerical value in a specified range of values.
Eg. Actual amount of rainfall between 10 – 20 inches per month.
✓ Discrete data are numerical values that result from a counting process, and limited to a finite number of
values. Eg. Number of days it rains in a month.
➢ Categorical data (or qualitative data) are values that describe a quality or characteristic of a group of
observations and usually take only a limited number of values that are mutually exclusive.
✓ Nominal data are categorical values that have no particular or logical order. Eg.: GICS sectors – 10
Energy, 20 Industrial, etc.
✓ Ordinal data are categorical values that can be logically ordered or ranked. Eg.: Morningstar ratings for
investment funds.
Adapted from CFA® Institute Curriculum Level 1
➢ Observation: the value of a specific variable collected at a point in time or over a specified period of time. Eg.
EPS of $7.5.
➢ Time series data: a sequence of observations for a single observational unit of a specific variable collected
over time and at discrete and typically equally spaced intervals of time, such as daily, weekly, monthly,
annually, or quarterly. E.g. monthly Microsoft stock return for the past 5 years.
➢ Cross-sectional data: a list of the observations of a specific variable from multiple observational units at a
given point in time. E.g. Inflation rate for EU countries in January.
➢ Panel data consist of observations through time on one or more variables for multiple observational units. Eg.
Inflation rate in EU countries over 5 year periods.
➢ Longitudinal data consist of observations on characteristic(s) of the same observational unit through time.
Eg. Financial ratios of a company over 10 year periods.
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Structured and Unstructured Data
• Based on whether or not data are in a highly organized form, data can be classified into structured
and unstructured types.
• Structured data are highly organized in a pre-defined manner, usually with repeating patterns.
Typical examples:
• Market data: issued by stock exchanges, e.g. closing stock price, volume.
• Fundamental data: eg. EPS, P/E ratio, dividend yield, ROE.
• Analytical data: derived from analytics, e.g. earnings growth estimate.
• Unstructured data do not follow any conventionally organized forms. Common types: text (eg.
Financial news, company fillings), audio/video (e.g. earnings call). They are typically alternative
data as they are usually collected from unconventional sources. Three group of sources:
• Individuals: social media posts, web searches
• Business process: credit card transaction, corporate fillings
• Sensors: satellite images, foot traffic by mobile devices
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Frequency Distribution
• Frequency distribution is a tabular display of data constructed by: counting the number of
observations of each unique value of a variable or counting values of numerical variable into a set
of intervals or range.
• Interval or bucket: set of return values in which an observation falls (all-inclusive, non overlapping,
and mutually exclusive)
• Interval is an easy method to see the frequency distribution of a data by looking at the number of
members in each group
• From a set of raw data, we can build a frequency distribution, relative frequency and cumulative
absolute/relative frequencies.
Frequency Distribution
Procedures to construct frequency distribution:
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Relative Frequencies
Cumulative Frequencies
Cummulative Cummulative
Relative
Interval Frequency Absolute Relative
Frequency
Frequency Frequency
Interval
R < -20%
-30% ≤ R < -20% 1 5% 1 5%
-20% ≤ R < -10% 2 10% 3 15% R < -10%
-10% ≤ R < 0% 3 15% 6 30% R < 0%
0% ≤ R < 10% 7 35% 13 65% R < 10%
10% ≤ R < 20% 3 15% 16 80% R < 20%
20% ≤ R < 30% 2 10% 18 90% R < 30%
30% ≤ R < 40% 1 5% 19 95% R < 40%
40% ≤ R < 50% 1 5% 20 100% R < 50%
Total 20 100%
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Contingency table
Contingency table: a tabular format that displays the frequency distributions of two or more
variables simultaneously; used for finding patterns between the variables. A contingency table for two
categorical variables is also known as a two-way table.
Joint frequency: joining a variable from the row (e.g. sector) and other variable from the column (e.g.
market cap) to count observations.
Marginal frequency: sum of added joint frequencies across rows and across columns.
One application of contingency tables is for evaluating the performance of a classification model
(using a confusion matrix). Another application of contingency tables is to investigate a potential
association between two categorical variables by performing a chi-square test of independence.
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Contingency table
• One application of contingency tables is for evaluating the performance of a classification model
(using a confusion matrix). Example: Confusion matrix for bond default prediction model
Data Visualization
Visualization: presentation of data in a graphical format to increase understanding and gaining
insights into the data. The key consideration when selecting among chart types is the intended
purpose of visualizing data (i.e. for exploring/presenting distributions or relationships or for making
comparisons).
• Histogram: a bar chart of data that are grouped into a frequency distribution. A frequency
polygon is a graph of frequency distributions obtained by drawing straight lines joining successive
midpoints of bars representing the class frequencies.
• Bar chart is used to plot the frequency distribution of data, with each bar representing a distinct
category and the bar’s height proportional to the frequency of the corresponding category. Grouped
bar charts or stacked bar charts can present the frequency distribution of multiple categorical
variables simultaneously.
• Tree-map: a graphical tool to display categorical data. It consists of a set of colored rectangles to
represent distinct groups, and the area of each rectangle is proportional to the value of the
corresponding group. Additional dimensions of categorical data can be displayed by nested
rectangles.
Adapted from CFA® Institute Curriculum Level 1
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Example Data visualization
Histogram and Frequency Polygon Cummulative Frequency distribution
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Example Data visualization (2)
Line chart
Word cloud
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
What is Statistics?
Statistics refers to data and the methods we use to analyze the data. There are 2 categories:
• Descriptive Statistics
How data can be summarized effectively to describe important aspects of large data set.
Consolidate data into useful information.
• Inferential Statistics
Procedures used to make forecasts, estimates, judgment about a large set of data on the basis
of the statistical characteristics of a smaller set (a sample)
Sample
• Defined as portion, or subset of the population of interest which is taken to represent the
whole population
• E.g: This class is a sample of CFA L1 course participants
Parameter
• Some common statistical measures to describe the characteristics of the population
• Any descriptive measure of a population characteristics
• Although there are many parameters, only a few utilized. E.g. mean return, standard deviation
of returns
• 4 parameters in statistic: Central tendency, Dispersion, skewness, kurtosis.
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Measures of Central Tendency
Measures of central tendency identifies the center or average of a data set. This central point
can be used to represent the typical or expected value in data set.
• Population Mean
• Sample Mean
• Weighted Mean
• Median
• Mode
• Arithmetic Mean
• Geometric Mean
• Quantiles
Mean
• Population mean, all observed values in the population are summed and divided by number of
observations in the population
X i
=
=
N
• Sample mean, all the values in a sample of a population is divided by the number of observations
in the sample
X i
X=
=
n
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Example:
Stock of AXZ has annual return as follows:
12%, 25%, 34%, 15%, 19%, 44%, 54%, 33%, 22%, 28%, 17%, 24%
= pop mean = 12 + 25 + 34 + 15 + 19 + 44 + 54 + 33 + 22 + 28 + 17 + 24
12
= 27.25%
Arithmetic Mean
• Population mean and sample mean are example of arithmetic mean
• Arithmetic mean is the sum of observation value divided by the number of observations
• All interval and ratio data sets have an arithmetic mean
• A data set has only one arithmetic mean
• The sum of deviation of the data set from the mean is always zero
• Sum of mean deviations= (Xi – X) = 0
Annualized Return
Year (Xi – X)
1 • 30% +8% (30%-22%)
-10% (12%-22%)
2 • 2%
3 • 25% +3% (25%-22%)
5 +1% (23%-
• 23% 22%)
Mean 22% (Xi – X) = 0
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Weighted Mean, Median, Mode
• Weighted Mean is the mean of the data based on the frequency/weight of each observation in comparison with the whole
dataset
Example: A portfolio consists of 50% common stocks, 40% bonds, and 10% cash. If the return on common stocks is 12%, the
return on bonds is 7%, and the return on cash is 3%, what is the return of the portfolio?
• Median: The midpoint of the data when it arranged from the largest to the smallest values
Example: What is the median of this data of 30, 25, 23, 21, 15?
23 is the median
What is the median of this data of 30, 28, 25, 23, 21, 15?
Example: What is the modus of this data of 30, 28, 25, 23, 28, 21, 15, 5
28 is the modus.
Geometric Mean
Geometric mean is often used when calculating investment returns over multiple periods or to find
compound Annual growth rate (CAGR)
Formula:
Example: for the last 3 years the return for Acme Corp common stock have been -9.34%, 23.45%
and 8.92%. What is the compound annual rate of return for 3 years?
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Harmonic Mean
Used to describe certain computation, such as average cost of shares purchased over time
Calculated as : _N_
1 / Xi
Example: An investor purchases $100 of stock each month for 3 months. The purchase price
for each month is $8, 9, 10. Calculate average cost
Quantiles
General term for value at or below which a stated fraction of data lies. Examples:
Example: Find the 3rd quartile of the observations lie below for the following distribution of returns:
8%, 10%, 12%, 13%, 15%, 17%, 17%, 18%, 19%, 23%, 24% (11 observations)
Ly = (11 + 1) X 75/100 = 9
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Measures of Dispersion
Dispersion is variability around the central tendency. Tradeoff between reward of
variability, the central tendency (mean) is the reward whereas dispersion is the risk.
Mean Absolute deviation (MAD) is the average of the absolute values of the deviations
of individual observations from the arithmetic mean.
Measures of Dispersion
• σ = 5.97%
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Dispersion from The Mean
Annualized
Year
Return
1 • 30%
2 • 12%
3 • 25%
4 • 20%
5 • 23%
Mean 22%
MAD and Standard Deviation measures the average dispersion of each data
from the mean. The more dispersed the data plots from the mean, the predictive
value of mean is weaker vice versa
Measures of Dispersion
Example:
Suppose that in the average score in CFA test is a 70 with standard deviation of 4 points.
At least what percent of the tests have a grade of at least 62 and at most 78?
Answer: 75%
Adapted from CFA® Institute Curriculum Level 1
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Target Downside Deviation
• Variance and standard deviation of returns take account of returns above and below the mean respectively.
However, investors are typically concerned only with downside risk—for example, returns below the mean or
below some specified minimum target return.
• Target downside deviation or target semideviation: measure of dispersion of the observations below the
target. Steps:
• Formula:
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Coefficient of Variation (CV)
Symmetric Distribution
• Distribution is the graphical depiction of the collection of all the possible results of the observations
• Symmetric Distribution means the data distribution at the left part is the same as the right part
and the mean is located at the centre at the distribution.
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Skewness
• Skewness: the distribution is not symmetrical. Can be positively or negatively skewed and result
from outliers in the data
• Outliers are observations with extraordinarily large (extreme) values, either positive or negative
• Positively skewed distribution has many outliers in the upper region or right tail
• Negatively skewed distribution has many outliers in the lower region or left tail
M M
e e
a a
n n
Mean M M M M M M
Median o e e e e o
Mode d d a a d d
e i n n i e
a a
n n
Mean > Median > Mode Mean < Median < Mode
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Kurtosis
• Kurtosis is a measure of the degree to which a distribution is more or less peaked than a normal
distribution(mesokurtic)
• Platykurtic (thin-tailed) is a distribution that is less peaked or flatter than a normal distribution
• Leptokurtic data has more data near the central but bigger tails, implying conditions common in stock market
where most daily returns fall in relatively tight range but sometimes large movement do occur, creating outliers.
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Example: IHSG negatively skewed
Parameters
Range/ Interval
Example
25
IHSG monthly return (Jan 2011 - Jun 2022)
20
Frequency
15
10
Return
Parameter
Mean 0.5%
St dev 4.0%
skewness -1.01
Kurtosis 2.03
Adapted from CFA® Institute Curriculum Level 1
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Risk and Return Tradeoff
Example:
IHSG in 31 dec 2010: 3703
IHSG in 30 June 2022: 6911
CAGR (Compounding Annual growth rate)?
(6911/3703)^(1/11.5) = 5.6%
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
Exercise (Untuk latihan dibahas bersama)
Which of the following is most accurate regarding a distribution of returns that has a mean greater
than its median?
A. It is positively skewed
B. It is a symmetric distribution
C. It is negatively skewed
The intervals in a frequency distribution should always have which of the following characteristics?
A. Be truncated
B. Be open ended
C. Be non overlapping
References
• Drake, Pamela Peterson, PhD, CFA, and Jian Wu, PhD, “Organizing, Visualizing, and Describing Data”, CFA Program
• SchweserNotes for the CFA Exam Level 1: CFA Program Exam Prep. Kaplan, Inc., 2018.
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.
@binus_executive_education BINUS Executive Education bbs.binus.ac.id/exed
This presentation material is strictly for class discussion purpose only, and can not be used, copied
and distributed for other purposes.