0% found this document useful (0 votes)
12 views

Module 1 - Descriptive Stats

Uploaded by

jennylehuynh29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Module 1 - Descriptive Stats

Uploaded by

jennylehuynh29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Module 1: Descriptive Statistics

Data Types
Qualitative/categorical
● Mutually exclusive labels (one label cannot mean two things)
● Not often numbers, if so, numbers have no mathematical meaning
- Nominal: ordering/ranking makes no sense, numerical labels are arbitrary
- Ordinal: ordering/ranking has meaning/can be interpreted, numerical labels
respect the ordering
Quantitative/numerical
● Numbers used to record certain events, numbers have mathematical meaning
- Interval: quantity in difference is meaningful, but in ratio is not; zero has no
natural meaning
- Ratio : difference and ratio of two quantities is also meaningful; zero is
meaningful

Using categorical/qualitative data


Frequency distribution
● Frequency: the total number of occurrences for each
category
● Relative frequency: the fraction of total number of items
belonging to category (eg. 102 ➗808 = 0.1262)
● Percent frequency: relative frequency x 100%
Histograms
● Categories on x-axis
● Frequency, relative frequency, percent frequency on y-axis

Using numerical/quantitative data


Frequency distributions and histograms
● Categories on x-axis are grouped (eg. 0-5, 5-10, 10-15)
● Density frequency

Probability theory
● Random variable (r.v.) - a variable’s value appears randomly
● population - the complete pool of a certain random variable
● Sample - a random collection of certain size from the population

Probability distribution
● Probability distribution - the general shape of probability for values that a random
variable may take

Notation
● Random variable denoted by X, Y (capital letters)
- Eg. X: number of children in household
- Eg. Y: amount of time spent by husband on
housework per day
● realisations/observations of a random variable denoted by xᵢ,
yᵢ (lowercase letters with subscript)
- Eg. x₁: number of children in household is 1
- Eg. y₁₃₇:amount of time spent by husband is 137 on housework per day
● N and n denote the size or number of observations.
- N is referred to population size
- n denotes the sample size

Descriptive Statistics
Central tendency
● Measure of central tendency yields info about the centre of a set of numbers
(distribution of a r.v.’s) – does not focus on the span of the dataset or how far values
are from middle numbers
● gives an idea of what a typical, middle, or average that a r.v. can take
● sometimes called measures of location

three measures of central tendency

Mode ● most frequently occurring value in a set of data


● If there are 2 modes, the 2 modes are listed and the data is said to be bimodal
● Datasets with 3 or more modes are referred to as multimodal
● Concept of mode is often used in determining sizes
● Appropriate descriptive summary measure for categorical data

Median ● middle value in an ordered array of numbers


n+1
● locate the median by finding the th term in the ordered array
2
● Large and small values do not inordinately influence the median – hence the
● best measure of location to use in the analysis of variables in which extreme but
acceptable values can occur at just one end of the data
● Not all info from the dataset is used
● Data must be quantitative or be able to be ranked

Mean ● Average of a set of numbers


● Sample mean is represented by X̄
● Population mean is represented by μ
● Data should be quantitative as it needs to be summed
● Affected by all values – advantage because it reflects all the data, but
disadvantage because extreme values pull the mean towards extremes
● To calculate the mean forecast value, we need to multiply each possible value by
its probability and sum up the products.

- If we denote the r.v. by X:


Variability
● Measures of variability yield info about the likelihood of a realisation of the r.v. is
away from the centre of its distribution, describes the spread/dispersion of a dataset
● Gives an idea of fluctuation and volatility across realisations of the r.v.
● The more variability in a dataset, the less typical they are of the whole set
● Using measures of variability in conjunction with measures of central tendency
makes possible a more complete numerical description of the data (measure of
variability is necessary to complement the mean value when describing data)
● Conveys fluctuations and volatility across realisation of random variable
● The more spread out the r.v. is, the larger the risk/dispersion the variability is
● Also called measures of scale, spread, dispersion or risk
● Measures of variability
- Variance (Var) - average of squared distance from the mean
- Standard deviation (std): square root of variance
- Coefficient of variation - standard deviation/ mean x100%

Variability formulas
Variance
● It computes the average squared distance between data points and their mean,
depending on sample or population
● Population variance
- Finite population
- Denoted by σ ² (stigma square) or
Var(X)/Variance of X
● Sample variance
- Denoted by s²
Standard deviation
● Standard deviation solves the problem of
squared units. It has the same unit of the
original data
● Population standard deviation
- Denoted by σ (stigma) or std(X)
● Sample standard deviation
- Denoted by s
Coefficient of variation
● Measures standard deviation per unit of mean
● In finance when the r.v. X denotes assets returns, CV measures risk per unit of
expected return
● It is unit free, because both the numerator and
denominator have the same unit as the original data and
they cancel each other
● Population CV
- when σ increase, CV increase
- when μ increase, CV decreases
- Ratio between risk and expected return
Skewness
Shape
● Central tendency and variability are useful to describe and summarise data or the
distribution of r.v.’s
● Skewness - measure of asymmetry
● Mode: value on the horizontal axis where the high point of the curve occurs
● M
e
a
n
:

towards
the tail of
the

distribution (drawn towards the extreme values)


● Median: generally located somewhere between the mode and the mean

Probability theory
● Multi-dimensional data
● Experiment: a random process that creates outcomes (eg. the data collection
procedure)
● Sample space: the set of all possible outcomes
● Event: a set of outcomes (can contain no outcome, single outcome or multiple
outcomes) of an experiment to which probability is assigned. So an event is a subset
of the sample space
● Relative frequency: outcomes receive probability corresponding to their number of
occurrences → P(outcomes)= number of occurrences of outcomeı ÷ total number of
occurrences of all outcomes

Law of addition
Joint vs marginal probabilities
● Distinguish joint and marginal probability through multidimensional outcomes
● Joint probability: denotes relative frequency when asking about all dimensions
- Eg. what is relative frequency that customer bought a $49 plan on a weekday
● Marginal probability: displays relative frequency when only asking about a single
dimension
- Eg. relative frequency that customer bought a $49 plan

Complement of the
event denoted
as A’ →
pronounced as A prime - meaning not A - if there is a dash at the top = not the outcome

When referring to joint probability, we use intersection “∩”. The event A∩B (it reads:the
intersection of A and B, or A intersection B) means the event where both A and B are
true or both A and B occur

Venn diagram: visualisation of probability


● Venn diagram shows logic relations across sets
● The external rectangle indicates the whole sample space
● The internal circle indicates some event A
Joint events
● Joint events such as A ∩ B is the intersection (∩) of A and B
Union of events
● Indicates the event A or B happens
● This is denoted by A∪B, pronounced as the union of A and B or A union B.
So P(A∪B) indicates the probability that A or B is true or that A or B occur

Mutually exclusive events


● If event A occurs only if event B does not occur (cannot occur at
the same time), we say A and B are mutually exclusive (events)
● Any event and its complement are mutually exclusive. Either “A
occurs” or “A does not occur
● P(A∩A’) = 0

Collectively exhaustive events


● If the occurrence of events A and B covers the whole sample
space, we say A and B are collectively exhaustive (events
● Any event and its complement are collectively exhaustive. “A occurs” and “A does not
occur” make up all possible outcomes
● P( A∪A’) = 1

Conditional probabilities and independence


Conditional probabilities
● P(A|B) denotes the probability that event A occurs, conditional on that B occurs.
● The symbol P(X=x|Y=y) denotes the probability of r.v. X taking value x, conditional on
the r.v. Y taking value y
● formula:

● Bayes rule:

Law of total probability


● Joint probability = conditional probability multiplied by marginal probability

Independent events: formula


● If A and B are independent events, whether or not B occurs should not affect the
probability that A occurs; also, whether or not A occurs should affect the probability
that B occurs
● Formula:

● Bayes rule:

Implications of formulas

Binomial experiments
● Eg. toss a coin 3 times in a row and you are interested in how likely it is that you get
exactly two heads
● A binomial experiment assesses the number of a certain outcome from repeated
independent trials
● Each trial has two possible outcomes (eg. heads or tails, success or failure)

Binomial tree
● When two outcomes are independent, P(A|B) = P(A)
● Suppose we have three products, each can be defect (D) with probability p or
functional (F) with probability q= = 1 - p

Continuous probability distributions


● Discrete probability distribution: the distribution of a discrete random variable
● Discrete random variable: a r.v. that takes discrete values. Discrete r.v. typically
counts
- Eg. number of kids in a household, number of successes in n trials
● Continuous random variable: a r.v. that takes values on (part of) the real line.
Continuous r.v. measures
- Eg. waiting time in a queue, height of soldiers, inflation rates

2 different probability distribution functions (pdfs): Discrete, Continuous

Scores add up
to 1

Probability density function


● continuous probability distribution for X is defined via the
means of probability density function (pdf) which assigns a
positive value to possible outcomes of X such that the
density is integrated to 1 (this means that the area under the
curve is 1). The probability that X lies between two numbers
is the area under pdf function between those numbers
Discrete random variable
● P(X=x), where x is some specific value because P(X=x) =0 always
● A continuous r.v. has infinitely many outcomes. If a single outcome had positive
probability, the probabilities would add up to infinity and not 1
● Eg. What is the probability that a random person waits exactly
2.71285748634050284… minutes?
- The probability is 0. However the probability that a person waits in between
2.71…84 and 2.71…9 is strictly positive
Implications for Inequalities

Cumulative density function for a continuous pdf


● P(X<x) for a continuous r.v. defines the cumulative
density function (CDF)

Conditions for pdf f χ (x):


1. Total area under the pdf equals 1: P(-∞ <X<∞ )=1
2. Given how probability is worked with areas we can
also say that the pdf can never be negative,
because it would imply negative probabilities over some range
- Eg.

Continuous uniform distribution


● A r.v. X taking any value within [a,b]
is said to follow the continuous
uniform distribution
● X ~ Unif(a,b)
● If all potential outcomes (realisations) between a and b are equally likely
● There are two parameters:
- a: the minimum value that X can assume
- b: the maximum value that X can assume

a+b (b−a)²
- E( X )= , Var ( X)=
2 12

● For any continuous r.v.’s P(x ₁< X < x ₂)=P ¿ )−P ¿), the area under the pdf from x₂
to x₁ is the difference between the values of the cdf at x₂ and x₁

You might also like