0% found this document useful (0 votes)
90 views44 pages

Unit 3 - Descriptive Statistics

Uploaded by

Rajdeep Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views44 pages

Unit 3 - Descriptive Statistics

Uploaded by

Rajdeep Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

EU Business School Munich

Quantitative Business Methods


Lecturer: Hashem Zarafat
E-Mail: [email protected]

UNIT 3 - FUNDAMENTALS OF
STATISTICS
Hashem Zarafat

QUANTITATIVE BUSINESS METHODS


UNIT 3 – FUNDAMENTAL OF STATISTICS
Some terms first…

FUNDAMENTALS OF STATISTICS
What Is Statistics?

1. Collecting Data
e.g. survey, databases
Data Why?
2. Presenting Data
e.g., plots, charts, tables, data visualization
Analysis

Decision-
Making
3. Characterizing Data
e.g., means, correlations…

© 1984-1994 T/Maker Co.


Statistical Methods

Statistical
Methods

Descriptive Inferential
Statistics Statistics
Descriptive Statistics
1. Involves
• Collecting Data
• Presenting Data $
50
• Characterizing Data
2. Purpose 25
• Describe Data
Numerical measures that describe a distribution 0
by providing:
• Information on the central tendency of Q1 Q2 Q3 Q4
the distribution
• The width of the distribution
• The shape of the distribution X = 30.5 S2 = 113
Measure of central tendency: a number that
characterizes the “middleness” of an entire
distribution
Inferential Statistics
1. Involves Population?
• Estimation
• Hypothesis
Testing

2. Purpose
• Make decisions about population
characteristics
Fundamental elements:
1. Statistical Inference
• Estimation or prediction or generalization about a
population based on information contained in a
sample
2. Measure of Reliability
• Statement (usually qualified) about the degree of
uncertainty associated with a statistical inference
Populations vs Samples

• A population (census) includes all of the entities of interest,


whether they be people, households, machines, or whatever.
The following are three typical populations:
– All potential voters in a presidential election
– All subscribers to cable television
– All invoices submitted for Medicare reimbursement by
nursing homes
• A sample is a subset of the population, often randomly chosen
and preferably representative of the population as a whole
A Puzzle is a Sample Until It Is Done! The Sample Allows One to Guess at the Picture.
SCALES OF MEASUREMENT
Nominal Scale (Categorical)

• Objects or individuals are assigned to categories that have no


numerical properties
• Characteristic of identity
• Categorical variables: variables measured on a nominal scale
• Examples: names, ethnicity, gender, favorite color, citizenship
• Dummy variables (experimental = 1; control = 2)

• R: “character” object
Ordinal Scale (Categorical)

• Objects or individuals are categorized, and the categories form a rank


order along a continuum
• Properties of identity and magnitude
• Ordinal data: referred to as ranked data
• Example: income brackets, educational level, grades (A, B, C), scale of
preferences (e.g. 0 to 10)
• Income and age brackets

• R: “factor” object
Interval Scale (Numerical, Continuous)

• Intervals between the numbers on the scale are all


equal in size
• Criteria of identity, magnitude, and equal unit size
are met
• Example: Fahrenheit temperature scale

• R: “numeric” object
Ratio Scale (Numerical, Continuous)

• A scale in which, in addition to order and equal units


of measurement, an absolute zero indicates an
absence of the variable being measured
• Ratio data have all properties of measurement
• Examples: number of children, times you check FB
per day…

• R: also “numeric” object (SPSS – “scale” variable, no


differentiation from the interval scale)
Scale of Measurement and Mathematical
Operations
Scale of measurement and Statistical Tests
Nominal Ordinal Interval Ratio
Descriptive Mode Mode Mode Mode
Statistics Median Median Median
Range Statistics Mean Mean
Range Statistics Range Statistics
Variance Variance
Standard Standard
deviation deviation
Inferential Non-Parametric Non-parametric Parametric Parametric
Statistics Chi-Square Mann-Whitney T test T test
U ANOVA ANOVA
Kruskal-Wallis H Pearson Pearson
Friedman Correlation Correlation
ANOVA
Not normal: Not normal:
Spearman
Non-parametric Non-parametric
Correlation
What about Measures of Central Tendency?
Data, data, data…
Probability Distributions

DISCRETE AND CONTINUOUS


VARIABLES
Discrete Variables
• Consist of whole number units or categories
• Made up of chunks or units that are detached and distinct
from one another
• Main questions:
– Are the variable’s values exhaustive (finite) or not?
– Do we know/ can we count all the possible values?

• Most nominal and ordinal data are discrete


– Examples: gender, political party, ethnicity
• Some interval or ratio data can be discrete
– Example: number of children in a family (one cannot have
5.34 children)
Continuous Variables

• Usually fall along a continuum and allow for


fractional amounts
• There is no way we can count all the possible values!
• Examples: age (22.7 years), height (64.5 inches),
weight (113.25 pounds)
• Most interval and ratio data are continuous in nature
Data Sets, Variables, Observations

• A data set is usually a rectangular array of data, with


variables in columns and observations in rows.
– R: “Data Frames”
• A variable (or field or attribute) is a characteristic of
members of a population, such as height, gender, or
salary.
• An observation (or case or record) is a list of all
variable values for a single member of a population.
➢ Logic of a Database

➢ R/ Excel: you need


two files sometimes
(data and desc)
STATISTICAL MEASURES
Measures of Central Tendency: The Mean

• The mean is the average of all values of a variable.


• If the data set represents a sample from some larger
population, we call this measure the sample mean and denote
it by .
• If the data set represents the entire population, we call it the
population mean and denote it by μ.
Measures of Central Tendency: The Median

• The median is the middle observation when the


data is arranged from smallest to largest.
• If the number of observations is odd, the median is
literally the middle observation.
• If the number of observations is even, the median
is usually defined as the average of the two middle
observations.
Measures of Central Tendency: The Mode

• The mode is the value that appears most


often.
• In most cases where a variable is essentially
continuous, the mode is not very interesting
because it is often the result of a few lucky
ties.
Minimum, Maximum, Percentiles

• The minimum and maximum are self-explanatory

• For any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it.
– Splits the data into two pieces: the lower piece contains k percent of
the data, and the upper piece contains the rest of the data.

• Calculating: For example, suppose you have 25 test scores, and in order from
lowest to highest they look like this: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72,
77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for
these (ordered) scores, start by multiplying 90% times the total number of scores,
which gives 90% ∗ 25 = 0.90 ∗ 25 = 22.5 (the index). Rounding up to the nearest
whole number, you get 23. Counting from left to right (from the smallest to the
largest value in the data set), you go until you find the 23rd value in the data set.
That value is 98, and it’s the 90th percentile for this data set.
Measures of Variability:
Quartiles, Range, and Interquartile Range

• The quartiles divide the data into four groups, each with
(approximately) a quarter of all observations.
• Naturally, the first, second and third quartiles are the percentiles
corresponding to p = 25%, p = 50%, and p = 75%.
• By definition, the second quartile (p = 50%) is equal to the median.

• The range is defined as the maximum value minus the minimum


value.
• The range is a fairly crude measure of variability.
• The interquartile range (IQR) is defined as the third quartile minus
the first quartile.
• Thus, the IQR is the range of the middle 50% of the data.
• It is less sensitive to extreme values than the range.
Measures of Variability: Variance and
Standard Deviation
• The variance is essentially the average of the squared
deviation from the mean.
• If Xi is a typical observation, its squared deviation from
the mean is .
• The sample variance is denoted by s2, and the population
variance by σ2.
• It is hard to interpret the variance numerically because it
is in squared units (e.g. $ → $2).
Formula for the Variance
Sample Variance: Population Variance:

• If all of the observations are close to the mean, then their squared
deviations from the mean will be relatively small, and the variance
will be relatively small.
• If at least a few of the observations are far from the mean, then
their squared deviations from the mean will be large, and this will
cause the variance to be large.
Empirical Rules for Interpreting Standard
Deviation
• A more natural measure is the standard deviation which is the square
root of variance (denoted as just s or σ).
• The interpretation of the standard deviation can be stated as three
empirical rules.
• If the values of a variable are approximately normally distributed
(symmetric and bell-shaped), then the following rules hold:
(1) Approximately 68% of the observations are within one standard deviation of
the mean.
(2) Approximately 95% of the observations are within two standard deviations
of the mean.
(3) Approximately 99.7% of the observations are within three standard deviations
of the mean.
• Fortunately, many variables in real-world data are indeed approximately
normally distributed.
Normal Distribution: Empirical Rules
• Standard normal distribution: a normal distribution with a mean of
0 and a standard deviation of 1 (→ “standardization”)
• Probability: the expected relative frequency of a particular outcome
Mean Absolute Deviation

• The mean absolute deviation (MAD) is another measure of


variability.
• For many variables, the standard deviation is approximately 25%
larger than the MAD:

• Formula for Mean Absolute Deviation:


Dataset ➢ Range
(exam scores; random sample): ➢ 99-57=42
57, 99, 78, 73, 84, 95
➢ Variance
Compute: ➢ ((57-81)^2+…)/(6-1) = 235.6

➢ Mean
➢ (57+99+78+73+84+95)/6 = 81

➢ Median
➢ 57, 73, 78, 84, 95, 99 ➢ Standard Deviation
➢ Even: (78+84)/2 = 81 ➢ √235.6 ≈ 15.35

➢ 90th percentile
➢ 0.9*6=5.4; round-up=6; 6th score
in ranked dataset:99.
Measures of Shape: Skewness
• Skewness occurs because of a lack of symmetry.
▪ A variable can be skewed to the right (or positively
skewed) because of some really large values (e.g.
comparing Baseball players’ salaries).
▪ Or it can be skewed to the left (or negatively skewed)
because of some really small values (e.g. examining
temperature lows in Antarctica).
Measures of Shape: Kurtosis
▪ Kurtosis has to do with the “fatness” of the tails of the
distribution relative to the tails of a normal distribution.
▪ A distribution with high kurtosis has many extreme
observations.
Skewness and Kurtosis – What are acceptable
values?

– Statistical methods include diagnostic hypothesis tests for


normality, and a rule of thumb that says a variable is
reasonably close to normal if its skewness and kurtosis
have values between –1.0 and +1.0.
– None of the methods is absolutely definitive.
– We will use the criteria that the skewness and kurtosis of
the distribution both fall between -1.0 and +1.0.
Outliers (Z Score)
• An outlier is literally a value or an entire observation that lies well outside
of the norm.
• You might define an outlier as any value more than three standard
deviations from the mean, but this is only a rule of thumb.
• Boxplot is widely used to define outliers.
• Probably the best advice for dealing with outliers is to run the analyses two
ways: With the outliers and without them.
Missing Values
• What are missing values?
– What to do about them?
• Never use “0” or a “blank” for a missing value (or any
“possible” value)!
• Missing data are coded in a variety of ways.
– Excel: empty cells/ codes.
– SPSS: 999 or 999.999 for a missing value
– R: NA for a missing value

You might also like