Statistics Final
Statistics Final
Contingency Tables
Contingency tables are tools used in statistics to summarize the relationship between two
categorical variables. Each entry in the table represents the frequency count of the
occurrences of specific combinations of variables.
Testing Dependency
The dependency between variables in a contingency table is tested using the Chi-square (χ2)
test and Fisher’s Exact test, depending on the sample size and expected frequencies:
Chi-square Test: Suitable for larger sample sizes (n > 40 or when 20 < n ≤ 40 and there’s no
any expected frequency that is < 5).). It checks the independence between two variables by
comparing observed frequencies with expected frequencies under the assumption that
variables are independent.
Chi-square Test: If the computed chi-square statistic is greater than the critical value from
chi-square distribution tables, we reject the null hypothesis (H0) and conclude that there is a
dependency between the variables.
Fisher’s Exact Test: Used when sample sizes are smaller (n ≤ 20 or when 20 < n ≤ 40 and any
expected frequency is < 5). This test is more accurate when the sample size is too small for
the χ2 test to be reliable.
Measures of Association
Relative Risk (RR) and Odds Ratio (OR)
Relative Risk (RR): This tells you how much more likely something is to happen in one group
compared to another. For example, if the relative risk of developing a disease among
smokers compared to non-smokers is 2, it means smokers are twice as likely to develop the
disease compared to non-smokers.
Odds Ratio (OR): This compares the odds of something happening in one group to the odds
of it happening in another group. It's commonly used in case-control studies. An odds ratio
of 1 means there's no difference in odds between the groups. If it's greater than 1, it means
the event is more likely to happen in the first group, and if it's less than 1, it's more likely to
happen in the second group.
Therminoloy to remember
Nominal Data: Data categorized without an inherent order; distinct categories based on
names or labels only.
Ordinal Data: Data arranged in a specific sequence where the order is meaningful, but
differences between values are not consistent.
Contingency Table: A matrix format that displays the frequency distribution of variables to
study the dependency between them.
Dependency: A situation in statistical terms where the occurrence of one event is
influenced by the presence of another.
Chi-square Test: A statistical method used to assess whether observed frequencies in a
contingency table differ significantly from expected frequencies.
Fisher Test: Also known as Fisher’s Exact Test, used for exact hypothesis testing on
categorical data, especially useful with small sample sizes.
Observed Frequency: The actual number of observations in each category or cell of a
contingency table.
Expected Frequency: The theoretical frequency calculated under the null hypothesis of
independence in a contingency table.
Relative Risk (RR): Measures the ratio of the probability of an event occurring in an exposed
group compared to a non-exposed group.
PRE Measure: Proportional reduction in error; indicates how much better one can predict
the dependent variable by knowing the independent variable.
Chi-square Critical Value: The threshold value against which the calculated chi-square
statistic is compared to decide whether to reject the null hypothesis.
Alpha Level (α): The threshold of probability at which you reject the null hypothesis;
commonly set at 0.05 or 5%.
Independence in Statistics: The scenario where the occurrence of one event does not affect
the probability of occurrence of another event.
Power of the Test: The likelihood that a test will correctly reject the null hypothesis when it
is false, i.e., it will detect an effect if there is one.
Coefficient of Association: A measure used to quantify the strength and direction of the
relationship between two nominal variables.
Fisher’s Exact Probability Test: A statistical significance test used in the analysis of 2x2
contingency tables.
Null Hypothesis (H0): A general statement or default position that there is no relationship
between two measured phenomena.
Alternative Hypothesis (H1): The hypothesis that sample observations are influenced by
some non-random cause.
P-value: The probability of observing test results at least as extreme as the results actually
observed, under the assumption that the null hypothesis is correct.
Cramer’s V: A measure of association between two nominal variables, giving a value
between 0 and 1 where 0 indicates no association.
Presentation #2
Simple Linear Regression Analysis
Linear regression models the relationship between two variables where one variable
(dependent, Y) is considered a function of the other variable (independent, X).
The equation for simple linear regression is typically expressed as Y = a + bX, where:
• 𝑎 (intercept) is the constant term.
• 𝑏 (regression coefficient) represents the change in the dependent variable for a one-
unit change in the independent variable.
Parameter Estimation:
Method of taking a random sample and checking how many pieces of one variable is in this
sample = predicting the proportion of this variable in the whole population.
Correlation analysis
Coefficient of Correlation (r):
Measures the strength and direction of the linear relationship between two variables.
Values range from -1 to +1:
• +1 indicates a perfect positive linear relationship,
• -1 indicates a perfect negative linear relationship,
• 0 indicates no linear relationship.
Coefficient of Determination (r2):
Represents the proportion of the variance in the dependent variable that is predictable from
the independent variable.
Measures how well variablity of one variable can predict the other variable.
In essence, the t-test looks at each individual predictor's significance, while the F-test looks
at the overall usefulness of the entire model in predicting the outcome.
Terminology to remember
Linearity:
The relationship between the independent and dependent variable must be linear.
Verification is through visual inspection of data plots or statistical tests.
Homoscedasticity:
The strength/level of the variability of our errors.
Ordinary Least Squares (OLS) - Method for estimating the parameters in a regression model
by getting an intercepts and slope of the line that estimates the relationship.
Intercept (a) - The expected mean value of Y when all X=0.
Regression Coefficient (b) - Represents the change in the dependent variable for a one-unit
change in the independent variable.
Coefficient of Correlation (r) - Measures the strength and direction of a linear relationship
between two variables.
Coefficient of Determination (r2) - Proportion of the variance in the dependent variable that
is predictable from the independent variable.
Testing in Regression - Processes like t-tests and F-tests to assess the significance of
regression models and their coefficients.
Presentation #3
Time Series Fundamentals
A time series is a sequence of data points recorded at consistent time intervals. This data
can be analyzed from several perspectives:
Fixed Moment and Interval: Refers to specific points in time versus continuous intervals.
Periodicity: Divided into short-term and long-term series, reflecting the length of time over
which data are collected.
Variable Types: Original variables (raw data) and derived variables (calculated or processed
data).
Unit of Measure: Natural variables (raw units like liters, meters) and financial variables
(monetary units).
Exponential Smoothing
Adaptive Models: These models adjust the parameters over time and do not assume
stability of the trend. Exponential smoothing is a key example where recent observations
are given more weight, decreasing exponentially into the past.
Types of Exponential Smoothing:
Simple Exponential Smoothing: Used when data has no trend or seasonality.
Double and Triple Exponential Smoothing: Handle data with trends (linear or quadratic).
Holt and Winters Methods: For data with trends and seasonal patterns.
Analysis of Seasonality
Seasonal Indices: Measures that quantify the seasonal pattern within a time series, allowing
adjustments to forecasts to account for seasonality.
Calculation of Seasonal Indices: Involves smoothing data points, averaging them over
periods, and comparing these averages to the overall trend to determine seasonal effects.
Terminology to remember
Time Series: Data points indexed in time order.
OLS: Method for estimating the unknown parameters in a linear regression model.
Trend: The underlying pattern in the data in a time series, excluding irregular effects.
Stationarity: A statistical characteristic of a time series whose mean and variance are
constant over time.
Seasonality: Repeating patterns or cycles of behavior over time.
Seasonal Index: A measure used to adjust predictions based on seasonal variations.
Forecast: A calculation or estimate of future events.
Exponential Smoothing: A rule of weighted moving averages where weights decrease
exponentially.
Accuracy: The closeness of a measured or calculated value to its true value.
Smoothing Constant (α): The weighting applied to the most recent period's value in
exponential smoothing.
Presentation #4
Introduction to Index Numbers
Index numbers are statistical measures designed for comparing quantities over different
periods or different entities. They are primarily used to measure changes in economic data
such as price levels, quantities, or other financial indicators over time.
Types of Indices
Individual Indices:
• Simple: Measures a single variable.
• Composite: Combines multiple simple indices.
Aggregate Indices:
Combine data from non-homogenous variables to create a unified index.
Decomposition of Indices
Indices can be decomposed to understand the influence of different components like price
and quantity on the overall index value. This can be done using:
Index of Constant Composition (ICC): Holds one element constant to measure the effect of
the other.
Index of Structure (ISTR): Measures the structural changes.
Practical Computations
Various exercises are provided to compute changes in:
• Employee numbers and wages in industries and branches.
• Quantity sold and earnings for specific products.
Terminology to remember
Base Period: The time against which all comparisons are made.
Current Period: The time period under analysis for changes.
Simple Index: Measures one variable directly.
Composite Index: Combines several simple indices.
Aggregate Index: Combines different data types into a unified measure.
Laspeyres Index: Uses base period quantities as weights.
Paasche Index: Uses current period quantities for weighting.
Fisher Index: Geometric mean of Laspeyres and Paasche indices.
Extensity: Extent or scope of activity measured.
Intensity: Intensity or level of activity.
Decomposition: Breaking down an index into components to analyze effects separately.