Pointers to Review Statistics
1. Descriptive and Inferential Statistics
Descriptive Statistics:
Descriptive statistics involves summarizing and organizing data in a meaningful way. These
statistics describe what is observed and help present data without making any conclusions or
predictions.
Measures of Central Tendency:
o Mean: The average of the data set.
o Median: The middle value when the data is arranged in order.
o Mode: The most frequent value in the data set.
Measures of Dispersion:
o Range: The difference between the highest and lowest values.
o Variance: The average squared deviation from the mean.
o Standard Deviation: The square root of the variance, indicating how spread out
the data is around the mean.
Data Distribution:
o Normal Distribution: A bell-shaped curve that is symmetrical.
o Skewness: The degree to which data is asymmetrical.
o Kurtosis: The sharpness of the peak of the distribution.
Inferential Statistics:
Inferential statistics allows us to make predictions or inferences about a population based on a
sample of data. These techniques determine whether a result is statistically significant or just due
to random chance.
Hypothesis Testing: Testing an assumption regarding a population parameter.
o Null Hypothesis (H0): Assumes no effect or difference.
o Alternative Hypothesis (H1): Assumes there is an effect or difference.
Confidence Intervals: A range of values used to estimate a population parameter.
P-value: The probability of obtaining test results at least as extreme as the observed
results under the assumption that the null hypothesis is correct. If the p-value is below a
threshold (usually 0.05), the null hypothesis is rejected.
Common Inferential Tests:
o t-test: Compares the means of two groups.
o Chi-square test: Tests relationships between categorical variables.
o ANOVA (Analysis of Variance): Compares means among three or more groups.
2. Variables and Types of Data
Variables:
A variable is any characteristic, number, or quantity that can be measured or counted. Variables
can change or vary across individuals or over time.
Independent Variable (IV): The variable that is manipulated to observe its effect on the
dependent variable.
Dependent Variable (DV): The outcome variable that is being studied and measured.
Types of Data:
1. Quantitative Data (Numerical):
o Discrete Data: Countable data (e.g., number of students in a class).
o Continuous Data: Measurable data that can take any value within a range (e.g.,
height, weight).
2. Qualitative Data (Categorical):
o Nominal Data: Data that represents categories without a natural order (e.g.,
gender, eye color).
o Ordinal Data: Data that represents categories with a meaningful order but no
fixed interval between categories (e.g., rankings, educational levels).
3. Data Collection and Sampling Techniques
Data Collection Methods:
Surveys/Questionnaires: A set of questions administered to collect data from
respondents.
Observational Studies: Collecting data by observing subjects in a natural or controlled
setting.
Experiments: Conducting controlled trials where one or more variables are manipulated
to determine their effect.
Sampling Techniques:
1. Probability Sampling: Every member of the population has an equal chance of being
selected.
o Simple Random Sampling: Every individual has an equal chance of being
chosen.
o Stratified Sampling: The population is divided into strata (groups), and samples
are drawn from each group proportionally.
o Cluster Sampling: The population is divided into clusters, and a random sample
of clusters is chosen.
o Systematic Sampling: Every nth individual is selected from a list of the
population.
2. Non-Probability Sampling: Not every member has a known or equal chance of being
selected.
o Convenience Sampling: Sampling those who are easily accessible.
o Judgmental/Purposive Sampling: Selecting a sample based on the researcher’s
judgment.
o Snowball Sampling: Current participants recruit future participants.
1. Organizing Data
Organizing data involves arranging raw data in a structured format to facilitate analysis and
interpretation. There are several methods to organize data:
Frequency Distribution Table:
A table that displays the frequency (number of occurrences) of each value or range of
values in a data set.
Class Intervals: In the case of continuous data, values are grouped into ranges or
intervals (e.g., 10-19, 20-29).
Frequency: The number of observations that fall within each interval.
Steps to Create a Frequency Distribution Table:
1. Decide on the number of classes (usually between 5 and 10).
2. Determine the class width by dividing the range (maximum value – minimum value) by
the number of classes.
3. List the class intervals and count the number of data points in each interval.
Cumulative Frequency:
A running total of frequencies as you progress through the class intervals.
2. Histograms and Frequency Polygons
Histograms:
A graphical representation of the frequency distribution of a data set, typically used for
continuous data.
Characteristics:
o Bars are used to represent frequency.
o The x-axis represents class intervals (ranges of data).
o The y-axis represents the frequency (number of occurrences).
o There are no gaps between bars, as the data is continuous.
Steps to Create a Histogram:
1. Organize the data into class intervals.
2. Draw the x-axis (class intervals) and y-axis (frequency).
3. Draw bars for each class interval with heights corresponding to the frequency.
Frequency Polygons:
A frequency polygon is similar to a histogram but uses points connected by straight lines
to represent frequency.
The midpoints of the class intervals are plotted on the x-axis, and the frequencies are
plotted on the y-axis.
1. Measures of Central Tendency
Measures of central tendency describe the center or average of a data set. The three most
common measures are mean, median, and mode.
Mean:
The arithmetic average of a data set.
Formula: Mean=∑xin\text{Mean} = \frac{\sum x_i}{n}Mean=n∑xi Where xix_ixi are
the data values, and nnn is the number of observations.
Example: For data set [2,4,6,8,10][2, 4, 6, 8, 10][2,4,6,8,10], Mean=2+4+6+8+105=6\
text{Mean} = \frac{2 + 4 + 6 + 8 + 10}{5} = 6Mean=52+4+6+8+10=6
Median:
The middle value when data is arranged in ascending or descending order.
If the number of observations is odd, the median is the middle value. If even, it is the
average of the two middle values.
Example: For [1,3,7,9,11][1, 3, 7, 9, 11][1,3,7,9,11], the median is 7.
For [2,4,6,8][2, 4, 6, 8][2,4,6,8], the median is 4+62=5\frac{4 + 6}{2} = 524+6=5.
Mode:
The value that appears most frequently in the data set.
A data set may have no mode, one mode (unimodal), or more than one mode (bimodal or
multimodal).
Example: For [3,7,3,9,3][3, 7, 3, 9, 3][3,7,3,9,3], the mode is 3.
2. Measures of Dispersion
Measures of dispersion describe how spread out or dispersed the data values are from the central
point.
Range:
The difference between the highest and lowest values in the data set.
Formula: Range=Maximum value−Minimum value\text{Range} = \text{Maximum
value} - \text{Minimum value}Range=Maximum value−Minimum value
Example: For [2,4,6,8,10][2, 4, 6, 8, 10][2,4,6,8,10], Range=10−2=8\text{Range} = 10 -
2 = 8Range=10−2=8
Variance:
Measures the average squared deviation of each data point from the mean.
Formula for population variance (σ2\sigma^2σ2):
σ2=∑(xi−μ)2N\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}σ2=N∑(xi−μ)2
Where xix_ixi is each data point, μ\muμ is the population mean, and NNN is the total
number of data points.
Formula for sample variance (s2s^2s2):
s2=∑(xi−xˉ)2n−1s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}s2=n−1∑(xi−xˉ)2
Where xˉ\bar{x}xˉ is the sample mean, and nnn is the sample size.
Standard Deviation:
The square root of the variance, indicating how much the values deviate from the mean
on average.
Formula: Standard Deviation=Variance\text{Standard Deviation} = \sqrt{\
text{Variance}}Standard Deviation=Variance
Example: For the data set [2,4,4,4,5,5,7,9][2, 4, 4, 4, 5, 5, 7, 9][2,4,4,4,5,5,7,9], the mean
is 5, and the sample standard deviation is calculated as: s=(2−5)2+(4−5)2+⋯
+(9−5)28−1=4.57≈2.14s = \sqrt{\frac{(2-5)^2 + (4-5)^2 + \cdots + (9-5)^2}{8-1}} = \
sqrt{4.57} \approx 2.14s=8−1(2−5)2+(4−5)2+⋯+(9−5)2=4.57≈2.14
Interquartile Range (IQR):
The range of the middle 50% of the data.
Formula: IQR=Q3−Q1\text{IQR} = Q_3 - Q_1IQR=Q3−Q1 Where Q1Q_1Q1 is the
first quartile (25th percentile) and Q3Q_3Q3 is the third quartile (75th percentile).
Example: For [1,3,5,7,9,11,13][1, 3, 5, 7, 9, 11, 13][1,3,5,7,9,11,13],
Q1=3Q_1 = 3Q1=3, Q3=11Q_3 = 11Q3=11, so IQR=11−3=8IQR = 11 - 3 =
8IQR=11−3=8.
3. Measures of Position
Measures of position indicate where a particular value lies in relation to other values in a data
set. Common measures of position include percentiles, quartiles, and z-scores.
Percentiles:
Percentiles divide the data set into 100 equal parts. The p-th percentile represents the
value below which ppp% of the data falls.
Example: If a student scores in the 90th percentile on an exam, they scored higher than
90% of the other students.
Quartiles:
Quartiles divide the data into four equal parts.
o First Quartile (Q1): The 25th percentile.
o Second Quartile (Q2): The 50th percentile (same as the median).
o Third Quartile (Q3): The 75th percentile.
Example: For [1,2,3,4,5,6,7,8,9][1, 2, 3, 4, 5, 6, 7, 8, 9][1,2,3,4,5,6,7,8,9],
Q1=2.5Q_1 = 2.5Q1=2.5, Q2=5Q_2 = 5Q2=5, and Q3=7.5Q_3 = 7.5Q3=7.5.
Z-scores:
A z-score (or standard score) indicates how many standard deviations a data point is from
the mean.
Formula: z=x−μσz = \frac{x - \mu}{\sigma}z=σx−μ Where xxx is the data point, μ\muμ
is the mean, and σ\sigmaσ is the standard deviation.
Example: If the mean test score is 80 with a standard deviation of 10, a student who
scored 90 has a z-score of: z=90−8010=1z = \frac{90 - 80}{10} = 1z=1090−80=1 This
means the student's score is 1 standard deviation above the mean.
Summary:
Measures of Central Tendency: Mean, median, and mode describe the center of a data
set.
Measures of Dispersion: Range, variance, and standard deviation show how spread out
the data is.
Measures of Position: Percentiles, quartiles, and z-scores help identify where a
particular data value stands relative to the rest of the data.