Data Analysis Notes
Data Analysis Notes
• Explanation:
➢ β₀ (Intercept): Value of Y when X = 0
➢ β₁ (Slope): Change in Y for each unit change in X
➢ ε (Residual error): Difference between actual and predicted values
Example: Predicting a car's price based on mileage.
Hierarchical Regression
What is it?
• Tests models in steps, adding variables progressively.
• Helps assess whether new variables improve predictions.
Example scenario:
• Model 1: Salary = β₀ + β₁ (Experience)
• Model 2: Salary = β₀ + β₁ (Experience) + β₂ (Education)
• Model 3: Salary = β₀ + β₁ (Experience) + β₂ (Education) + β₃ (Age)
How to evaluate? Compare R² across models.
PART 2
These are different types of sampling methods used in research and data collection.
Sampling methods can be broadly categorized into probability sampling (where every
individual has an equal chance of being selected) and non-probability sampling (where
selection is based on specific criteria).
1. Simple Random Sampling (Probability Sampling)
● Every member of the population has an equal chance of being selected.
● Example: Assigning numbers to students in a school and using a random number
generator to pick participants.
2. Systematic Random Sampling (Probability Sampling)
● Selecting every nth person from a list after choosing a random starting point.
● Example: Surveying every 5th customer who enters a store.
3. Stratified Sampling (Probability Sampling)
● The population is divided into subgroups (strata) based on characteristics (e.g., age,
gender), and participants are randomly chosen from each group.
● Example: If a school has 60% girls and 40% boys, a stratified sample would maintain
this ratio.
4. Cluster Sampling (Probability Sampling)
● The population is divided into clusters, and entire clusters are randomly selected
instead of individuals.
● Example: Instead of selecting students across all schools, you randomly pick entire
schools and survey all students within them.
5. Convenience Sampling (Non-Probability Sampling)
● Choosing participants who are easily accessible.
● Example: Surveying students in a nearby café instead of the whole university.
6. Self-Selection Sampling (Non-Probability Sampling)
● Participants volunteer themselves for the study.
● Example: An online survey where people choose to participate.
7. Purposive Sampling (Non-Probability Sampling)
● Researchers select individuals based on specific characteristics that are relevant to
the study.
● Example: Interviewing only experts in a medical study on heart disease.
8. Snowball Sampling (Non-Probability Sampling)
● Existing participants recruit others, creating a chain-referral effect.
● Example: A study on drug users where one participant refers to others.
9. Quota Sampling (Non-Probability Sampling)
● Researchers select a specific quota of people from different groups.
● Example: A market research study might ensure that 50% of respondents are men
and 50% are women, even if they are not randomly selected.
VARIABLES (DATA)
• QUALITATIVE/CATEGORICAL: Nominal (e.g., profession, gender...); Ordinal (e.g.,
education level, social class...).
• QUANTITATIVE/NUMERICAL: Discrete (e.g., number of children, number of system
accesses...); Continuous (e.g., height, weight, salary...).
• INDEPENDENT (X/Cause): Condition or cause leading to a particular effect or
consequence.
• DEPENDENT (Y/Effect): Consequence or response to a given stimulus.
• MISSING VALUES: Absent data caused by a lack of responses in a questionnaire or
missing information in a database.
🔹 Quick Difference:
S-W: Best for normality testing (small/medium samples).
K-S: Best for comparing distributions (large samples).
Task: Outliers
• Load an existing dataset (Spinner Regression Dataset)or create anew one with at least one
numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Were any outliers identified in the graphs and statistics?
Do the outliers significantly influence the mean or data dispersion?
Task: Normality
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
What do the results of statistical tests (e.g., Shapiro-Wilk, Kolmogorov-Smirnov) indicate
about normality?
• If residuals are not normally distributed, what potential issues could this cause in the
analysis?
Task: Homoscedasticity
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Do the residual plots show a constant spread of residuals across all levels of the
independent variable(s)?
• How might heteroscedasticity impact the reliability of hypothesis tests and confidence
intervals?
• Does heteroscedasticity suggest the presence of omitted variables or model
misspecification?
Task: Multicollinearity
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
What are the Variance Inflation Factor (VIF) values for each predictor variable?
• Are any VIF values above the commonly accepted threshold (e.g., greater than 5 or 10)
indicating severe multicollinearity?
• Did the Tolerance values confirm the presence of multicollinearity (i.e., values close to 0)?
Task: Linearity
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Do the scatter plots or residual plots show a linear relationship between the independent and
dependent variables?
• Are there influential outliers that might be distorting the perceived linear relationship?