0% found this document useful (0 votes)
2 views

Data Analysis Notes

The document covers various statistical methods including parametric and non-parametric tests, multiple linear regression, exploratory factor analysis, reliability analysis, and cluster analysis. It explains key concepts, assumptions, and applications of these methods, providing examples and techniques for data analysis. Additionally, it discusses sampling methods, variable types, and tasks for assessing assumptions in regression analysis.

Uploaded by

Kailin Padamo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analysis Notes

The document covers various statistical methods including parametric and non-parametric tests, multiple linear regression, exploratory factor analysis, reliability analysis, and cluster analysis. It explains key concepts, assumptions, and applications of these methods, providing examples and techniques for data analysis. Additionally, it discusses sampling methods, variable types, and tasks for assessing assumptions in regression analysis.

Uploaded by

Kailin Padamo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Analysis Notes

MO1. Parametric and Non-Parametric Tests

1.1 General Concepts


• Parametric Tests: Statistical methods that assume a specific data distribution (usually
normal).
• Non-Parametric Tests: Tests that do not assume a specific distribution, offering more
flexibility.
• Hypothesis Testing: The process of testing an assumption (null hypothesis H₀) based on
sample data.
• p-value: The probability of obtaining an extreme result assuming H₀ is true.
• Significance Level (α): The probability threshold to reject H₀ (commonly 0.05).

1.2 One-Sample Tests


• One-Sample t-Test: Compares the sample mean to a known population value.
• Wilcoxon Signed-Rank Test: Non-parametric alternative to compare the sample median to
a specific value.

1.3 Comparing Two or More Groups


• Independent Samples t-Test: Compares means between two independent groups.
• Paired Samples t-Test: Compares means of two related datasets (e.g., before- and-after
studies).
• Mann-Whitney U Test: Non-parametric alternative to the t-test for independent samples.
• ANOVA (Analysis of Variance): Tests differences across three or more groups.
• Kruskal-Wallis Test: Non-parametric version of ANOVA.
• Post-hoc Tests: Tests conducted after ANOVA to identify specific group differences.

MO2. Multiple Linear Regression

2.1 Core Concepts


• Linear Regression: Technique to model the relationship between a dependent variable and
one or more independent variables.
• Multiple Linear Regression (MLR): An extension of linear regression with multiple
predictors.
• Dependent Variable (Y): The variable being predicted or explained.
• Independent Variables (X): Variables used to explain or influence Y.
• Regression Coefficients (β): Represent the impact of each independent variable on Y.
• Intercept (β₀): The expected value of Y when all predictors are zero.
• R-squared (R²): Measures the proportion of variance in Y explained by the model.
• Adjusted R-squared: Adjusted version of R² for multiple predictors.

2.2 Quantitative and Qualitative Information


• Dummy Variables: Categorical variables converted to binary for regression (0 or 1).
• Multicollinearity: When independent variables are highly correlated, affecting coefficient
precision.
• Homoscedasticity: Assumption of constant variance in model errors.
• Residual Analysis: Evaluating errors to verify model assumptions.

MO3. Exploratory Factor Analysis (EFA)

3.1 Theory and Application


• Factor Analysis: Technique to identify latent factors that explain correlations among
observed variables.
• Exploratory Factor Analysis (EFA): Used to uncover the underlying structure of data without
prior hypotheses.
• Factors: Latent variables not directly observed.
• Factor Loadings: Correlations between observed variables and latent factors.
• Communality: Proportion of a variable's variance explained by the factors.

3.2 Discovering Factors


• Eigenvalues: Indicate the importance of each factor. Factors with eigenvalues > 1 are often
retained.
• Scree Plot: Graph used to determine the number of factors to retain.
• Rotation (Varimax, Promax): Techniques to simplify factor interpretation.
• Kaiser-Meyer-Olkin (KMO) Test: Measures sampling adequacy for factor analysis.
• Bartlett’s Test of Sphericity: Tests if variable correlations are significant enough for EFA.

MO4. Reliability Analysis

4.1 Measures of Reliability


• Reliability: Degree of consistency or stability of a measure over time.
• Internal Consistency: Measures if items within a test assess the same construct.
• Cronbach's Alpha: Coefficient assessing internal consistency; values above 0.7 are
acceptable.
• Split-Half Reliability: Divides the test into two parts to evaluate correlation.
• Test-Retest Reliability: Assesses measurement stability over time.
• Inter-Rater Reliability: Measures agreement between different evaluators.

MO5. Cluster Analysis

5.1 Basic Concepts


• Cluster Analysis: Technique for grouping objects based on similarity, without dependent
variables.
• Clusters: Groups of objects more similar to each other than to those in other groups.
• Distance Metrics: Measures to calculate similarity between objects (e.g., Euclidean,
Manhattan distance).
• Dendrogram: Hierarchical graphical representation of clusters.
5.2 Algorithm's Methods
• Hierarchical Clustering: Builds a hierarchy of clusters using agglomerative (bottom-up) or
divisive (top-down) methods.
• K-Means Clustering: Partitions data into k predefined clusters.
• DBSCAN (Density-Based Spatial Clustering): Groups points based on data density,
identifying noise and outliers.
• Silhouette Score: Metric evaluating the quality of formed clusters.
• Elbow Method: Technique to determine the optimal number of clusters in K-means.

Introduction to Simple, Multiple, and Hierarchical Linear


Regression. Understand the basic concepts of simple, multiple, and
hierarchical linear regression.
Linear Regression
• A statistical technique used to model the relationship between a dependent variable (Y)
and one or more independent variables (X).
Applications:
• Predicting house prices
• Analysing academic performance
• Economic impact studies

Simple Linear Regression


• Formula:

• Explanation:
➢​ β₀ (Intercept): Value of Y when X = 0
➢​ β₁ (Slope): Change in Y for each unit change in X
➢​ ε (Residual error): Difference between actual and predicted values
Example: Predicting a car's price based on mileage.

Example of Simple Linear Regression


Single Predictor(X) in a Linear Regression Model with a scatter plot and a best-fit regression
line. If the balls are closer to the line it has a better fit!

Multiple Linear Regression


• Formula:

Key difference from simple regression:


Uses multiple independent variables to improve predictions.
• Example: Predicting academic performance(Y) based on:
➢​ Study hours(X)
➢​ Teacher quality(X)
➢​ Student motivation(X)

Example of Multiple Linear Regression


Multiple Linear Regression Model with a 3D scatter plot and a best-fit regression plane.

Hierarchical Regression
What is it?
• Tests models in steps, adding variables progressively.
• Helps assess whether new variables improve predictions.
Example scenario:
• Model 1: Salary = β₀ + β₁ (Experience)
• Model 2: Salary = β₀ + β₁ (Experience) + β₂ (Education)
• Model 3: Salary = β₀ + β₁ (Experience) + β₂ (Education) + β₃ (Age)
How to evaluate? Compare R² across models.

Example of Hierarchical Regression


Hierarchical Regression, showing the stepwise addition of predictors and their contribution to
explaining variance (R²).

Identify and interpret dependent and independent variables.


Dependent Variable (Y) → What we aim to predict
Independent Variable (X) → Factors influencing Y
Examples:
• Y: Monthly salary | X: Years of experience
• Y: Student grades | X: Study hours

Learn the key assumptions of multiple linear regression.


• Linearity → Linear relationship between X and Y
Independence of errors → Durbin-Watson test
Normality of residuals → Shapiro-Wilk test
Homoscedasticity → Residual plots
No multicollinearity → Variance Inflation Factor (VIF)

PART 2

Regression with Quantitative Data


• Modelling Continuous Variables
• Interpretation of Coefficients and Model Adjustments

• POPULATION, SAMPLING, AND SAMPLE


• The set of people to whom we aim to generalize the conclusions of the analysis, if possible.
• The process of defining the sample.
• A subgroup of the population selected for analysis.

These are different types of sampling methods used in research and data collection.
Sampling methods can be broadly categorized into probability sampling (where every
individual has an equal chance of being selected) and non-probability sampling (where
selection is based on specific criteria).
1. Simple Random Sampling (Probability Sampling)
●​ Every member of the population has an equal chance of being selected.
●​ Example: Assigning numbers to students in a school and using a random number
generator to pick participants.
2. Systematic Random Sampling (Probability Sampling)
●​ Selecting every nth person from a list after choosing a random starting point.
●​ Example: Surveying every 5th customer who enters a store.
3. Stratified Sampling (Probability Sampling)
●​ The population is divided into subgroups (strata) based on characteristics (e.g., age,
gender), and participants are randomly chosen from each group.
●​ Example: If a school has 60% girls and 40% boys, a stratified sample would maintain
this ratio.
4. Cluster Sampling (Probability Sampling)
●​ The population is divided into clusters, and entire clusters are randomly selected
instead of individuals.
●​ Example: Instead of selecting students across all schools, you randomly pick entire
schools and survey all students within them.
5. Convenience Sampling (Non-Probability Sampling)
●​ Choosing participants who are easily accessible.
●​ Example: Surveying students in a nearby café instead of the whole university.
6. Self-Selection Sampling (Non-Probability Sampling)
●​ Participants volunteer themselves for the study.
●​ Example: An online survey where people choose to participate.
7. Purposive Sampling (Non-Probability Sampling)
●​ Researchers select individuals based on specific characteristics that are relevant to
the study.
●​ Example: Interviewing only experts in a medical study on heart disease.
8. Snowball Sampling (Non-Probability Sampling)
●​ Existing participants recruit others, creating a chain-referral effect.
●​ Example: A study on drug users where one participant refers to others.
9. Quota Sampling (Non-Probability Sampling)
●​ Researchers select a specific quota of people from different groups.
●​ Example: A market research study might ensure that 50% of respondents are men
and 50% are women, even if they are not randomly selected.

VARIABLES (DATA)
• QUALITATIVE/CATEGORICAL: Nominal (e.g., profession, gender...); Ordinal (e.g.,
education level, social class...).
• QUANTITATIVE/NUMERICAL: Discrete (e.g., number of children, number of system
accesses...); Continuous (e.g., height, weight, salary...).
• INDEPENDENT (X/Cause): Condition or cause leading to a particular effect or
consequence.
• DEPENDENT (Y/Effect): Consequence or response to a given stimulus.
• MISSING VALUES: Absent data caused by a lack of responses in a questionnaire or
missing information in a database.

Assumptions of Multiple Regression


Linearity → Linear relationship between X and Y
Independence of errors → Durbin-Watson test
Normality of residuals → Shapiro-Wilk test
Homoscedasticity → Residual plots
No multicollinearity → Variance Inflation Factor (VIF)

A minimum of 20 subjects per independent variable (X).

• Histogram of Standardized Residuals (normal distribution)


• a. If the representations fall outside the range <3>-3, outliers are present.
• Goal: Absence of Outliers.
• Normality of residuals
• Shapiro-Wilk test (S-W)
➔​ Purpose: Checks if data follows a normal distribution.
➔​ Best for: Small to medium sample sizes (n ≤ 50-2000).
➔​ Interpretation:
1.​ p > 0.05 → Data is normal
2.​ p < 0.05 → Data is not normal
➔​ Limitation: Very sensitive to small deviations.
• Kolmogorov-Smirnov (K-S)
➔​ Purpose: Compares data to a reference distribution (e.g., normal, uniform).
➔​ Best for: Large sample sizes.
➔​ Interpretation:
1.​ p > 0.05 → Data matches the reference distribution
2.​ p < 0.05 → Data does not match
➔​ Limitation: Less sensitive to normality than S-W.

🔹 Quick Difference:
S-W: Best for normality testing (small/medium samples).
K-S: Best for comparing distributions (large samples).

Task: Outliers
• Load an existing dataset (Spinner Regression Dataset)or create anew one with at least one
numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Were any outliers identified in the graphs and statistics?
Do the outliers significantly influence the mean or data dispersion?
Task: Normality
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
What do the results of statistical tests (e.g., Shapiro-Wilk, Kolmogorov-Smirnov) indicate
about normality?
• If residuals are not normally distributed, what potential issues could this cause in the
analysis?

• Diagnosis of Homoscedasticity (constant variance of residuals across different


observables)
a. If the graph pattern is conical or rectangular, it characterizes homoscedasticity.
• Goal: Presence of Homoscedasticity

Task: Homoscedasticity
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Do the residual plots show a constant spread of residuals across all levels of the
independent variable(s)?
• How might heteroscedasticity impact the reliability of hypothesis tests and confidence
intervals?
• Does heteroscedasticity suggest the presence of omitted variables or model
misspecification?

• Independence of errors → Durbin-Watson test (Independence Diagnosis - Uncorrelated


Residuals) Predicted Value vs. Observed
• a.Durbin-Watson test close to 2, acceptable between 1.5 and 2.5.
• Goal: Absence of Autocorrelation among Residuals.

Task: Independence of errors


• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Did the Durbin-Watson test indicate significant autocorrelation in the residuals?
• Do the residual plots show any patterns (e.g., trends or cycles) suggesting
non-independence?
• If autocorrelation is present, what could be causing it (e.g., omitted variables, time
dependence)?

• Analysis of Collinearity and Multicollinearity


• a. If the tolerance value is > 0.1, there is no multicollinearity.
• b. If the VIF value is < 10, there is also no multicollinearity.
• Goal: Absence of Multicollinearity

Task: Multicollinearity
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
What are the Variance Inflation Factor (VIF) values for each predictor variable?
• Are any VIF values above the commonly accepted threshold (e.g., greater than 5 or 10)
indicating severe multicollinearity?
• Did the Tolerance values confirm the presence of multicollinearity (i.e., values close to 0)?

• Linearity → Linear relationship between X and Y


• Linearity is confirmed if the points in the residuals plot do not show a systematic pattern
(such as curves or shapes).
• Goal: Presence of Linearity

Task: Linearity
• Load an existing dataset (Spinner Regression Dataset)or create a new one with at least
one numerical variable.
Examples of variables: Age, Income, Height.
• Questions for Reflection
Do the scatter plots or residual plots show a linear relationship between the independent and
dependent variables?
• Are there influential outliers that might be distorting the perceived linear relationship?

You might also like