Slides
Slides
Regression
Qiyu Wang 2025-03-05
• Introduction to Simple Regression
• Basic Concepts and Formulas
• Data Preparation & Analysis
Simple regression can be classified into two main types based on the nature of the relationship
between the variables:
Linear Regression: This type of regression assumes a linear relationship between the predictor
and the response variables, i.e., a straight line can be fitted to the data points. It is the most
commonly used type of simple regression.
Non-linear Regression: In this type of regression, the relationship between the predictor and the
response variables is not linear. Various forms of non-linear functions can be used to describe the
relationship, such as polynomial, exponential, or logarithmic functions.
Applications in Various Fields
Social Sciences
Behavioral Analysis: Simple regression can help researchers understand the relationship between
different behaviors, such as the impact of education on income or the relationship between age and
voting patterns.
Survey Research: In survey research, simple regression can be used to analyze the relationship
between survey responses and various demographic variables, such as age, gender, or income.
Science and Engineering
Experimental Data Analysis: Scientists and engineers often use simple regression to analyze
experimental data and understand the relationship between different variables, such as the effect of
temperature on chemical reactions or the relationship between pressure and volume in a gas.
Quality Control: Simple regression can be used in quality control processes to identify relationships
between different variables and to predict the quality of products based on certain characteristics.
02
Basic Concepts and
Formulas
Independent and Dependent Variables
01 Independent Variable
A variable whose variation does not depend on the variation
of another variable; it is the presumed cause.
02 Dependent Variable
A variable whose variation is dependent on the variation of
another variable; it is the presumed effect.
03 Examples
In a study on the relationship between hours of study and
test scores, hours of study would be the independent variable
and test scores would be the dependent variable.
Scatter Plot Diagram
Scatter Plot
A graph that displays the relationship between
two variables; each data point is plotted as a
dot on the graph.
Trend Line
A line that summarizes the direction of the
relationship between two variables; it is often
used to identify patterns in scatter plots.
Positive/Negative Correlation
A positive correlation indicates that as one
variable increases, the other variable also
increases; a negative correlation indicates that
as one variable increases, the other variable
decreases.
Linear Equation & Regression Line
Linear Equation: An equation that represents a Regression Line: A line that is calculated to best
straight line; in the context of regression, it is fit the data points in a scatter plot; it is used to
used to model the relationship between two predict the value of the dependent variable based
variables. on the value of the independent variable.
Slope: The measure of the steepness of the Intercept: The point where the regression line
regression line; it indicates the change in the crosses the Y-axis; it represents the value of the
dependent variable for a one-unit change in the dependent variable when the independent
independent variable. variable is zero.
Residual Analysis
01 02 03 04
Residual: The difference Residual Plot: A graph Patterns in Residuals: If Outliers: Data points
between the observed that shows the the residuals show a that lie far away from
value of the dependent residuals on the vertical pattern, it suggests that the regression line; they
variable and the value axis and the the relationship can have a large impact
predicted by the independent variable between the variables on the slope and
regression line. on the horizontal axis; it is not linear; this may intercept of the
is used to check for indicate that a more regression line and
patterns or outliers in complex model is should be carefully
the data. needed. examined.
03
Data Preparation &
Analysis
Data Collection Methods
Experimental data
Controlled experiments can provide high-quality data for regression
analysis, allowing for the manipulation of independent variables.
Observational data
Observational data, collected through natural or real-world settings,
can provide insights into relationships between variables.
Data Cleaning & Processing Techniques
Data normalization
Data deduplication
Visual inspection
Outliers can be identified through visual inspection of scatterplots or other
graphical representations of the data.
Statistical methods
Statistical methods, such as the Z-score or IQR method, can be used to detect
outliers.
Treatment options
Outliers can be removed, transformed, or kept in the dataset depending on their
nature and impact on the regression model.
Data Transformation Strategies
Linear transformation
Techniques such as log transformation or square root transformation can help to linearize
relationships between variables.
Polynomial transformation
When the relationship between variables is non-linear, polynomial transformation can help to
model the curvature.
Dummy variables
For categorical variables, dummy variables can be created to represent the different
categories in the regression model.
04
Model Building & Evaluation
Metrics
Simple Linear Regression Model Building
Measuring the proportion of the variance in the Examining the residuals (differences between
dependent variable that is predictable from the observed and predicted values) to check for
independent variable. patterns or non-linearity.
Normality of
03 Residuals
04 Visual Inspection
Checking if the residuals follow a normal Using plots to visually assess the fit of the model.
distribution.
Evaluation Metrics: RMSE, MAE, etc.
Mean Squared
03 Error (MSE)
04 Accuracy
Similar to RMSE, but without taking the square Measuring the proportion of predictions that are
Slope coefficient
Represents the average change in the dependent variable for a one-
unit increase in the independent variable.
Intercept coefficient
Represents the value of the dependent variable when the
independent variable is zero.
Confidence intervals
Ranges of values that are likely to contain the true parameter values.
Hypothesis testing
A statistical method used to test the validity of a claim or hypothesis about a population
parameter.
P-value
The probability of observing the test statistic as extreme as or more extreme than
the one observed, given that the null hypothesis is true.
Statistical significance
A measure of the evidence against the null hypothesis, usually expressed in terms
of a p-value.
Alpha level
The pre-determined threshold for statistical significance, typically set at 0.05.
Prediction & Forecasting Methods
A single estimated value for the A range of values within which the
dependent variable. dependent variable is expected to
fall.
Sales Forecasting
Estimating future sales based on historical data.
Employee Turnover
Identifying factors that lead to employee retention or
turnover.
Trend Analysis in Economics
Housing Prices
Healthcare
Predicting patient outcomes
based on various health
indicators.
Social Sciences
Examining relationships
between variables in social
science research.
07
Challenges & Limitations in
Simple Regression
Multicollinearity Issue
Misleading results
If the true relationship between the variables is non-linear, the regression coefficients may
be misleading, leading to incorrect conclusions about the relationship between the variables.
Limited generalizability
If the assumptions are not met, the regression model may not be
generalizable to other populations or situations, limiting its
usefulness.
Thanks
汇报人: XXX 2025-03-05