Coding 2
Coding 2
Word Count:
Session: May 2025
Linear Regression Simulation
Introduction:
This project comes from my interest in understanding how data and mathematics work
together to explain patterns. Since learning about graphs and equations, I’ve been curious
about how we can use simple numbers to represent real-world situations. This study
explores how to create a small dataset (a group of points) and then use math—specifically
the least squares method—to find the straight line that best fits those points. This method
helps us minimize the distance between the line and the points, showing how well the data
follows a trend. By completing this experiment, I aim to better understand how we can use
math not just to solve equations, but to clearly represent and interpret data.
Research Question:
How does the best-fit line, calculated using linear regression, vary when applied to different
sets of data points with varying levels of noise, while keeping the number of points and the
overall trend constant?
Theoretical Framework
Linear regression is one of the most widely used methods in data analysis and statistics for identifying and
modeling the relationship between two quantitative variables. In its most basic form—simple linear
regression—this method seeks to find the best-fitting straight line through a set of data points plotted on a
coordinate plane. The line is meant to show how one variable (called the independent variable, usually
represented as x) influences another (the dependent variable, represented as y).
This method assumes that the relationship between the variables is approximately linear, meaning that
changes in x result in proportional changes in y. The key goal is to create a predictive model—an equation
that allows for estimating unknown y values based on given x values.
By minimizing the total residuals, the least squares method ensures that the resulting line is as close as
possible to all data points on average. The standard form of the line is:
y=mx+b
Where:
This line captures the direction (positive or negative) and strength of the linear relationship between the two
variables.
Linear regression is used in nearly every field—economics, physics, biology, social sciences, and machine
learning—to discover trends, relationships, and predict future outcomes based on observed data. In this
investigation, linear regression is applied to explore how well a best-fit line can represent different data sets
and how changes in the data (e.g., added noise or outliers) affect the accuracy and reliability of the model.
To measure how well the regression line fits the data, we use the coefficient of determination, known as R².
This value ranges from 0 to 1 and represents the proportion of the variance in the dependent variable that is
predictable from the independent variable.
2
In practice, R² values between 0.7 and 1 are considered strong, though this depends on the context and field
of study. It’s a critical part of evaluating the reliability of any regression model.
Graphical Interpretation
Plotting the data points alongside the regression line allows for a clear visual assessment of the model. When
the points are closely clustered around the line, this suggests a strong linear relationship. If the points are
widely scattered, the linear model may not be appropriate.
This visual aspect is also useful for identifying outliers, which are points that deviate significantly from the
pattern of the other data. Outliers can heavily influence the slope and intercept of the line and should be
carefully examined when interpreting results.
In this project, the data sets are created manually or generated through simulations. The regression analysis is
conducted using software tools such as Microsoft Excel, Desmos, or Python-based libraries like NumPy and
Matplotlib. These tools compute the line of best fit using least squares, provide the regression equation, and
automatically calculate the R² value.
This approach offers a straightforward way to analyze patterns, test how noise or data changes affect the
model, and build a deeper understanding of how statistical tools can simplify complex data.