SSMDA Notes Unit 2
SSMDA Notes Unit 2
Unit - II
Statistical Modeling
Statistical modeling is a process of using statistical techniques to describe,
analyze, and make predictions about relationships and patterns within data. It
involves formulating mathematical models that represent the underlying structure
of data and capturing the relationships between variables. Statistical models are
used to test hypotheses, make predictions, and infer information about
populations based on sample data. Statistical modeling is widely employed across
various disciplines, including economics, finance, biology, sociology, and
engineering, to understand complex phenomena and inform decision-making.
Key Concepts:
Model Formulation:
The choice of model depends on the nature of the data, the research
question, and the assumptions underlying the modeling process.
Parameter Estimation:
Model Evaluation:
Model Selection:
Applications:
Example:
Once validated, the model can be used to predict treatment outcomes for new
patients and inform clinical decision-making.
watch: https://www.khanacademy.org/math/statistics-probability/analysis-of-
variance-anova-library/analysis-of-variance-anova/v/anova-1-calculating-sst-
total-sum-of-squares
https://youtu.be/0Vj2V2qRU10?si=1ZGk9n7xTUk9yE8t
Key Concepts:
Variability:
Hypothesis Testing:
ANOVA tests the null hypothesis that the means of all groups are equal
against the alternative hypothesis that at least one group mean is different.
The test statistic used in ANOVA is the F-statistic, which compares the
ratio of between-group variability to within-group variability.
Types of ANOVA
Assumptions:
ANOVA assumes that the data within each group are normally distributed,
the variances of the groups are homogeneous (equal), and the
observations are independent.
Example:
Using ANOVA
The researcher collects performance data from each group and conducts a
one-way ANOVA to compare the mean performance scores across the three
groups.
By using ANOVA, the researcher can determine whether there are significant
differences in performance outcomes among the training programs and make
informed decisions about which program is most effective for improving employee
performance.
1. Variability:
ANOVA breaks down the total variation in data into two parts:
It's like comparing how much people in different classes score on a test
compared to how much each person's score varies within their own class.
2. Hypothesis Testing:
It uses the F-statistic, which compares the variability between groups to the
variability within groups.
For instance, it's like seeing if there's a big difference in test scores between
classes compared to how much scores vary within each class.
3. Types of ANOVA
For example, it's like comparing test scores based on different teaching
methods (one-way) or considering both teaching method and study time (two-
way).
4. Assumptions:
ANOVA assumes data in each group are normally distributed, group variances
are equal, and observations are independent.
Imagine it as assuming each class's test scores follow a bell curve, have
similar spreads, and aren't influenced by other classes.
Applications:
Example:
If significant, further tests reveal which groups differ from each other.
Gauss-Markov Theorem
The Gauss-Markov theorem, also known as the Gauss-Markov linear model
theorem, is a fundamental result in the theory of linear regression analysis. It
provides conditions under which the ordinary least squares OLS estimator is the
best linear unbiased estimator BLUE of the coefficients in a linear regression
model. The theorem plays a crucial role in understanding the properties of OLS
estimation and the efficiency of estimators in the context of linear regression.
Key Concepts:
The OLS estimator provides estimates of the coefficients that best fit the
observed data points in a least squares sense.
Gauss-Markov Theorem:
The Gauss-Markov theorem states that under certain conditions, the OLS
estimator is the best linear unbiased estimator BLUE of the coefficients in
a linear regression model.
Specifically, if the errors (residuals) in the model have a mean of zero, are
uncorrelated, and have constant variance (homoscedasticity), then the
OLS estimator is unbiased and has minimum variance among all linear
unbiased estimators.
Additionally, the OLS estimator is efficient in the sense that it achieves the
smallest possible variance among all linear unbiased estimators, making it
the most precise estimator under the specified conditions.
Finance and Business: In finance and business analytics, the theorem is used
to model relationships between financial variables, forecast future trends, and
assess the impact of business decisions.
Example:
If the assumptions of the theorem hold (e.g., errors have zero mean, are
uncorrelated, and have constant variance), then the OLS estimator provides
unbiased and efficient estimates of the regression coefficients.
The researcher can use the OLS estimates to assess the impact of advertising
spending on sales revenue and make predictions about future sales based on
advertising budgets.
Imagine you have a bunch of points on a graph, and you want to draw a
straight line that goes through them as best as possible. That's what a
linear regression model does. It helps us understand how one thing (like
how much we spend on advertising) affects another thing (like how much
stuff we sell).
OLS is like drawing that line through the points by minimizing the distance
between the line and each point. It's like trying to draw the best line that
gets as close as possible to all the points.
This is a fancy rule that says if we follow certain rules when drawing our
line (like making sure the errors are not too big and don't have any
patterns), then the line we draw using OLS will be the best one we can
make. It's like saying, "If we play by the rules, the line we draw will be the
most accurate one."
It's like having a superpower when we're trying to understand how things
are connected. We can trust that the line we draw using OLS will give us
the best idea of how one thing affects another thing. This helps us make
better predictions and understand the world around us.
Examples:
Let's say you're trying to figure out if eating more vegetables makes you grow
taller. You collect data from a bunch of kids and use OLS to draw a line
showing how eating veggies affects height. The Gauss-Markov theorem tells
you that if you follow its rules, that line will be the most accurate prediction of
how veggies affect height.
Or imagine you're a scientist studying how temperature affects how fast ice
cream melts. By following the rules of the Gauss-Markov theorem when using
OLS, you can trust that the line you draw will give you the best understanding
of how temperature affects melting speed.
In simple terms, the Gauss-Markov theorem is like a set of rules that, when
followed, help us draw the best line to understand how things are connected in the
world. It's like having a secret tool that helps us make really good guesses about
how things work!
https://www.youtube.com/watch?
v=osh80YCg_GM&list=PLE7DDD91010BC51F8&index=17&pp=iAQB
Key Concepts:
The OLS regression line is the line that best fits the observed data points
by minimizing the sum of squared vertical distances (residuals) between
the observed yᵢ values and the corresponding predicted values on the
regression line.
The residual for each observation is the vertical distance between the
observed yᵢ value and the predicted value on the regression line.
Each observed data point (xᵢ, yᵢ) can be projected onto the regression line
to obtain the predicted value ȳᵢ.
The vertical distance between the observed data point and its projection
onto the regression line represents the residual for that observation.
Minimization of Residuals:
Assessment of Model Fit: Geometric insights can help assess the adequacy
of the regression model by examining the distribution of residuals around the
regression line. A good fit is indicated by residuals that are randomly scattered
around the line with no discernible pattern.
Example:
Each observed data point can be projected onto the regression line to obtain
the predicted exam score.
The vertical distance between each data point and its projection onto the
regression line represents the residual for that observation.
The OLS regression line is chosen to minimize the sum of squared residuals,
ensuring that the residuals are orthogonal to the line.
By understanding the geometry of least squares, analysts can gain insights into
how the OLS estimator works geometrically, facilitating better interpretation and
application of regression analysis in various fields.
other way:
Key Concepts:
Each observed data point corresponds to a vector in the space, where the
components represent the values of the independent variables.
In the context of linear models, the space spanned by the observed data
points is the data subspace, while the space spanned by the regression
coefficients is the coefficient subspace.
Basis vectors are vectors that span a subspace, meaning that any vector in
the subspace can be expressed as a linear combination of the basis
vectors.
The projection of a data point onto the coefficient subspace represents the
predicted response value for that data point based on the linear model.
The difference between the observed response value and the projected
value is the residual, representing the error or discrepancy between the
observed data and the model prediction.
Orthogonal Decomposition:
Example:
Consider a simple linear regression model with one independent variable (x) and
one dependent variable (y). The subspace formulation represents the observed
data points (xᵢ, yᵢ) as vectors in a two-dimensional space, where xᵢ is the
independent variable value and yᵢ is the corresponding dependent variable value.
The data subspace is spanned by the observed data points, representing the
space of possible values for the dependent variable given the independent
variable.
The regression line is the projection of the data subspace onto the coefficient
subspace, representing the best linear approximation to the relationship
between x and y.
Vectors:
In simple terms, it's like an arrow with a certain length and direction in
space.
Subspaces:
Basis:
A basis for a vector space is a set of vectors that are linearly independent
and span the space.
Linear independence means that none of the vectors in the basis can be
expressed as a linear combination of the others.
For example, in 2D space, the vectors 1,0 and 0,1 form a basis, as they
are linearly independent and can represent any vector in the plane.
Linear Independence:
For example, in 2D space, the vectors 1,0 and 0,1 are linearly
independent because neither can be written as a scalar multiple of the
other.
Understanding these concepts lays a strong foundation for more advanced topics
in linear algebra and helps in solving problems involving vectors, subspaces, and
linear transformations.
Orthogonal Projections
https://youtu.be/5B8XluiqdHM?si=uvhg24qroSLd-k-
Key Concepts:
Applications:
Example:
Key Concepts:
In regression analysis, the observed data points are projected onto the
model space defined by the regression coefficients.
Orthogonality of Residuals:
The least squares criterion aims to minimize the sum of squared residuals,
which is equivalent to finding the orthogonal projection of the data onto
the model space.
Orthogonal Decomposition:
Applications:
Example:
Consider a simple linear regression model with one predictor variable X and one
response variable Y. The goal is to estimate the regression coefficients
(intercept and slope) that best describe the relationship between X and Y.
The observed data points Xᵢ, Yᵢ) are projected onto the model space spanned
by the predictor variable X.
Factorial Experiments
What are Factorial Experiments?
Imagine you're doing a science experiment where you want to see how
different things affect a plant's growth, like temperature and humidity.
Instead of just changing one thing at a time, like only changing the
temperature or only changing the humidity, you change both at the same time
in different combinations.
So, you might have some plants in high temperature and high humidity, some
in high temperature and low humidity, and so on. Each of these combinations
is called a "treatment condition."
Key Concepts:
Factorial Design:
This just means you're changing more than one thing at a time in your
experiment.
Main Effects:
This is like looking at how each thing you change affects the plant's
growth on its own, without considering anything else.
So, we'd look at how temperature affects the plant's growth, ignoring
humidity, and vice versa.
Interaction Effects:
Sometimes, how one thing affects the plant depends on what's happening
with the other thing.
For example, maybe high temperature helps the plant grow more, but only
if the humidity is also high. If the humidity is low, high temperature might
not make much difference.
Factorial Notation:
This is just a fancy way of writing down what you're doing in your
experiment.
For example, if you have two factors, like temperature and humidity, each
with two levels (high and low), you'd write it as a "22" factorial design.
Advantages:
Efficiency:
You can learn more from your experiment by changing multiple things at
once, rather than doing separate experiments for each factor.
Comprehensiveness:
Factorial designs give you a lot of information about how different factors
affect your outcome, including main effects and interaction effects.
Flexibility:
You can study real-world situations where lots of things are changing at
once, like in nature or in product development.
Applications:
Example:
In our plant experiment, we're changing both temperature and humidity to see
how they affect plant growth. By looking at the growth rates of plants under
different conditions, we can figure out how each factor affects growth on its
own and if their effects change when they're combined.
Key Concepts:
In ANCOVA, group means are compared while statistically adjusting for the
effects of one or more continuous covariates. This adjustment helps
reduce error variance and increase the sensitivity of the analysis.
Model Formula:
Assumptions:
Hypothesis Testing:
Applications:
Imagine this:
You want to compare two groups, like students who study with Method 1 and
students who study with Method 2, to see if one method is better for test
scores.
But there's a twist! You also know that students' scores before the test (let's
call them "pre-test scores") might affect their test scores.
ANCOVA looks at the differences in test scores between the two groups
Method 1 and Method 2 while taking into account the pre-test scores.
It's like saying, "Okay, let's see if Method 1 students have higher test scores
than Method 2 students, but let's also make sure any differences aren't just
because Method 1 students started with higher pre-test scores."
Key Terms:
Covariate: This is just a fancy word for another factor we think might affect
the outcome. In our example, the pre-test scores are the covariate because
we think they could influence test scores.
Model Formula: This is just the math equation ANCOVA uses to do its job. It
looks at how the independent variables (like the teaching method) and the
covariate (like pre-test scores) affect the outcome (test scores).
ANCOVA helps us get a clearer picture by considering all the factors that
could affect our results. It's like wearing glasses to see better!
Example:
Let's say we find out that Method 1 students have higher test scores than
Method 2 students. But, without ANCOVA, we might wonder if this is because
Method 1 is truly better or just because Method 1 students had higher pre-test
scores to begin with. ANCOVA helps us tease out the real answer.
So, ANCOVA is like a super detective that helps us compare groups while making
sure we're not missing anything important!
Key Concepts:
Residuals:
Types of Residuals:
Residual Analysis:
Influence Diagnostics:
Advantages:
Applications:
Example:
Key Concepts:
Logarithmic Transformation:
Log transformations are useful for dealing with data that exhibit
exponential growth or decay, such as financial data, population growth
rates, or reaction kinetics.
Square root transformations involve taking the square root of the variable.
Reciprocal Transformation:
Reciprocal transformations are useful for dealing with data that exhibit a
curvilinear relationship, where the effect of the predictor variable on the
response variable diminishes as the predictor variable increases.
Exponential Transformation:
Choosing Transformations:
Visual Inspection:
Statistical Tests:
Applications:
Example:
2. Why Transform?
Sometimes, the relationship between variables isn't linear, or the data doesn't
meet regression assumptions like normality or constant variance.
3. Common Transformations:
data's characteristics.
5. Advantages of Transformations:
Improves linearity: Helps make the relationship between variables more linear.
6. Example:
7. Caution:
Choosing the right transformation is crucial for enhancing model accuracy and
ensuring valid interpretations of results.
Box-Cox Transformation
The Box-Cox transformation is a widely used technique in statistics for stabilizing
variance and improving the normality of data distributions. It is particularly useful
in regression analysis when the assumptions of constant variance
(homoscedasticity) and normality of residuals are violated. The Box-Cox
transformation provides a family of power transformations that can be applied to
the response variable to achieve better adherence to the assumptions of linear
regression.
Key Concepts:
Assumptions:
The Box-Cox transformation assumes that the data are strictly positive;
therefore, it is not suitable for non-positive data.
Applications:
Time Series Analysis: In time series analysis, the Box-Cox transformation can
be applied to stabilize the variance of time series data and remove trends or
seasonal patterns.
Key Concepts:
Variable Selection:
Model Complexity:
Model Interpretability:
Model interpretability refers to the ease with which the model's predictions
can be explained and understood by stakeholders.
Strategies:
Start Simple: Begin with a simple model that includes only the most important
predictor variables and assess its performance.
Iterative Model Building: Iteratively add or remove variables from the model
based on their significance and contribution to model performance.
Applications:
Example:
Model Building: Start with a simple linear regression model using the selected
predictor variables and assess its performance using cross-validation
techniques (e.g., k-fold cross-validation).
By following these model selection and building strategies, the data scientist can
develop a reliable predictive model for housing price forecasting that effectively
captures the relationships between predictor variables and housing prices while
ensuring robustness and generalizability.
Key Concepts:
Assumptions:
Linearity in the Logit: The relationship between the predictor variables and
the log-odds of the outcome is assumed to be linear.
Large Sample Size: Logistic regression performs well with large sample sizes.
Applications:
Example:
The bank decides to use logistic regression to build a predictive model. They
preprocess the data, splitting it into training and testing datasets. Then, they fit a
logistic regression model to the training data, with transaction features as
predictor variables and the binary outcome variable (fraudulent or not) as the
response variable.
After fitting the model, they evaluate its performance using metrics such as
accuracy, precision, recall, and the area under the ROC curve AUCROC on the
testing dataset. The bank uses these metrics to assess the model's predictive
accuracy and determine its suitability for detecting fraudulent transactions in real-
time.
In summary, logistic regression models are valuable tools for predicting binary
outcomes in various fields, providing insights into the factors that influence the
likelihood of an event occurring. They are widely used in practice due to their
simplicity, interpretability, and effectiveness in classification tasks.
Key Concepts: