Linear Models and Linear Mixed Effects Models in R

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Linear models and linear mixed effects

models in R

Pitch ~ sex
The thing on the left : “dependent variable” (the thing you measure).
The thing on the right : “independent variable” ; the fixed effect.

 Even if we measured all of these factors, there would still be other


factors influencing
pitch that we cannot control for.
Hence, let’s update our formula to capture the existence of these
“random” factors :
pitch ~ sex + ε.
ε : is an error term. It stands for all of the things that affect pitch that are
not sex, all of the stuff that – from the perspective of our experiment – is
random or uncontrollable.

You could call the former the “structural” or “systematic” part of your
model and the latter the “random” or “probabilistic” part of the model.

R-squared (R²) value, which measures how much of the variance in the
data is explained by the model. For example, an R² value of 0.921 means
that 92.1% of the variance is explained by the model. In this case, the
model is using one predictor (e.g., sex), so R² reflects the variance
accounted for by the difference between males and females.

The Adjusted R-squared adjusts for the number of predictors used. If more
variables are added, this value helps indicate whether the additional
variables improve the model’s fit. It is usually lower than the regular R²
when there are many predictors, as the adjustment controls for potential
overfitting.

Next, the p-value is introduced, a commonly used measure to test


statistical significance. In this case, the p-value reflects the likelihood that
the observed data could occur if the null hypothesis were true. For
example, if the null hypothesis is "sex has no effect on pitch," a small p-
value suggests that the data observed would be unlikely under the null
hypothesis, making it more probable that the alternative hypothesis ("sex
affects pitch") is true.

Finally, there is a distinction between the p-value for the overall model
and the p-values for individual coefficients. The p-value for the overall
model assesses all predictors together, whereas individual p-values assess
the significance of each predictor separately.
If you wanted to say that your result is “significant”, you would have to
write something like this :
“We constructed a linear model of pitch as a function of sex. This model
was significant (F(1,4)=46.61, p<0.01).

The explanation focuses on understanding the p-value associated with the


coefficient for “sexmale” in a linear model, where "sex" is the only fixed
effect. The p-value for the coefficient "sexmale" is the same as the overall
model's p-value since there is only one explanatory variable. However, if
there were multiple fixed effects, the significance of the overall model and
each individual coefficient would differ, as the overall model considers all
variables together.

The term "sexmale" appears instead of just "sex" because in a regression


model, R uses one category (e.g., "female") as the baseline (Intercept).
The estimate for "(Intercept)" represents the mean of the female voice
pitch, while the estimate for "sexmale" indicates the difference in pitch
between males and females.

Since the estimate for "sexmale" is negative, this implies that male voice
pitches are lower than female ones. The difference between the estimates
for "(Intercept)" and "sexmale" gives the mean pitch for males. In short,
the Intercept represents the female pitch average, and the coefficient for
"sexmale" shows the difference from that baseline, providing a
comparison between the two categories.

The difference between two categories is exactly correlated with the slope
between two categories. Exactly proportional to this increase in distance,
the slope increased as well.

 Via centering our variable we made the intercept more meaningful.

Pitch ~ dialect + sex + age + ε


And so on and so on. The only thing that changes is the following. The p-
value at the bottom of the output will be the p-value for the overall model.
This means that the p-value considers how well all of your fixed effects
together help in accounting for variation in pitch. The coefficient output
will then have p-values for the individual fixed effects.
This is called : multiple regression, where you model one response
variable as a function of multiple predictor variables.
- The linear model is just another word for multiple regression.

Linearity
In linear models, the outcome must be a linear combination of the
predictors. If this assumption is violated, a residual plot will reveal
patterns like curves or multiple lines. Residuals are the differences
between observed and predicted values. A residual plot helps check for
violations of linearity; ideally, there should be no obvious pattern. If a
nonlinear pattern appears, options include adding a missing fixed effect,
applying a nonlinear transformation (e.g., log), or including polynomial
terms (like age² for a U-shaped relationship). If stripes appear, you may be
dealing with categorical data and need logistic models.

Absence of Collinearity
Collinearity occurs when two or more predictors are highly correlated,
causing instability in the model’s interpretation. For example, if you're
studying how talking speed affects intelligence, but you include multiple
related speed measures (syllables, words, sentences per second), these
variables may "compete" with each other, making it hard to identify which
one is significant.
To prevent or address collinearity, focus on uncorrelated predictors from
the design stage or select the most meaningful measure in the analysis
phase. You could also use Principal Component Analysis to reduce the
number of correlated variables and improve model stability.

Homoskedasticity
Homoskedasticity refers to the assumption that the variance of residuals
is roughly equal across the predicted values. If this assumption is violated,
it leads to heteroskedasticity, where the variance of the residuals is
unequal, often larger at higher predicted values. You can check this with a
residual plot, which should resemble a blob-like pattern if the assumption
is met. If the residuals appear to fan out or cluster, this indicates
heteroskedasticity. A common solution is to apply a transformation, such
as a log-transform, to stabilise variance.

A blob-like pattern in data visualisation signals a lack of association


between two variables. It represents randomness in the data, making it
difficult to infer meaningful relationships or apply predictive models like
regression.

When the residual is 0 in a regression analysis, it means that the predicted


value from the regression model perfectly matches the observed value for
a particular data point. In other words, the model has made an exact
prediction for that observation.
 Perfect Fit for That Data Point : a residual of 0 means that the
regression model has perfectly predicted the actual value of the
dependent variable for that specific data point.

Normality of Residuals
The normality of residuals is a less critical assumption but still worth
checking. If violated, linear models are generally robust enough to handle
this. You can check for normality by creating a histogram of residuals,
which should resemble a bell curve, or by using a Q-Q plot, which
compares the distribution of residuals to a normal distribution. If the
residuals follow a straight line in a Q-Q plot, the normality assumption is
likely met.

Absence of Influential Data Points


Influential data points can distort your model’s results. You can use the
dfbeta() function in R to check the impact of each data point on the
model’s coefficients. A data point is considered influential if its removal
significantly changes the slope of the coefficient or reverses its sign. If
influential data points are identified, it's best to report results both with
and without those points, unless the point clearly results from an error.

Independence
The independence assumption is the most crucial in statistical tests,
particularly in linear model analyses. It assumes that each data point
comes from a different subject, meaning there should be no relationship
between observations. For example, in the studies mentioned, each row of
data corresponds to a unique subject, allowing for proper analysis.
Independence can be compared to the randomness of a coin flip, where
each flip's outcome has no influence on the others. Violating this
assumption, such as by collecting multiple responses from the same
subject without adjusting the analysis, can lead to inaccurate results and
meaningless p-values.
This assumption is frequently violated in various fields, leading to a body
of literature discussing the issue. To ensure independence, careful
attention to experimental design is required. In cases where repeated
measures are collected from the same subject, statistical techniques like
mixed models can be applied to account for the lack of independence.

The best fitting line, also known as the regression line or line of best fit, is
the straight line that best represents the relationship between the
independent (predictor) variable and the dependent (outcome) variable in
a set of data. In simple linear regression, the best fitting line minimises
the differences between the observed data points and the predicted
values (the "errors" or "residuals").
- The slope b indicates the strength and direction of the relationship
between the independent and dependent variables. A positive slope
means a positive relationship, while a negative slope means a
negative relationship.
- The intercept a is the point where the line crosses the Y-axis,
representing the predicted value of Y when the independent variable
X is zero.

Residuals (Errors) :
- Residuals are the differences between the observed data points and
the predicted values on the best fitting line. The smaller these
residuals are, the better the line fits the data.
- A good fit is when the residuals are randomly distributed around zero,
without any systematic pattern.

The p-value measures the probability of observing your data (or


something more extreme) under the assumption that the null hypothesis
is true. When the p-value is smaller than the predetermined significance
level α, the result is considered statistically significant, and the null
hypothesis is rejected. When the p-value is larger than α, there is not
enough evidence to reject the null hypothesis.

R-squared (R²), also known as the coefficient of determination, is a key


statistic in regression analysis that measures the proportion of the
variance in the dependent variable (outcome) that is predictable from the
independent variables (predictors). In simple terms, it indicates how well
the model explains the variation in the data.
Adjusted R2 will always be a bit lower than the multiple R 2.

You might also like