Linear Models and Linear Mixed Effects Models in R
Linear Models and Linear Mixed Effects Models in R
Linear Models and Linear Mixed Effects Models in R
models in R
Pitch ~ sex
The thing on the left : “dependent variable” (the thing you measure).
The thing on the right : “independent variable” ; the fixed effect.
You could call the former the “structural” or “systematic” part of your
model and the latter the “random” or “probabilistic” part of the model.
R-squared (R²) value, which measures how much of the variance in the
data is explained by the model. For example, an R² value of 0.921 means
that 92.1% of the variance is explained by the model. In this case, the
model is using one predictor (e.g., sex), so R² reflects the variance
accounted for by the difference between males and females.
The Adjusted R-squared adjusts for the number of predictors used. If more
variables are added, this value helps indicate whether the additional
variables improve the model’s fit. It is usually lower than the regular R²
when there are many predictors, as the adjustment controls for potential
overfitting.
Finally, there is a distinction between the p-value for the overall model
and the p-values for individual coefficients. The p-value for the overall
model assesses all predictors together, whereas individual p-values assess
the significance of each predictor separately.
If you wanted to say that your result is “significant”, you would have to
write something like this :
“We constructed a linear model of pitch as a function of sex. This model
was significant (F(1,4)=46.61, p<0.01).
Since the estimate for "sexmale" is negative, this implies that male voice
pitches are lower than female ones. The difference between the estimates
for "(Intercept)" and "sexmale" gives the mean pitch for males. In short,
the Intercept represents the female pitch average, and the coefficient for
"sexmale" shows the difference from that baseline, providing a
comparison between the two categories.
The difference between two categories is exactly correlated with the slope
between two categories. Exactly proportional to this increase in distance,
the slope increased as well.
Linearity
In linear models, the outcome must be a linear combination of the
predictors. If this assumption is violated, a residual plot will reveal
patterns like curves or multiple lines. Residuals are the differences
between observed and predicted values. A residual plot helps check for
violations of linearity; ideally, there should be no obvious pattern. If a
nonlinear pattern appears, options include adding a missing fixed effect,
applying a nonlinear transformation (e.g., log), or including polynomial
terms (like age² for a U-shaped relationship). If stripes appear, you may be
dealing with categorical data and need logistic models.
Absence of Collinearity
Collinearity occurs when two or more predictors are highly correlated,
causing instability in the model’s interpretation. For example, if you're
studying how talking speed affects intelligence, but you include multiple
related speed measures (syllables, words, sentences per second), these
variables may "compete" with each other, making it hard to identify which
one is significant.
To prevent or address collinearity, focus on uncorrelated predictors from
the design stage or select the most meaningful measure in the analysis
phase. You could also use Principal Component Analysis to reduce the
number of correlated variables and improve model stability.
Homoskedasticity
Homoskedasticity refers to the assumption that the variance of residuals
is roughly equal across the predicted values. If this assumption is violated,
it leads to heteroskedasticity, where the variance of the residuals is
unequal, often larger at higher predicted values. You can check this with a
residual plot, which should resemble a blob-like pattern if the assumption
is met. If the residuals appear to fan out or cluster, this indicates
heteroskedasticity. A common solution is to apply a transformation, such
as a log-transform, to stabilise variance.
Normality of Residuals
The normality of residuals is a less critical assumption but still worth
checking. If violated, linear models are generally robust enough to handle
this. You can check for normality by creating a histogram of residuals,
which should resemble a bell curve, or by using a Q-Q plot, which
compares the distribution of residuals to a normal distribution. If the
residuals follow a straight line in a Q-Q plot, the normality assumption is
likely met.
Independence
The independence assumption is the most crucial in statistical tests,
particularly in linear model analyses. It assumes that each data point
comes from a different subject, meaning there should be no relationship
between observations. For example, in the studies mentioned, each row of
data corresponds to a unique subject, allowing for proper analysis.
Independence can be compared to the randomness of a coin flip, where
each flip's outcome has no influence on the others. Violating this
assumption, such as by collecting multiple responses from the same
subject without adjusting the analysis, can lead to inaccurate results and
meaningless p-values.
This assumption is frequently violated in various fields, leading to a body
of literature discussing the issue. To ensure independence, careful
attention to experimental design is required. In cases where repeated
measures are collected from the same subject, statistical techniques like
mixed models can be applied to account for the lack of independence.
The best fitting line, also known as the regression line or line of best fit, is
the straight line that best represents the relationship between the
independent (predictor) variable and the dependent (outcome) variable in
a set of data. In simple linear regression, the best fitting line minimises
the differences between the observed data points and the predicted
values (the "errors" or "residuals").
- The slope b indicates the strength and direction of the relationship
between the independent and dependent variables. A positive slope
means a positive relationship, while a negative slope means a
negative relationship.
- The intercept a is the point where the line crosses the Y-axis,
representing the predicted value of Y when the independent variable
X is zero.
Residuals (Errors) :
- Residuals are the differences between the observed data points and
the predicted values on the best fitting line. The smaller these
residuals are, the better the line fits the data.
- A good fit is when the residuals are randomly distributed around zero,
without any systematic pattern.