0% found this document useful (0 votes)
23 views27 pages

Help Statistiek Intro To Long Data Analysis 2023

This document provides an introduction to longitudinal data analysis and summarizes a lunchtime lecture on the topic. Longitudinal data involves measuring multiple subjects at several points in time. Linear regression is not suitable for longitudinal data due to dependency between observations from the same subject over time. Two common approaches for analyzing longitudinal data are using summary measures, which reduces the data but allows standard analyses, or multilevel modeling, which better utilizes all data by accounting for the clustering of observations within subjects. The lecture will introduce multilevel models for change to model trajectories over time while accounting for the nested structure of longitudinal data.

Uploaded by

Gebrekiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views27 pages

Help Statistiek Intro To Long Data Analysis 2023

This document provides an introduction to longitudinal data analysis and summarizes a lunchtime lecture on the topic. Longitudinal data involves measuring multiple subjects at several points in time. Linear regression is not suitable for longitudinal data due to dependency between observations from the same subject over time. Two common approaches for analyzing longitudinal data are using summary measures, which reduces the data but allows standard analyses, or multilevel modeling, which better utilizes all data by accounting for the clustering of observations within subjects. The lecture will introduce multilevel models for change to model trajectories over time while accounting for the nested structure of longitudinal data.

Uploaded by

Gebrekiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Help! Statistics!

Introduction to Longitudinal Data Analysis

Sacha la Bastide-van Gemert


Medical Statistics and Decision Making
Epidemiology, UMCG
Help! Statistics! -lunch time lectures

What? Frequently used statistical methods and questions in a manageable


timeframe for all researchers at the UMCG.
No knowledge of advanced statistics is required.

When? Lectures take place every 1st Tuesday of every two months, 12.00-13.00hrs.

Who? Unit for Medical Statistics and Decision Making and colleagues

When? Where? What? Who?

Feb 7, 2023 Room 16 Introduction to longitudinal data analysis Sacha la Bastide


April 4, 2023 Room 16 Machine Learning Hylke Donker
Jun 6, 2023 Room 16 … …

Slides from each presentation can be downloaded from:


https://www.rug.nl/research/epidemiology/download-area
Introduction to longitudinal data analysis: overview

What is
longitudinal data?

Why does it need a  revisiting the linear regression model


special approach?

• using summary measures


Longitudinal data • introduction of the multilevel model for change
analysis: (a.k.a. mixed effects model)
And yes, there will be some mathematical notation…

• To denote the value of a variable for subject number (at time-point ) in our dataset,
we use subscripts like this:

• To denote a variable to be Normally distributed,

with mean 0 and standard deviation , we write:

• A linear relationship (= “a straight line”) between 𝑌


𝑌 =2.5+0.5 ∗ 𝑋
variables and with a certain intercept and

a certain slope is described by the formula:

𝑋
What is longitudinal data? (1)
Clustered data For many more
examples, see the
Clustered (or nested/multilevel/hierarchical/...) data previous Help!
Example: several classrooms, within each classroom students Statistics! lecture

• Observations from students from the same classroom are more alike than students
from different classrooms: students are nested in classrooms

• Variables at student level: gender, SES, ...


multilevel data
• Variables at classroom level: teacher effect, ...
What is longitudinal data? (2)

• Longitudinal data: several subjects, each measured at several (different) points in


time t1, t2, t3, t4, t5:

t1 t1 t1 t1
t5 t5 t5
t2 t2 t2
t3 t4 t3 t4 t4 t4

• Measurements (at different time points) from one subject are more alike than
measurements from different subjects: measurements are nested within subjects
Today: focus
on continuous
• Variables at each time point: lengths, grades... multilevel data outcome
• Variables for each subject: gender, SES, ... variables
Example: adolescent alcohol use (Curran et al, 1997)*

• Sample of 82 adolescents:
37 are Children Of an Alcoholic parent (COAs), 45 are non-COAs

• Research design:
- each child assessed 3 times
(at ages 14, 15, 16)
- outcome: alcuse (continuous,
“alcohol use”-questionnaire)
- covariate: coa (dichotomous, 0=no, 1=yes)

• Research question:
Do trajectories of adolescent alcohol use differ by parental alcoholism?

* Example from: Singer & Willet: Applied longitudinal data analysis. Modeling change and event occurence (Oxford, 2003)
Longitudinal data
The data-set: person-period format

Data in long format:


for each person, each repeated
measurement is stored as a new person 1

case.
person 2

Here: 3 rows per person ...


- a time variable: age
- an outcome variable: alcuse
clustered
- a (time-independent) data!
covariate: coa
Investigating change over time
Scatterplot age-alcuse for the whole data-set:

Continuous
outcome variable
alcuse, covariates
age and coa…

... what about


alcuse

linear regression
of alcuse on age?

Let’s revisit (simple)


linear regression
age (years)
analysis...
Intermezzo
The linear regression model revisited (1)
Cross-sectional data: for each adolescent i
one observation (alcuse, age, coa)

Investigating the linear relation between Note: cross-


sectional
age and outcome alcuse: data
• what is the best fitted straight line?
= find the line “closest” to the data points in the
scatter plot

𝑌 𝑖= 𝛽0 + 𝛽1+𝜀
𝑋 𝑖 𝑖 𝜀𝑖 𝑁 (0,𝜎 2)
etc

Here, and are estimated to be -


3.0 and 0.26 : residuals
Intermezzo
The linear regression model revisited (2)
Linear regression:
- we assume mean alcohol use for fixed age values are on a straight line
- individual observations are assumed to be normally distributed around these means
(random residual)

Formally: we assume an underlying true population linear relationship

𝑌 𝑖= 𝛽0 + 𝛽1+𝜀
𝑋 𝑖 𝑖 𝜀𝑖 𝑁 (0 ,𝜎 2) Residual : random variable,
normally distributed with
constant variance σ²,
Assumptions made in order for the model to be valid: independent from the value of X
• independent observations
• linear relation between Y and X
• normally distributed residuals
• homogeneity of the residuals’ variance across values of X
Back to our longitudinal data-example...
Longitudinal data
Plot of whole group

Remember the research question:

Do trajectories of adolescent
alcohol use differ by parental
alcoholism?

Different measurements from one adolescent are related:

dependency within observations!


Linear regression is no longer an option... 12
Analysis of longitudinal data
Using summary measures (1)
Solution:
Choose ONE suitable summary measure Y which reflects a relevant feature of the curve:
- mean over time
- maximum value
- time of reaching the maximum
- maximal velocity/increase
- ...
Now there is just one outcome variable (the summary measure Y)
per adolescent ⟶ independent observations ⟶ multiple regression analysis!

Advantages:
- simple and easy (can be done using standard techniques)
- provides nice summaries of the data
Disadvantages:
- inefficient use of the whole data
- not all types of research questions can be addressed
Analysis of longitudinal data
Using summary measures (2)

Example: for each adolescent we take the


maximum value of alcohol use alcuse_max
over the three years:

• Higher median alcuse_max for COA=1


group than for COA=0 group

• Different distributions of two groups


alcuse_max much more skewed in COA=0 than in
COA=1

Does COA affect maximum alcohol use? coa


(Mann-Whitney test for independent groups)

Let’s see if we can make


better use of all our data!
14
Analysis of longitudinal data
Summarizing so far...

• Investigating change over time requires longitudinal data: multiple (ideally


≥ 3 waves) measurements over time per subject

• Using summary measures is an option, but means throwing away


information and is limited in answering research questions on
change/trajectories

• Linear regression model is not applicable, due to violations to the model


assumptions (dependency in longitudinal data!)

… so time to tackle the clustering!

15
Analysis of longitudinal data
Introducing the multilevel model for change
We want to expand the linear regression model with several random effects:
mixed effects or multilevel model

“random effects & fixed effects” “individual level & group level”

This model answers:


- within-person questions (intra-individual) Level 1
How does each person’s alcohol use change over time? multilevel
(trajectories) model

(linked pair
- between-person questions (inter-individual) Level 2 of statistical
How does having an alcoholic parent affect these trajectories? models)
Introducing the multilevel model
Exploring individual’s growth plots
to come up with a level-1 submodel

Plotting regression models


for a group of subjects i to
help answer the question:

What population
individual growth model
might have generated
these sample data?

elevation? tilt?
(non-)linear?

Note: “simpler is better”

Here we choose
a linear model
Introducing the multilevel model
The level-1 submodel for individual change
Assumption: in the population, alcuseij is a linear function of child i’s age on occasion j

, and are deviations of i’s


true trajectory from
linearity on each occasion
Individual i’s alcuse (random errors)
(hypothesized) 4
true trajectory Assumption:
3

is the intercept of i’s 𝜀𝑖 1


true trajectory 2
(= “alcuse at age 0”) 𝜀𝑖 2
1 is the slope of i’s true
trajectory
i =1, ...,82 (children) (=“rate of alcuse change”)
j=1, 2, 3 (measurements) 0
14 15 16
age
Introducing the multilevel model
What do we want from our level-2 submodels?
Demands:
1. We need two level-2 submodels :
- one for intercept 0i
- one for slope 1i
These models should:
2. specify the relationship between 0i and 1 and the covariate of interest (COA)
3. allow adolescents with common COA-values to have different individual trajectories
COA=0 COA=1
alcuse

alcuse

age age
Introducing the multilevel model
The level-2 submodels for inter-individual differences in change

Level-2 intercepts
Population average intercept and
slope for COA=0

Level-2 slopes
Effect of COA on intercept and on
slope

𝜋 0𝑖=𝛾 00+𝛾 01 𝐶𝑂 𝐴𝑖+𝜁 0 𝑖(random intercept)


𝜋1𝑖 =𝛾10 +𝛾11 𝐶𝑂 𝐴𝑖 +𝜁 1(random
𝑖
slope)

Level-2 residuals and Extra model assumptions:


Deviations of each individual’s trajectory around the predicted
average intercept and slope:

allowing for “scattering” of the individual trajectories around the >>> beyond the scope of
population mean growth trajectories
today’s lecture <<<
Introducing the multilevel model
Estimating the fixed effects (, , )
Summarizing the total model:

(level 1)

(level 2) For the average COA-adolescent,


it is 1.4 higher (at age 0)

Initial alcuse (“alcuse at age 0”) (difference in initial alcuse between COA-
for the average non-COA groups)
adolescent is -3.8

^𝜋 0𝑖=−3.8+1.4∗𝐶𝑂 𝐴𝑖
Fitted model for intercept

Fitted model for slope ^𝜋1𝑖 =0.29−0.05∗𝐶𝑂 𝐴𝑖

For the average COA-adolescent, it is 0.05


Annual rate of change for the lower (non significant)
average non-COA adolescent is
0.29 (difference in slope between COA-groups)
Introducing the multilevel model
Visualizing the results: constructing fitted growth trajectories

For COA=0 we get: For COA=1 we get:


^𝜋 0𝑖=− 3.8+1.4∗𝐶𝑂 𝐴𝑖 ^𝜋 0𝑖=− 3.8 ^𝜋 0𝑖=−3.8+1.4∗1=−2.4
𝜋^ 1𝑖 =0.29− 0.05∗𝐶𝑂 𝐴𝑖 ^𝜋1 𝑖 =0.29 ^𝜋1𝑖 =0.29−0.05∗1=0.24
Substitute the estimates into the
level-1 model
ALCUSE to get fitted growth trajectories:
2

𝑤h𝑒𝑛𝐶𝑂 𝐴𝑖=1: 𝑌^ 𝑖𝑗=−2.4+0.24∗𝑎𝑔𝑒


COA = 1

1
COA = 0 𝑤h𝑒𝑛𝐶𝑂 𝐴𝑖 =0: 𝑌^ 𝑖𝑗 =−3.8+0.29∗𝑎𝑔𝑒
dotted line: individual estimated¿ trajectory for one child i
(randomly deviation from the bold green curve due to )

green dots: actual observed values of alcuse for child i


0
13 14 15 16 17 (randomly scattered around the dotted green line due to )
AGE
The multilevel model
Combining the levels: rewriting the model

𝜋 0𝑖=𝛾 00+𝛾 01 𝐶𝑂 𝐴𝑖 +𝜁 0𝑖 𝜋1𝑖 =𝛾10 +𝛾11 𝐶𝑂 𝐴𝑖 +𝜁 1𝑖


…Toto, I’ve got a
𝑎𝑙𝑐𝑢𝑠𝑒𝑖𝑗 =𝜋 0𝑖+𝜋 1𝑖 𝑎𝑔 𝑒𝑖𝑗 +𝜀𝑖𝑗 feeling this is not
regular linear
regression
anymore…

𝑎𝑙𝑐𝑢𝑠𝑒𝑖𝑗 =( 𝛾00 +𝛾 01 𝐶𝑂 𝐴𝑖 +𝜁 0𝑖+) ( 𝛾10 +𝛾11 𝐶𝑂 𝐴𝑖 +𝜁 1𝑖 )   ∗  𝑎𝑔𝑒𝑖𝑗 +𝜀𝑖𝑗


Same
model, now
in one
equation!

Complex residuals!
They change with age now and
Fixed part of the model shows clearly how alcuse depends on: are autocorrelated (dependent)
– covariates age and COA
– interaction term, COA age, allowing the effect of age to differ for (… very much unlike linear
levels of COA regression...)
JUST LIKE LINEAR REGRESSION!
Some final remarks on mixed effects/multilevel models

• A lot more can to be considered, such as:


- unbalanced/missing data
- time-dependent covariates
- different correlation structures/model designs/estimation methods
- models for different types of outcome variables
- …
• Mixed effects models are complex and applying them correctly is a challenge

… so why bother?
- to make efficient use of all data (even for subjects with missing measurements!)
- to properly account for correlation structures within your data (and avoiding
estimation bias in your confidence intervals/standard errors)
- in general: estimating fewer parameters, reducing number of tests, …
Books, courses and an e-module

• Snijders & Bosker: Multilevel Analysis. An introduction to basic and advanced


multilevel modeling (London, 1999, 2011)
• Verbeke & Molenberghs: Linear mixed models for longitudinal data (New York, 2000)
• Singer & Willet: Applied longitudinal data analysis. Modeling change and event
occurence (Oxford, 2003)
• Pinheiro & Bates: Mixed effects models in S and S-plus (New York, 2000)

Courses offered from our unit:


• Generalized and Linear Mixed Effects Models (SPSS, R)
• Applied Longitudinal Data Analysis (SPSS, R)
• Beyond Regression (CPE-students)
https://edubox.nl/portal.aspx#opleiding=opleiding_gnk
Next Help! Statistics!-lunchtime lecture

Hylke Donker
Machine learning
April 4th, 2023
Room 16

You might also like