Rbrul Manual
Rbrul Manual
2017]
Rbrul Manual
Welcome To The Wonderful World Of Rbrul! Or So...
Rbrul is a script that is written in the programming language R. It is specifically designed to make statistical analysis of linguistic data fast and simple.
Relatively fast and simple.
A very comprehensive guide by Agata Daleszynska, also linked on the Rbrul website
http://agatad.co.uk/images/Rbrul/rbrul%20handout_daleszynska.pdf
These and other guides you may find elsewhere are often great in terms of statistical explanations, therefore, if you struggle in the process of analysing your
results, you may find answers in these guides. However, most of them do not provide detailed installation instructions or troubleshooting sections.
The Rbrul Manual of the Student Assistants Linguistics, English Department, University of Bern
This manual provides detailed instructions for the installation of both R and Rbrul (and even Rstudio), as well as an overview of a number of basic functions.
Remember that there is also a troubleshooting section and a short glossary with explanations of the most important terms.
This manual has been put together – and is continuously enhanced - by the Student Assistants of the English Department, University of Bern. All persons involved
are clearly more linguists than statisticians. Any error reports, feedback and suggestions for additions and enhancements are greatly appreciated. Please get in
touch via the following email address:
[email protected]
- thank you
Manual Contents
Installation
0.1 Installation of R 8. Saving data
0.2 Installation of Rbrul Packages 8.1 The 'load/save' menu
8.2 Saving of current data
Basic Functions
1. Launching of R and Rbrul 9. Grouping variables and variable responses
1.1 Launching of R 9.1 Grouping in Excel
1.2 The Rbrul source code 9.2 Grouping in Rbrul
1.3 The Rbrul launch command
Glossary
6 Modeling o Variable
6.1 The 'modeling' menu o Variable response/variant
6.2 Preparing a test on effects and signficances o Dependent variable/DV
6.3 Running a test on effects and significances o Independent variable/IV
o Random vs fixed effect
o Step-up/step-down model
7. Adjusting data o Factor weight
7.1 The 'adjust data' menu o R^2
7.2 Exlcuding of variable response(s) o Intercept
7.3 Recoding of variable response(s) o Significance
Installation
0.1 Installation of R
0.1.1 Visit the following web site:
https://cran.r-project.org/
0.1.2 Install the newest version of R with default settings
2. Preparing of data
2.1 Use Excel to prepare data before loading it into Rbrul.
Make sure that the data file does not include unpermissible entries or typos
2.2 Save the Excel file
(Recommendation: Save as Text (Tab delimited) file.)
MAIN MENU
1-load/save data
9-reset 0-exit
1:
4. Crosstabs
Note: Examples contain random numbers, counts and percentages do not match.
4.4 pages (optional) Define whether there should be several cross tabs representing a third variable
columns X rows X pages
variable: age
young variable: st. old variable: st.
stand non-st stand non-st
variable: female 27 13 variable: female 38 4
sex male 14 2 sex male 15 16
4.5 cells Select output format (Enter for counts, > 1 for proportions/means)
counts: see above
proportions/means:
binary variables:
variable: st.
stand non-st
variable: female 63.3 15.2 cells = proportions for e.g. young speakers vs other age groups
sex male 57.4 21.9 (i.e. variable: age)
continuous variables:
variable: st.
stand non-st
variable: female 422.3 440.9 cells = means for e.g. first formant frequencies (F1)
sex male 401.7 419.1 (i.e. variable: age)
5. Plotting
Note: The aspect ratio of the scatterplots are dependent on the window size and of the screen size.
In RStudio, use the 'Export' function to manually specify the ratio.
This graph could represent how many tokens females and males (IV ‘sex’ plotted on the x-axis)
realise in the standard way (DV ‘st’ plotted on y-axis)
x-axis
5.2.4 Color separation (optional) Define which variable is to be plotted by means of different colors, e.g. IV age
y-axis
This graph could represent how many tokens young and old (IV ‘age’, color) females and males
(IV ‘sex’, x-axis) realise in the standard way (DV ‘st’, y-axis)
x-axis
5.2.5 Averages (optional) Indicate whether the graph should contain a black average line
5.2.6 Horizontal panels and/or vertical panels
(optional) Indicate whether there should be different panels for different responses of further IVs
y-axis
This graph could represent how many tokens young and old (IV ‘age’, color) females and males
(IV ‘sex’, x-axis), who speak a different dialect (IV ‘dialect’, panels), realise in the
standard way (DV ‘st’, y-axis)
x-axis
This creates circles on the graph that represent the number of tokens for a given category –
the bigger the circle, the more data it represents
x-axis
6.2.5 Continuous IV(s) Define those IV(s) that have an infinite number of numerical values
6.2.6 Interaction(s) (optional) Define whether certain IVs are to be paired up (e.g. How does age AND sex
influence the realisation of X?)
6.2.7 Random effects Define those IV(s) that are characterised by significant heterogeneity in terms of the number of
occurrences of their responses
E.g. 1: 540 tokens for speaker A, 660 tokens for speaker B –
relatively similar number of occurrences, i.e. fixed effect;
E.g. 2: 1100 tokens for speaker A, 100 tokens for speaker B –
relatively dissimilar number of occurrences, i.e. random effect.
The variables have been defined successfully when the following appears:
Current variables are:
.
.
MODELING MENU
1-choose variables 2-one-level 3-step-up 4-step-down 5-step-up/step-down
6-trim 7-plotting 8-settings 9-main menu 0-exit
10-chi-square test
6.3 > 5 Command: step-up/step-down model
Rbrul is running the calculations as specified. This may take a while.
Rbrul has successully run its analysis when the following appears:
STEPPING UP...
.
.
.
.
.
.
STEP-UP AND STEP-DOWN MATCH!*
MODELING MENU
1-choose variables 2-one-level 3-step-up 4-step-down 5-step-up/step-down
6-trim 7-plotting 8-settings 9-main menu 0-exit
10-chi-square test
$[IV]
factor logodds tokens [DV responses] centered factor weight
[IV resp.] 0.012 345 0.910 0.789
[IV resp.] -0.012 678 0.123 0.211
...
$misc
...
STEPPING DOWN...
$[IV]
factor logodds tokens [DV responses] centered factor weight
[IV resp.] 0.012 345 0.910 0.789
[IV resp.] -0.012 678 0.123 0.211
...
This is the summary of the stepping-up calculations, followed by the stepping-down calculations.
For the interpretation of the step-up/step-down calculations, this area is where further analysis should be started
(Recommendation: Copy, paste and save such blocks, in a word or excel file, and add comments about what is being calculated in
each block.)
6.3.1.3 Interpretation (binary)
Check the Glossary for a more detailed description of the expressions used in this section.
tokens 345/678
Number of tokens for an IV response of an IV, e.g. IV sex: female, male
DV responses 0.910/0.090
Percentage of tokens of an IV response occurring with a specified DV response, e.g. female and standard, male and standard
In the example above, 91% of 345 tokens are realised as X (X = possible DV response, e.g. standard)
p* 0.00143
The significance p indicates whether this favouring or disfavouring effect is accidental or the manifestation of a clear pattern
(see Glossary entry)
Key values:
0.05 (< 0.05: significant; > 0.05: not significant)
0.01 (< 0.01: significant; > 0.01: not significant
0.001 (< 0.001: significant; > 0.001: not significant)
(e = 10^x, e.g. 3.46e-15 = 3.46 * 10^-15 = 0.00000000000000346)
* Factor weights of significant IVs should be taken from the step-up summary or from the individual step-up calculations
Factor weights of non-significant IVs should be taken from the first iteration of the step-down calculations
$[IV]
factor coef tokens mean
[IV resp.] 202.827 99 1729.505
[IV resp.] 163.793 68 1690.471
...
$misc.1
n df intercept overall mean
1172 8 1526.678 1504.713
$misc.2
deviance AIC AICc R2
168000194 17259.16 17259.31 0.068
Run X (above) with [IV] is better/worse than Run X-1 without [IV], p = 3.46e-15
...
(Recommendation: Copy, paste and save such blocks, in a word or excel file, and add comments about what is being calculated in
each block.)
6.3.2.3 Interpretation (continuous)
Check the Glossary for a more detailed description of the expressions used in this section.
tokens 99/68
Number of tokens for an IV response of an IV, e.g. IV sex: female, male
mean 1729.505/1690.471
The mean of the DV values for a given IV response, e.g. IV speaker: speaker A mean 1729.505, speaker B mean 1690.471, ...
intercept 1526.678
The mean of the means of the DV values within a given IV response, e.g. the mean of 1729.505, 1690.471, ...
coef 202.827/163.793
The deviation of a given IV response mean from the intercept
e.g. mean formant frequency for speaker A = intercept + coef(speaker A) = 1526.678 + 202.827 = 1729.505
R^2 0.068
Multiple R-squared
Do the IV(s), e.g. speaker, etc., have a strong or weak effect on the DV?
How much of the variance between DV values can be explained due to IV effect?
In the example above, only 6.8% of variance between DV values (e.g. formant frequency) is due to the effect of all IVs in the
calculation (e.g. speaker, etc.).
Individual IV effects can be taken from the individual step-up calculations
p 3.46e-15
The significance p indicates whether the effect of the IV on the DV is accidental or the manifestation of a clear pattern (see
Glossary entry)
Key values:
0.05 (< 0.05: significant; > 0.05: not significant)
0.01 (< 0.01: significant; > 0.01: not significant
0.001 (< 0.001: significant; > 0.001: not significant)
(e = 10^x, e.g. 3.46e-15 = 3.46 * 10^-15 = 0.00000000000000346)
8. Saving data
8.1 > 1 Command: load/save
8.2 Save current data
Yes In case a new file is to be created; name the new file (the original file remains unchanged)
No In case new data is to be loaded (the changes applied to the data loaded in Rbrul will be lost)
8.3 Load new data
Specify file type In case new data is to be loaded
Enter In case the current data is to be kept
9.2.1.9 Pairwise interaction Choose the two variables that are of interest for their combined occurrence, e.g. age and sex
9.2.1.10 Random effects Define those IV(s) that are characterised by significant heterogeneity in terms of the
number of occurrences of their responses
E.g. 1: 540 tokens for speaker A, 660 tokens for speaker B - relatively similar number of occurrences, i.e.
fixed effect
E.g. 2: 1100 tokens for speaker A, 100 tokens for speaker B - relatively dissimilar number of occurrences,
i.e. random effect
MAIN MENU
1-load/save data
9-reset 0-exit
1:
More than two variables can be grouped together, e.g. age, sex and dialect, by firstly creating a full interaction group of age and sex,
and then by creating a full interaction group with the newly created variable (age:sex) and dialect (age:sex:dialect)
9.2.2.5 Rbrul analysis
It is now possible to analyse this new variable, which is a combinations of other variables.
MAIN MENU
1-load/save data
9-reset 0-exit
1:
If imported data is not computed correctly by Excel for further calculations or creations of graphs, go to "Troubleshooting\Change number
categories in Excel"
Troubleshooting
Rbrul has a tendency to crash - some times for no apparent reason. In most cases, a reboot is necesseary –
which is why it is convenient to save the source code and have it handy at all times when working with Rbrul
(see Basic Functions, 1.2.1).
Alternative launches
If the Main Menu does not show up after the source code and Rbrul launch command have been entered, try the following alternative launches – in this
order.
AL 4.1 Close R
AL 4.2 Uninstall R
AL 4.2.W Windows users:
AL 4.2.W.1 Locate R in the Control Panel, uninstall it AND search the computer for any remaining folders and files (in most cases,
there will be quite a few).
AL 4.2.W.2 Delete everything.
AL 4.2.M Mac users:
AL 4.2.M.1 Go to the location where R is saved and delete it from there.
Alternatively, install another version of R (step AL 4.3) without deleting the old one; follow the steps outlined here:
https://support.rstudio.com/hc/en-us/articles/200486138-Using-Different-Versions-of-R
Note: This leaves behind quite a number of files that are normally invisible.
If AL3 does not work the first time, try to find and delete those remainders by following the steps laid out in this guide:
http://macs.about.com/od/tipstricks/qt/hiddenfolder.htm
Dependent variable/DV The feature in the centre of attention. How does X (the DV) behave in different contexts? How is it realised in these
contexts?
(context: the speaker of X is male/female; the speaker of X young/old; X occurs before/after Y; etc. - those are IVs)
Independent variable/IV Context variable, e.g. sex, age, linguistic setting etc.
Random vs fixed effect A variable that is characterised by significant heterogeneity in terms of the number of occurrences of its responses is to
be treated as random effect
A variable that is not characterised by significant heterogeneity in terms of the number of occurrences of its responses is
to be treated as fixed effect
E.g. 1: 540 tokens for speaker A, 660 tokens for speaker B - relatively similar number of occurrences,
i.e. fixed effect;
E.g. 2: 1100 tokens for speaker A, 100 tokens for speaker B - significantly dissimilar number of occurrences, i.e. random
effect.
Intercept The mean of the means of the DV values within given IV responses.
E.g. The effect of the IV speaker on the DV formant frequency is investigated.
Every speaker (IV response) is likely to end up with different DV values and thus a different DV mean value.
The intercept is the calculation of the mean of every speaker's individual mean DV value:
mean formant frequency for speaker A: 100
mean formant frequency for speaker B: 200
mean formant frequency for speaker C: 300
intercept: (100 + 200 + 300) / 3 = 600 / 3 = 200
Significance Is this favouring or disfavouring effect accidental or the manifestation of a clear pattern?
Key values:
0.05 (< 0.05: significant; > 0.05: not significant)
0.01 (< 0.01: significant; > 0.01: not significant
0.001 (< 0.001: significant; > 0.001: not significant)
(e = 10^x, e.g. 3.46e-15 = 3.46 * 10^-15 = 0.00000000000000346)
Statistical significance is a means to assess how much confidence researchers can have in their findings.
I.e. it is a means of evaluation of findings.
Colqohoun (2014) suggests, that a significance threshold of 0.001 be set. With a threshold of 0.05, there is at least a 30%
likelihood that conclusions about effects are wrong! Colqohoun therefore advices:
- to set a threshold of 0.001, which reduces the likelihood of false discoveries to less than 5%;
- to treat findings with p ~ 0.05 merely as “worth another look”.
Non-significance is often a consequence of a low token number in particular response variable (IV).
If token numbers are high, it is worth reporting that a particular response variable (IV) has no statistically significant
effect on the dependent variable (DV) – this is a valid and important finding!
If token numbers are low, it is not possible to draw conclusions for a particular response variable (IV):
too low token numbers do not mean that a particular response variable does or does not exert an influence on the dependent
variable (DV), it simply means that there is not enough data to make an assessment. In a particular context, it is worth
pointing this out, and, with caution, make tentative statements.
E.g. "Low token numbers for IV X render an interpretation of its influence on the DV impossible. However, the factor
weight seems to suggest that, if there was more data, XZY effect may be apparent.”
Signficance tests:
likelihood-ratio Chi-squared test for runs with binary DV responses (logistic regression)
F-test for runs with continuous DV responses (linear regression)