Quantitative Methods in Linguistics - Lecture 3: Adrian Brasoveanu March 30, 2014
Quantitative Methods in Linguistics - Lecture 3: Adrian Brasoveanu March 30, 2014
Adrian Brasoveanu∗
March 30, 2014
Contents
1 Basic graphics 1
2 Data frames 14
2.1 Saving a data frame to a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Attaching and detaching data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Lists 25
6 More graphics 34
1 Basic graphics
> x <- c(12, 15, 13, 20, 14, 16, 10, 10, 8, 15)
A histogram:
> hist(x)
∗ These
notes have been generated with the ‘knitr’ package (Xie 2013) and are based on many sources, including but not limited to:
Abelson (1995), Miles and Shevlin (2001), Faraway (2004), De Veaux et al. (2005), Braun and Murdoch (2007), Gelman and Hill (2007),
Baayen (2008), Johnson (2008), Wright and London (2009), Gries (2009), Kruschke (2011), Diez et al. (2013), Gries (2013).
1
Histogram of x
3.0
2.0
Frequency
1.0
0.0
8 10 12 14 16 18 20
A barplot:
> barplot(table(x))
2.0
1.5
1.0
0.5
0.0
8 10 12 13 14 15 16 20
More examples, with random draws from a standard normal distribution (mean 0, standard deviation
1):
2
> (x <- rnorm(100))
> hist(x)
Histogram of x
25
20
Frequency
15
10
5
0
−2 −1 0 1 2
3
Histogram of x
150
Frequency
100
50
0
−3 −2 −1 0 1 2 3
4
Histogram of x density.default(x = x)
0.4
1000 1500
0.3
Frequency
Density
0.2
500
0.1
0.0
0
−4 −2 0 2 4 −4 −2 0 2 4
> # ?density
> par(mfrow = c(1, 1))
The two together – note the freq=F option passed to the hist function:
5
Histogram of x
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
> # ?lines
A scatterplot:
[1] 1 2 3 4 5 6 7 8 9 10
> plot(x, y)
6
●
0
−5
● ●
y
−15
● ●
● ●
−25
● ●
●
2 4 6 8 10
−15
−25
2 4 6 8 10
7
> plot(x, y, type = "b")
●
0
−5
● ●
y
−15
● ●
● ●
−25
● ●
●
2 4 6 8 10
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6
[15] 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4
[29] 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2
[43] 9.4 9.6 9.8 10.0
[1] -9.00 -10.56 -12.04 -13.44 -14.76 -16.00 -17.16 -18.24 -19.24 -20.16
[11] -21.00 -21.76 -22.44 -23.04 -23.56 -24.00 -24.36 -24.64 -24.84 -24.96
[21] -25.00 -24.96 -24.84 -24.64 -24.36 -24.00 -23.56 -23.04 -22.44 -21.76
[31] -21.00 -20.16 -19.24 -18.24 -17.16 -16.00 -14.76 -13.44 -12.04 -10.56
[41] -9.00 -7.36 -5.64 -3.84 -1.96 0.00
> plot(x, y)
8
●
0
●
●
−5
●
●
● ●
● ●
● ●
y
● ●
−15
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●● ●
●● ●●
−25
●●●●●●●●●●
2 4 6 8 10
−15
−25
2 4 6 8 10
9
> curve(expr = sin, from = 0, to = 6 * pi)
1.0
0.5
sin(x)
0.0
−0.5
−1.0
0 5 10 15
−15
−25
2 4 6 8 10
10
> par(mfrow = c(1, 2))
> (x <- seq(1, 10, by = 0.2))
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6
[15] 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4
[29] 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2
[43] 9.4 9.6 9.8 10.0
[1] -9.00 -10.56 -12.04 -13.44 -14.76 -16.00 -17.16 -18.24 -19.24 -20.16
[11] -21.00 -21.76 -22.44 -23.04 -23.56 -24.00 -24.36 -24.64 -24.84 -24.96
[21] -25.00 -24.96 -24.84 -24.64 -24.36 -24.00 -23.56 -23.04 -22.44 -21.76
[31] -21.00 -20.16 -19.24 -18.24 -17.16 -16.00 -14.76 -13.44 -12.04 -10.56
[41] -9.00 -7.36 -5.64 -3.84 -1.96 0.00
0
−5
−5
(x^2) − (10 * x)
y
−15
−15
−25
−25
2 4 6 8 10 2 4 6 8 10
x x
11
0
0
−5
−5
x2 − 10x
x2 − 10x
−15
−15
−25
−25
2 4 6 8 10 2 4 6 8 10
x x
> # ?rbinom
> (a <- rbinom(100, 1, 0.5))
[1] 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1
[36] 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 0 1
[71] 0 1 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0
> sum(a)
[1] 51
12
Histogram of a
1500
1000
Frequency
500
0
35 40 45 50 55 60 65
Histogram of a
0.08
0.06
Density
0.04
0.02
0.00
35 40 45 50 55 60 65
13
> hist(a, probability = TRUE, col = "lightblue", border = "white", main = "A prob. distribution",
+ xlab = "value", ylab = "probability", breaks = 30)
> # ?hist
> lines(density(a), col = "darkblue", lwd = 3)
A prob. distribution
0.08
0.06
probability
0.04
0.02
0.00
35 40 45 50 55 60 65
value
2 Data frames
14
$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 1 2 4 3 5
$ TokenFrequency: num 421 337 1411 458 455
$ TypeFrequency : num 271 103 735 18 37
$ Class : Factor w/ 2 levels "closed","open": 2 2 2 1 1
> summary(x)
> x$PartOfSpeech
> str(x$PartOfSpeech)
> summary(x$PartOfSpeech)
15
> str(x.2)
> x.2$PartOfSpeech
NULL
> row.names(x.2)
> names(x.2)
character(0)
> # View(x)
> str(x)
16
'data.frame': 5 obs. of 4 variables:
$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 1 2 4 3 5
$ TokenFrequency: int 421 337 1411 458 455
$ TypeFrequency : int 271 103 735 18 37
$ Class : Factor w/ 2 levels "closed","open": 2 2 2 1 1
> x$TokenFrequency
> x$Class
character(0)
> str(x)
> attach(x)
> Class
> x$TokenFrequency
17
> TokenFrequency[4] <- 458
> detach(x)
> Class
> x
> x[2, 3]
[1] 103
> x[2, ]
> x[, 3]
> x$TypeFrequency
> x
> x[2:3, 4]
> x[3:4, 4]
18
[1] open closed
Levels: closed open
> x
TokenFrequency Class
1 421 open
3 1411 open
[1] 3 4 5
> x$TokenFrequency
19
[1] 271 103 735
> x
> x$TokenFrequency
> order(x$TokenFrequency)
[1] 2 1 5 4 3
> x$TokenFrequency[order(x$TokenFrequency)]
20
[1] 337 421 455 458 1411
[1] 2 1 5 4 3
> x[ordering.index, ]
> x
> x[order(x$TokenFrequency), ]
> -x$TokenFrequency
> order(-x$TokenFrequency)
[1] 3 4 5 1 2
> x$TokenFrequency[order(-x$TokenFrequency)]
[1] 4 5 2 1 3
> x[ordering.index, ]
21
> x$Class
> str(x$Class)
> order(x$Class)
[1] 4 5 1 2 3
> x$Class[order(x$Class)]
[1] 4 5 1 2 3
> x[ordering.index, ]
[1] 5 4 2 1 3
> x[ordering.index, ]
[1] 4 5 3 1 2
> x[ordering.index, ]
22
PartOfSpeech TokenFrequency TypeFrequency Class
4 CONJ 458 18 closed
5 PREP 455 37 closed
3 N 1411 735 open
1 ADJ 421 271 open
2 ADV 337 103 open
> x
> dim(x)
[1] 5 4
[1] 5
[1] 4
[1] 2 5 1 4 3
> x[ordering.index, ]
> x[sample(dim(x)[1]), ]
23
> x$Class
> sort(x$Class)
> x$PartOfSpeech
> sort(x$PartOfSpeech)
> x$PartOfSpeech
> rank(x$PartOfSpeech)
[1] 1 2 4 3 5
> order(rank(x$PartOfSpeech))
[1] 1 2 4 3 5
> x$PartOfSpeech[order(rank(x$PartOfSpeech))]
> x$PartOfSpeech
> rank(x$PartOfSpeech)
[1] 1 2 4 3 5
> -rank(x$PartOfSpeech)
24
[1] -1 -2 -4 -3 -5
> x$PartOfSpeech[order(-rank(x$PartOfSpeech))]
4 Lists
[1] 1 2 3 4 5 6 7 8 9 10
25
PartOfSpeech TokenFrequency TypeFrequency Class
1 ADJ 421 271 open
2 ADV 337 103 open
3 N 1411 735 open
4 CONJ 458 18 closed
5 PREP 455 37 closed
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
PartOfSpeech TokenFrequency TypeFrequency Class
1 ADJ 421 271 open
2 ADV 337 103 open
3 N 1411 735 open
4 CONJ 458 18 closed
5 PREP 455 37 closed
[[3]]
[1] "This" "may" "be" "a" "sentence" "from"
[7] "a" "corpus" "file" "."
> str(a.list)
List of 3
$ : int [1:10] 1 2 3 4 5 6 7 8 9 10
$ :'data.frame': 5 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 1 2 4 3 5
..$ TokenFrequency: int [1:5] 421 337 1411 458 455
..$ TypeFrequency : int [1:5] 271 103 735 18 37
..$ Class : Factor w/ 2 levels "closed","open": 2 2 2 1 1
$ : chr [1:10] "This" "may" "be" "a" ...
$Part1
[1] 1 2 3 4 5 6 7 8 9 10
$Part2
PartOfSpeech TokenFrequency TypeFrequency Class
1 ADJ 421 271 open
2 ADV 337 103 open
3 N 1411 735 open
4 CONJ 458 18 closed
5 PREP 455 37 closed
26
$Part3
[1] "This" "may" "be" "a" "sentence" "from"
[7] "a" "corpus" "file" "."
$Part1
[1] 1 2 3 4 5 6 7 8 9 10
$Part2
PartOfSpeech TokenFrequency TypeFrequency Class
1 ADJ 421 271 open
2 ADV 337 103 open
3 N 1411 735 open
4 CONJ 458 18 closed
5 PREP 455 37 closed
$Part3
[1] "This" "may" "be" "a" "sentence" "from"
[7] "a" "corpus" "file" "."
> a.list[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
> a.list[[2]]
> a.list[[3]]
> a.list[1]
$Part1
[1] 1 2 3 4 5 6 7 8 9 10
> is.list(a.list[1])
[1] TRUE
> is.list(a.list[[1]])
[1] FALSE
> is.vector(a.list[[1]])
[1] TRUE
> a.list$Part1
27
[1] 1 2 3 4 5 6 7 8 9 10
> a.list[["Part1"]]
[1] 1 2 3 4 5 6 7 8 9 10
> a.list["Part1"]
$Part1
[1] 1 2 3 4 5 6 7 8 9 10
> a.list[c(1, 3)]
$Part1
[1] 1 2 3 4 5 6 7 8 9 10
$Part3
[1] "This" "may" "be" "a" "sentence" "from"
[7] "a" "corpus" "file" "."
> a.list[[1]][3]
[1] 3
> a.list[[1]][3:5]
[1] 3 4 5
> a.list[[1]][c(3, 5)]
[1] 3 5
> a.list[[2]][3, 2]
[1] 1411
> a.list[[2]][3, 2:4]
TokenFrequency TypeFrequency Class
3 1411 735 open
> a.list[[2]][3, c(2, 4)]
TokenFrequency Class
3 1411 open
> x <- a.list[[2]]
> y <- split(x, x$Class)
> str(y)
List of 2
$ closed:'data.frame': 2 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 3 5
..$ TokenFrequency: int [1:2] 458 455
..$ TypeFrequency : int [1:2] 18 37
..$ Class : Factor w/ 2 levels "closed","open": 1 1
$ open :'data.frame': 3 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 1 2 4
..$ TokenFrequency: int [1:3] 421 337 1411
..$ TypeFrequency : int [1:3] 271 103 735
..$ Class : Factor w/ 2 levels "closed","open": 2 2 2
28
> y$open
List of 5
$ ADJ :'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 1
..$ TokenFrequency: int 421
..$ TypeFrequency : int 271
..$ Class : Factor w/ 2 levels "closed","open": 2
$ ADV :'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 2
..$ TokenFrequency: int 337
..$ TypeFrequency : int 103
..$ Class : Factor w/ 2 levels "closed","open": 2
$ CONJ:'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 3
..$ TokenFrequency: int 458
..$ TypeFrequency : int 18
..$ Class : Factor w/ 2 levels "closed","open": 1
$ N :'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 4
..$ TokenFrequency: int 1411
..$ TypeFrequency : int 735
..$ Class : Factor w/ 2 levels "closed","open": 2
$ PREP:'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 5
..$ TokenFrequency: int 455
..$ TypeFrequency : int 37
..$ Class : Factor w/ 2 levels "closed","open": 1
> y$ADJ
> y$N
> y$PREP
29
List of 10
$ closed.ADJ :'data.frame': 0 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..:
..$ TokenFrequency: int(0)
..$ TypeFrequency : int(0)
..$ Class : Factor w/ 2 levels "closed","open":
$ open.ADJ :'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 1
..$ TokenFrequency: int 421
..$ TypeFrequency : int 271
..$ Class : Factor w/ 2 levels "closed","open": 2
$ closed.ADV :'data.frame': 0 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..:
..$ TokenFrequency: int(0)
..$ TypeFrequency : int(0)
..$ Class : Factor w/ 2 levels "closed","open":
$ open.ADV :'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 2
..$ TokenFrequency: int 337
..$ TypeFrequency : int 103
..$ Class : Factor w/ 2 levels "closed","open": 2
$ closed.CONJ:'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 3
..$ TokenFrequency: int 458
..$ TypeFrequency : int 18
..$ Class : Factor w/ 2 levels "closed","open": 1
$ open.CONJ :'data.frame': 0 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..:
..$ TokenFrequency: int(0)
..$ TypeFrequency : int(0)
..$ Class : Factor w/ 2 levels "closed","open":
$ closed.N :'data.frame': 0 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..:
..$ TokenFrequency: int(0)
..$ TypeFrequency : int(0)
..$ Class : Factor w/ 2 levels "closed","open":
$ open.N :'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 4
..$ TokenFrequency: int 1411
..$ TypeFrequency : int 735
..$ Class : Factor w/ 2 levels "closed","open": 2
$ closed.PREP:'data.frame': 1 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..: 5
..$ TokenFrequency: int 455
..$ TypeFrequency : int 37
..$ Class : Factor w/ 2 levels "closed","open": 1
$ open.PREP :'data.frame': 0 obs. of 4 variables:
..$ PartOfSpeech : Factor w/ 5 levels "ADJ","ADV","CONJ",..:
..$ TokenFrequency: int(0)
..$ TypeFrequency : int(0)
..$ Class : Factor w/ 2 levels "closed","open":
30
5 Character / String Processing
[1] 1 2 3 4
[1] "national"
> substr(example.1, 2, 3)
> tolower(example.1)
> toupper(example.1)
> paste("I", "do", "not", "know", sep = " ", collapse = " ")
> paste(example.1)
31
[1] "I do not know"
[[1]]
[1] "I"
[[2]]
[1] "do"
[[3]]
[1] "not"
[[4]]
[1] "know"
> paste(list.1)
[[1]]
[1] "I" "do" "not" "know"
[[1]]
[1] "I" " " "d" "o" " " "n" "o" "t" " " "k" "n" "o" "w"
[[1]]
[1] "Hello" "Elmo"
[[2]]
[1] "Bye" "bye" "binky"
> text.1 <- "This is the first sentence. This is the second sentence."
> strsplit(text.1, ".")
32
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[24] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[47] "" "" "" "" "" "" "" "" "" ""
[[1]]
[1] "This is the first sentence" "This is the second sentence."
[[1]]
[1] "This" "is" "the" "first" "sentence"
[[2]]
[1] "This" "is" "the" "second" "sentence."
integer(0)
[1] 3 8
> str(list.1)
List of 4
$ : chr "I"
$ : chr "do"
$ : chr "not"
$ : chr "know"
33
> strsplit(unlist(list.1), " ")
[[1]]
[1] "I"
[[2]]
[1] "do"
[[3]]
[1] "not"
[[4]]
[1] "know"
[[1]]
[1] "I"
[[2]]
[1] "d" "o"
[[3]]
[1] "n" "o" "t"
[[4]]
[1] "k" "n" "o" "w"
6 More graphics
> VADeaths
> # ?VADeaths
> barplot(VADeaths)
34
200
150
100
50
0
35
70
60
50
40
30
20
10
0
> barplot(VADeaths, beside = TRUE, legend = TRUE, ylim = c(0, 90), ylab = "Deaths per 1000",
+ main = "Death rates in Virginia")
36
Death rates in Virginia
50−54
80
55−59
60−64
65−69
70−74
60
Deaths per 1000
40
20
0
> dotchart(VADeaths, xlim = c(0, 75), xlab = "Deaths per 1000", main = "Death rates in Virginia",
+ pch = 20, col = "blue")
37
Death rates in Virginia
Rural Male
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
Rural Female
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
Urban Male
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
Urban Female
70−74 ●
65−69 ●
60−64 ●
55−59 ●
50−54 ●
0 20 40 60
38
B
C D
> # ?iris
> head(iris)
> str(iris)
39
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(iris)
40
Iris measurements
8.0
7.5
7.0
Sepal length (cm)
6.5
6.0
5.5
5.0
●
4.5
> # ?boxplot
41
$ ContentOfCategory: Factor w/ 15 levels "adventureWestern",..: 10 3 11 9 15 8 2 6 13 4 ...
$ NumberOfTexts : int 44 27 17 17 36 48 75 30 80 29 ...
> brown.corpus.structure
> # View(brown.corpus.structure)
>
> (number.of.texts.by.genre <- tapply(brown.corpus.structure$NumberOfTexts,
+ brown.corpus.structure$GenreGroup, sum))
>
> barplot(number.of.texts.by.genre, cex.names = 0.9, col = c("red",
+ "gray", "green", "blue"))
> legend(3, max(number.of.texts.by.genre), paste(names(number.of.texts.by.genre),
+ ":", number.of.texts.by.genre), cex = 0.9, fill = c("red", "gray",
+ "green", "blue"))
42
200
fiction : 126
generalProse : 206
150
learned : 80
press : 88
100
50
0
>
> (number.of.texts.by.category <- tapply(brown.corpus.structure$NumberOfTexts,
+ brown.corpus.structure$Category, sum))
A B C D E F G H J K L M N P R
44 27 17 17 36 48 75 30 80 29 24 6 29 29 9
43
E
D
E skillsTradesHobbies : 36
D religion : 17
C
C review : 17
B editorial : 27
A reportage : 44
B
A
0 10 20 30 40
44
K generalFiction : 29
J science : 80
H miscellaneous : 30
G belleslettresBiographiesEssays : 75
K
F popularLore : 48
J
H
G
F
0 20 40 60 80
45
R humor : 9
P romanceLoveStory : 29
N adventureWestern : 29
R
M scienceFiction : 6
L misteryDetectiveFiction : 24
P
N
M
L
0 5 10 15 20 25
46
+ sep = "\n", comment.char = ""))
> head(brown.tagged)
[1] "|SA01:1 the_AT Fulton_NP County_NN Grand_JJ Jury_NN said_VBD Friday_NR an_AT investigation_NN of_IN
[2] "|SA01:2 the_AT jury_NN further_RBR said_VBD in_IN term-end_NN presentments_NNS that_CS the_AT City_
[3] "|SA01:3 the_AT September-October_NP term_NN jury_NN had_HVD been_BEN charged_VBN by_IN Fulton_NP Su
[4] "|SA01:4 only_RB a_AT relative_JJ handful_NN of_IN such_JJ reports_NNS was_BEDZ received_VBN ,_, the
[5] "|SA01:5 the_AT jury_NN said_VBD it_PPS did_DOD find_VB that_CS many_AP of_IN Georgia's_NP$ registra
[6] "|SA01:6 it_PPS recommended_VBD that_CS Fulton_NP legislators_NNS act_VB to_TO have_HV these_DTS law
> tail(brown.tagged)
[1] "|SR09:93 as_CS you_PPSS can_MD count_VB on_IN me_PPO to_TO do_DO the_AT same_AP ._."
[2] "|SR09:94 compassionately_RB yours_PP$$ ,_, S._NP J._NP Perelman_NP revulsion_NN in_IN the_AT desert
[3] "|SR09:95 the_AT doors_NNS of_IN the_AT J_NP train_NN slid_VBD shut_VBN ,_, and_CC as_CS I_PPSS drop
[4] "|SR09:96 she_PPS was_BEDZ a_AT living_VBG doll_NN and_CC no_AT mistake_NN --_-- the_AT blue-black_J
[5] "|SR09:97 from_IN what_WDT I_PPSS was_BEDZ able_JJ to_IN gauge_NN in_IN a_AT swift_JJ ,_, greedy_JJ
[6] "|S:1 ._"
> length(brown.tagged)
[1] 57067
> tail(brown.tag.set)
Tag Description
221 WRB+DO WH-adverb + verb "to do", present, not 3rd person singular
222 WRB+DOD WH-adverb + verb "to do", past tense
223 WRB+DOD* WH-adverb + verb "to do", past tense, negated
224 WRB+DOZ WH-adverb + verb "to do", present tense, 3rd person singular
225 WRB+IN WH-adverb + preposition
226 WRB+MD WH-adverb + modal auxillary
Examples
221 howda
222 where'd how'd
223 whyn't
224 how's
225 why'n
226 where'd
47
> dim(brown.tag.set)
[1] 226 3
> str(brown.tag.set)
> # brown.tag.set
> brown.words.1 <- strsplit(brown.tagged, " ")
> head(brown.words.1)
[[1]]
[1] "|SA01:1" "the_AT" "Fulton_NP"
[4] "County_NN" "Grand_JJ" "Jury_NN"
[7] "said_VBD" "Friday_NR" "an_AT"
[10] "investigation_NN" "of_IN" "Atlanta's_NP$"
[13] "recent_JJ" "primary_NN" "election_NN"
[16] "produced_VBD" "no_AT" "evidence_NN"
[19] "that_CS" "any_DTI" "irregularities_NNS"
[22] "took_VBD" "place_NN" "._."
[[2]]
[1] "|SA01:2" "the_AT" "jury_NN"
[4] "further_RBR" "said_VBD" "in_IN"
[7] "term-end_NN" "presentments_NNS" "that_CS"
[10] "the_AT" "City_NN" "Executive_JJ"
[13] "Committee_NN" ",_," "which_WDT"
[16] "had_HVD" "over-all_JJ" "charge_NN"
[19] "of_IN" "the_AT" "election_NN"
[22] ",_," "deserves_VBZ" "the_AT"
[25] "praise_NN" "and_CC" "thanks_NNS"
[28] "of_IN" "the_AT" "City_NN"
[31] "of_IN" "Atlanta_NP" "for_IN"
[34] "the_AT" "manner_NN" "in_IN"
[37] "which_WDT" "the_AT" "election_NN"
[40] "was_BEDZ" "conducted_VBN" "._."
[[3]]
[1] "|SA01:3" "the_AT" "September-October_NP"
[4] "term_NN" "jury_NN" "had_HVD"
[7] "been_BEN" "charged_VBN" "by_IN"
[10] "Fulton_NP" "Superior_JJ" "Court_NN"
[13] "Judge_NN" "Durwood_NP" "Pye_NP"
[16] "to_TO" "investigate_VB" "reports_NNS"
[19] "of_IN" "possible_JJ" "irregularities_NNS"
[22] "in_IN" "the_AT" "hard-fought_JJ"
[25] "primary_NN" "which_WDT" "was_BEDZ"
[28] "won_VBN" "by_IN" "Mayor-nominate_NN"
[31] "Ivan_NP" "Allen_NP" "Jr._NP"
[34] "._."
48
[[4]]
[1] "|SA01:4" "only_RB" "a_AT" "relative_JJ"
[5] "handful_NN" "of_IN" "such_JJ" "reports_NNS"
[9] "was_BEDZ" "received_VBN" ",_," "the_AT"
[13] "jury_NN" "said_VBD" ",_," "considering_IN"
[17] "the_AT" "widespread_JJ" "interest_NN" "in_IN"
[21] "the_AT" "election_NN" ",_," "the_AT"
[25] "number_NN" "of_IN" "voters_NNS" "and_CC"
[29] "the_AT" "size_NN" "of_IN" "this_DT"
[33] "city_NN" "._."
[[5]]
[1] "|SA01:5" "the_AT" "jury_NN"
[4] "said_VBD" "it_PPS" "did_DOD"
[7] "find_VB" "that_CS" "many_AP"
[10] "of_IN" "Georgia's_NP$" "registration_NN"
[13] "and_CC" "election_NN" "laws_NNS"
[16] "are_BER" "outmoded_JJ" "or_CC"
[19] "inadequate_JJ" "and_CC" "often_RB"
[22] "ambiguous_JJ" "._."
[[6]]
[1] "|SA01:6" "it_PPS" "recommended_VBD"
[4] "that_CS" "Fulton_NP" "legislators_NNS"
[7] "act_VB" "to_TO" "have_HV"
[10] "these_DTS" "laws_NNS" "studied_VBN"
[13] "and_CC" "revised_VBN" "to_IN"
[16] "the_AT" "end_NN" "of_IN"
[19] "modernizing_VBG" "and_CC" "improving_VBG"
[22] "them_PPO" "._."
> tail(brown.words.1)
[[1]]
[1] "|SR09:93" "as_CS" "you_PPSS" "can_MD" "count_VB" "on_IN"
[7] "me_PPO" "to_TO" "do_DO" "the_AT" "same_AP" "._."
[[2]]
[1] "|SR09:94" "compassionately_RB" "yours_PP$$"
[4] ",_," "S._NP" "J._NP"
[7] "Perelman_NP" "revulsion_NN" "in_IN"
[10] "the_AT" "desert_NN"
[[3]]
[1] "|SR09:95" "the_AT" "doors_NNS" "of_IN"
[5] "the_AT" "J_NP" "train_NN" "slid_VBD"
[9] "shut_VBN" ",_," "and_CC" "as_CS"
[13] "I_PPSS" "dropped_VBD" "into_IN" "a_AT"
[17] "seat_NN" "and_CC" ",_," "exhaling_VBG"
[21] ",_," "looked_VBD" "up_RP" "across_IN"
[25] "the_AT" "aisle_NN" ",_," "the_AT"
[29] "whole_JJ" "aviary_NN" "in_IN" "my_PP$"
[33] "head_NN" "burst_VBD" "into_IN" "song_NN"
[37] "._."
49
[[4]]
[1] "|SR09:96" "she_PPS" "was_BEDZ"
[4] "a_AT" "living_VBG" "doll_NN"
[7] "and_CC" "no_AT" "mistake_NN"
[10] "--_--" "the_AT" "blue-black_JJ"
[13] "bang_NN" ",_," "the_AT"
[16] "wide_JJ" "cheekbones_NNS" ",_,"
[19] "olive-flushed_JJ" ",_," "that_WPS"
[22] "betrayed_VBD" "the_AT" "Cherokee_NP"
[25] "strain_NN" "in_IN" "her_PP$"
[28] "Midwestern_JJ" "lineage_NN" ",_,"
[31] "and_CC" "the_AT" "mouth_NN"
[34] "whose_WP$" "only_AP" "fault_NN"
[37] ",_," "in_IN" "the_AT"
[40] "novelist's_NN$" "carping_VBG" "phrase_NN"
[43] ",_," "was_BEDZ" "that_CS"
[46] "the_AT" "lower_JJR" "lip_NN"
[49] "was_BEDZ" "a_AT" "trifle_NN"
[52] "too_QL" "voluptuous_JJ" "._."
[[5]]
[1] "|SR09:97" "from_IN" "what_WDT"
[4] "I_PPSS" "was_BEDZ" "able_JJ"
[7] "to_IN" "gauge_NN" "in_IN"
[10] "a_AT" "swift_JJ" ",_,"
[13] "greedy_JJ" "glance_NN" ",_,"
[16] "the_AT" "figure_NN" "inside_IN"
[19] "the_AT" "coral-colored_JJ" "boucle_NN"
[22] "dress_NN" "was_BEDZ" "stupefying_VBG"
[25] "._."
[[6]]
[1] "|S:1" "._"
> length(unlist(brown.words.1))
[1] 1194534
> words.per.sentence.1 <- sapply(brown.words.1, length)
> str(words.per.sentence.1)
int [1:57067] 24 42 34 34 23 23 42 3 25 24 ...
> sum(words.per.sentence.1)
[1] 1194534
> mean(words.per.sentence.1)
[1] 20.93
> summary(words.per.sentence.1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 11.0 19.0 20.9 28.0 181.0
> boxplot(words.per.sentence.1)
50
●
●
●
●
150
●
●
●
●
●
●
●
●
●
●
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
50
0
References
Abelson, R.P. (1995). Statistics as Principled Argument. L. Erlbaum Associates.
Baayen, R. Harald (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge
University Press.
Braun, J. and D.J. Murdoch (2007). A First Course in Statistical Programming with R. Cambridge University
Press.
De Veaux, R.D. et al. (2005). Stats: Data and Models. Pearson Education, Limited.
Diez, D. et al. (2013). OpenIntro Statistics: Second Edition. CreateSpace Independent Publishing Platform.
URL : http://www.openintro.org/stat/textbook.php.
Faraway, J.J. (2004). Linear Models With R. Chapman & Hall Texts in Statistical Science Series. Chapman &
Hall/CRC.
Francis, W. N. and H. Kucera (1979). Brown Corpus Manual. Tech. rep. Department of Linguistics, Brown
University, Providence, Rhode Island, US. URL: http://icame.uib.no/brown/bcm.html.
Gelman, A. and J. Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical
Methods for Social Research. Cambridge University Press.
Gries, S.T. (2009). Quantitative Corpus Linguistics with R: A Practical Introduction. Taylor & Francis.
— (2013). Statistics for Linguistics with R: A Practical Introduction, 2nd Edition. Mouton De Gruyter.
Johnson, K. (2008). Quantitative methods in linguistics. Blackwell Pub.
Kruschke, John K. (2011). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press/Elsevier.
Miles, J. and M. Shevlin (2001). Applying Regression and Correlation: A Guide for Students and Researchers.
SAGE Publications.
Wright, D.B. and K. London (2009). Modern regression techniques using R: A practical guide for students and
researchers. SAGE.
Xie, Yihui (2013). Dynamic Documents with R and knitr. Chapman and Hall/CRC.
51