Lec 07
Lec 07
Lec 7: Strings
Owen G. Ward
• Basics of Strings
• Basic String Operations
• Regular Expressions
library(tidyverse)
library(stringr)
Reading
Required reading:
Useful reference:
1
Working with …
• Character string manipulation in base R has evolved over time as a bit of a patch-work
of tools.
• The names and functionality of these tools has been taken from string manipulation tools
in Unix and scripting languages like Perl.
• The stringr package aims for a cleaner interface for tasks that relate to detecting,
extracting, replacing and splitting on substrings.
mystrings <- c(
"one fish", "two fish",
"red fish", "blue fish"
)
str_length(mystrings)
[1] 8 8 8 9
2
Combining Strings with str_c()
str_c(mystrings[1], mystrings[2])
[1] NA
str_sub(mystrings, 1, 3)
3
str_sub(mystrings, 1, 10000)
str_sub(mystrings, 9, 10000)
Exercise 1 Part 1
Exercise 1 Part 2
2. extract the last four characters of each of the three components; and
3. combine the three components into one string, separated by a plus-sign.
Note: These are separate sub-exercises. (2) does not follow from (1), etc.
• Fixed strings are interpreted literally, while regular expressions are a language for spec-
ifying patterns.
– For example, “fish” is fixed and matches only “fish”, while “f[aeiou]sh” matches to
“fash”, “fesh”, …, “fush”.
• Functions from stringr that detect/find/extract/substitute strings can do so with ether
fixed strings or regular expressions.
4
• We will illustrate these functions with fixed strings first, then discuss regular expressions.
• The text used to discuss regular expressions first but has no created a full chapter on
them. We won’t go into much detail
• Regular expressions are (for me) difficult to remember and something I always have to
Google
mystrings[str_detect(mystrings, pattern)]
• (We will later see that we can specify a more general pattern than a fixed string.)
• str_locate() returns the start and stop positions of the first occurrence of a string.
• str_locate_all() returns the start and stop of all occurrences.
5
str_locate(Seuss, pattern)
start end
[1,] 5 8
str_locate_all(Seuss, pattern)
[[1]]
start end
[1,] 5 8
[2,] 15 18
[3,] 25 28
[4,] 36 39
Seuss
6
Splitting Strings
• Some characters in strings, such as ., have a special meaning (more in a minute). One
option is to wrap such patterns in fixed() for a fixed string.
[[1]]
[1] "" "" "" "" "" ""
[[2]]
[1] "" "" "" "" "" ""
[[1]]
[1] "20" "50"
[[2]]
[1] "33" "33"
Regular expressions
• Regular expressions (abbreviated regexps) are recipes used to specify search patterns.
• We use character strings to specify regexps in R.
• Regular expressions is a complex topic. We’ll only cover (some of) the basics.
• To illustrate pattern matching, use a simple pattern p.n, meaning p followed by any
character, followed by n.
7
Matching Special Characters
mystrings
pattern
[1] "p.n"
str_split(mystrings, pattern)
[[1]]
[1] "" "eapple"
[[2]]
[1] "apple"
[[3]]
[1] "" ""
8
str_locate(mystrings, pattern)
start end
[1,] 1 3
[2,] NA NA
[3,] 1 3
mystrings
pattern
[1] "p.n"
str_extract(mystrings, pattern)
str_match(mystrings, pattern)
[,1]
[1,] "pin"
[2,] NA
[3,] "pen"
Replacing patterns
9
str_replace(mystrings, pattern, "p.n")
Exercise 2
str_extract(mystrings, pattern2)
10
Numerical quantifiers
• Use {n} to require exactly n matches, {n,} to require n or more, {,m} at most m, and
{n,m} between n and m.
str_extract(mystrings, "f.{6}n")
str_extract(mystrings, "f.{1,13}n")
Anchors
str_extract(mystrings, "e$")
Exercise 3
• Create a regular expression that matches words that are exactly three letters long.
11
pattern4 <- "f[aeiou]*n"
mystrings <- c(
"fan", "fin", "fun", "fan, fin, fun",
"friend", "faint"
)
str_extract(mystrings, pattern4)
mystrings
pattern4
[1] "f[aeiou]*n"
str_extract_all(mystrings, pattern4)
[[1]]
[1] "fan"
[[2]]
[1] "fin"
[[3]]
[1] "fun"
[[4]]
[1] "fan" "fin" "fun"
[[5]]
character(0)
[[6]]
[1] "fain"
12
Short-hands for Common Character Classes
Exercise 4
• Create a regular expression that matches words that end in ed but not eed.
Alternatives
Converting Case
str_to_upper(Seuss)
13
In summary…
14