String R
String R
A,B,C(AIDS) Date:
R Strings
Creation of String in R
R Strings can be created by assigning character values to a variable. These strings can be further
concatenated by using various functions and methods to form a big string.
Example
1
An Overview of String-Manipulation Functions
The string "South Pole" was found to have 10 characters. C programmers, take note: There is no
NULL character terminating R strings. Note that the results of nchar() will be unpredictable if x
is not in character mode.
2
grep(): The call grep(pattern,x) searches for a specified substring pattern in a vector x of strings.
Explanation: In the first case, the string "Pole" was found in elements 2 and 3 of the second
argument, hence the output (2,3). In the second case, string "pole" was not found anywhere, so
an empty vector was returned.
paste(): The call paste(...) concatenates several strings, returning the result in one long string.
Note: the optional argument sep can be used to put something other than a space between the
pieces being spliced together. If you specify sep as an empty string, the pieces won’t have any
character between them.
sprintf(): The call sprintf(...) assembles a string from parts in a formatted manner.
> i <- 8
> s <- sprintf("the square of %d is %d", i, i^2)
>s # Output: "the square of 8 is 64"
Explanation: The name of the function is intended to evoke string print for “printing” to a string
rather than to the screen. Here, we are printing to the string s.
What are we printing? The function says to first print “the square of” and then print the decimal
value of i.
substr(): The call substr(x,start,stop) returns the substring in the given character position range
start:stop in the given string x.
strsplit(): The call strsplit(x,split) splits a string x into an R list of substrings based on another
string split in x.
> strsplit("6-16-2011",split="-") # Output: "6" "16" "2011"
3
regexpr(): The call regexpr(pattern, text) finds the character position of the first instance of
pattern within text, as in this example:
> regexpr("uat", "Equator") # Output: 3
gregexpr(): The call gregexpr(pattern, text) is the same as regexpr(), but it finds all instances of
pattern.
> gregexpr("iss","Mississippi") # Output: 2 5
Explanation: This finds that “iss” appears twice in “Mississippi,” starting at character positions
2 and 5.
String Replacement:
sub(): Replace the first occurrences of a pattern in a string.
gsub(): Replace all occurrences of a pattern in a string.
String Trimming:
trimws(): Remove leading and trailing whitespaces.
String Slicing
str <- "Learn Code"
len <- nchar(str) # counts the number of characters of str = 10
print(substr(str, 1, 4))
print(substr(str, len-2, len))
Case Conversion
toupper() which converts all the characters to upper case,
tolower() which converts all the characters to lower case, and
casefold(…, upper=TRUE/FALSE) which converts on the basis of the value specified
to the upper argument. All these functions can take in as arguments multiple strings too.
4
Regular Expressions
When dealing with string-manipulation functions in programming languages, the
notion of regular expressions sometimes arises. In R, we must pay attention to this
point when using the string functions grep(), grepl(), regexpr(), gregexpr(), sub(),
gsub(), and strsplit().
A regular expression is a kind of wild card. It’s shorthand to specify broad classes of
strings. For example, the expression "[au]" refers to any string that contains either of
the letters a or u.
> grep("[au]", c("Equator", "North Pole", "South Pole")) # Output: 1 3
This reports that elements 1 and 3 of ("Equator", "North Pole", "South Pole")—that
is, “Equator” and “South Pole”—contain either an a or a u.
A period (.) represents any single character. Here’s an example of using it:
> grep("o.e", c("Equator", "North Pole", "South Pole")) # Output: 2 3
But what if you want to search for a period using grep()? Here’s the naive approach:
The result should have been 3, not (1, 2, 3). This call failed because periods are
metacharacters. W need to escape the metacharacter nature of the period, which is
done via a backslash:
5
Now, didn’t I say a backslash? Then why are there two? Well, the sad truth is that the
backslash itself must be escaped, which is accomplished by its own backslash!