0% found this document useful (0 votes)
40 views26 pages

Lenguaje R C2

Uploaded by

amaury bascos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views26 pages

Lenguaje R C2

Uploaded by

amaury bascos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Lenguaje estadística

2
Working with Logical Vectors.
You can construct vectors that contain only logical values and use them as
an argument for the index functions.
Comparing values
To build logical vectors, you’d better know how to compare values, and R
contains a set of operators that you can use for this purpose.

All these operators are, again, vectorized.


> baskets.of.Granny
[1] 12 4 4 6 9 3
> baskets.of.Geraldine
[1] 5 3 2 2 12 9

> baskets.of.Granny > 5


[1] TRUE FALSE FALSE TRUE TRUE FALSE

To find out which games Granny scored more than five baskets we use the
which() function. Which is handy for long vectors.
> which(baskets.of.Granny > 5)
[1] 1 4 5

The which() function takes a logical vector as argument. Hence, you can
save the outcome of a logical vector in an object and pass that to the
which()
> the.best <- baskets.of.Geraldine < baskets.of.Granny
> the.best
[1] TRUE TRUE TRUE TRUE FALSE FALSE
> which(the.best)
[1] 1 2 3 4
Using logical vectors as indices
The index function doesn’t take only numerical vectors as arguments; it
also
works with logical vectors. If you use a logical vector to index, R returns a
vector with only the values for which the logical vector is TRUE.
> baskets.of.Granny[the.best]
[1] 12 4 4 6
> baskets.of.Granny[baskets.of.Granny > baskets.of.Geraldine]
[1] 12 4 4 6

> x <- c(3, 6, 1, NA, 2)


> x[x > 2]
[1] 3 6 NA
>x>2
[1] TRUE TRUE FALSE NA FALSE

> x <- c(3, 6, 1, NA, 2)


> mayor.que.dos <- x > 2
> mayor.que.dos
[1] TRUE TRUE FALSE NA FALSE
> which(mayor.que.dos)
Combining logical statements
Find out the games in which Granny scored the fewest baskets and the
games in which she scored the most baskets:
1. Create two logical vectors, as follows:
> min.baskets <- baskets.of.Granny == min(baskets.of.Granny)
> min.baskets
[1] FALSE FALSE FALSE FALSE FALSE TRUE
> which(min.baskets) [1] 6
> max.baskets <- baskets.of.Granny == max(baskets.of.Granny)
> max.baskets
[1] TRUE FALSE FALSE FALSE FALSE FALSE
> which(max.baskets)
[1] 1
min.baskets tells you whether the value is equal to the minimum, and
max.baskets tells you whether the value is equal to the maximum.
2. Combine both vectors with the OR operator (|), as follows:
> min.baskets | max.baskets
[1] TRUE FALSE FALSE FALSE FALSE TRUE
This method actually isn’t the most efficient way to find those values. You see
how to do things like this more efficiently with the match() function.
To drop the missing values in the vector x, for example,use the following:
> x[!is.na(x)] # The NOT operator (!)
[1] 3 6 2 1

> x == NA
That won’t work — you need to use is.na().

Summarizing logical vectors


You also can use logical values in arithmetic operations as well. In that case, R sees
TRUE as 1 and FALSE as 0.
You want to know how often Granny scored more than Geraldine did.
> sum(the.best)
[1] 4

You have an easy way to figure out whether any value in a logical vector is TRUE
with the function any(). To ask R whether Granny was better than Geraldine in any
game, use this code:
> any(the.best)
[1] TRUE
 
To find out whether Granny was always better than Geraldine, use the following
code:
> all(the.best)
[1] FALSE
Powering Up Your Math
Vectorization is the Holy Grail for every R programmer. Using the indices
and vectorized operators, however, can save you a lot of coding and
calculation time.
Using arithmetic vector operations
A third set of arithmetic functions consists of functions in which the
outcome is dependent on more than one value in the vector. Often, the
idea behind these operations requires some form of looping over the
different values in a vector.
Calculations with missing values always return NA as a result. The same is
true for vector operations as well. R, however, gives you a way to simply
discard the missing values by setting the argument na.rm to TRUE.
> x <- c(3, 6, 2, NA, 1)
> sum(x)
[1] NA
> sum(x, na.rm = TRUE)
[1] 12
This argument works in sum(), prod(), min(), and max().
If you have a vector that contains only missing values and you set the
argument na.rm to TRUE, The sum of missing values is 0, the product is 1,
the minimum is Inf, and the maximum is ‐Inf.
Cumulating operations
Suppose that after every game, you want to update the total number of
baskets that Granny made during the season. After the second game, that’s
the total of the first two games; after the third game, it’s the total of the first
three games; and so on. In other words, you want to calculate the cumulative
sum of the baskets Granny scored. You can make this calculation easily by
using the function cumsum() as in the following example:
> cumsum(baskets.of.Granny)
[1] 12 16 21 27 36 39
In a similar way, cumprod() gives you the cumulative product. You also can
get the cumulative minimum and maximum with the related functions
cummin() and cummax(). To find the maximum number of baskets Geraldine
scored up to any given game, you can use the following code:
> cummax(baskets.of.Geraldine)
[1] 5 5 5 5 12 12
These functions don’t have an extra argument to remove missing values.
Missing values are propagated through the vector, as shown in the following
example:
> cummin(x)
[1] 3 3 2 NA NA
Working with vectors with NA elements
> d[1] 3 NA 5 7 NA 10
> index.clean.d <- which(!is.na(d))
> cumsum(d[index.clean.d])
[1] 3 8 15 2
> cumprod(d[index.clean.d])
[1] 3 15 105 1050
> cummin(d[index.clean.d])
[1] 3 3 3 3
> cummax(d[index.clean.d])
[1] 3 5 7 10
Calculating differences
You can calculate the difference in the number of baskets between every
two games Granny played by using the following code:
> diff(baskets.of.Granny)
[1] -8 0 2 3 -6 The vector returned by diff() is always one element shorter
than the original vector you gave as an argument.
The rule about missing values applies here, too. When your vector
contains a missing value, the result from that calculation will be NA.
> diff(d)[1] NA NA 2 NA NA
Just like the cumulative functions, the diff() function doesn’t have an
argument to eliminate the missing values.
Recycling arguments
Each time, you combine a vector with multiple values and one with a single value in a
function. R applies the function, using that single value for every value in the vector.
R repeats the shortest vector as often as necessary to carry out the task you asked it
to perform.
Suppose you split up the number of baskets Granny made into two‐pointers and
three‐pointers:
> Granny.pointers <- c(10, 2, 4, 0, 4, 1, 4, 2, 7, 2, 1, 2)
You arrange the numbers in such a way that for every game, first the number of two‐
pointers is given, followed by the number of three‐pointers. Now Granny wants to
know how many points she’s actually scored this season. You can calculate that
easily with the help of recycling:
> points <- Granny.pointers * c(2, 3)
> points
[1] 20 6 8 0 8 3 8 6 14 6 2 6
> sum(points)
[1] 87
If the length of the longer vector isn’t exactly a multiple of the length of the shorter
vector, you can get unexpected results.
 Now Granny wants to know how much she improved every game.
> round(diff(baskets.of.Granny) / baskets.of.Granny[1:5] * 100)
2nd 3rd 4th 5th 6th
-67 25 20 50 -67
Getting Started with Reading and Writing
You assign text to variables. You manipulate these variables in many different
ways, including finding text within text and concatenating different pieces of
text into a single vector. You also use R functions to sort text andto find words
in text with some powerful pattern search functions, called regular
expressions. Finally, you work with factors, the R way of representing
categories(or categorical data, as statisticians call it).
Using Character Vectors for Text Data
Text in R is represented by character vectors. In the world of computer
programming, text often is referred to as a string. Here text refers to a single
element of a vector. Each element of a character vector is a bit of text,
also known as a string.
Named vectors, vectors in which each element has a name. This is useful
because you can then refer to the elements by name as well as position.
Assigning a value to a character vector
You assign a value to a character vector by using the assignment operator
(<‐), the same way you do for all other variables. You test whether a variable
is of class character, for example, by using the is.character() function as
follows:
> x <- "Helloworld!"
>is.character(x)
TRUE
Noticethat x is a character vector of length 1. To find out how many characters
are in the text, use nchar():
>length(x)
[1] 1
>nchar(x)
[1] 12
The results tell you that x has length 1 and that the single element in x has 12
characters.
Creating a character vector with more than one element
To create a character vector with more than one element, use the combine
function, c():
x <- c("Hello", "world!")
>length(x)
[1] 2
>nchar(x)
[1] 5 6
Notice that this time, R tells you that your vector has length 2 and that the first
element has five characters and the second element has six characters.
Extracting a subset of a vector
You use the same indexing rules for character vectors that you use for
numeric vectors (or for vectors of any type). The process of referring to a
subset of a vector through indexing its elements is also called subsetting. In
Other words, subsetting is the process of extracting a subset of a vector.
Use These built-in vectors whenever you need to make lists of things.
> letters[10]
[1] "j“
> LETTERS[24:26]
[1] "X" "Y" "Z“
You can use the tail() function to display the trailing elements of a vector. To
get the last five elements of
LETTERS, try:
> tail(LETTERS, 5)
[1] "V" "W" "X" "Y" "Z“
Similarly, you can use the head() function to get the first element of a
variable. By default, both head() and tail() returns six elements.
> head(letters, 10)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j”
Naming the values in your vectors
You can use these named vectors in R to associate text values (names)
with any other type of value. Then you can refer to these values by name in
addition to position in the list. This format has a wide range of applications,
for example, named vectors make it easy to create lookup tables.
> str(islands)
Named num [1:48] 11506 5500 16988 2968 16...
‐ attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia"...
Because each element in the vector has a value as well as a name, now
you can subset the vector by name. To retrieve the sizes of Asia, Africa, and
Antarctica, use the following:
> islands[c("Asia", "Africa", "Antarctica")]
Asia Africa Antarctica
16988 11506 5500
You use the names() function to retrieve the names of a named vector:
> names(islands)[1:9]
The names of the six largest islands
> names(sort(islands, decreasing = TRUE)[1:6])
[1] "Asia" "Africa" "North America"
[4] "South America" "Antarctica" "Europe”
Creating and assigning named vectors
You use the assignment operator (<‐) to assign names to vectors in much
the same way that you assign values to character vectors.
Imagine you want to create a named vector with the number of days in
each month. First, create a numeric vector containing the number of days
in each month. Then use the built‐in dataset month.name for the month
names, as follows:
> month.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
> names(month.days) <- month.name
> month.days
January February March April
31 28 31 30
May June July August
31 30 31 31
September October November December
30 31 30 31
Now you can use this vector to find the names of the months with 31 days:
> names(month.days[month.days == 31])
[1] "January" "March" "May"
[4] "July" "August" "October"
[7] "December”
Splitting text
> pangram <- "The quick brown fox jumps over the lazy dog"
> pangram
[1] "The quick brown fox jumps over the lazy dog"
To split this text at the word boundaries (spaces), you can use strsplit() as
follows:
> strsplit(pangram, " ")
[[1]]
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
Lists allow you to combine all kinds of variables.
In the preceding example, this list has only a single component, a vector.
To extract a component from a list, you have to use double square
brackets. Split your pangram into words, and assign the first component to
a new variable called words, using double‐square‐brackets ([[ ]]) subsetting.
> words <- strsplit(pangram, " ")[[1]]
> words
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
To find the unique elements of a vector, including a vector of text, you use
the unique() function.
> unique(tolower(words)) # tolower convert to lowercase
[1] "the" "quick" "brown" "fox" "jumps" "over" "lazy"
[8] "dog"
> colores <- strsplit(c("Los arboles son verdes", "El Mar es azul", "El sol es
amarillo"), ",")
> oracion.color1 <- colores[[1]]
> oracion.color1
[1] "Los arboles son verdes"
> color1.separadas <- strsplit(oracion.color1, " ")
> color1.separadas
[[1]]
[1] "Los" "arboles" "son" "verdes"
> color1.separadas <- strsplit(oracion.color1, "")
> color1.separadas
[[1]]
[1] "L" "o" "s" " " "a" "r" "b" "o" "l" "e" "s" " " "s" "o" "n" " " "v" "e" "r"
[20] "d" "e" "s"

> color1.separadas2 <- strsplit(oracion.color1, "")[[1]]


> color1.separadas2
[1] "L" "o" "s" " " "a" "r" "b" "o" "l" "e" "s" " " "s" "o" "n" " " "v" "e" "r"
[20] "d" "e" "s"
> unique(tolower(color1.separadas2))
[1] "l" "o" "s" " " "a" "r" "b" "e" "n" "v" "d"
Concatenating text
Now that you’ve split text, you can concatenate these components so that
they again form a single text string.
Changing text case
To change some elements of words to uppercase, use the toupper() function:
> toupper(words[c(4, 9)])
[1] "FOX" "DOG"
To change text to lowercase, use tolower():
> tolower("Some TEXT in Mixed CASE")
[1] "some text in mixed case"
To concatenate text, use the paste() function. The default for the sep
argument is a space (" ") — it defaults to separating components with a
blank space, unless you tell it otherwise.
> paste("The", "quick", "brown", "fox")
[1] "The quick brown fox"
The c() function combines objects into a vector (or list). By default, paste()
concatenates separate vectors — it doesn’t collapse elements of a vector.
> paste(c("The", "quick", "brown", "fox"))
[1] "The" "quick" "brown" "fox"
When you want to concatenate the elements of a vector by using paste(), you use
the collapse argument, as follows:
> paste(words, collapse = " ")
[1] "The quick brown FOX jumps over the lazy DOG"
The collapse argument of paste() can take any character value. If you want to paste
together text by using an underscore, use the following:
> paste(words, collapse = "_")
[1] "The_quick_brown_FOX_jumps_over_the_lazy_DOG"
The paste() function takes vectors as input and joins them together. If one vector is
shorter than the other, R recycles (repeats) the shorter vector to match the length of
the longer one.
> paste("Sample", 1:5)
[1] "Sample 1" "Sample 2" "Sample 3" "Sample 4" "Sample 5“
> paste(c("A", "B"), c(1, 2, 3, 4),
sep = "-")
[1] "A-1" "B-2" "A-3" "B-4"
> paste(c("A"), c(1, 2, 3, 4, 5),
sep = "-")
[1] "A-1" "A-2" "A-3" "A-4" "A-5"
You can use sep and collapse in the same paste call. In this case, the vectors
are first pasted with sep and then collapsed with collapse. Try this:
> paste(LETTERS[1:5], 1:5, sep = "_", collapse = "---")
[1] "A_1---B_2---C_3---D_4---E_5"
Sorting text
> sort(letters, decreasing = TRUE)
[1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p"
[12] "o" "n" "m" "l" "k" "j" "i" "h" "g" "f" "e"
[23] "d" "c" "b" "a"
The sort() function sorts a vector. It doesn’t sort the characters of each element of
the vector.
> sort(words)
[1] "brown" "DOG" "FOX" "jumps" "lazy"
[6] "over" "quick" "the" "The"
The sort order will depend on the locale of the machine the code runs on.
 Finding text inside text
Searching for individual words
Broadly speaking, you can find substrings in text in two ways:
✓✓By position: For example, you can tell R to get three letters starting at position 5
✓✓By pattern: For example, you can tell R to get substrings that match a specific
word or pattern.
 If you know the exact position of a subtext inside a text string, you use substr().
function to return the value.
> head(state.name)
[1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
> head(substr(state.name, start = 3, stop = 6))
[1] "abam" "aska" "izon" "kans" "lifo" "lora”
Searching by pattern
To find substrings, you can use the grep() function, which takes two
essential arguments:
✓✓pattern: The pattern you want to find.
✓✓x: The character vector you want to search.
> grep("New", state.name)
[1] 29 30 31 32
> state.name[29:32]
[1] "New Hampshire" "New Jersey" "New Mexico" "New York"
> grep("New", state.name, value = TRUE)
[1] "New Hampshire" "New Jersey"
[3] "New Mexico" "New York"
> state.name[grep("New", state.name)] [1] "New Hampshire" "New Jersey"
"New Mexico" "New York"
> state.name[grep("East", state.name)]
character(0)
The grep() function is case sensitive — it only matches text in the same
case (uppercase or lowercase) as your search pattern.
R makes a distinction between NULL and an empty vector. NULL usually
means something is undefined.
Substituting text
The sub() function (short for substitute) searches for a pattern in text and
replaces this pattern with replacement text. You use sub() to substitute
text for text, and you use its cousin gsub() to substitute all occurrences of
a pattern.
(The g in gsub() stands for global.)
The gsub() function takes three arguments: the pattern to find, the
replacement pattern, and the text to modify:
> gsub("cheap", "sheep's", "A wolf in cheap clothing")
[1] "A wolf in sheep's clothing“
Removing substrings is the same as replacing the substring with empty
text (that is, nothing at all).
> x <- c("file_a.csv", "file_b.csv", "file_c.csv")
> y <- gsub("file_", "", x)
>y
[1] "a.csv" "b.csv" "c.csv"
> gsub("\\.csv", "", y)
[1] "a" "b" "c"
A dot (.) is a wildcard in a regular expression. It indicates “any
character.”If you want to refer to a point, you have to escape it with two
backslashes.
Revving up with regular expressions
Regular expressions allow three ways of making a search pattern more general than
a single, fixed expression:
✓✓Alternatives: You can search for instances of one pattern or another, indicated
by the | symbol. For example beach|beech matches both beach and beech.
On English and American English keyboards, you can usually find the | on the same
key as backslash (\).
✓✓Grouping: You group patterns together using parentheses ( ). For example you
write be(a|e)ch to find both beach and beech.
✓✓Quantifiers: You specify whether a component in the pattern must be repeated
or not by adding * (occurs zero or many times) or + (occurs one or many times). For
example, to find either bach or beech (zero or more of a and e but not both), you use
b(e*|a*)ch.
> rwords <- c("bach", "back", "beech", "beach", "black")
> grep("beach|beech", rwords)
[1] 3 4
> rwords[grep("beach|beech", rwords)]
[1] "beech" "beach"
> rwords[grep("be(a|e)ch", rwords)]
[1] "beech" "beach"
rwords[grep("b(e*|a*)ch", rwords)]
[1] "bach" "beech”
Extending text functionality with package stringr
Here are some of the advantages of using stringr rather than the standard R
functions:
✓✓ Function names and arguments are consistent and more descriptive. For
example, all stringr functions have names starting with str_ (such as str_detect()
and str_replace()).
✓✓stringr has a more consistent way of dealing with cases with missing data or
empty values.
✓✓stringr has a more consistent way of ensuring that input and output data are of
the same type.
The stringr equivalent for grep() is str_detect(), and the equivalent for gsub() is
str_replace_all().
As a starting point to explore stringr, you may find some of these functions useful:
✓✓str_detect(): Detects the presence or absence of a pattern in a string
✓✓str_extract(): Extracts the first piece of a string that matches a pattern
✓✓str_length(): Returns the length of a string (in characters)
✓✓str_locate(): Locates the position of the first occurrence of a pattern in a string
✓✓str_match(): Extracts the first matched group from a string
✓✓str_replace(): Replaces the first occurrence of a matched pattern in a string
✓✓str_split(): Splits up a string into a variable number of pieces
✓✓str_sub(): Extracts substrings from a character vector
✓✓str_trim(): Trims white space from the start and end of string
✓✓str_wrap(): Wraps strings into nicely formatted paragraphs
Happy Coding!
Instructor: Amaury Beltrán Mendez

You might also like