Week 1-B. Data in R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

Data in R
Matrix
It is often useful to work in 2-dimensions when working with data. In R, a matrix is a 2-dimensional object consisting of
values of a single data type.

You can create a matrix using matrix(<vector of values>, <number of rows>) .

matrix(1:9, 3)

## [,1] [,2] [,3]


## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

matrix(LETTERS, 2) # `LETTERS` is a prepopulated vector with the capital alphabets A to Z.

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
## [2,] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

You can access the individual elements using [ like vectors. If you use a single value or single vector, it treats the matrix
as a vector concatenating the columns in order.

mat_ex <- matrix(1:12, 3)


mat_ex

## [,1] [,2] [,3] [,4]


## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12

mat_ex[5] # 5th value

## [1] 5

mat_ex[c(2, 11)] # 2nd and 11th values

## [1] 2 11

More intuitively, you can use 2-dimensional coordinates to extract the elements separated by a comma. The first value
corresponds to the row and the second value to the column. The index starts at 1 like a single value index.

mat_ex[2, 3] # second row, third column

## [1] 8

mat_ex[c(1, 3), 3] # first and third row, third column

## [1] 7 9

mat_ex[c(2, 3), c(1, 4)] # second and third row, first and fourth column

## [,1] [,2]
## [1,] 2 11
## [2,] 3 12

If you leave one dimension blank, you can extract across the whole rows or columns.

mat_ex[1, ] # all values from the first row

## [1] 1 4 7 10

mat_ex[ , 4] # all values from the fourth column

## [1] 10 11 12

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 1/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

There are a few convenient functions provided for working with multidimensional objects. For example, rowMeans()
computes mean values for each row. See documentation for details and other similar functions.

R CODE  START OVER  RUN CODE

1 ?rowMeans
2
3

colSums package:base R Documentation

_F_o_r_m _R_o_w _a_n_d _C_o_l_u_m_n _S_u_m_s _a_n_d _M_e_a_n_s

_D_e_s_c_r_i_p_t_i_o_n:

Form row and column sums and means for numeric arrays (or data
frames).

_U_s_a_g_e:

colSums (x, na.rm = FALSE, dims = 1)


rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)

.colSums(x, m, n, na.rm = FALSE)


.rowSums(x, m, n, na.rm = FALSE)
.colMeans(x, m, n, na.rm = FALSE)
.rowMeans(x, m, n, na.rm = FALSE)

_A_r_g_u_m_e_n_t_s:

x: an array of two or more dimensions, containing numeric,


complex, integer or logical values, or a numeric data frame.
For '.colSums()' etc, a numeric, integer or logical matrix
(or vector of length 'm * n')

Exercise 4
mat is a 6 by 4 numeric matrix in the following code chunk. Use R code to

1. find the row numbers where the mean values of the row is smaller than 15
2. find the column numbers where the column sums are larger than 75
3. check whether all values in the second row are larger than values in the fourth row AND value 5 is in the last
column

R CODE  START OVER  HINTS  RUN CODE

1 mat
2
3 (1:6)[rowMeans(mat) < 15] # fill in the bracket with a logical statement
4 (1:4)[colSums(mat) > 75] # fill in the bracket with a logical statement
5
6 all(mat[2, ] > mat[4, ]) && (5 %in% mat[ ,4])
7

[,1] [,2] [,3] [,4]


[1,] 21 20 24 2
[2,] 5 22 23 9
[3,] 13 7 10 16
[4,] 11 14 12 18
[5,] 19 17 3 6
[6,] 1 15 8 4

[1] 2 3 4 5 6

[1] 2 3

[1] FALSE

Data frames
Vectors and matrices can only save values of a single data type. We can create tables consisting of columns in different
data types via data.frame() in R.

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 2/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

For example, the data frame below has a column of numbers and a column of characters.

set.seed(238)

some_table <- data.frame(


numbers = 1:5,
alphabets = letters[1:5],
l = sample(c(T, F), 5, replace = TRUE),
y = rpois(5, 100),
w = rgeom(5, 1/100)
)
some_table

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 a TRUE 106 14

2 b TRUE 100 98

3 c FALSE 89 1

4 d FALSE 100 141

5 e TRUE 92 282

5 rows

Many functions in R are designed to work with data frames. For example, ggplot() 1 expects a data frame as the first
input argument. You can then define the mappings between columns and aesthetics (e.g., x axis).

library(ggplot2)
ggplot(some_table, # provide the table
aes(x = numbers)) + # map the column `numbers` to x-axis
theme_minimal() + # use minimal theme
geom_point(aes(y = ifelse(l, y, w), # draw points using `y` or `w` based on the value of `l`
color = l)) # change color based on the value of `l`

Specifying aesthetics such as color , size , linetype , shape , etc. inside aes() changes the specified aesthetics
based on the mapped values.
Most multivariate data come in a table format and R makes it easy to work with them.

You can access elements of a data frame using 2D indexes in the same way you use 2D indexes with matrices.

some_table[1, ] # extracting the first row

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 1 a TRUE 106 14

1 row

some_table[1:3, ] # extracting the first three rows

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 3/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 1 a TRUE 106 14

2 2 b TRUE 100 98

3 3 c FALSE 89 1

3 rows

some_table[ , 3] # extracting the third column

## [1] TRUE TRUE FALSE FALSE TRUE

some_table[1, 3:4] # extracting the element in the first row, third and fourth columns

l y
<lgl> <int>

1 TRUE 106

1 row

When you extract multiple columns, the result is still a data frame.

On the other hand, using a single positional index behaves differently from a matrix and returns a column.

some_table[3]

l
<lgl>

TRUE

TRUE

FALSE

FALSE

TRUE

5 rows

You can also extract a column using $ followed by the column name.

some_table$l

## [1] TRUE TRUE FALSE FALSE TRUE

Using 2D index to extract a column (e.g., some_table[ , 3] ) or $ extractor returns a vector whereas using 1D index
(e.g., some_table[3] ) returns a single column data frame. It doesn’t affect the codes in most cases but can be a good
place to check if you face an unexpected error.

Exercise 5
In the code chunk below, some_table is available for you. i. Extract values of y from rows where l is TRUE . ii.
Extract values of w from rows where l is FALSE . iii. Add the five pairs of extracted values.

R CODE  START OVER  HINTS  RUN CODE

1 some_table
2 sum(some_table$y[some_table$l]) + sum(some_table$w[!some_table$l])
3

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 a TRUE 106 14

2 b TRUE 100 98

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 4/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

3 c FALSE 89 1

4 d FALSE 100 141

5 e TRUE 92 282

5 rows

[1] 440

PREVIOUS TOPIC NEXT TOPIC

1. ggplot() is a plotting function available from ggplot package (https://ggplot2.tidyverse.org).

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 5/5

You might also like