0% found this document useful (0 votes)
2 views

Lecture 1

The document outlines a course structure for learning R programming, including lectures, labs, homework, and assessments. It introduces key concepts such as the R console, basic computations, data structures, and functions, as well as the RStudio environment. Additionally, it covers various data types, vectors, matrices, and data frames, emphasizing their roles in statistical analysis and data manipulation.

Uploaded by

justusnyamaiamin
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 1

The document outlines a course structure for learning R programming, including lectures, labs, homework, and assessments. It introduces key concepts such as the R console, basic computations, data structures, and functions, as well as the RStudio environment. Additionally, it covers various data types, vectors, matrices, and data frames, emphasizing their roles in statistical analysis and data manipulation.

Uploaded by

justusnyamaiamin
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Lecture1

Ann Mwangi, PhD


Course Structure

• Two lectures a week: concepts, methods,


examples
• Lab
• HW bi-weekly
• In class Assessment
• Final Exam
Getting started in R

• R is a programming language
• One can use this language in two different ways:
– directly through the R console
– and using the RStudio development
environment.

• To start, you need to download R and RStudio.


Why R?

• R is built for statisticians and data analysts.

• R is open source.

• R has most of the latest statistical methods


available.

• R is flexible.
R Console

• The R console provides


our first intro to code in
R.
• You type in commands, R
gives back answers (or
errors)
• Menus and other
graphical interfaces are
extras built on top of the
console
Basic computations and objects

Computations Objects
• Addition: 7+8 • Named data
structure that allows
• Subtraction: 7-6
us to reference that
• Multiplication: 2*7 data structure
• Division: 15/3 • Commonly
• Exponentiation: 8^7 called variables.
• Modulo: 100 %% 5
Rstudio environment
Panes

• There are four panes shown by default:


• Source Pane - used for editing code files such as
R scripts or MarkDown documents.
• Console Pane - used to show the live R session.
• Environment Pane - containing the
Environment and History tabs, used to keep track
of the current state.
• Output Pane - containing the Plots and Packages
tabs.
Functions

• A function takes in one or more inputs and returns a


single output object.
• E.g round(), exp(),ceiling()
• round() function rounds a number ending in 0.5 to
the closest even integer.
• round(6.5)
• #> [1] 6
• round(7.5)
• #> [1] 8
Working with directories and
paths
• read.csv()
• reads in a comma-delimited file and returns the
information as a data frame
• (typing ?read.csv in the console will provide more
about this function).
• The first argument to this function is a file, which can
be expressed as either a file name or a path to a file.
• By default, R looks for the file in your current
working directory
• To find the working directory, you can run getwd()
Installing and loading
packages
• An R package is a collection of functions and/or sample data that allow us to
expand on the functionality of R beyond the base functions
• For example when working with data frames, we often use the tidyverse package
• This is a collection of R packages for data science applications.. You can check
whether you have the tidyverse package installed by going to the package tab in
the Output Pane in RStudio or by running the following command, which displays
all your installed packages.
•installed.packages()
•If you don’t already have a package installed, you can install it using
the install.packages() function.
•Note that you have to include single or double quotes around the package name
when using this function.
•You only have to install a package one time.
Data Structures
•Data structures are how information is stored in R and refer to
the types of objects we can create in R.
•The data structures that we use inform R how to interpret our
code.
•Any object is a named instance of a data structure.
•The main data structures in R are
– vectors, factors, matrices, arrays, lists, and data frames.

•These structures are distinguished by their dimensions and by


the type of data they store
Data Types

• Each individual value in R has a type:


– logical, integer, double, or character.
• We can think of these as the building blocks of all data
structures.
• We use the typeof() function to find the type of our
vector
• A double is a numeric value with a stored decimal.
• An integer is a whole number that does not contain a
decimal
Vectors

• Vectors are 1-dimensional data • The class() function returns the data
structures that can store multiple data structure of an object.
values of the same type (e.g. character,
boolean, or numeric). • Another way to create a vector is with
• A vector is a sequence of values, all of the rep() or seq() functions
the same type
•x <- c(7, 8, 10, 45) •rep(0, 5)
#> [1] 0 0 0 0 0
x
•rep("Monday", 4)
[1] 7 8 10 45 #> [1] "Monday" "Monday" "Monday" "Monday"
•is.vector(x) •seq(1, 5, 1)
#> [1] 1 2 3 4 5
[1] TRUE
•seq(0, -10, -2)
•c() function returns a vector containing all
#> [1] 0 -2 -4 -6 -8 -10
its arguments in order
Indexing Vectors
•Once we have a vector, we may want to access certain values
stored in that vector.
•To do so, we index the vector using the position of each value
– x[1] is the first element,
– x[4] is the 4th element
– x[-4] is a vector containing all but the fourth element

•We can either access a single value or a subset of values


using a vector of indices
•c(1,4) gives values at index 1 and at index 4
Operations on Vectors

• Vector arithmetic
– Operators apply to vectors “pairwise” or •Pairwise comparisons:
“elementwise”
•x <- c(7, 8, 10, 45)
•x > 9
•y <- c(-7, -8, -10, -45) [1] FALSE FALSE TRUE TRUE
• x+y
•Note: returns Boolean vector
[1] 0 0 0 0
•x*y •Boolean operators work
[1] -49 -64 -100 -2025 elementwise:
• Recycling
•(x > 9) & (x < 20)
– repeat elements in shorter vector when
combined with longer [1] FALSE FALSE TRUE FALSE
•x + c(-7,-8)
[1] 0 0 3 37
•x^c(1,0,-1,0.5)
Functions on Vectors
 mean(), median(), sd(), var(), max(), min(), length(), sum():
return single numbers
 sort() returns a new vector
 hist() takes a vector of numbers and produces a histogram, a
highly structured object, with the side-effect of making a plot
 Similarly ecdf() produces a cumulative-density-function object
 summary() gives a five-number summary of numerical
vectors
 any() and all() are useful on Boolean vectors
Factors

•A factor is a special kind of vector •days <- c("Monday", "Tuesday",


that behaves like a regular vector "Wednesday", "Monday",
except that it represents values
"Thursday", "Wednesday")
from a category.
•A factor keeps track of all possible
•days_fct <- as.factor(days)
values of that category, which are • class(days_fct)
called the levels of the factor.
#> [1] "factor"
• Factors are especially helpful when
we start getting into data analysis •levels(days_fct)
and have categorical columns.
#> [1] "Monday" "Thursday"
•The as.factor() function converts a
"Tuesday" "Wednesday"
vector to a factor.
Matrices
•Matrices are similar to vectors in that they store data of the same type.
•However, matrices are two-dimensional consisting of both rows and columns, as
opposed to one-dimensional vectors.
•A matrix is a rectangular arrangement of numbers in rows nd columns.
– rows run horizontally and

– columns run vertically.


•Matrices are two-dimensional, homogeneous data structures.
•In R to create a matrix we use the function called matrix.
•The arguments to this matrix() are the set of elements in the vector. By default,
matrices are in column-wise order.
• Ann
Matrix Example

• In R, a matrix is a specialization of a
• Factory makes cars 2D array
and trucks, using > factory <-
labor and steel matrix(c(40,1,60,3),nrow=2)
> is.array(factory)
– a car takes 40 [1] TRUE
hours of labor and 1 > is.matrix(factory)
[1] TRUE
ton of steel • could also specify ncol, and/or
– a truck takes 60 byrow=TRUE to by rows.
• Element-wise operations with the
hours and 3 tons of usual arithmetic and comparison
steel operators (e.g., factory/3)
Matrix Multiplication
• Gets a special operator
> six.sevens <- matrix(rep(7,6),ncol=3)
> six.sevens
[,1] [,2] [,3]
[1,] 7 7 7
[2,] 7 7 7
> factory %*% six.sevens # [2x2] * [2x3]
[,1] [,2] [,3]
[1,] 700 700 700
[2,] 28 28 28
• What happens if you try six.sevens % * % factory?
Multiplication of matrices
and vectors
• Numeric vectors can act like proper vectors:
> output <- c(10,20)
> factory %*% output
[,1]
[1,] 1600
[2,] 70
> output %*% factory
[,1] [,2]
[1,] 420 660
• R silently casts the vector as either a row or a column matrix
Matrix Operators
• Transpose:
> t(factory)
[,1] [,2]
[1,] 40 1
[2,] 60 3
• Determinant:
> det(factory)
[1] 60
• Diagonoal: The diag() function can extract the diagonal entries of a
matrix:
> diag(factory)
[1] 40 3
Creating a diagonal of Identity
Matrix
> diag(c(3,4)) > diag(3)
[,1] [,2] [,1] [,2] [,3]
[1,] 3 0
[1,] 1 0 0
[2,] 0 4
[2,] 0 1 0
• > diag(2) [3,] 0 0 1
[,1] [,2]
[1,] 1 0
[2,] 0 1
Inverting Matrix

> solve(factory)
[,1] [,2]
[1,] 0.05000000 -1.0000000
[2,] -0.01666667 0.6666667
> factory %*% solve(factory)
[,1] [,2]
[1,] 1 0
[2,] 0 1
Naming Matrix

• We can name either rows or Names


> rownames(factory) <- c("labor","steel")
columns or both, with > colnames(factory) <- c("cars","trucks")
rownames() and colnames() > factory
cars trucks
• These are just character labor 40 60
vectors, and we use the same steel 1 3
> available <- c(1600,70)
function to get and to set their > names(available) <- c("labor","steel")
values > output <- c(20,10)
> names(output) <- c("trucks","cars")
• Names help us understand > factory %*% output # But we've got cars and trucks mixed
what we're working with [,1]
labor 1400
• Names can be used to steel 50
coordinate different objects > factory %*% output[colnames(factory)]
[,1]
labor 1600
Data Frames

•Dataframes are generic data •Data frames have the following


objects of R which are used to store constraints placed upon them:
the tabular data. •A data-frame must have column names
and every row should have a unique
•They are the foremost popular data
name.
objects in R programming because
•Each column must have the identical
we are comfortable in seeing the number of items.
data within the tabular form.
•Each item in a single column must be of
•They are two-dimensional, the same data type.
heterogeneous data structures. •Different columns may have different
•These are lists of vectors of equal data types.
lengths. •To create a data frame we use the
data.frame() function.variable.
Data frame
• Dataframe = the classic data table, n rows for cases, p
columns for variables
• Lots of the really-statistical parts of R presume dataframes
• Not just a matrix because columns can have different types
• Many matrix functions also work for dataframes
(rowSums(), summary(), apply())
• but no matrix multiplying dataframes, even if all columns are
numeric
Data frame
• Dataframe = the classic data table, n rows for cases, p
columns for variables
• Lots of the really-statistical parts of R presume dataframes
• Not just a matrix because columns can have different types
• Many matrix functions also work for dataframes
(rowSums(), summary(), apply())
• but no matrix multiplying dataframes, even if all columns are
numeric
Data frame
• a.matrix <- matrix(c(35,8,10,4),nrow=2)
• colnames(a.matrix) <- c("v1","v2")
• a.matrix
• v1 v2
• [1,] 35 10
• [2,] 8 4
• a.matrix[,"v1"] # Try a.matrix$v1 and see what happens
• [1] 35 8
Data frame
• a.data.frame <- data.frame(a.matrix,logicals=c(TRUE,FALSE))
• a.data.frame
• v1 v2 logicals
• 1 35 10 TRUE
• 2 8 4 FALSE
• a.data.frame$v1
• [1] 35 8
• a.data.frame[,"v1"]
• [1] 35 8
• a.data.frame[1,]
• v1 v2 logicals
• 1 35 10 TRUE
• colMeans(a.data.frame)
• v1 v2 logicals
• 21.5 7.0 0.5
Adding Rows and columns

• a.data.frame <- data.frame(a.matrix,logicals=c(TRUE,FALSE))


• Adding rows and columns
• We can add rows or columns to an array or data-frame with rbind() and cbind(), but be
careful about forced type conversions
• rbind(a.data.frame,list(v1=-3,v2=-5,logicals=TRUE))
• v1 v2 logicals
• 1 35 10 TRUE
• 2 8 4 FALSE
• 3 -3 -5 TRUE
• rbind(a.data.frame,c(3,4,6))
• v1 v2 logicals
• 1 35 10 1
• 2840
• 3346
Lists
• A list is a generic object consisting of an ordered collection of
objects. Lists are heterogeneous data structures.
• These are also one-dimensional data structures.
• A list can be a list of vectors, list of matrices, a list of characters and
a list of functions and so on.
• A data frame is actually a special type of another data structure
called a list, which is a collection of objects under the same name.
• There does not have to be any relation in size, type, or other
attribute between different members of the list.
• We create an example list using the list() function, which takes in a
series of objects. What are the types of each element of the following
list?
R program to illustrate a List

• # The first attributes is a numeric vector


• # containing the employee IDs which is
• # created using the 'c' command here
empId = c(1, 2, 3, 4)
• # Second attribute is employee name created using
• # line of code here which is character vector
empName = c("Debi", "Sandeep", "Subham", "Shiba")
• #3rd attribute is no of employees a single numeric var.
numberOfEmp = 4
• # We can combine all these three different data types
• # into a list containing the details of employees
• # which can be done using a list command
empList = list(empId, empName, numberOfEmp)
Lists
• Sequence of values, not necessarily all of the same type
• > my.distribution <- list("exponential",7,FALSE)
• > my.distribution
• [[1]]
• [1] "exponential"
• [[2]]
• [1] 7
• [[3]]
• [1] FALSE
• Most of what you can do with vectors you can also do with
• lists
Accessing pieces of a List

• Sequence of values, not necessarily all of the same type


• Can use [ ] as with vectors or use [[ ]], but only with a
• single index [[ ]] drops names and structures, [ ] does not
> is.character(my.distribution)
[1] FALSE
> is.character(my.distribution[[1]])
[1] TRUE
> my.distribution[[2]]^2
[1] 49
Exercise
• What happens if you try my.distribution [2]2?
• What happens if you try [[ ]] on a vector?
Expanding and contracting
List
• Expanding and contracting lists
• Add to lists with c() (also works with vectors):
• > my.distribution <- c(my.distribution,7)
• > my.distribution
• [[1]]
• [1] "exponential"
• [[2]]
• [1] 7
• [[3]]
• [1] FALSE
• [[4]]
• [1] 7
Expanding and contracting
List
• Chop off the end of a list by setting the length to something smaller (also works
with vectors):
• > length(my.distribution)
• [1] 4
• > length(my.distribution) <- 3
• > my.distribution
• [[1]]
• [1] "exponential"
• [[2]]
• [1] 7
• [[3]]
• [1] FALSE
Naming elements in a List

• We can name some or all of the elements of a list


• names(my.distribution) <- c("family","mean","is.symmetric")
• my.distribution
• $family
• [1] "exponential"
• $mean
• [1] 7
• $is.symmetric
• [1] FALSE
• my.distribution[["family"]]
• [1] "exponential"
• my.distribution["family"]
• $family
• [1] "exponential"
Shortcut $
• Lists have a special short-cut way of using names, $ (which removes names and
structures):
• my.distribution[["family"]]
• [1] "exponential"
• my.distribution$family
• [1] "exponential"
Key Value Pairs
• Lists give us a way to store and look up data by name, rather than
by position
• A really useful programming concept with many names: key-value
pairs, dictionaries, associative arrays, hashes
• If all our distributions have components named family, we can
look that up by name, without caring where it is in the list
Practice Questions
• Create a vector of the odd numbers from 1 to 11 using
the seq() function. Then, find the third value in the vector using
indexing, which should have value 5.
• Create a 3x4 matrix of all 1’s using
the rep() and matrix() functions. Then select the first and third
columns using indexing which returns a 3x2 matrix of all ones.

You might also like