0% found this document useful (0 votes)
8 views47 pages

Data - Analysis - With - R - 24

The document serves as an introduction to statistics using R, highlighting its advantages, disadvantages, and functionalities for data analysis. It covers the workflow of data analysis, the use of R for statistical computing, and the creation and manipulation of data frames. Additionally, it emphasizes R's capabilities for reproducible research and high-quality graphical outputs.

Uploaded by

Kar Wai Hong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views47 pages

Data - Analysis - With - R - 24

The document serves as an introduction to statistics using R, highlighting its advantages, disadvantages, and functionalities for data analysis. It covers the workflow of data analysis, the use of R for statistical computing, and the creation and manipulation of data frames. Additionally, it emphasizes R's capabilities for reproducible research and high-quality graphical outputs.

Uploaded by

Kar Wai Hong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to Statistics

Joao Lourenço ([email protected]) and Rachel Marcone ([email protected])

January 2024
Data analysis with R:
An introduction
Data analysis workflow

Hadley Wickham

Garrett Grolemund
Prepare: make data available in a specific format

• Database
• Flat file
• Proprietary file
Which tool to use for data analysis ?
Annoyances with spreadsheets
Microsoft Excel

•Many standard methods in statistics are not available. Other


methods only offer basic options (linear regression)

• Different analysis require user to reorganize the data

•Probably ok for simple calculations (basic summary statistics,


simple regression)

•Add-ons can be used for missing functions (e.g. StatPlus for


Excel)
Libreoffice

• Many types of graphics violate standards of good graphics


Annoyances with spreadsheets

“The date conversions affect at least 30 gene names; the floating-point


conversions affect at least 2,000 if Riken identifiers are included. These
conversions are irreversible; the original gene names cannot be recovered.”
Example of a dataset which is difficult to use with any statistical program
https://en.wikipedia.org/wiki/Comparison_of_statistical_packages
What is R ?

• R is an open source complete and flexible software environment for


statistical computing and graphics.
• It includes :
• Tools for data import and manipulation
• Large set of data analysis tools
• Graphical tools
• As a programming language, a simple development environment, with a text
editor
• R itself is written primarily in C and Fortran, and is an implementation
of the statistical language S
Why R ?

• R has become the tool of choice for statistical analysis in several


fields, including life sciences
• Two reasons for this success: it is free and many contributed packages
are available (can be installed and run directly from R).
• Well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed.
• Many tools implemented for bioinformatics
Advantages of R

• Advantages of R
▪ Availability and compatibility
▪ State-of-the-art graphics capabilities
▪ Can import files from other (statistical) programs
▪ New version every x months
▪ Interactive development environments (IDEs) available
▪ Large users community

• Advantages of learning R
▪ Learn to program and do reproducible research
▪ Speak the common language
Drawbacks of R

• «Expert friendly»
• Learn by example
• Not very (easily) interactive
• Command-based
• Documentation sometimes cryptic

• (Too) large amount of resources


• Constantly evolving
• Memory intensive and slow at times
Downloading and installing R: the R website

https://www.r-project.org/
R console

The prompt “>”


indicates that R is
waiting for you to
type a command
RStudio interface

Editor Workspace,
history

File explorer,
plots,
packages,
Console,
help
terminal
R scripts and workspace

• R script (.R file)


▪ Very useful instead of typing commands on the console.
▪ Allows you to keep track of what you are doing and make any modification easier
▪ To actually execute some commands, you can select the lines and run the execution

• Workspace (.Rdata file)


▪ The internal memory where R will store the objects you created during the session.
▪ To list what is in your workspace: ls()
▪ To empty the workspace from all objects: rm(list=ls())
▪ To save only specific R objects: save(object_name(s),"name_of_file.RData")
▪ To save your entire workspace: save.image("name_of_file.RData")
▪ To load your workspace / specific R objects: load("name_of_file.RData")
R Markdown

• R Markdown provides an authoring framework for data science. You can use a single R Markdown file to
both:
▪ save and execute code
▪ generate high quality reports that can be shared with an audience
• R Markdown documents are fully reproducible and support dozens of static and dynamic output formats

https://rmarkdown.rstudio.com/lesson-1.html
LeavingR

• Toleave R, use the q()command (or "quit" from the menu in RStudio):
> q()
Save workspace image? [y/n/c]:

Answers:
y save workspace image
n don't save workspace image
c cancel quitting
Functions, operators and variables

CIhigh <- mean(x) + 1.96*sd(x)/sqrt(n)

Variables: objects stored in memory


Functions: always followed by parenthesis
Operators
R syntax

• Case sensitive: A is not a


• Variable names can include A-Z, a-z, 0-9, .… but can not start with a
number
• Commands can be separated by ; or newline
> x <- 2; x+2
[1] 4
• # indicates comments:
> maxvalue <- 2 # Data above two is not relevant
R help

> ?sum # equivalent to help(sum)


Using R as a calculator

> 2*3
[1] 6
>log(6)/2^2
[1] 0.4479399
>exp(6)-4
[1] 399.4288
> pi-3
[1] 0.1415927
Using R as a programming language

> x <- 2.0


> x
[1] 2.0
> y = 3.0 # Equivalent to y <- 3.0
> y; x
[1] 3
[1] 2
>1/x
[1] 0.5
Creating vectors using the c() command

> x <- c(1.3, 0.32 10.5, 5.9, 6.3)


,
> x
[1] 1.30 0.32 10.5 5.90 6.30
0
> y <- c(x, 1.4, x, x); y
[1] 1.30 0.32 10.5 5.90 6.30
0
[6] 1.40 1.30 0.32 10.50 5.90
[11] 6.30 1.30 0.3 10.50 5.90
2
[16] 6.30
Vector operations

Vector operations work element by element:

> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)


> y <- x*2; y
[1] 2.60 0.64 21.00 11.80 12.60
>z <- x*y; z
[1] 3.38 0.21 220.50 69.62 79.38
Recycling
• If a vector is too short, R recycles it (reuses it) as needed:
> x <- c(1.3, 0.32, 10.5, 5.9)
> y <- c(2, 10)
> x*y
[1] 2.6 3.2 21.0 59.0
1.3*2 0.32*10 10.5*2 5.9*10

• A warning message is displayed if the shortest vector can not be recycled entirely:
> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)
> x*y
[1] 2.6 3.2 21.0 59.0 12.6
Warning message:
In x * y :
longer object length is not a multiple of shorter object length
Generating sequences of numbers
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10

This is equivalent to:


>c(1,2,3,4,5,6,7,8,9,10)
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
Beware of operator priority
> x <- 2*1:10
# equivalent to x <- 2*(1:10)
> x
[1] 2 4 6 8 10 12 14 16 18 20
> n <- 10
> 1:n-1
# equivalent to (1:n)-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(n-1)
[1] 1 2 3 4 5 6 7 8 9
The seq() function: the same, but more flexible
> seq(from=1, to=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1, to=5, by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> x <- seq(from=1, to=5, length=17)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
[9] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
[17 5.0
] 0
Non numeric vectors: boolean (logical) values
> x <- seq(from=1, to=5, length=17)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
[9] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
[17] 5.00
> y <- x<5 # help(“<”) shows list of relational operators
> y
[1] TRUE TRUE TRUE TRUE TRUE TRUE
[7] TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE
>sum(x<5)
[1] 16
Missing values are designated by NA
> z <- c(1:3,NA)
> z
[1] 1 2 3 NA
> is.na(z)
[1] FALSE FALSE FALSE TRUE
> mean(z)
[1] NA
> mean(z, na.rm=TRUE)
[1] 2
Character strings
> char <- c("hello","world","!"); char
[1] "hello" "world" "!"

Vectors can not combine numbers and characters:


> char <- c("hello",3:5,"world"); char
[1] "hello" "3" "4" "5" "world"
> char <- c(char, NA); char
[1] "hello" "3" "4" "5" "world" NA
Selecting subsets of vectors using [ ]
> x <- 10:30
> x[2]
[1] 11
> x[1:5]
[1] 10 11 12 13 14
Selecting subsets of vectors using [ ] and boolean vectors
> x <- 10:30
> x[x>25]
[1] 26 27 28 29 30
>x <-c(seq(from=5, to=10,by=0.5),NA,
seq(from=11,to=15,by=0.5),NA,
seq(from=16,to=20,by=0.5))
> x[!is.na(x)]
[1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
[9] 9.0 9.5 10.0 11.0 11.5 12.0 12.5 13.0
[17] 13.5 14.0 14.5 15.0 16.0 16.5 17.0 17.5
[25] 18.0 18.5 19.0 19.5 20.0
Changing parts of vectors using [ ]
> x[32] <- 200
> x[c(10,29)] <- c(1,100)
> x[x>15] <- NA
Finding the length of a vector
> x <- 1:5
> length(x)
[1] 5

> y <- 1:16


>len <- length(y) ; len
[1] 16
Data analysis workflow
Importing data into R
• R can import flat files using e.g. the commands:
read.table()
read.csv()
read.delim()
(with many options – check the help).

• R can also:
▪ Read Excel spreadsheets
▪ Read plenty of other formats
▪ Directly access databases
▪ Access files over the web
Data frames
• Data frames are made of columns having all the same number of elements
• They look like matrices, except that the columns can hold different variables
types
• They are typically used to store data, with
▪ Each row being an experimental unit
▪ Each column being a measurement

> data[,1] # access first column


> data[, "data1"] # access column "data1"
> data$data1 # … same
Creating data frames
> x <- 1:10
> y <- seq(from=5,to=10,length=10)
> z <- c("A","B","B","A","A","A","B","A","B","B")
> df <- data.frame(d1=x, d2=y, fact=z)
> df
d1 d2 fact
1 1 5.000000 A
2 2 5.555556 B
..
> names(df)
[1] "d1" "d2" "fact“
>dim(df)
[1] 10 3
Adding new columns
> df$d3 <- 10:1
> df
d1 d2 fact d3
1 1 5.000000 A 10
2 2 5.555556 B 9

> summary(df)
d1 d2 fact d3
Min. : 1.00 Min. : 5.00 Length:10 Min. : 1.00
1st Qu.: 3.25 1st Qu.: 6.25 Class :character 1st Qu.: 3.25
Median : 5.50 Median : 7.50 Mode :character Median : 5.50
Mean : 5.50 Mean : 7.50 Mean : 5.50
3rd Qu.: 7.75 3rd Qu.: 8.75 3rd Qu.: 7.75
Max. :10.00 Max. :10.00 Max. :10.00
Select data from a data frame
• Select all valuesof "d2" for which "fact" is "B"
> df[ df$fact == "B", "d2" ]
[1] 5.555556 6.111111 8.333333 9.444444 10.000000

• Select all values of "d1" for which "fact"is "B " and "d2" > 7
> df[ (df$fact == "B" & df$d2 > 7), "d1" ]
[1] 7 9 10

• Select all values of "d3" for which "fact" is “A " or "d2" < 6
>df[ (df$fact == "B" | df$d2 < 6), "d3" ]
[1] 10 9 8 4 2 1
Exercise
• Import students.csv into a variable (call it data)

• Extract the weight of women only in a new


variable

• Extract the weights of the people who weight


more than 80 kilos

• Extract the entries of men who weight more


than 80 kg (you can use the "&" operator to
include two conditions)
If you do not know what to do:

1.Extract the weight of women only in


a new variable
2.Extract the weights of the people
who weight more than 80 kilos
3.Extract the entries of men who
weight more than 80 kg
[you can use the "&" operator to
include two conditions]

You might also like