Data - Analysis - With - R - 24
Data - Analysis - With - R - 24
January 2024
Data analysis with R:
An introduction
Data analysis workflow
Hadley Wickham
Garrett Grolemund
Prepare: make data available in a specific format
• Database
• Flat file
• Proprietary file
Which tool to use for data analysis ?
Annoyances with spreadsheets
Microsoft Excel
• Advantages of R
▪ Availability and compatibility
▪ State-of-the-art graphics capabilities
▪ Can import files from other (statistical) programs
▪ New version every x months
▪ Interactive development environments (IDEs) available
▪ Large users community
• Advantages of learning R
▪ Learn to program and do reproducible research
▪ Speak the common language
Drawbacks of R
• «Expert friendly»
• Learn by example
• Not very (easily) interactive
• Command-based
• Documentation sometimes cryptic
https://www.r-project.org/
R console
Editor Workspace,
history
File explorer,
plots,
packages,
Console,
help
terminal
R scripts and workspace
• R Markdown provides an authoring framework for data science. You can use a single R Markdown file to
both:
▪ save and execute code
▪ generate high quality reports that can be shared with an audience
• R Markdown documents are fully reproducible and support dozens of static and dynamic output formats
https://rmarkdown.rstudio.com/lesson-1.html
LeavingR
• Toleave R, use the q()command (or "quit" from the menu in RStudio):
> q()
Save workspace image? [y/n/c]:
Answers:
y save workspace image
n don't save workspace image
c cancel quitting
Functions, operators and variables
> 2*3
[1] 6
>log(6)/2^2
[1] 0.4479399
>exp(6)-4
[1] 399.4288
> pi-3
[1] 0.1415927
Using R as a programming language
• A warning message is displayed if the shortest vector can not be recycled entirely:
> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)
> x*y
[1] 2.6 3.2 21.0 59.0 12.6
Warning message:
In x * y :
longer object length is not a multiple of shorter object length
Generating sequences of numbers
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
• R can also:
▪ Read Excel spreadsheets
▪ Read plenty of other formats
▪ Directly access databases
▪ Access files over the web
Data frames
• Data frames are made of columns having all the same number of elements
• They look like matrices, except that the columns can hold different variables
types
• They are typically used to store data, with
▪ Each row being an experimental unit
▪ Each column being a measurement
• Select all values of "d1" for which "fact"is "B " and "d2" > 7
> df[ (df$fact == "B" & df$d2 > 7), "d1" ]
[1] 7 9 10
• Select all values of "d3" for which "fact" is “A " or "d2" < 6
>df[ (df$fact == "B" | df$d2 < 6), "d3" ]
[1] 10 9 8 4 2 1
Exercise
• Import students.csv into a variable (call it data)