0% found this document useful (0 votes)

8 views47 pages

Data - Analysis - With - R - 24

The document serves as an introduction to statistics using R, highlighting its advantages, disadvantages, and functionalities for data analysis. It covers the workflow of data analysis, the use of R for statistical computing, and the creation and manipulation of data frames. Additionally, it emphasizes R's capabilities for reproducible research and high-quality graphical outputs.

Uploaded by

Kar Wai Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views47 pages

Data - Analysis - With - R - 24

Uploaded by

Kar Wai Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Introduction to Statistics

Joao Lourenço ([email protected]) and Rachel Marcone ([email protected])

January 2024
Data analysis with R:
An introduction
Data analysis workflow

Hadley Wickham

Garrett Grolemund
Prepare: make data available in a specific format

• Database
• Flat file
• Proprietary file
Which tool to use for data analysis ?
Annoyances with spreadsheets
Microsoft Excel

•Many standard methods in statistics are not available. Other

methods only offer basic options (linear regression)

• Different analysis require user to reorganize the data

•Probably ok for simple calculations (basic summary statistics,

simple regression)

•Add-ons can be used for missing functions (e.g. StatPlus for

Excel)
Libreoffice

• Many types of graphics violate standards of good graphics

Annoyances with spreadsheets

“The date conversions affect at least 30 gene names; the floating-point

conversions affect at least 2,000 if Riken identifiers are included. These
conversions are irreversible; the original gene names cannot be recovered.”
Example of a dataset which is difficult to use with any statistical program
https://en.wikipedia.org/wiki/Comparison_of_statistical_packages
What is R ?

• R is an open source complete and flexible software environment for

statistical computing and graphics.
• It includes :
• Tools for data import and manipulation
• Large set of data analysis tools
• Graphical tools
• As a programming language, a simple development environment, with a text
editor
• R itself is written primarily in C and Fortran, and is an implementation
of the statistical language S
Why R ?

• R has become the tool of choice for statistical analysis in several

fields, including life sciences
• Two reasons for this success: it is free and many contributed packages
are available (can be installed and run directly from R).
• Well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed.
• Many tools implemented for bioinformatics
Advantages of R

• Advantages of R
▪ Availability and compatibility
▪ State-of-the-art graphics capabilities
▪ Can import files from other (statistical) programs
▪ New version every x months
▪ Interactive development environments (IDEs) available
▪ Large users community

• Advantages of learning R
▪ Learn to program and do reproducible research
▪ Speak the common language
Drawbacks of R

• «Expert friendly»
• Learn by example
• Not very (easily) interactive
• Command-based
• Documentation sometimes cryptic

• (Too) large amount of resources

• Constantly evolving
• Memory intensive and slow at times
Downloading and installing R: the R website

https://www.r-project.org/
R console

The prompt “>”

indicates that R is
waiting for you to
type a command
RStudio interface

Editor Workspace,
history

File explorer,
plots,
packages,
Console,
help
terminal
R scripts and workspace

• R script (.R file)

▪ Very useful instead of typing commands on the console.
▪ Allows you to keep track of what you are doing and make any modification easier
▪ To actually execute some commands, you can select the lines and run the execution

• Workspace (.Rdata file)

▪ The internal memory where R will store the objects you created during the session.
▪ To list what is in your workspace: ls()
▪ To empty the workspace from all objects: rm(list=ls())
▪ To save only specific R objects: save(object_name(s),"name_of_file.RData")
▪ To save your entire workspace: save.image("name_of_file.RData")
▪ To load your workspace / specific R objects: load("name_of_file.RData")
R Markdown

• R Markdown provides an authoring framework for data science. You can use a single R Markdown file to
both:
▪ save and execute code
▪ generate high quality reports that can be shared with an audience
• R Markdown documents are fully reproducible and support dozens of static and dynamic output formats

https://rmarkdown.rstudio.com/lesson-1.html
LeavingR

• Toleave R, use the q()command (or "quit" from the menu in RStudio):
> q()
Save workspace image? [y/n/c]:

Answers:
y save workspace image
n don't save workspace image
c cancel quitting
Functions, operators and variables

CIhigh <- mean(x) + 1.96*sd(x)/sqrt(n)

Variables: objects stored in memory

Functions: always followed by parenthesis
Operators
R syntax

• Case sensitive: A is not a

• Variable names can include A-Z, a-z, 0-9, .… but can not start with a
number
• Commands can be separated by ; or newline
> x <- 2; x+2
[1] 4
• # indicates comments:
> maxvalue <- 2 # Data above two is not relevant
R help

> ?sum # equivalent to help(sum)

Using R as a calculator

> 2*3
[1] 6
>log(6)/2^2
[1] 0.4479399
>exp(6)-4
[1] 399.4288
> pi-3
[1] 0.1415927
Using R as a programming language

> x <- 2.0

> x
[1] 2.0
> y = 3.0 # Equivalent to y <- 3.0
> y; x
[1] 3
[1] 2
>1/x
[1] 0.5
Creating vectors using the c() command

> x <- c(1.3, 0.32 10.5, 5.9, 6.3)

,
> x
[1] 1.30 0.32 10.5 5.90 6.30
0
> y <- c(x, 1.4, x, x); y
[1] 1.30 0.32 10.5 5.90 6.30
0
[6] 1.40 1.30 0.32 10.50 5.90
[11] 6.30 1.30 0.3 10.50 5.90
2
[16] 6.30
Vector operations

Vector operations work element by element:

> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)

> y <- x*2; y
[1] 2.60 0.64 21.00 11.80 12.60
>z <- x*y; z
[1] 3.38 0.21 220.50 69.62 79.38
Recycling
• If a vector is too short, R recycles it (reuses it) as needed:
> x <- c(1.3, 0.32, 10.5, 5.9)
> y <- c(2, 10)
> x*y
[1] 2.6 3.2 21.0 59.0
1.3*2 0.32*10 10.5*2 5.9*10

• A warning message is displayed if the shortest vector can not be recycled entirely:
> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)
> x*y
[1] 2.6 3.2 21.0 59.0 12.6
Warning message:
In x * y :
longer object length is not a multiple of shorter object length
Generating sequences of numbers
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10

This is equivalent to:

>c(1,2,3,4,5,6,7,8,9,10)
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
Beware of operator priority
> x <- 2*1:10
# equivalent to x <- 2*(1:10)
> x
[1] 2 4 6 8 10 12 14 16 18 20
> n <- 10
> 1:n-1
# equivalent to (1:n)-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(n-1)
[1] 1 2 3 4 5 6 7 8 9
The seq() function: the same, but more flexible
> seq(from=1, to=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1, to=5, by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> x <- seq(from=1, to=5, length=17)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
[9] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
[17 5.0
] 0
Non numeric vectors: boolean (logical) values
> x <- seq(from=1, to=5, length=17)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
[9] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
[17] 5.00
> y <- x<5 # help(“<”) shows list of relational operators
> y
[1] TRUE TRUE TRUE TRUE TRUE TRUE
[7] TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE
>sum(x<5)
[1] 16
Missing values are designated by NA
> z <- c(1:3,NA)
> z
[1] 1 2 3 NA
> is.na(z)
[1] FALSE FALSE FALSE TRUE
> mean(z)
[1] NA
> mean(z, na.rm=TRUE)
[1] 2
Character strings
> char <- c("hello","world","!"); char
[1] "hello" "world" "!"

Vectors can not combine numbers and characters:

> char <- c("hello",3:5,"world"); char
[1] "hello" "3" "4" "5" "world"
> char <- c(char, NA); char
[1] "hello" "3" "4" "5" "world" NA
Selecting subsets of vectors using [ ]
> x <- 10:30
> x[2]
[1] 11
> x[1:5]
[1] 10 11 12 13 14
Selecting subsets of vectors using [ ] and boolean vectors
> x <- 10:30
> x[x>25]
[1] 26 27 28 29 30
>x <-c(seq(from=5, to=10,by=0.5),NA,
seq(from=11,to=15,by=0.5),NA,
seq(from=16,to=20,by=0.5))
> x[!is.na(x)]
[1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
[9] 9.0 9.5 10.0 11.0 11.5 12.0 12.5 13.0
[17] 13.5 14.0 14.5 15.0 16.0 16.5 17.0 17.5
[25] 18.0 18.5 19.0 19.5 20.0
Changing parts of vectors using [ ]
> x[32] <- 200
> x[c(10,29)] <- c(1,100)
> x[x>15] <- NA
Finding the length of a vector
> x <- 1:5
> length(x)
[1] 5

> y <- 1:16

>len <- length(y) ; len
[1] 16
Data analysis workflow
Importing data into R
• R can import flat files using e.g. the commands:
read.table()
read.csv()
read.delim()
(with many options – check the help).

• R can also:
▪ Read Excel spreadsheets
▪ Read plenty of other formats
▪ Directly access databases
▪ Access files over the web
Data frames
• Data frames are made of columns having all the same number of elements
• They look like matrices, except that the columns can hold different variables
types
• They are typically used to store data, with
▪ Each row being an experimental unit
▪ Each column being a measurement

> data[,1] # access first column

> data[, "data1"] # access column "data1"
> data$data1 # … same
Creating data frames
> x <- 1:10
> y <- seq(from=5,to=10,length=10)
> z <- c("A","B","B","A","A","A","B","A","B","B")
> df <- data.frame(d1=x, d2=y, fact=z)
> df
d1 d2 fact
1 1 5.000000 A
2 2 5.555556 B
..
> names(df)
[1] "d1" "d2" "fact“
>dim(df)
[1] 10 3
Adding new columns
> df$d3 <- 10:1
> df
d1 d2 fact d3
1 1 5.000000 A 10
2 2 5.555556 B 9
…
> summary(df)
d1 d2 fact d3
Min. : 1.00 Min. : 5.00 Length:10 Min. : 1.00
1st Qu.: 3.25 1st Qu.: 6.25 Class :character 1st Qu.: 3.25
Median : 5.50 Median : 7.50 Mode :character Median : 5.50
Mean : 5.50 Mean : 7.50 Mean : 5.50
3rd Qu.: 7.75 3rd Qu.: 8.75 3rd Qu.: 7.75
Max. :10.00 Max. :10.00 Max. :10.00
Select data from a data frame
• Select all valuesof "d2" for which "fact" is "B"
> df[ df$fact == "B", "d2" ]
[1] 5.555556 6.111111 8.333333 9.444444 10.000000

• Select all values of "d1" for which "fact"is "B " and "d2" > 7
> df[ (df$fact == "B" & df$d2 > 7), "d1" ]
[1] 7 9 10

• Select all values of "d3" for which "fact" is “A " or "d2" < 6
>df[ (df$fact == "B" | df$d2 < 6), "d3" ]
[1] 10 9 8 4 2 1
Exercise
• Import students.csv into a variable (call it data)

• Extract the weight of women only in a new

variable

• Extract the weights of the people who weight

more than 80 kilos

• Extract the entries of men who weight more

than 80 kg (you can use the "&" operator to
include two conditions)
If you do not know what to do:

1.Extract the weight of women only in

a new variable
2.Extract the weights of the people
who weight more than 80 kilos
3.Extract the entries of men who
weight more than 80 kg
[you can use the "&" operator to
include two conditions]

Excel For Accountants
No ratings yet
Excel For Accountants
34 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
MAPEH - Arts: Quarter 1 - Module 3: Concepts On The Use of Computer Software
No ratings yet
MAPEH - Arts: Quarter 1 - Module 3: Concepts On The Use of Computer Software
14 pages
Salinan Karirnex - Bootcamp MS Excel - Sesi 1
No ratings yet
Salinan Karirnex - Bootcamp MS Excel - Sesi 1
15 pages
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
No ratings yet
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
36 pages
ExcelVBAQuickReference PDF
100% (1)
ExcelVBAQuickReference PDF
2 pages
Cumulative Cost Curve Dollars Template
No ratings yet
Cumulative Cost Curve Dollars Template
10 pages
Research Question and Data Collection Feb 20 2020
No ratings yet
Research Question and Data Collection Feb 20 2020
46 pages
SharePoint - 2 Day Course Manual
No ratings yet
SharePoint - 2 Day Course Manual
216 pages
Data Anlytics Using R Notes
No ratings yet
Data Anlytics Using R Notes
14 pages
Skills Matrix Template
No ratings yet
Skills Matrix Template
10 pages
Template Based Protein Structure Modeling
No ratings yet
Template Based Protein Structure Modeling
98 pages
IBM Cognos 8 Business Intelligence Analysis Studio: User Guide
No ratings yet
IBM Cognos 8 Business Intelligence Analysis Studio: User Guide
128 pages
R Statistical Package
No ratings yet
R Statistical Package
63 pages
R Course ISLR Basics 2023
No ratings yet
R Course ISLR Basics 2023
77 pages
R Lab
No ratings yet
R Lab
114 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Teaching Secondary Science With ICT
No ratings yet
Teaching Secondary Science With ICT
164 pages
3 - PowerAnalysis - Slides
No ratings yet
3 - PowerAnalysis - Slides
58 pages
3.Tds Manual Form 26q
No ratings yet
3.Tds Manual Form 26q
54 pages
R-Basic Concepts
No ratings yet
R-Basic Concepts
67 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
CS273 - Protein Structure Prediction
No ratings yet
CS273 - Protein Structure Prediction
39 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Network Analysis and Visualization With R and Igraph
No ratings yet
Network Analysis and Visualization With R and Igraph
62 pages
Introdution To R - Network Analysis - Practical 1 - Sacha Epskamp - University of Amsterdam, 2013
No ratings yet
Introdution To R - Network Analysis - Practical 1 - Sacha Epskamp - University of Amsterdam, 2013
34 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
R Programming 101 Part 1
No ratings yet
R Programming 101 Part 1
53 pages
Introduction To Clinical Informatics - 2020
No ratings yet
Introduction To Clinical Informatics - 2020
24 pages
Exploratory Data Analysis24
No ratings yet
Exploratory Data Analysis24
27 pages
L3 - Microbiology of Acute Pyogenic Meningitis
No ratings yet
L3 - Microbiology of Acute Pyogenic Meningitis
24 pages
Introduction To R Chap 2
No ratings yet
Introduction To R Chap 2
30 pages
BBA (CA) Sem1
No ratings yet
BBA (CA) Sem1
21 pages
Pivot Table in Excel 2007 Training
No ratings yet
Pivot Table in Excel 2007 Training
35 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Introduction To R
No ratings yet
Introduction To R
23 pages
Introduction To Analytics and R File
No ratings yet
Introduction To Analytics and R File
29 pages
Advantages of R Programming Language:: Extensive Libraries
No ratings yet
Advantages of R Programming Language:: Extensive Libraries
34 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
MDPN460 Lecture03
No ratings yet
MDPN460 Lecture03
34 pages
Practical 1 - Data Frame Manipulation - 072502
No ratings yet
Practical 1 - Data Frame Manipulation - 072502
16 pages
From PDB To AlphaFold2 and Beyond
No ratings yet
From PDB To AlphaFold2 and Beyond
13 pages
Org Uipath - Naan Mudhalvan
No ratings yet
Org Uipath - Naan Mudhalvan
22 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
R Introduction
No ratings yet
R Introduction
40 pages
ITC Course Outline - Spring 2021
No ratings yet
ITC Course Outline - Spring 2021
3 pages
R Programming
No ratings yet
R Programming
50 pages
R
No ratings yet
R
13 pages
Module 2 - The Excel Data Model
No ratings yet
Module 2 - The Excel Data Model
17 pages
Section 03
No ratings yet
Section 03
20 pages
Data Analysis Using R and Vectors
No ratings yet
Data Analysis Using R and Vectors
35 pages
Manual - 3D SymbolDesigner (En)
No ratings yet
Manual - 3D SymbolDesigner (En)
64 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
R Programming
No ratings yet
R Programming
61 pages
Modules Computer Fundamentals, MS-Office, Internet & Soft Skills
No ratings yet
Modules Computer Fundamentals, MS-Office, Internet & Soft Skills
2 pages
Role of Statistics in Clinical Trials
No ratings yet
Role of Statistics in Clinical Trials
5 pages
Transcript Ordering School Users Guide
No ratings yet
Transcript Ordering School Users Guide
48 pages
MIS 4.hafta (Introduction To R)
No ratings yet
MIS 4.hafta (Introduction To R)
52 pages
Introduction To BQL For Excel
No ratings yet
Introduction To BQL For Excel
4 pages
RBU First Semester Scheme and Syllabus For CSE
No ratings yet
RBU First Semester Scheme and Syllabus For CSE
10 pages
ميكروسوفت اكسلMicrosoft Excel
No ratings yet
ميكروسوفت اكسلMicrosoft Excel
49 pages
TreePlan 212 Guide
No ratings yet
TreePlan 212 Guide
22 pages
STAT 04 Simplify Notes
No ratings yet
STAT 04 Simplify Notes
34 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
Rintro
No ratings yet
Rintro
14 pages
CV Rakesh RPA AA
No ratings yet
CV Rakesh RPA AA
5 pages
Littlefield Simulation UCL 2021-22
No ratings yet
Littlefield Simulation UCL 2021-22
7 pages
Chap 1a
No ratings yet
Chap 1a
14 pages
Drill Through TM1 Process
No ratings yet
Drill Through TM1 Process
5 pages
CATT Procedure
No ratings yet
CATT Procedure
1 page
Raci Matrix Template
No ratings yet
Raci Matrix Template
10 pages
R-Programming: To See The Working Directory in R Studio
No ratings yet
R-Programming: To See The Working Directory in R Studio
17 pages
R Examples
No ratings yet
R Examples
56 pages
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
No ratings yet
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
7 pages
R Studio
No ratings yet
R Studio
41 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
Introduction To R: 1 Getting Started
No ratings yet
Introduction To R: 1 Getting Started
14 pages
Session Set Working Directory Choose Directlry
No ratings yet
Session Set Working Directory Choose Directlry
17 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Sap Business Planning and Consolidation Embedded Consolidation
No ratings yet
Sap Business Planning and Consolidation Embedded Consolidation
3 pages
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
No ratings yet
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
15 pages
AA Master
No ratings yet
AA Master
7 pages
Intro To Statistic Using R - Session 2
No ratings yet
Intro To Statistic Using R - Session 2
1 page
RStudio Exercices
No ratings yet
RStudio Exercices
8 pages
R Short Tutorial
No ratings yet
R Short Tutorial
5 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Learning Programming and Computer Science: 1, #1
From Everand
Learning Programming and Computer Science: 1, #1
MATHY WISDOM
No ratings yet
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet

Data - Analysis - With - R - 24

Uploaded by

Data - Analysis - With - R - 24

Uploaded by

Introduction to Statistics

Joao Lourenço ([email protected]) and Rachel Marcone ([email protected])

•Many standard methods in statistics are not available. Other

• Different analysis require user to reorganize the data

•Probably ok for simple calculations (basic summary statistics,

•Add-ons can be used for missing functions (e.g. StatPlus for

• Many types of graphics violate standards of good graphics

“The date conversions affect at least 30 gene names; the floating-point

• R is an open source complete and flexible software environment for

• R has become the tool of choice for statistical analysis in several

• (Too) large amount of resources

The prompt “>”

• R script (.R file)

• Workspace (.Rdata file)

CIhigh <- mean(x) + 1.96*sd(x)/sqrt(n)

Variables: objects stored in memory

• Case sensitive: A is not a

> ?sum # equivalent to help(sum)

> x <- 2.0

> x <- c(1.3, 0.32 10.5, 5.9, 6.3)

Vector operations work element by element:

> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)

This is equivalent to:

Vectors can not combine numbers and characters:

> y <- 1:16

> data[,1] # access first column

• Extract the weight of women only in a new

• Extract the weights of the people who weight

• Extract the entries of men who weight more

1.Extract the weight of women only in

You might also like