Introduction to R:
Why we use R for statistical computing and graphics?
Which companies are using R?
Application of R program in real world
R reserved words
R data types and constants
R data types:
Logical ,numerical,integer,complex,character and raw data
R constants:
Numeric
Character
Built in
R data structures
Vector
List
Matrices
Arrays
Factors
1. How to create factors?
2. How to access components of a factors?
Data frames
1. How to create dataframe in R?
2. How to access components of a data frame?
Using rbind() and Column bind cbind()/ Installing pakages
R flow controls (loops and if then else) / import data sets/ How to read csv files?
R charts and graphs
Pie chart
Bar chart
Box plot
Histogram
Histogram with added parameters
Line graphs
Scatter plots
Strip charts
R statistical functions
Basic statistical functions: mean,median,mode,average,min,max
Correlation and Linear regression , multilinear regression functions
ANOVA functions
Mode of teaching: Lab sessions
Introduction to R programming:
• R is a programming language and environment commonly used in statistical
computing, data analytics and scientific research.
• It is one of the most popular languages used by statisticians, data analysts, researchers
and marketers to retrieve, clean, analyze, visualize and present data.
• Due to its expressive syntax and easy-to-use interface, it has grown in popularity in
recent years.
Why we use R programming for statistical computing and graphics?
• R is open source and free!
• R is popular – and increasing in popularity
• R runs on all platforms
• Learning R will increase your chances of getting a job
• R is being used by the biggest tech giants
History of R
John Chambers and colleagues developed R at Bell Laboratories. R is an
implementation of the S programming Language and combines with lexical scoping
semantics inspired by Scheme.
R was named partly after the first names of two R authors. The project conceives in
1992, with an initial version released in 1995 and a stable beta version in 2000.
Companies Using R for what purpose?
Application of R programming in the real world:
• Data Science
• Programming languages like R give a data scientist superpowers that allow them to
collect data in real-time, perform statistical and predictive analysis, create
visualizations and communicate actionable results to stakeholders.
• Statistical computing
• It has a rich package repository with more than 9100 packages with every statistical
function you can imagine
• Machine Learning
Machine learning enthusiasts to researchers use R to implement machine learning algorithms
in fields like finance, genetics research, retail, marketing and health care.
Alternatives of R programming:
SAS
SPSS
Python
Features of R
It supports procedural programming with functions and object-oriented
programming with generic functions. Procedural programming includes procedure,
records, modules, and procedure calls. While object-oriented programming language
includes class, objects, and
functions.
Packages are part of R programming. Hence, they are useful in collecting sets of R
functions into a single unit.
R programming features include database input, exporting data, viewing data, variable
labels, missing data, etc.
R is an interpreted language. Hence, we can access it through command line interpreter.
R supports matrix arithmetic.
It has effective data handling and storage facilities.
R supports a large pool of operators for performing operations on arrays and matrices.
It has facilities to print the reports for the analysis performed in the form of graphs either
on-screen or on hardcopy.
So, we can obtain the installation files for the R program on the official R Website (www.r-
project.org). The website has general documentation related to R along with the libraries of
routines. We can simply download and install the R program from the R Website.
Run R programming in Windows
• Go to official site of R programming
• Click on the CRAN link on the left sidebar
• Select a mirror
• Click “Download R for Windows”
• Click on the link that downloads the base distribution
• Run the file and follow the steps in the instructions to install R.
R studio GUI
a. Features of RStudio
Code highlighting that gives different colors to keywords and variables, making it easier
to read
Automatic bracket matching
Code completion, so as to reduce the effort of typing the commands in full
Easy access to R Help, with additional features for exploring functions and parameters of
functions
Easy exploration of variables and values. RStudio is available free of charge for Linux,
Windows, and Mac devices. It can be directly accessed by clicking the RStudio icon in
the menu system on the desktop.
Because RStudio is available free of charge for Linux, Windows, and Mac devices, it is a
good option to use with R. To open RStudio, click the RStudio icon in the menu system or on
the desktop.
b. Components of RStudio
Source – Top left corner of the screen contains a text editor that lets the user work with
source script files. Multiple lines of code can also be entered here. Users can save R
script file to disk and perform other tasks on the script.
Console – Bottom left corner is the R console window. The console in RStudio is
identical to the console in RGui. All the interactive work of R programming is performed
in this window.
Workspace and History – The top right corner is the R workspace and history window.
This provides an overview of the workspace, where the variables created in the session
along with their values can be inspected. This is also the area where the user can see a
history of the commands issued in R.
Files, Plots, Package, and Help the bottom right corner gives access to the following tools:
Files – This is where the user can browse folders and files on a computer.
Plots – Now, this is where R displays the user’s plots.
Packages – This is where the user can view a list of all the installed packages.
Help – This is where you can browse the built-in Help system of R.
R reserved words
Comparison of R with other technologies:
Data handling Capabilities – Good data handling capabilities and options for parallel
computation.
Availability / Cost – R is an open source and we can use it anywhere.
Advancement in Tool – If you are working on latest technologies, R gets latest features.
Ease of Learning – R has a learning curve. R is a low-level programming language. As a
result, simple procedures can take long codes.
Job Scenario – It is a better option for start-ups and companies looking for cost
efficiency.
Graphical capabilities – R is having the most advanced graphical capabilities. Hence, it
provides you with advanced graphical capabilities.
Customer Service support and community – R is the biggest online growing
community.
R code and explanation
Vectors:
A vector must have elements of the same type, this function will try and coerce elements to
the same type, if they are different.
Coercion is from lower to higher types from logical to integer to double to character.
Example 1:
Code:
x <- c(1, 5, 4, 9, 0)
typeof(x)
length(x)
Example:2
Code:
x <- c(1, 5.4, TRUE, "hello")
x
typeof(x)
If we want to create a vector of consecutive numbers, the : operator is very helpful.
Code:
X <- 1:7; x
y <- 2:-2; y
Creating a vector using seq() function
Code:
seq(1, 3, by=0.2) # specify step size
seq(1, 5, length.out=4) # specify length of the vector
Using integer vector as index
Vector index in R starts from 1, unlike most programming languages where index
start from 0.
We can use a vector of integers as index to access specific elements.
We can also use negative integers to return all elements except that those specified.
But we cannot mix positive and negative integers while indexing and real numbers, if
used, are truncated to integers.
Code:
[1] 0 2 4 6 8 10
x[3] # access 3rd element
[1] 4
x[c(2, 4)] # access 2nd and 4th element
[1] 2 6
x[-1] # access all but 1st element
[1] 2 4 6 8 10
x[c(2, -4)] # cannot mix positive and negative integers
Error in x[c(2, -4)] : only 0's may be mixed with negative subscripts
x[c(2.4, 3.54)] # real numbers are truncated to integers
[1] 2 4
Using logical vector as index
When we use a logical vector for indexing, the position where the logical vector
is TRUE is returned.
This useful feature helps us in filtering of vector as shown below.
x[c(TRUE, FALSE, FALSE, TRUE)]
[1] -3 3
x[x < 0] # filtering vectors based on conditions
[1] -3 -1
x[x > 0]
[1] 3
Using character vector as index
This type of indexing is useful when dealing with named vectors. We can name each
elements of a vector.
x <- c("first"=3, "second"=0, "third"=9)
names(x)
[1] "first" "second" "third"
x["second"]
second
x[c("first", "third")]
first third
3 9
How to modify a vector in R?
We can modify a vector using the assignment operator.
We can use the techniques discussed above to access specific elements and modify
them.
If we want to truncate the elements, we can use reassignments.
x
[1] -3 -2 -1 0 1 2
x[2] <- 0; x # modify 2nd element
[1] -3 0 -1 0 1 2
x[x<0] <- 5; x # modify elements less than 0
[1] 5 0 5 0 1 2
x <- x[1:4]; x # truncate x to first 4 elements
[1] 5 0 5 0
How to delete a Vector?
We can delete a vector by simply assigning a NULL to it.
x
[1] -3 -2 -1 0 1 2
x <- NULL
NULL
x[4]
NULL
Matrix:
Matrix is a two dimensional data structure in R programming.
Matrix is similar to vector but additionally contains the dimension attribute.
All attributes of an object can be checked with the attributes() function (dimension
can be checked directly with the dim() function).
We can check if a variable is a matrix or not with the class() function.
R code for practice:
*charcter constants
'example'
typeof("5")
*Numeric Constants
Types of operators
Arithmetic operators
+ / - / * / / / %% / %/% / ^
Add two vectors
Subtract s second vector from the first
Multiply both the vectors
Divide the first vector with the second
Give the remainder of the first vector with the second
The result of division of first vector with second (quotient)
The first vector raised to the exponent of second vector
u <- c(2,3,4)
v <- c(9,8,7)
print (u+v)
b <- c(1,2,3)
c <- c(9,8,7)
print(b-c)
print (u-v)
print(v-u)
g <- c(1,2)
h <- c(2,3,4)
print(g*h)
g <- c(1,2,3)
h <- c(3,5,6)
print (g*h)
g <- c(1,2,3)
h <- c(3,5,6)
print (g%%h)
print(g %/% h)
print (g^h)
Built in constants
LETTERS
letters
pi
month.name
month.abb
Vector:
Basic statistical operations (code)
mean(c(0, 5, 1, -10, 6))
median(c(0, 5, 1, -10, 6))
var(c(0, 5, 1, -10, 6))
length(c(1, 5, 6, -2))
quantile(c(5,6,7))
sd(c(5,6,7,8))
max(c(5,6,7,8))
min(c(5,6,7,8))
sqrt(c(2, 4))
Mode function :
Mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
result <- getmode(v)
print(result)
Create the vector with characters.
charv <- c("o","it","the","it","it")
>
> # Calculate the mode using the user function.
result <- getmode(charv)
print(result)
vector:
Vector is a basic data structure in R. It contains element of the same type. The
data types can be logical, integer, double, character, complex or raw.
A vector’s type can be checked with the typeof() function.
Another important property of a vector is its length. This is the number of
elements in the vector and can be checked with the function length().
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
Bschools <- c('MMS','PGPM','PGDM')
Bschools
print(class(Bschools))
List:
List is a data structure having components of mixed data types.
A vector having all elements of the same type is called atomic vector but a vector
having elements of different type is called list.
We can check if it’s a list with typeof() function and find its length using length().
Here is an example of a list having three components each of different data type.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
list <- list('MMS students',21.5,c(3,7,8,9))
list
Matrix:
Create a matrix
Matrix can be created using the matrix() function.
Dimension of the matrix can be defined by passing appropriate value for
arguments nrow and ncol.
M = matrix( c('k','a','v','i','t','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
matrix(1:9, nrow = 3, ncol = 3)
matrix(1:9, nrow = 3)
matrix(1:9, nrow=3, byrow=TRUE) # fill matrix row-wise
x <- matrix(1:9, nrow = 3, dimnames = list(c("India","USA","UK"), c("C1","C2","C3")))
Column names and row names chaging and accessing
colnames(x)
"A" "B" "C"
rownames(x)
"X" "Y" "Z"
> # It is also possible to change names
colnames(x) <- c("C1","C2","C3")
rownames(x) <- c("R1","R2","R3")
Column bind and row bind
cbind(c(1,2,3),c(4,5,6))
rbind(c(1,2,3),c(4,5,6))
cbind(c('t','e','a','c','h'),c(1,2,3,4,5))
rbind(c('t','e','a','c','h'),c(1,2,3,4,5))
How to modify a matrix?
x[2,2] <- 10; x # modify a single element
x[x<5] <- 0; x # modify elements less than 5
x[-1,] # select all rows except first
x[c(1,2),c(2,3)] select rows 1 & 2 and columns 2 & 3
x[c(3,2),] # leaving column field blank will select entire columns
x[,] # leaving row as well as column field blank will select entire matrix
x[-1,] # select all rows except first
factors
Factor is a data structure used for fields that takes only predefined, finite number of
values (categorical data).
For example: a data field such as marital status may contain only values from single,
married, separated, divorced, or widowed.In such case, we know the possible values
beforehand and these predefined, distinct values are called levels. Following is an
example of factor in R.
seeds_rice<- c('IR 20','Basmati','IR 60','Kolam','kolam nasik','IR idli rice','IR
20','Basmati','wada kolam')
seeds_rice
factor_seeds <- factor(seeds_rice)
print(factor_seeds)
print(nlevels(factor_seeds))
Data frames
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
print(BMI)
Temp < - data.frame (
Min = c (23,12,13,5),
Max = c(23,45,45,65)
)
Print(Temp)
x <- data.frame("SN" = 1:2, "Age" = c(47,75), "Name" = c("kavita","ramalingam"))
str(x) # structure of x
x["Name"]
x$Name
x[["Name"]]
x[[3]]
combining two dataframes
library(gtools)
df1 = data.frame(a = c(1:5), b = c(6:10))
df2 = data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])
smartbind(df1,df2)
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Ricky","Danish","Mini","Ryan","Gary"),
salary = c(643.3,515.2,671.0,729.0,943.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11","2015-03-
27")),
stringsAsFactors = FALSE
)
print(emp.data)
Get the Structure of the R Data Frame
The structure of the data frame can see by using the str () function.
str(emp.data)
extract specific columns
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
extract first two rows
result <- emp.data[1:2,]
print(result)
3rd and 5th row and 2nd and 4th column
result <- emp.data[c(3,5),c(2,4)]
print(result)
add the dept column
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)
create a 2nd data frame
emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
Using Loops
Exercise 1:
How to print a multiplication table
num = as.integer(readline(prompt = "Enter a number: "))
# use for loop to iterate 10 times
for(i in 1:10)
print(paste(num,'x', i, '=', num*i))
Exercise:2
How to print a addition table
num = as.integer(readline(prompt = "Enter a number: "))
for(i in 1:10)
print(paste(num,'+', i, '=', num +i))
Exercise:3
To check the given number is even or odd
num = as.integer(readline(prompt="Enter a number: "))
if((num %% 2) == 0) {
print(paste(num,"is Even"))
} else {
print(paste(num,"is Odd"))
Charts and its types:
max.temp <- c(22, 27, 26, 24, 23, 26, 28)
barplot(max.temp)
bar chart with added parameters:
barplot(max.temp,
main = "Maximum Temperatures in a Week",
xlab = "Degree Celsius",
ylab = "Day",
names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),
col = "darkred",
horiz = TRUE)
plotting categorical data:
age <- c(17,18,18,17,18,19,18,16,18,18)
table(age)
barplot(table(age),
main="Age Count of 10 Students",
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
histogram
Builtin data sets
str(airquality) # str structure of the data set
Temperature <- airquality$Temp
hist(Temperature)
added parameters
hist(Temperature,
main="Maximum daily temperature at La Guardia Airport",
xlab="Temperature in degrees Fahrenheit",
xlim=c(50,100),
col="darkmagenta",
freq=FALSE
return value of hist()
h <- hist(Temperature)
return values for labels using text()
h <- hist(Temperature,ylim=c(0,40))
text(h$mids,h$counts,labels=h$counts, adj=c(0.5, -0.5))
histogram using different breaks
hist(Temperature, breaks=4, main="With breaks=4")
hist(Temperature, breaks=20, main="With breaks=20")
histogram with non uniform width:
hist(Temperature,
main="Maximum daily temperature at La Guardia Airport",
xlab="Temperature in degrees Fahrenheit",
xlim=c(50,100),
col="chocolate",
border="brown",
breaks=c(55,60,70,75,80,100)
bar plot
str(airquality)
boxplot(airquality$Ozone) # ozone readings
boxplot(airquality$Ozone,
main = "Mean ozone in parts per billion at Roosevelt Island",
xlab = "Parts Per Billion",
ylab = "Ozone",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
b <- boxplot(airquality$Ozone)
boxplot(Temp~Month,
data=airquality,
main="Different boxplots for each month",
xlab="Month Number",
ylab="Degree Fahrenheit",
col="orange",
border="brown"
strip chart
str(airquality)
stripchart(airquality$Ozone)
using jitter as a method
stripchart(airquality$Ozone,
main="Mean ozone in parts per billion at Roosevelt Island",
xlab="Parts Per Billion",
ylab="Ozone",
method="jitter",
col="orange",
pch=1
to draw multiple strips we want to prepare data set
# prepare the data
temp <- airquality$Temp
# gererate normal distribution with same mean and sd
tempNorm <- rnorm(200,mean=mean(temp, na.rm=TRUE), sd = sd(temp, na.rm=TRUE))
# make a list
x <- list("temp"=temp, "norm"=tempNorm)
stripchart(x,
main="Multiple stripchart for comparision",
xlab="Degree Fahrenheit",
ylab="Temperature",
method="jitter",
col=c("orange","red"),
pch=16
strip chart from the formula
stripchart(Temp~Month,
data=airquality,
main="Different strip chart for each month",
xlab="Months",
ylab="Temperature",
col="brown3",
group.names=c("May","June","July","August","September"),
vertical=TRUE,
pch=16
TYPES OF CHARTS
Data set
class.interval frequency
11.5-16.5 2
16.5-21.5 6
21.5-26.5 7
26.5-31.5 5
31.5-36.5 3
hist(CHARTS1$frequency,right = FALSE)
histogram
v <- c(9,13,21,8,36,22,12,41,31,33,19)
hist(v,xlab = "Weight",col = "yellow",border = "blue")
hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim =
c(0,5),
breaks = 5)
plot
v <- c(7,12,28,3,41)
plot(v,type = "o")
line chart
v <- c(7,12,28,3,41)
plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
multiple lines in a chart
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)
plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
lines(t, type = "o", col = "blue")
stem(CHARTS1$frequency)
> stem(CHARTS1$frequency)
The decimal point is at the |
2 | 00
4|0
6 | 00
dotchart(CHARTS$frequency)
SCATTER PLOT
plot(CHARTS1$frequency)
barplot(CHARTS1$frequency)
Values <- matrix(c(28,40,38,50,53,55,38,30,53),
nrow=3,ncol=3,byrow=TRUE,
dimnames = list(c("A","B","C"),c("1947","1957","1967")))
State <- c ("A","B","C")
colors <-c("darkblue","red","yellow")
counts <- table(dot_data$A,dot_data$B)
barplot(Values, main="production of paddy",
xlab="Years", col=c("darkblue","red","yellow"),
beside=TRUE,ylab = "production of paddy in lakhs tones")
legend("bottomright", State, cex=1.3, fill=colors)
Values <- matrix(c(28,40,38,50,53,55,38,30,53),
nrow=3,ncol=3,byrow=TRUE,
dimnames = list(c("A","B","C"),c("1947","1957","1967")))
State <- c ("A","B","C")
colors <-c("darkblue","red","yellow")
counts <- table(dot_data$A,dot_data$B)
barplot(Values, main="production of paddy",
xlab="Years", col=c("darkblue","red","yellow"),
ylab = "production of paddy in lakhs tones")
legend("bottomright", State, cex=1.3, fill=colors)
slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")
slices <- c(10, 12, 4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=rainbow(length(lbls)),
main="Pie Chart of Countries")
x <- seq(-pi,pi,0.1)
plot(x, sin(x))
plot(x, sin(x),
main="The Sine Function",
ylab="sin(x)")
plot(x, sin(x),
main="The Sine Function",
ylab="sin(x)",
type="l",
col="blue")
plot(x, sin(x),
main="Overlaying Graphs",
ylab="",
type="l",
col="blue")
lines(x,cos(x), col="red")
legend("topleft",
c("sin(x)","cos(x)"),
fill=c("blue","red")
max.temp # a vector used for plotting
Sun Mon Tue Wen Thu Fri Sat
22 27 26 24 23 26 28
par(mfrow=c(1,2)) # set the plotting area into a 1*2 array
barplot(max.temp, main="Barplot")
pie(max.temp, main="Piechart", radius=1)
Temperature <- airquality$Temp
Ozone <- airquality$Ozone
par(mfrow=c(2,2))
hist(Temperature)
boxplot(Temperature, horizontal=TRUE)
hist(Ozone)
boxplot(Ozone, horizontal=TRUE)
make labels and margins smaller
par(cex=0.7, mai=c(0.1,0.1,0.2,0.1))
Temperature <- airquality$Temp
# define area for the histogram
par(fig=c(0.1,0.7,0.3,0.9))
hist(Temperature)
# define area for the boxplot
par(fig=c(0.8,1,0,1), new=TRUE)
boxplot(Temperature)
# define area for the stripchart
par(fig=c(0.1,0.67,0.1,0.25), new=TRUE)
stripchart(Temperature, method="jitter")
drawing a 3D plot
cone <- function(x, y){
sqrt(x^2+y^2)
to prepare our variables
x <- y <- seq(-1, 1, length= 20)
z <- outer(x, y, cone)
persp(x, y, z)
persp(x, y, z,
main="Perspective Plot of a Cone",
zlab = "Height",
theta = 30, phi = 15,
col = "springgreen", shade = 0.5)
Read csv file:
mydata <- read.csv ("flowers.csv", header= TRUE)
mydata
output:
mydata <- read.csv ("flowers.csv", header= TRUE)
> mydata
flowers
1 rose
2 liliy
3 champa
4 mogra
5 malligai
6 mullai
7 orchid
8 hibiscus
9 jaswant
10 marigold
Anova
Analysis of Variance
Anova code:
y1 = c(18.2, 20.1, 17.6, 16.8, 18.8, 19.7, 19.1)
y2 = c(17.4, 18.7, 19.1, 16.4, 15.9, 18.4, 17.7)
y3 = c(15.2, 18.8, 17.7, 16.5, 15.9, 17.1, 16.7)
y = c(y1, y2, y3)
n = rep(7, 3)
group = rep(1:3, n)
group
tmp = tapply(y, group, stem)
tmpfn = function(x) c(sum = sum(x), mean = mean(x), var = var(x),
n = length(x))
tapply(y, group, tmpfn)
data = data.frame(y = y, group = factor(group))
fit = lm(y ~ group, data)
anova(fit)
df = anova(fit)[, "Df"]
names(df) = c("trt", "err")
df
anova(fit)["Residuals", "Sum Sq"]
anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), 18,
lower.tail = FALSE)
output:
> y1 = c(18.2, 20.1, 17.6, 16.8, 18.8, 19.7, 19.1)
> y2 = c(17.4, 18.7, 19.1, 16.4, 15.9, 18.4, 17.7)
> y3 = c(15.2, 18.8, 17.7, 16.5, 15.9, 17.1, 16.7)
> y = c(y1, y2, y3)
> n = rep(7, 3)
> n
[1] 7 7 7
> group = rep(1:3, n)
> group
[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3
> tmp = tapply(y, group, stem)
The decimal point is at the |
16 | 8
17 | 6
18 | 28
19 | 17
20 | 1
The decimal point is at the |
15 | 9
16 | 4
17 | 47
18 | 47
19 | 1
The decimal point is at the |
15 | 29
16 | 57
17 | 17
18 | 8
>
> tmpfn = function(x) c(sum = sum(x), mean = mean(x), var = var(x),
+ n = length(x))
> tapply(y, group, tmpfn)
$`1`
sum mean var n
130.300000 18.614286 1.358095 7.000000
$`2`
sum mean var n
123.600000 17.657143 1.409524 7.000000
$`3`
sum mean var n
117.900000 16.842857 1.392857 7.000000
>
> data = data.frame(y = y, group = factor(group))
> fit = lm(y ~ group, data)
> anova(fit)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
group 2 11.007 5.5033 3.9683 0.03735 *
Residuals 18 24.963 1.3868
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> df = anova(fit)[, "Df"]
> names(df) = c("trt", "err")
> df
trt err
2 18
>
> anova(fit)["Residuals", "Sum Sq"]
[1] 24.96286
>
> anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), 18,
+ lower.tail = FALSE)
[1] 0.7918086 3.0328790
Interpretation :
If the p value from the F test is greater than or equal to 0.05 then the null hyphothesis is accepted otherwise
rejected.
Correlation :
cor(CORRELATION, use="complete.obs", method="pearson")
CORRELATION
X Y
1 10 20
2 12 13
3 9 12
4 13 5
5 6 9
6 8 2
7 12 5
8 13 6
OUTPUT:
cor(CORRELATION, use="complete.obs", method="pearson")
X Y
X 1.00000000 -0.09610721
Y -0.09610721 1.00000000
>
cor(CORRELATION, use="complete.obs", method="spearman")
CORRELATION
X Y
1 10 20
2 12 13
3 9 12
4 13 5
5 6 9
6 8 2
7 12 5
8 13 6
OUTPUT: cor(CORRELATION, use="complete.obs", method="spearman")
X Y
X 1.00000000 -0.09697148
Y -0.09697148 1.00000000
cor(CORRELATION, use="complete.obs", method="kendall")
cov(CORRELATION, use="complete.obs")
output:
cor(CORRELATION, use="complete.obs", method="kendall")
X Y
X 1.00000000 -0.03774257
Y -0.03774257 1.00000000
> cov(CORRELATION, use="complete.obs")
X Y
X 6.553571 -1.428571
Y -1.428571 33.714286
Data set:
X Y
10 20
12 13
9 12
13 5
6 9
8 2
12 5
13 6
Regression
>
REGRESSION
alligator = data.frame(
lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)
alligator #view data
alligator_regression = lm(lnWeight ~ lnLength, data = alligator)
lm(formula = lnWeight ~ lnLength, data = alligator)
lm(formula = lnWeight ~ lnLength, data = alligator)
summary(alligator_regression)
alligator_regression = lm(lnWeight ~ lnLength, data = alligator)
> lm(formula = lnWeight ~ lnLength, data = alligator)
Call:
lm(formula = lnWeight ~ lnLength, data = alligator)
Coefficients:
(Intercept) lnLength
-8.476 3.431
>
> summary(alligator_regression)
Call:
lm(formula = lnWeight ~ lnLength, data = alligator)
Residuals: