R Studio Manual
R Studio Manual
Class : __________________________________
1
STATISTICS WITH R-PROGRAMMING
LABORATORY MANUAL
2
HINDUSTHAN COLLEGE OF ENGINEERING AND TECHNOLOGY
(AUTONOMOUS)
COIMBATORE 641 032
Certificate
Date :
Place :
1 Introduction to R Programming 06
6 Application of F test 41
4
MARKS STATEMENT
1 Introduction to R Programming
6 Application of F test
5
Ex. No. :1 Date :
Introduction to R Programming
Objective:
To understand the concept of R and R –studio Programming
To Understand the basic data representation
To create data frame by using given data
To import data from MS Excel file
Formula:
Introduction to R – Programming:
R Studio incorporates the console, script, graphical output and various other elements in
an accessible and easy-to-manipulate form. R Studio is free and available for both Windows and
Macintosh operating systems and can be downloaded from
http://www.rstudio.com/products/rstudio/. Note that the R studio menu differs slightly between
PC and Mac versions.
The R Studio screen is divided into four resizable parts:
6
The upper left part contains a script editor where commands are written and
saved. The various tabs in the upper left part can contain multiple scripts and also
data files.
Commands are sent to the console in the lower left part using the key combination
cmd + Return (Macintosh) or Control+R (PC).
On the right side,RStudio displays a workspace tab listing all objects in the
current analysis and a history tab providing a recollection of executed
commands.
The lower right partition hosts a figure tab where graphical output is stored, a
package tab where packages can be viewed and installed, a file tab to manipulate
files and a help tab where R help information can be searched and displayed.
In R Studio, you can bundle your analyses into projects using the project drop-down
menu on the top (through “file” for PC and through “project” for Mac) or the pop-up menu in the
top right corner of R Studio (both versions). Projects will contain all elements of analyses
allowing you to continue a session exactly where you ended the previous time.
7
Procedure:
You can set a new working directory following the menu options Session > Set working
directory
To create a new script, you can follow “File > New File > R script”, or use the shortcut
Ctrl + Shift + N. Save your scripts regularly. A file that has been modified but not saved again
will show with a red title and a * at the end.
You can navigate between different plots produced during a session using the blue arrows
at the top left corner of the Plots tab. You can save your graphs by clicking on “Export”.
R stores data, analyses and outputs as objects in the current workspace that can be saved.
Observe that workspace is a collection of R objects with properties assigned through R
commands whereas the working directory basically is a folder on your computer that contains
various files of any type.
You assign information to variable using the operator- an "arrow" composed of a smaller
than sign and a minus sign (<-) pointing to the name of the variable. There are certain rules for
naming your variables.
8
R-Code Program:
Online help:
Once R is installed, there is a comprehensive built-in help system.
For general help type:
> help.start()
For help about a function:
> help(log)
You can also list all functions containing a given string.
> apropos("log")
To show an example of a function, type:
> example(log)
First objects:
R lets you assign values to variables and refer to them by name.
In R, we use the symbol <- for assigning values to variables. We can see the results of
the assignment by typing the name of our new object.
> x <- 3
>x
[1] 3
The equals sign = can also be used as assignment operator in most circumstances.
>y=5
>y
[1] 5
R is case sensitive so X is not the same as x.
>X
Error: object ’X’ not found
Variable names should not begin with numbers or symbols and should not contain blank
spaces.
Character strings in R are made with matching quotes:
> myname <- "Bea"
> myname
[1] "Bea"
9
TRUE and FALSE are reserved words denoting logical constants in the R language.
> mylog <- TRUE
> mylog
[1] TRUE
Special values used in R:
At its most basic level, R can be viewed as a calculator. The elementary arithmetic
operators are +, -, *, / and ^ for raising to a power.
> 7 * (3 + 2)/2
[1] 17.5
> 2^3
[1] 8
Most of the work in R is done through functions. For example, if we want to compute √9
in R we type:
> sqrt(9)
[1] 3
The natural logarithm can be computed with the function log.
> log(5)
[1] 1.609
Data types:
R deals with numbers, characters (entered in quotes, " ", as the “hej” in the example
above) and logical statements (TRUE or FALSE).
The following types of data are commonly used:
Vector:
One-dimensional, contains a sequence of one type of data, i.e. numbers OR categories
(letters, group names) OR logical statements. Vectors can be created using c(element1, element2,
element3, .... , which concatenates (connect them one after each other) the different elements into
a vector. Note that the elements can themselves be vectors.
For example, c("population1","population2","population3","population4") will generate a
vector as follows:
[1] "population1" "population2" "population3" "population4"
10
Number sequences can be created using the operator ‘:’. For instance,
x <- 1:7 creates the vector x that contains a sequence of number from 1 to 7:
[1] 1 2 3 4 5 6 7
Besides, there are a number of other functions for creating vectors, for user-defined seq()
for sequences and rep() for repeated elements. You can find out about these functions using the
R help.
Factor:
Similar to vectors but also contains information on levels. Entries of a factor that are
equal belong to the same factor level, or in other words, to the same category. Factors can be
created from vectors using factor().
For example, you can create a factor named sex using the code below:
sex <- c(rep("male",25), rep("female", 35))
sex <- factor(sex)
Data Frame:
A data frame is used for storing data tables.
Each column can contain different modes of data. The first column can be numeric while
the second column can be character and third column can be logical.
It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
This is the format commonly used for data analysis where each row corresponds to an
observation and each column corresponds to a variable (vector or factor). The section getting
data into R explains how to create data frames from your data and the sections accessing and
changing individual entries, accessing and changing entire rows or columns, adding and deleting
columns explain how to handle and manipulate the contents of data frames.
Example:1
From the following variable, to form a data frame containing three vectors n, s, b where
n = (2, 3, 5), s = (aa, bb, cc), and b = (TRUE, FALSE, TRUE) .
11
R-Code:
> s = c("aa", "bb", "cc")
> n = c(2, 3, 5)
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df
Output:
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE
Note:
> df[1,2]
[1] aa
Levels: aa bb cc
> df[3,3]
[1] TRUE
Example: 2
Create a data frame from the following details regarding the name,gender, height, weight
and age.
Name: A, B, C
Gender: Male, Female, Male
Height:152,162,160
Weight:75,82,69
Age:40,35,29
R-Code:
Data Import:
It is often necessary to import sample textbook data into R before you start working on
your homework.
Excel File:
Quite frequently, the sample data is in Excel format, and needs to be imported into R
prior to use. For this, we can use the function read.xls from the gdata package. It reads from an
Excel spreadsheet and returns a data frame. The following shows how to load an Excel
spreadsheet named "mydata.xls". This method requires Perl runtime to be present in the system.
R-Code:
> library(readxl)
> kk <- read_excel("D:/kannan/kk.xls")
> View(kk)
CSV File:
The sample data can also be in comma separated values (CSV) format. Each cell inside
such data file is separated by a special character, which usually is a comma, although other
characters can be used as well.
The first row of the data file should contain the column names instead of the actual data.
Here is a sample of the expected format.
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
After we copy and paste the data above in a file named "mydata.csv" with a text editor,
we can read the data with the function read.csv.
R-Code:
13
> mydata = read.csv("mydata.csv") # read csv file
> mydata
Output:
Col1 Col2 Col3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
In various European locales, as the comma character serves as the decimal point, the
function read.csv2 should be used instead. For further detail of the read.csv and read.csv2
functions, please consult the R documentation.
Example: To import data from excel sheet
R-Code:
> help(read.csv)
>details=read.csv("C:/Users/admin/Desktop/Hegiht and Weight.csv")
> details
Highest=subset(details, Height>=160)
print(Highest)
Output:
Name Weight Height
1 X1 62.0 152
2 X2 63.4 176
3 X3 63.2 165
4 X4 65.0 164
5 X5 66.0 162
6 X6 64.2 153
7 X7 64.0 152
8 X8 50.3 150
9 X9 52.0 162
10 X10 56.3 167
11 X11 54.0 164
12 X12 55.0 156
13 X13 56.0 168
14 X14 65.0 174
15 X15 65.0 172
16 X16 68.0 171
14
17 X17 69.0 170
18 X18 67.0 160
19 X19 65.0 161
20 X20 68.2 162
21 X21 54.0 158
22 X22 70.0 159
23 X23 71.2 160
24 X24 65.0 152
25 X25 64.0 164
26 X26 62.0 162
27 X27 63.0 171
28 X28 61.0 172
29 X29 65.0 170
30 X30 54.0 156
Working Directory:
Finally, the code samples above assume the data files are located in the R working
directory, which can be found with the function getwd.
> getwd() # get current working directory
You can select a different working directory with the function setwd(), and thus avoid
entering the full path of the data files.
> setwd("<new path>") # set working directory
Note that the forward slash should be used as the path separator even on Windows
platform.
> setwd("C:/MyDoc")
Task:1
To import data from excel sheet from the following “Marks statement and to find who have
passed in test 1.
`Sl. No` Name Test.1 Test.2
1 A 25 41
2 B 26 28
3 D 28 36
4 E 29 37
5 F 30 40
15
6 G 35 28
7 H 35 29
8 I 34 38
9 J 29 32
10 K 36 34
11 L 33 36
Task:2
Create a data frame from the following details regarding babies frocks:
S.No. Size Season Material Decoration Pattern price
1 L Spring Silk Embroidery Dot 650
2 M Summer Chiffon Bow Print 275
3 M Summer Cotton Null Animal 380
4 M Winter Cotton Null Patchwork 450
5 L Autumn Lines Ruffles Animal 420
Faculty Signature
Median:
The median of an observation variable is the value at the middle when the data is sorted in
ascending order. It is an ordinal measure of the central location of the data values.
Standard deviation:
The standard deviations (also sd) is a measure of the spread of the data and is calculated:
17
Variance:
The variance is a numerical measure of how the data values are dispersed around the
mean. In particular, the sample variance is defined as:
Box Plot:
The box plot of an observation variable is a graphical representation based on its
quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the
data distribution.
Example: 1
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> duration = faithful$eruptions # the eruption durations
> boxplot(duration, horizontal=TRUE) # horizontal box plot
Answer
The box plot of the eruption duration is:
18
R-Code Program:
To find the mean:
x=c(given data)
Mean = mean(x)
Mean
To find the median:
x=c(given data)
Median = median(x)
Median
To find the standard deviation:
x=c(given data)
Std Deviation = sd(x)
Std Deviation
To find Variance:
x=c(given data)
Variance = var (x)
Variance
Example: 2
Find the Mean, median, standard deviation and variance of the data’s 55, 54, 53,56, 52,
58, 52, 49, 50, 51
R-Code:
x=c(55, 54, 53,56, 52, 58, 52, 49, 50, 51)
mean(x)
median(x)
sd(x)
var (x)
Output:
> mean(x)
[1] 53
> median(x)
[1] 52.5
> sd(x)
[1] 2.788867
> var (x)
19
[1] 7.777778
To find the mode:
x=c(given data)
mode=function(x)
+ ux=unique(x)
ux[which.max(tabulate(match(x,ux)))]
Example: 3
Find the Mode of the data’s 5, 2, 3, 3, 2, 5, 5, 1, 5, 1, 5, 6, 7.
R-Code:
ux=c(5,2,3,3,2,5,5,1,5,1,5,6,7)
mode=function(x)
ux=unique(x)
ux[which.max(tabulate(match(x,ux)))]
Output:
[1] 5
Task: 1
The following list gives the export quantity of raw material (in Million kg) for the seven
consecutive years 2010-11 to 2016-17: 195, 152, 174, 186.3, 158, 168, 170.5. Find the mean,
median and standard deviation.
20
To import data from excel sheet and to find the mean of weight and summary of the details
To import data from Excel sheet “Hegiht and Weight”, first save the file as *.csv
(comma delimited) in the current working directory. Then execute the following command
> help(read.csv)
> details=read.csv("C:/Users/admin/Desktop/Hegiht and Weight.csv")
> details
Output:
Name Weight Height
1 X1 62.0 152
2 X2 63.4 176
3 X3 63.2 165
4 X4 65.0 164
5 X5 66.0 162
6 X6 64.2 153
7 X7 64.0 152
8 X8 50.3 150
9 X9 52.0 162
10 X10 56.3 167
11 X11 54.0 164
12 X12 55.0 156
13 X13 56.0 168
14 X14 65.0 174
15 X15 65.0 172
16 X16 68.0 171
17 X17 69.0 170
18 X18 67.0 160
21
19 X19 65.0 161
20 X20 68.2 162
21 X21 54.0 158
22 X22 70.0 159
23 X23 71.2 160
24 X24 65.0 152
25 X25 64.0 164
26 X26 62.0 162
27 X27 63.0 171
28 X28 61.0 172
29 X29 65.0 170
30 X30 54.0 156
> mean(details$Weight)
[1] 62.26
> summary(details)
Name Weight Height
X1 : 1 Min. :50.30 Min. :150.0
X10 : 1 1st Qu.:57.48 1st Qu.:158.2
X11 : 1 Median :64.00 Median :162.0
X12 : 1 Mean :62.26 Mean :162.8
X13 : 1 3rd Qu.:65.00 3rd Qu.:169.5
X14 : 1 Max. :71.20 Max. :176.0
(Other):24
Task: 2
Import the excel file “Height and weight” from your directory to create a data
frame. Also find the mean, median, mode for height and weight.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Weight 69 80 80 84 81 85 72 55 72 64 52 71 74 80 75
Height 152 176 166 168 161 174 153 154 161 162 154 156 171 162 167
22
Viva voce Questions:
1. State the merits and demerits of mean.
2. State the merits and demerits of median.
3. Give the merits of mode.
4. What are the merits of standard deviation?
5. What is the box plot?
Marks Split up for Internal Marks
Experiment 10
Program 5
Output 5
Viva Voce 5
Total Marks / 25
Faculty Signature
23
Line of Regression:
The equation of line of Regression x on y
Where
Where
Correlation coefficient:
R-Code Program:
1. To construct the scatter plot with the variables x and y
x=c(a,b,…)
y=c(l,m,…)
plot(x,y,xlab=’’… ”,ylab=”…”,xlim=c(0,10),ylim=c(0,25),col=c(“…”),main= “…”)
regyx
24
5. To find the regression line of x on y
regxy=lm(x y)
regxy
6. To construct the regression plot of y on x
plot(x,y)
Note:
Example:
Construct the scatter plot and also find the coefficient of correlation, Spearman’s correlation
coefficient between the ends per inch (X) and picks per inch (Y). Also find the two regression
lines. Estimate the value of y when x=26.
x 23 27 28 28 29 30 31 33 35 36
y 18 20 22 27 21 29 27 29 28 29
R code:
x=c(23,27,28,28,29,30,31,33,35,36)
y=c(18,20,22,27,21,29,27,29,28,29)
plot(x,y,xlim=c(0,50), ylim=c(0,40))
r=cor(x,y)
r
rank = cor(x,y, method="spearman")
rank
Output:
[1] 0.8176052 – Correlation coefficient
[1] 0.8426568 - Spearman Correlation coefficient
25
Conclusion:
The correlation is strong positive between x and y.
Plot:
regyx=lm(y x)
regyx
Output:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-1.7391 0.8913
26
Hence, the regression lion of y on x is y = -1.7391 + 0.8913 x.
To find the regression line of x on y
R code:
regxy=lm(x y)
regxy
Output:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
11.25 0.75
Hence, the regression line of x on y is x = 11.25 + 0.75 y.
To find y when x=26
y1=-1.7391+0.8913*26
y1
[1]21.4347
Regression plot of y on x
R code:
plot(x,y)
Output:
27
Task: 1
Compute the two equations of the regression lines for the following data:
A panel of judges A and B graded seven debaters and independently awarded the following
marks:
Marks by A: 40 34 28 30 44 38 31
Marks by B: 32 39 26 30 38 34 28
An eighth debater was awarded 36 marks by Judge A while Judge B was not present. If Judge B
was also present, how many marks would you expect him to award to eighth debater assuming
same degree of relationship exists in judgment?
28
Task: 2
The following table gives the ages and blood pressure of 10 men.
Age(X): 56 42 36 47 49 42 60 72 63 55
Blood Pressure(Y): 147 125 118 128 145 140 155 160 149 150
Find
i. The two regression line equations.
ii. Estimate the blood pressure of men whose age is 45 years.
iii. Estimate the age of men whose blood pressure is 172.
iv. Construct the regression plot of blood pressure on age.
29
Viva voce Questions:
1. Define correlation.
2. What are the various methods of studying correlation?
3. Explain scatter diagram.
4. Define regression.
5. What are regression lines? Write their equations.
Faculty Signature
30
Application of Normal distribution
Objective
To predict values and computing probabilities using normal distribution
Formula:
The normal distribution is defined by the following probability density function, where
The normal distribution with and is called the standard normal distribution,
and is denoted as N(0,1). Consider a normal distribution with mean and standard deviation .
R-Code Program:
1. To find
R-code:
pnorm(a, mean = , sd = )
2. To find
R-code:
3. To find
R-code:
Note:
31
Use lower.tail=TRUE if you are, e.g., finding the probability at the lower tail of a
confidence interval or if you want to the probability of values no larger than z.
Use lower.tail=FALSE if you are, e.g., trying to calculate test value significance or at the
upper confidence limit, or you want the probability of values z or larger.
You should use pnorm(z, lower.tail=FALSE) instead of 1-pnorm(z)because the former
returns a more accurate answer for large z. This is really simple issue, and has no inherent
complexity associated with it.
Example:
A certain type of storage battery lasts on the average 3.0 years with standard deviation of 0.5
year. Assuming that the battery lives are normally distributed, find the probability that a given
battery will last (i) less than 2.3 years (ii) more than 3.1 years (iii) between 2.5 and 3.5 years.
Answer:
R-Code: pnorm(2.3,mean =3.0, sd =0.5)
Output: [1] 0.08075666
R-Code: pnorm(3.1,mean =3.0, sd =0.5, lower.tail=FALSE)
Output: [1] 0.4207403
R-Code: pnorm(3.5,mean =3.0, sd =0.5)-pnorm(2.5,mean =3.0, sd =0.5)
Output: [1] 0.6826895
Task:1
Suppose the heights of men of a certain country are normally distributed with average 68 inches
and standard deviation 2.5, find the percentage of men who are
(i) Between 66 inches and 71 inches in height
(ii) Approximately 6 feet tall(ie, between 71.5 inches and 72.5 inches)
32
Task: 2
The mean yield for one acre plots is 662 kgs with S.D 32. Assuming normal distribution, how
many one acre plots in a batch of 1000 plots. Would you expect to yield.
Over 700 kgs
Below 650 kgs
(Note: Find the respective probabilities and multiply the probabilities by the number of
plots(=1000) to get the final answers)
Task:3
An intelligence test is administered to 1000 children. The average score is 42 and S.D is 24.
Assuming the test follows normal distribution
(i) Find the number of children exceeding the score 60.
(ii) Find the number of children with score lying between 20 and 40
33
Task: 4
The mean weight of 500 male students in a certain college is 151 lb and the standard deviation is
15 lb. Assuming the weights are normally distributed find how many students weight.
(i) Between 12 and 155 lb (ii) More than 185lb
34
Total Marks / 25
Faculty Signature
Where the sample mean, S is is the sample standard deviation of the sample, be the
population mean and n is the sample size.
Hypothesis:
The hypothesis that the estimate is based solely on chance is called the null hypothesis
(H0). Thus, the null hypothesis is true if the observed data (in the sample) do not differ from
what would be expected on the basis of chance alone. The complement of the null hypothesis is
called the alternative hypothesis (H1).
Types of errors:
or
R-Code Program:
Output:
data: x and y
t = 0.57131, df = 14, p-value = 0.5768
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.262417 14.262417
sample estimates:
mean of x mean of y
37 34
Conclusion:
t-value = 0.57 < 2.145 (critical value)
Hence, Ho is accepted and we may conclude that there is no significance difference
between the given sample means.
Task: 2
39
Two horses A and B were tested according to time (in seconds) to run on a particular track with
the following results.
Horse A: 28 30 32 33 33 29 34
Horse B: 29 30 30 24 27 29
From the above data report, whether or not you can discriminate between the two horses. Test at
5% level.
Task: 3
The weight gain in pounds under two systems of feeding of calves of 10 pairs of
identical twins is given below.
Twin pair 1 2 3 4 5 6 7 8 9 10
Weight gain under system A 43 39 39 42 46 43 38 44 51 43
Weight gain under system B 37 35 34 41 39 37 37 40 48 36
Discuss whether the difference between the two systems of feeding is significant.
40
Viva voce Questions:
1. Define types of error.
2. Define Null Hypothesis.
3. What is the test statistic for testing hypothesis about a population mean?
4. What is the test statistic for testing hypothesis about a difference between two mean?
5. Define level of significant.
41
Program 5
Output 5
Viva Voce 5
Total Marks / 25
Faculty Signature
(OR)
Where
R-Code Program:
The null hypothesis is that the ratio of the variances of the populations from which x and
y were drawn, or in the data to which the linear models x and y were fitted, is equal to ratio.
R-Code for F-test:
var. test (x, y, ratio=1, alternative=c(“two. sided”, ”less”, ”greater”), Conf. level=0.95,
…)
Note:
42
x, y - numeric vectors of data values, or fitted linear model objects (inheriting from
class “lm”).
Ratio - the hypothesized ratio of the population variances of x and y.
Alternative - a character string specifying the alternative hypothesis, must be one of ‘’two.
Sided” (default),”greater” or “less”. You can specify just the initial letter.
Conf. level - confidence level for the returned confidence interval.
Example:
Two samples of 6 and 7 items respectively have the following values for a variable
Sample 1 39 41 42 42 44 44
Sample 2 40 42 39 45 38 39 40
Do the sample variances differ significantly?
data: x and y
F = 1.8323, num df = 6, denom df = 5, p-value = 0.523
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2625934 10.9710044
sample estimates:
ratio of variances
1.832298
Conclusion:
Since F < F0.05 , we accept the null hypothesis and we may conclude that there is no
significant difference in the sample variances.
Task 1:
43
A group of 10 rats fed on diet A and another group of 8 rats on a different diet B, recorded the following
increase in weight (gms).
Diet A : 5 6 8 1 12 4 3 9 6 10
Diet B: 2 3 6 8 10 1 2 8
Find if the variances are significantly different.
Task 2:
The nicotine content in 2 random sample of tobacco are given below:
Sample 1: 21 24 25 26 27
Sample 2: 22 27 28 30 31 36
Test whether the populations have the same variances.
44
Task 3:
Two horses A and B were tested according to the time (in seconds) to run a particular track with
the following results:
Horse A: 28 30 32 33 33 29 34
Horse B: 29 30 30 24 27 29
Test whether the two horses have the same running capacity.
Faculty Signature
Formula:
Tests of goodness of fit are used when we want to determine whether an actual sample
distribution matches a known theoretical distribution. It enables us to find if the deviation
of the experiment from theory is just by chance or it is due to the inadequacy of the
theory to fit the date.
Null Hypothesis: :
46
The difference between the observed and expected frequencies is not significant. ie, the
theory fits well into the given data.
Regular method:
Compare the calculated - value with the tabulated - value (with n-1 d.f) and form
the conclusion.
–Test is used for testing the null hypothesis that two criteria of classification are
independent. Let the two attributes be A and B, where A has r categories and B has s categories.
Thus the members of the population and hence, those of the sample are divided into rs classes.
Let the total number of observations be N. The observations are arranged in the form of a matrix,
called contingency table.
Regular method:
The expected frequencies for various cells are calculated using the formula:
Test statistic is
47
Note: For a contingency table with cell frequencies a, b, c, d the - value is given by
Degrees of freedom = 1.
R-Code Program:
Null Hypothesis: The accidents are uniformly distributed over the week
Alternative Hypothesis: The accidents are not uniformly distributed over the week
Level of significance: 95%
R-code:
accident=c(14,18,12,11,15,14)
48
p=c(1/6,1/6,1/6,1/6,1/6,1/6)
a=chisq.test(accident,p=c(1/6,1/6,1/6,1/6,1/6,1/6))
a
Output:
Chi-squared test for given probabilities
data: accident
X-squared = 2.1429, df = 5, p-value = 0.829
Conclusion:
-value = 2.1429 < 11.07 (Table value)
Hence, Ho is accepted and we may conclude that the accidents occur uniformly over the
week.
Example: 2
From the following data, test whether there is any association between intelligency and economic
conditions
Intelligency
Excellent Good Medium Dull Total
Economic Good 48 200 150 80 478
Conditions Not good 52 180 190 100 522
Total 100 380 340 180
R-code:
good = c(48,200,150,80)
notgood = c(52,180,190,100)
economicconditions = as.data.frame(rbind(good, notgood))
chisq.test(economicconditions,simulate.p.value = TRUE)
Output:
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: economicconditions
X-squared = 6.2168, df = NA, p-value = 0.1034
49
Conclusion:
-value = 6.2168 < 7.815 (Table value)
Hence, Ho is accepted and we may conclude that there is no significance difference
between intelligency and economic conditions.
Task: 1
The following table shows the distribution of digits in the numbers choosen at random from a
telephone directory:
Digits : 0 1 2 3 4 5 6 7 8 9 Total
Frequency: 1026 1107 997 966 1075 933 1107 972 964 853 10000
Test whether the digits may be taken to occur equally frequency in the directory.
50
Task: 2
Two researchers adopted different sampling techniques while investigating the same group of
students to find the number of students falling into different intelligence level. The results are as
follows.
Researcher Below average Average Above average Excellent Total
X 86 60 44 10 200
Y 40 33 25 2 100
Total 126 93 69 12 300
Would you say that the sampling techniques adopted by the researches are significantly
different?
51
1. What is the conditions for the - test?
4. What is the test statistic for testing hypothesis about the - test?
Faculty Signature
Between
Q1(SSR) h-1 Q1/(h-1)
rows
Residual Q3(SSE)
N-h Q2/(N-h)
52
Where
The term one way classification refers to the fact that a single variable factor of interest is
controlled and the effects on the other elementary units is observed.
Three types of variation present in a data
1. Treatments
2. Environmental
3. Residual or Error
Assumptions for ANOVA test
1. The observations are independent.
2. The parent population is normal
3. Various treatment and environmental effects are additive in nature.
4. The samples have been randomly selected from the population
Null Hypothesis: There is no significant difference between the mean of each samples.
Alternative Hypothesis: There is significant difference between the mean of each samples.
53
Drug C: 6 7 6 6 7 5 6 5 5
R-code:
pain=c(4,5,4,3,2,4,3,4,4,6,8,4,5,4,6,5,8,6,6,7,6,6,7,5,6,5,5)
drug=c(rep("A",9),rep("B",9),rep("C",9))
data=data.frame(pain,drug)
data
results=aov(pain~drug,data=data)
summary(results)
Output:
pain drug
14A
25A
34A
43A
52A
64A
73A
84A
94A
10 6 B
11 8 B
12 4 B
13 5 B
14 4 B
15 6 B
16 5 B
17 8 B
18 6 B
19 6 C
20 7 C
21 6 C
22 6 C
23 7 C
24 5 C
25 6 C
26 5 C
27 5 C
54
Conclusion:
, F> , so we reject the null hypothesis and conclude that the means of the
three drug groups are different.
Task:1
A completely randomized design experiment with 10 plots and 3 treatments gave the following
result .
Plot no 1 2 3 4 5 6 7 8 9 10
Treatme A B C A C C A B A B
nt
Yield 5 4 3 7 5 1 3 4 1 7
Analysis the result for treatment effects.
55
Task:2
The following table shows the lives in hours of four brands of electric lamps: Brand
A: 1610, 1610, 1650, 1680, 1700, 1720, 1800
B: 1580, 1640, 1640, 1700, 1750
C: 1460, 1550, 1600, 1620, 1640, 1660, 1740, 1820
D: 1510, 1520, 1530, 1570, 1600, 1680
Perform an analysis of variance and test the homogeneity of the mean lives of the four brands of
lamps.
56
Viva voce Questions:
1. What are the basic principles of experimental design?
2. Mention the important design of experiments.
3. Explain one way classification
4. What is the purpose of ANOVA?
Marks Split up for Internal Marks
Experiment 10
Program 5
Output 5
Viva Voce 5
Total Marks / 25
Faculty Signature
Between
Q1(SSR) h-1 Q1/(h-1)
rows
Between
Q2(SSC) k-1 Q2/(k-1)
columns
58
summary (av)
Example:
The following data represent the number of units of production per day turned out by 5 different
workers using 4 different types of machines.
workers Machine Type
A B C D
1 44 38 47 36
2 46 40 52 43
3 34 36 44 32
4 43 38 46 33
5 38 42 49 39
(a) Test whether the mean production is the same for the different machine types.
(b) Test whether the 5 men differ with means productivity.
R-code:
a=c(42,44,32,41,36,36,38,34,36,40,45,50,42,44,47,34,41,30,31,37)
f=c("w1","w2","w3","w4","w5")
h=5
k=4
worker=gl(h,1,k*h,factor(f))
worker
machine=gl(k,h,h*k)
machine
av = aov(a ~ worker+machine)
summary(av)
Output:
> a=c(42,44,32,41,36,36,38,34,36,40,45,50,42,44,47,34,41,30,31,37)
> f=c("w1","w2","w3","w4","w5")
> h=5
> k=4
> worker=gl(h,1,k*h,factor(f))
> worker
[1] w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5
Levels: w1 w2 w3 w4 w5
> machine=gl(k,h,h*k)
> machine
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Levels: 1 2 3 4
59
> av = aov(a ~ worker+machine)
> summary(av)
Df Sum Sq Mean Sq F value Pr(>F)
worker 4 161.5 40.37 6.574 0.00485 **
machine 3 338.8 112.93 18.388 8.78e-05 ***
Residuals 12 73.7 6.14
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion:
From F-table, F0.05 (4,12)=3.26 < F1 = 6.54, hence we reject H 01 and conclude that the 5
workers differ with respect to mean productivity.
F0.05 (3,12)=3.49 < F 2 = 18.388, hence we reject H02 and conclude that the 4 machines
differ with respect to mean productivity.
Task: 1
A company appoints 4 salesmen A, B, C, D and observes their sales in 3 seasons: summer,
winter and monsoon. The figures (in lakhs of Rs.) are given in the following table:
SALES MEN
SEASON
A B C D
SUMMER 45 40 38 37
WINTER 43 41 45 38
MONSOON 39 39 41 41
60
Viva voce Questions:
1. When do you apply the analysis of variance technique?
2. Write any two advantages of RBD over CRD.
3. What is meant be two way classifications?
4. Write the differences between one way and two way classification?
Marks Split up for Internal Marks
Experiment 10
Program 5
Output 5
Viva Voce 5
Total Marks / 25
Faculty Signature
61