R Programming Notes
R Programming Notes
SUB CODE:IT412
August 8, 2016
Sample presentation
NULL HYPOTHESIS
”NULL” means nothing new or different; assumption or status quo
maintained.
Null hypothesis is denoted by H0 .
ALTERNATIVE HYPOTHESIS
The ”Alternative” is simply ’the other option’ when null is rejected;
nothing more.
Alternative hypothesis is denoted by Ha .
(BEC ) Short version August 8, 2016 2 / 113
Differences between null hypothesis and alternative
hypothesis
POPULATION PARAMETER
Any characteristic of a population which is measurable is called
POPULATION PARAMETER.(we usually use greek letters for
population parameters.) A parameter is a numerical property of a
sample.
Example
For example the population mean,µ population variance σ 2 are
population parameters
TEST STATISTIC
Critical value
The CRITICAL VALUE separates the critical region from noncritical
region
Rejection Region
REJECTION REGION is the range of values of the test value that
indicates that there is a significant difference and that the null
hypothesis should be rejected.
(BEC ) Short version August 8, 2016 9 / 113
Significance Level(α)
Significance Level(α)
It is a desired parameter of a cut off probability in experimental design
to determine whether an observed test statistic is extreme or not. is
usually set to be 0.05, 0.025 or 0.01.
We reject the null hypothesis if the probability of the observed test
statistic to appear is smaller than .
graphed as shown.
(BEC ) Short version August 8, 2016 12 / 113
The normal distribution is important because of the CENTRAL LIMIT
THEOREM, which states that the population of all possible samples of
size n from a population with mean µ and variance σ 2 approaches a
normal distribution with mean µ and σ 2 /n when n approaches infinity.
Test
1.Start with a well-developed, clear research problem or question.
2.Establish hypothesis, both null and alternative.
3.Determine appropriate statistical test and sampling distribution.
4.Choose the Type 1 error rate.
5.State the decision rule.
6.Gather sample data.
7.Calculate test statistics.
8.State statistical conclusion.
9.Make decision or inference based on conclusion.
problem
Assume that the test scores of a college entrance exam fits a normal
distribution. Furthermore, the mean test score is 72, and the standard
deviation is 15.2. What is the percentage of students scoring 84 or
more in the exam?
solution
We apply the function pnorm of the normal distribution with mean 72
and standard deviation 15.2. Since we are looking for the percentage of
students scoring higher than 84, we are interested in the upper tail of
the normal distribution.
problem
Find the 2.5th and 97.5th percentiles of the Student t distribution with
5 degrees of freedom.
solution
We apply the quantile function qt of the Student t distribution against
the decimal values 0.025 and 0.975.
x̄−µ0
t= √s
n
x=sample mean, µ=hypothesized population mean.
s=sample standard deviation,n=sample size.
In the t-test value in the Non-Rejection and Rejection Region based on
df=n-1.
solution
Step 1:Establish Hypothesis.
H0 :µ=69,873
Ha :µ=69,873.
Step 2: Determine Appropriate Statistical Test and Sampling
Distribution.
(BEC ) Short version August 8, 2016 21 / 113
solution
This will be a two sided test. Salaries could be higher OR lower.
Since σ is unknown and n is small we will use the t-distribution.
Step 3:Specify the Type-1 error rate(significance level)
α=.05.
Step 4:State the decision rule.
for df =11
if t>2.201,reject H0 .
if t<-2.201,reject H0 .
n=15
solution
> x = c(0.593,
(BEC ) 0.142, 0.329, 0.691,
Short 0.231,
version 0.793, 0.519, 0.392,
August 0.418)30 / 113
8, 2016
Two sample t-Test with Unequal Variance
The default form of the t.test() does not assume that the samples
have equal variance,so the welch two sample test is carried out unless
you specify.
Ex:data2=3,5,7,5,3,2,6,8,5,6,9,4,5,7,3,4
data3=6,7,8,7,6,3,8,9,10,7,6,9
Example
> t.test(data2, data3)
we can override the default and use the classic t test by adding
var.equal=T.The p-value slightly different from welch version.
Example
> t.test(data2, data3, var.equal = T )
V = X1 2 + X2 2 + ... + Xm 2 ∼ χm 2 (3)
Solution
We apply the quantile function qchisq of the Chi-Squared distribution
against the decimal values 0.95.
Z-test
When σ is known,we use the normal standard distribution,or z
distribution,to establish the non-rejection region and critical values.
When σ is not known,we use the t-distribution.
Some instructors or books will indicate that using the z distribution is
acceptable any time n>30.
n:Sample population
x̄:Population mean
µ:Hypothesized population mean
σ:Standard deviation
Problem
Suppose the food label on a bag states that there is at most 2 grams of
saturated fat in a single cookie.In a sample of 35 cookies,it is found
that the mean amount of saturated fat per cookie is 2.1 grams.Assume
the population standard deviation is 0.25 grams.At 0.05 significance
level can we reject the claim on food label?
Solution
H0 :µ 62
Ha :µ >2
mean=2,n=35,x̄=2.1,σ=0.25
z = x̄−µ
√σ
n
a1 ←qnorm(p=0.05,lower.tail=F)
a1
[1]1.64485
Example
In this example we have blood test results before and after receiving
some treatment.(25 paired observations of 3 variables)
> BP[c(1,3,5),]
increases.
For a generally downward shape we say that the correlation is
negative.
As an independent variable increases,the dependent variable generally
decreases.
Covariance
The covariance of two variables x and y in a data sample measure
how the two are linearly related.A positive covariance would indicates a
positive linear relationship
P between the variables,and vice versa.
Co − variance = (x−x̄)(y−ȳ)
n
coefficient(r)
Sxy
r=p
Sxx Sxy
2
P
• Sxx = x2 − (Pnx) or Sxx = (x − x̄)2
P P
P 2 ( y)2
y − Pn or Syy = (y − ȳ)2
P
• Syy =
xy − ( nxy) or Sxy = (x − x̄)(y − ȳ)
P P
• Sxy =
6 d2
P
rs = 1 −
(n)(n2 − 1)
cluster analysis
Good clustering
A good clustering method will produce high quality clusters with,high
intra-class similarity and low inter-class similarity.
3.Ordinal
The ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them,but the magnitude between
successive values is not known.
ex:customer satisfactory(0:very
dissatisfied,1:dissatisfied,2:neutral,3:satisfied,4:very satisfied).
•Dierent algorithms, and even multiple runs of the one algorithm, will
deliver dierent clusterings.
Pn
dist(x, y) = i=1 |xi − yi |
Minkowski Distance
A general p
measure of distance:
d(x, y) = q (|x1 − y1 |q + |x2 − y2 |q + ... + |xn − yn |q ) If q = 2, d is the
Euclidean distance,if q = 1, d is the Manhattan distance.A variation is
the weighted
p distance variables have dierent importance:
d(x, y) = q (w1 |x1 − y1 |q + w2 |x2 − y2 |q + ... + wn |xn − yn |q )
Example of scaling:
b+c
d(xi , yj ) =
a+b+c+d
•Jaccard coefficientt
(asymmetric: 1 is more important - e.g. diseases):
b+c
d(xi , yj ) =
a+b+c
(BEC ) Short version August 8, 2016 65 / 113
Categorical Variables
•Partitioning algorithm:
Construct various partitions and then evaluate them by some criterion.
•Hierarchy algorithms
Create a hierarchical decomposition of the set of data (or objects)
using some criterion.
•Density-based
based on connectivity and density functions.
•Grid-based
based on a multiple-level granularity structure.
•Model-based
A model is hypothesized for each of the clusters and the idea is to find
the best fit of that model to each other.
•Given,
→ A data set of n objects
→ K the number of clusters to form.
• Organize the objects into k partitions (k6n) where each partition
represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion
→Objects within a cluster are similar
→ Objects of different clusters are dissimilar
• Input
→ K: the number of clusters
→ D: a data set containing n objects
• Output: A set of k clusters
• Method:
1. Arbitrary choose k objects from D as in initial cluster centers
2.Repeat
3. Reassign each object to the most similar cluster based on the mean
value of the objects in the cluster
4. Update the cluster means
5. Until no change
→ E: the sum of the squared error for all objects in the data set
→ E: the sum of the squared error for all objects in the data set
→ mi : is the mean of cluster Ci
• It works well when the clusters are compact clouds that are rather
well separated from one another.
Advantages
• K-means is relatively scalable and efficient in processing large data
sets
• The computational complexity of the algorithm is O(nkt)
→ n: the total number of objects
→ k: the number of clusters
→ t: the number of iterations
→ Normally: k << n and t << n
Disadvantages
• Can be applied only when the mean of a cluster is defined
• Users need to specify k
• K-means is not suitable for discovering clusters with nonconvex
shapes or clusters of very different size
• It is sensitive to noise and outlier data points(can influence the mean
value)
i ←read.csv(choose.files())
str(i)
names(i)
i2 ←i[,-5]
names(i2)
i3 ←kmeans(i2,3)
i3
table(iris$Species,i3$cluster)
plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species)
plot(iris$Petal.Length,iris$Petal.Width,col=i3$cluster)
plot(iris$Sepal.Length,iris$Sepal.Width,col=i3$cluster)
· E: the sum of absolute error for all objects in the data set
· P: the data point in the space representing an object
· Oi : is the representative object of cluster Ci
x←read.csv(choose.files())
str(x)
names(x)
x1←x[,-5]
names(x1)
library(cluster)
km←pam(x1,3)
km
table(iris$Species,km$clustering)
plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species)
plot(iris$Petal.Length,iris$Petal.Width,col=km$clustering)
plot(iris$Sepal.Length,iris$Sepal.Width,col=iris$Species)
plot(iris$Sepal.Length,iris$Sepal.Width,col=km$clustering)
data(nutrient,package = ”flexclust”)
View(nutrient)
row.names(nutrient)¡-tolower(row.names(nutrient))
row.names(nutrient)
nutrient.scaled¡-scale(nutrient)
nutrient.scaled
d←dist(nutrient.scaled)
fit.average←hclust(d,method = ”average”)
fit.average
Definition
The data analysis task is classication,where a model or classier is
constructed to predict categorical labels,such as safe or risky for the
loan application data;yes or no for the marketing data; or treatment A,
treatment B, or treatment C for the medical data.
Regression analysis
Regression analysis is a statistical methodology that is most often
used for numeric prediction,hence the two terms are often used
synonymously.
•Some decision tree algorithms produce only binary trees (where each
internal node branches to exactly two other nodes), whereas others can
produce nonbinary trees.
How are decision trees used for classication? Given a tuple, X, for
which the associated class label is unknown,the attribute values of the
tuple are tested against the decision tree. A path is traced from the
root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classication rules.
Input:
Data partition, D, which is a set of training tuples and their associated
class labels;
attribute list, the set of candidate attributes;
Attribute selection method,a procedure to determine the splitting
criterion thatbestpartitions the data tuples into individual classes.
This criterion consists of a splitting attribute and, possibly, either a
split point or splitting subset.
Output:A decision tree
Method:
1.create a node N;
2. if tuples in D are all of the same class,C then
3.return N as a leaf node labeled with the class C;
4.if attribute list is empty then
5.return N as a leaf node labeled with the majority class in D;
Inf o(D) = − m
P
i=1 pi log 2 (pi )