0% found this document useful (0 votes)
493 views

R Programming Notes

This document provides an overview of statistical hypothesis testing concepts. It defines key terms like the null hypothesis (H0), alternative hypothesis (Ha), type 1 and type 2 errors, test statistics, significance levels, p-values, and normal and t-distributions. It also outlines the steps for performing hypothesis tests, including establishing hypotheses, determining the appropriate statistical test, calculating test statistics, and making conclusions based on rejection regions. An example t-test is provided to illustrate the process.

Uploaded by

nalluri_08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
493 views

R Programming Notes

This document provides an overview of statistical hypothesis testing concepts. It defines key terms like the null hypothesis (H0), alternative hypothesis (Ha), type 1 and type 2 errors, test statistics, significance levels, p-values, and normal and t-distributions. It also outlines the steps for performing hypothesis tests, including establishing hypotheses, determining the appropriate statistical test, calculating test statistics, and making conclusions based on rejection regions. An example t-test is provided to illustrate the process.

Uploaded by

nalluri_08
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

DATA ANALYTICS-1

SUB CODE:IT412

1 DEPARTMENTOF INFORMATION TECHNOLOGY


BAPATLA ENGINEERING COLLEGE

August 8, 2016
Sample presentation

(BEC ) Short version August 8, 2016 1 / 113


Definition
HYPOTHESIS
An hypothesis is a statement made about the value of a
POPULATION PARAMETER that we wish to test by collecting
evidence in the form of sample.
In a statistical hypothesis test the evidence comes from a sample which
is summarized in the form of a statistic called the TEST STATISTIC.

NULL HYPOTHESIS
”NULL” means nothing new or different; assumption or status quo
maintained.
Null hypothesis is denoted by H0 .

ALTERNATIVE HYPOTHESIS
The ”Alternative” is simply ’the other option’ when null is rejected;
nothing more.
Alternative hypothesis is denoted by Ha .
(BEC ) Short version August 8, 2016 2 / 113
Differences between null hypothesis and alternative
hypothesis

(BEC ) Short version August 8, 2016 3 / 113


Definition
TYPE-1 ERROR
Rejection of the assumption (null hypothesis) when it should not have
been rejected.
Incorrectly rejecting the null hypothesis.

TYPE-1 ERROR EXAMPLE


Let us consider a scenario
1.you smell smoke.
2.you think,”This is not normal” (reject the assumption that
everything is OK).Reject your null hypothesis.
3.Therefore you pull the fire alarm.The building is evacuated and the
fire department arrives to investigate.
4.After the investigation it is determined there was no fire. You
”falsely” pulled the fire alarm.
5.When you rejected tour assumption that everything was OK, when it
really was ok you committed Type-1 Error.A ”false alarm.”
(BEC ) Short version August 8, 2016 4 / 113
Definition
TYPE-2 ERROR
Failure to reject the (null hypothesis) when it should have been
rejected.
Incorrectly mot rejecting the null hypothesis.

TYPE-2 ERROR EXAMPLE


Let us consider a scenario
1.you smell smoke.
2.you think,”it is probabaly someone who burned their lunch in
microwave.No big deal.
3.Therefore you do not reject tour assumption(null hypothesis) that
everything is ok. you uphold the null.
4.But there is indeed a fire. no one is injured. but building burns to
the ground.
5.When you failed to reject your assumption that everything was
OK,when it really was NOT OK,you committed Type-2 Error
(BEC ) Short version August 8, 2016 5 / 113
(BEC ) Short version August 8, 2016 6 / 113
Statistics
Introduction

POPULATION PARAMETER
Any characteristic of a population which is measurable is called
POPULATION PARAMETER.(we usually use greek letters for
population parameters.) A parameter is a numerical property of a
sample.

Example
For example the population mean,µ population variance σ 2 are
population parameters

(BEC ) Short version August 8, 2016 7 / 113


population,sample

(BEC ) Short version August 8, 2016 8 / 113


Definition of terms for hypothesis test

TEST STATISTIC
Critical value
The CRITICAL VALUE separates the critical region from noncritical
region

Rejection Region
REJECTION REGION is the range of values of the test value that
indicates that there is a significant difference and that the null
hypothesis should be rejected.
(BEC ) Short version August 8, 2016 9 / 113
Significance Level(α)

Significance Level(α)
It is a desired parameter of a cut off probability in experimental design
to determine whether an observed test statistic is extreme or not. is
usually set to be 0.05, 0.025 or 0.01.
We reject the null hypothesis if the probability of the observed test
statistic to appear is smaller than .

(BEC ) Short version August 8, 2016 10 / 113


p-value
It is the probability to obtain a new test statistic which is equal or
more extreme than the original observed test statistic.
A small p-value indicates that it is unlikely to get the value of the
observed test statistic.
We reject the null hypothesis if p-value is smaller than α.

(BEC ) Short version August 8, 2016 11 / 113


Normal distribution
Normal distribution is defined by the following probability density
function, where µ is the population mean and σ 2 is the variance.
1 −(x−µ)2
f (x) = √ e 2σ2 (1)
σ 2π
If a random variable X follows the normal distribution, then we write:
X ∼ N (µ, σ 2 ) (2)
In particular, the normal distribution with µ = 0 and σ = 1 is called
the standard normal distribution, and is denoted as N(0,1). It can be

graphed as shown.
(BEC ) Short version August 8, 2016 12 / 113
The normal distribution is important because of the CENTRAL LIMIT
THEOREM, which states that the population of all possible samples of
size n from a population with mean µ and variance σ 2 approaches a
normal distribution with mean µ and σ 2 /n when n approaches infinity.

(BEC ) Short version August 8, 2016 13 / 113


Hypothesis Testing Procedure

Test
1.Start with a well-developed, clear research problem or question.
2.Establish hypothesis, both null and alternative.
3.Determine appropriate statistical test and sampling distribution.
4.Choose the Type 1 error rate.
5.State the decision rule.
6.Gather sample data.
7.Calculate test statistics.
8.State statistical conclusion.
9.Make decision or inference based on conclusion.

(BEC ) Short version August 8, 2016 14 / 113


example

problem
Assume that the test scores of a college entrance exam fits a normal
distribution. Furthermore, the mean test score is 72, and the standard
deviation is 15.2. What is the percentage of students scoring 84 or
more in the exam?

solution
We apply the function pnorm of the normal distribution with mean 72
and standard deviation 15.2. Since we are looking for the percentage of
students scoring higher than 84, we are interested in the upper tail of
the normal distribution.

The percentage of students scoring 84 or more in the college entrance


exam is 21.5

(BEC ) Short version August 8, 2016 15 / 113


Student-t distribution

Assume that a random variable Z has the standard normal


distribution, and another random variable V has the Chi-Squared
distribution with ”m” degrees of freedom. Assume further that Z and
V are independent, then the following quantity follows a Student t
distribution with m degrees of freedom.
Here is a graph of the Student t distribution with 5 degrees of freedom

(BEC ) Short version August 8, 2016 16 / 113


Example

problem
Find the 2.5th and 97.5th percentiles of the Student t distribution with
5 degrees of freedom.

solution
We apply the quantile function qt of the Student t distribution against
the decimal values 0.025 and 0.975.

The 2.5th and 97.5th percentiles of the Student t distribution with 5


degrees of freedom are -2.5706 and 2.5706 respectively.

(BEC ) Short version August 8, 2016 17 / 113


T-Test for single mean
Significance of t test
1. In probability and statistics, Student’s t-distribution (or simply the
t-distribution) is any member of a family of continuous probability
distributions that arises when estimating the mean of a normally
distributed population in situations where the sample size is small and
population standard deviation is unknown.
2. In real life it is impossible to known the standard deviation of
population from which our sample is drawn.
3. When population standard deviation is NOT KNOWN and there we
have to use an estimate,s.
4. when σ is NOT KNOWN we use t-distribution.
5. Every sample size has it’s own t-distribution with n-1 degree of
freedom.
6. The degree of freedom change the probability distribution looks.
7. t-distribution has more probability at the tails and less probability
in the middle.
(BEC ) Short version August 8, 2016 18 / 113
Comparison of t-distribution with z-distribution
.
8. Using Z-distribution is acceptable any time n≥ 30.

(BEC ) Short version August 8, 2016 19 / 113


T-Test statistics for single mean

x̄−µ0
t= √s
n
x=sample mean, µ=hypothesized population mean.
s=sample standard deviation,n=sample size.
In the t-test value in the Non-Rejection and Rejection Region based on
df=n-1.

(BEC ) Short version August 8, 2016 20 / 113


T- distribution example
BUSINESS ANALYST SALARIES
A report from 6 years ago indicated that the average gross salary for a
business analyst was 69,873.Since this survey is now outdated,the
Bureau of labor Statistics wishes to test this figure against current
salaries to see if the current salaries are statistically different from the
old ones.
Based on this sample, we found s=14,985. We do not know σ and we
will therefore have estimate it using s.
For this study, the BLS will take a sample of 12 current salaries.

solution
Step 1:Establish Hypothesis.
H0 :µ=69,873
Ha :µ=69,873.
Step 2: Determine Appropriate Statistical Test and Sampling
Distribution.
(BEC ) Short version August 8, 2016 21 / 113
solution
This will be a two sided test. Salaries could be higher OR lower.
Since σ is unknown and n is small we will use the t-distribution.
Step 3:Specify the Type-1 error rate(significance level)
α=.05.
Step 4:State the decision rule.
for df =11
if t>2.201,reject H0 .
if t<-2.201,reject H0 .

Step 5: Gather data n=12, x=79,180.

(BEC ) Short version August 8, 2016 22 / 113


Solution

(BEC ) Short version August 8, 2016 23 / 113


Solution

(BEC ) Short version August 8, 2016 24 / 113


BUSINESS ANALYST SALARIES,When n=15

n=15

(BEC ) Short version August 8, 2016 25 / 113


BUSINESS ANALYST SALARIES,When n=15
n=15

(BEC ) Short version August 8, 2016 26 / 113


one sample t test

one sample t test in R programming


∗The R function t.test() can be used to perform both one and two
sample t-tests on vectors of data. The function contains a variety of
options and can be called as follows:
> t.test(x, y = NULL, alternative = c(”two.sided”, ”less”, ”greater”),
mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
∗Here x is a numeric vector of data values and y is an optional numeric
vector of data values.
∗ If y is excluded, the function performs a one-sample t-test on the
data contained in x, if it is included it performs a two-sample t-tests
using both x and y.
∗The option mu provides a number indicating the true value of the
mean (or difference in means if you are performing a two sample test)
under the null hypothesis.

(BEC ) Short version August 8, 2016 27 / 113


one sample t test

one sample t test in R programming


∗ The option alternative is a character string specifying the alternative
hypothesis, and must be one of the following: ”two.sided” (which is the
default), ”greater” or ”less” depending on whether the alternative
hypothesis is that the mean is different than, greater than or less than
mu, respectively. For example the following call:
∗ > t.test(x, alternative = ”less”, mu = 10)
performs a one sample t-test on the data contained in x where the null
hypothesis is that =10 and the alternative is that ¡10.
∗The option paired indicates whether or not you want a paired t-test
(TRUE = yes and FALSE = no). If you leave this option out it
defaults to FALSE.

(BEC ) Short version August 8, 2016 28 / 113


one sample t test

one sample t test in R programming


∗The option var.equal is a logical variable indicating whether or not to
assume the two variances as being equal when performing a two-sample
t-test. If TRUE then the pooled variance is used to estimate the
variance otherwise the Welch (or Satterthwaite) approximation to the
degrees of freedom is used. If you leave this option out it defaults to
FALSE.
∗Finally, the optionconf.level determines the confidence level of the
reported confidence interval for in the one-sample case and 1-2 in the
two-sample case.

(BEC ) Short version August 8, 2016 29 / 113


one sample t test
one sample t test problem
Ex. An outbreak of Salmonella-related illness was attributed to ice
cream produced at a certain factory. Scientists measured the level of
Salmonella in 9 randomly sampled batches of ice cream. The levels (in
MPN/g) were:
0.593 0.142 0.329 0.691 0.231 0.793 0.519 0.392 0.418
Is there evidence that the mean level of Salmonella in the ice cream is
greater than 0.3 MPN/g?
Let be the mean level of Salmonella in all batches of ice cream. Here
the hypothesis of interest can be expressed as:
H0 := 0.3
Ha :> 0.3
Hence, we will need to include the options alternative=”greater”,
µ=0.3. Below is the relevant R-code:

solution
> x = c(0.593,
(BEC ) 0.142, 0.329, 0.691,
Short 0.231,
version 0.793, 0.519, 0.392,
August 0.418)30 / 113
8, 2016
Two sample t-Test with Unequal Variance

The default form of the t.test() does not assume that the samples
have equal variance,so the welch two sample test is carried out unless
you specify.
Ex:data2=3,5,7,5,3,2,6,8,5,6,9,4,5,7,3,4
data3=6,7,8,7,6,3,8,9,10,7,6,9

Example
> t.test(data2, data3)

(BEC ) Short version August 8, 2016 31 / 113


Two sample t-Test with Equal Variance

we can override the default and use the classic t test by adding
var.equal=T.The p-value slightly different from welch version.

Example
> t.test(data2, data3, var.equal = T )

(BEC ) Short version August 8, 2016 32 / 113


CHI-SQUARED DISTRIBUTION

If X1 , X2 , .., Xm are m independent random variables having the


standard normal distribution,then the following quantity follows a
CHI-SQUARED DISTRIBUTION with m degrees of freedom.Its mean
is m,and its variance is 2m.

V = X1 2 + X2 2 + ... + Xm 2 ∼ χm 2 (3)

Here is a graph of the Chi-Squared distribution 7 degrees of freedom.

(BEC ) Short version August 8, 2016 33 / 113


Example
Find the 95th percentile of the Chi-Squared distribution with 7 degrees
of freedom.

Solution
We apply the quantile function qchisq of the Chi-Squared distribution
against the decimal values 0.95.

The 95th percentile of the Chi-Squared distribution with 7 degrees of


freedom is 14.067

(BEC ) Short version August 8, 2016 34 / 113


Z-Distribution

Z-test
When σ is known,we use the normal standard distribution,or z
distribution,to establish the non-rejection region and critical values.
When σ is not known,we use the t-distribution.
Some instructors or books will indicate that using the z distribution is
acceptable any time n>30.

Z-Test for single mean


x̄ − µ
z=
√σ
n

n:Sample population
x̄:Population mean
µ:Hypothesized population mean
σ:Standard deviation

(BEC ) Short version August 8, 2016 35 / 113


Example

Problem
Suppose the food label on a bag states that there is at most 2 grams of
saturated fat in a single cookie.In a sample of 35 cookies,it is found
that the mean amount of saturated fat per cookie is 2.1 grams.Assume
the population standard deviation is 0.25 grams.At 0.05 significance
level can we reject the claim on food label?

Solution
H0 :µ 62
Ha :µ >2
mean=2,n=35,x̄=2.1,σ=0.25
z = x̄−µ
√σ
n

(BEC ) Short version August 8, 2016 36 / 113


>z
[1]2.36643

a1 ←qnorm(p=0.05,lower.tail=F)
a1
[1]1.64485

z > a1 ,reject the null hypothesis.


z value is greater than critical value.z is in the rejection region.We
reject the null hypothesis ,that the food label contains greater than 2
grams of saturated fat in single cookie.

(BEC ) Short version August 8, 2016 37 / 113


Mann Whitney U test/Mann Whitney Wilcoxon test
•Mann-Whitney U testis used to compares two independent samples by
rank test.
•Non-parametric equivalent of independent Samples t-test.
•The ’U’ statistic provides the degree of overlap in ranks between two
groups
•They come from distinct populations and the samples do not affect
each other.
•Using Mann Whitney Wilcoxon test,we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.
Note
The difference of Mann Whitney Wilcoxon test and wilcoxon signed
rank test is., Mann Whitney is for independent groups and uses only
ordinal information.Wilcoxon is for matched groups and uses interval
information.There is no reason to expect the two analyses to give
similar results.
(BEC ) Short version August 8, 2016 38 / 113
Example of mann whitney
Note
It is the way of examining the relationship between a numeric outcome
variable y,and a categorical explanatory variable(x,two levels) when
two groups are independent!

Here we have a data file consisting of lung capacity of persons both


smokers and non smokers(725 observations of 6 variables).
We see the relation between smoking and lung capacity.

Here, we plot the relationship between lung capacity and smoke.


(BEC ) Short version August 8, 2016 39 / 113
In this we have the null hypothesis,
H0 :Median lung cap of smokers is equal to lung cap of non-smokers
Ha :Median lung cap of smokers not equal to lung cap of non-smokers
And it is a two sided test.
So,the test will be done in R as follows

(BEC ) Short version August 8, 2016 40 / 113


This wilcoxon test shows the rejection of null hypothesis and accepts
the alternate hypothesis.

(BEC ) Short version August 8, 2016 41 / 113


One sample U test

If we specify a single vector,a one sample u test is carried out.

(BEC ) Short version August 8, 2016 42 / 113


Wilcoxon Signed rank test/ Two sample U test

This is a non-parametric method appropriate for examining the median


difference in observations for 2 populations that are paired or
dependent on one another..

Example
In this example we have blood test results before and after receiving
some treatment.(25 paired observations of 3 variables)

(BEC ) Short version August 8, 2016 43 / 113


> BP←read.table(file.choose(),header=T)
>attach(BP)
>names(BP)

[1]”Subject” ”Before” ”After”

> BP[c(1,3,5),]

we will score changes of before to after treatment


Before examiming the data we plot the before and after values.

(BEC ) Short version August 8, 2016 44 / 113


>boxplot(Before,After)

In the plot,we can see the BP decrement after taking treatment.

H0 :Median change in systolic blood pressure is 0


Two sided test. >
wilcox.test(Before,After,mu=0,alt=”two.sided”,paired=”T”,conf.int=T,co

(BEC ) Short version August 8, 2016 45 / 113


(BEC ) Short version August 8, 2016 46 / 113
Correlation and Covariance

Correlation is a measure of the strength of the relationship or


association between two variables.
•For a generally upward shape we say that the correlation is positive.
As an independent variable increases,the dependent variable generally

increases.
For a generally downward shape we say that the correlation is
negative.
As an independent variable increases,the dependent variable generally
decreases.

(BEC ) Short version August 8, 2016 47 / 113


Continues..

For a randomly scattered points with no upward or downward trend,we


say there is no correlation.

Look at the spread of to make a judgement about the strength of the


correlation.For the positive relationships we would classify the
following scatter diagrams as:

(BEC ) Short version August 8, 2016 48 / 113


continues..

We classify the strengths of negative relationships in the same way:

Covariance
The covariance of two variables x and y in a data sample measure
how the two are linearly related.A positive covariance would indicates a
positive linear relationship
P between the variables,and vice versa.
Co − variance = (x−x̄)(y−ȳ)
n

(BEC ) Short version August 8, 2016 49 / 113


Pearsons correlation coefficient(r)
If a horizontal line is drawn through the mean y valueȳ,and a vertical
mean through the mean x value x̄,you can see the relationship between
the two variables in another way.

(BEC ) Short version August 8, 2016 50 / 113


continues..

coefficient(r)
Sxy
r=p
Sxx Sxy
2
P
• Sxx = x2 − (Pnx) or Sxx = (x − x̄)2
P P
P 2 ( y)2
y − Pn or Syy = (y − ȳ)2
P
• Syy =
xy − ( nxy) or Sxy = (x − x̄)(y − ȳ)
P P
• Sxy =

(BEC ) Short version August 8, 2016 51 / 113


Spearman’s rank correlation coefficient(rs )

In practice a simpler procedure is normally used to calculate rs .The


raw scores are converted to ranks,and the difference d between the
ranks of each observation on the two variables are calculated.
If there are no tied ranks,then rs is given as

6 d2
P
rs = 1 −
(n)(n2 − 1)

• di =the difference between the each rank of corresponding values x


and y
•n=the number of pairs of values.
If there are tied ranks,the classic pearson’s correlation coefficient
between ranks could be used instead of this formula.

(BEC ) Short version August 8, 2016 52 / 113


Least square regression line
Linear regression
Linear regression is a formal method of finding a line which best fits
a set of data.
We can use technology to perform linear regression and hence find the
equation of the line.Most graphics calculators and computer packages
use the method of ’least squares’ to determine the gradient and
y-intercept.
The least square regression line; We find the vertical distances
d1 , d2 , ... to the line of best fit.
We add the squares of these distances,giving d1 2 + d2 2 + ...
The least square regression line is the one which makes this sum as
small as possible

(BEC ) Short version August 8, 2016 53 / 113


unit-2(machine learning)

cluster analysis

(BEC ) Short version August 8, 2016 54 / 113


common steps in cluster analysis

Good clustering
A good clustering method will produce high quality clusters with,high
intra-class similarity and low inter-class similarity.

Data object:It represents an entity.


ex:In sales database the objects may be customers,store items,sales.
Attribute:It is a data field,representing a character or feature of data
objects.
Types of attributes
The type of an attribute is determined by the set of possible values
1.Nominal
Nominal means ”relating to names”.The values of a nominal attribute
are symbols or name of things.
ex:hair color can take black,brown,gray..

(BEC ) Short version August 8, 2016 55 / 113


2.Binary
A binary attribute is a nominal attribute with only two categories 0,1.
•A binary attribute is symmetric,if both of its states are equally
valuable and carry the same weight;that is,there is no preference on
which outcome should be coded as 0 or 1. ex:gender(male,female)
•A binary attribute is asymetric,if the outcomes of the states are not
equally important,such as the positive and negative outcomes of a
medical test for HIV.By convention,we code the most important
outcome,which is usually the rarest one, by 1(eg:HIV positive)and
other by 0(ex:HIV negative).

3.Ordinal
The ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them,but the magnitude between
successive values is not known.
ex:customer satisfactory(0:very
dissatisfied,1:dissatisfied,2:neutral,3:satisfied,4:very satisfied).

(BEC ) Short version August 8, 2016 56 / 113


4.Numeric
A numeric attribute is quantitative,that is,it is a measurable
quantity,represented in integer or real values.Numeric attributes can be
interval scaled or ratio scaled.
.Interval-scaled Attributes:measured on scale of equal-size units.
ex:Temperature scaling.
.Ratio-scaled Attributes:It is a numerical attribute with an inherent
zero-point.
ex:you are 100 times richer with 100crs than with 1cr.

•CLUSTER:A collection of data objects,Similar to one another


within the same cluster or Dissimilar to the objects in other clusters.
•CLUSTER ANALYSIS:
Grouping a set of data objects into clusters.
Clustering is unsupervised classification: no predefined classes.
•Examples of Clustering Applications
Marketing,Land use,Insurance,City planning,Earthquake studies.

(BEC ) Short version August 8, 2016 57 / 113


calculating distances

Dissimilarity/Similarity metric: Similarity is expressed in terms of a


distance function, which is typically metric:d(i,j).

(BEC ) Short version August 8, 2016 58 / 113


Continues..

•The definitions of distance functions are usually very different for


interval-scaled,boolean,categorical,ordinal and ratio variables.

•Clustering may not be the best way to discover interesting groups in a


data set. Often visulisation methods work well, allowing the human
expert to identify useful groups. However, as the data set sizes increase
to millions of observations, this becomes impractical and clusters help
to partition the data so that we can deal with smaller groups.

•Dierent algorithms, and even multiple runs of the one algorithm, will
deliver dierent clusterings.

(BEC ) Short version August 8, 2016 59 / 113


Similarity and Dissimilarity Between Observations

Distance measures the dissimilarity between two data observations


x = (x1 , x2 , ..., xn ) and y = (y1 , y2 , ..., yn ).
s(x,y)=1-d(x,y)

A distance measure should satisfy the following requirements:


.d(x, y) > 0 distance is non-negative.
.d(x, y)=0 distance to itself is 0.
.d(x,y)=d(y,x) distance is symmetric.
.d(x,y)6d(x,z)+d(z,y) triangular inequality.

(BEC ) Short version August 8, 2016 60 / 113


Euclidean Distance

*The straight line distance between two points;


Default measure based on numeric values;
*The distance calculation we learned in school
v
u n
uX
dist(x, y) = t (xi − yi )2
i=1

* x and y are the observations;


*n variables;
xi is the value of variable i for observation x;
Similarly for yi .

(BEC ) Short version August 8, 2016 61 / 113


Manhattan Distance
The distance walking the streets of Manhattan.

Pn
dist(x, y) = i=1 |xi − yi |

Minkowski Distance
A general p
measure of distance:
d(x, y) = q (|x1 − y1 |q + |x2 − y2 |q + ... + |xn − yn |q ) If q = 2, d is the
Euclidean distance,if q = 1, d is the Manhattan distance.A variation is
the weighted
p distance variables have dierent importance:
d(x, y) = q (w1 |x1 − y1 |q + w2 |x2 − y2 |q + ... + wn |xn − yn |q )

(BEC ) Short version August 8, 2016 62 / 113


Issue of scale

Example of scaling:

(BEC ) Short version August 8, 2016 63 / 113


Interval-Scaled Variables

Interval-scaled variables are continuous variables of a roughly linear


scale.
•The Euclidean distance or some other instance of the Minkowski
distance can be used.
•Before applying the distance measure, the variables need to be
normalized:
ex:Variables with larger ranges (e.g., income) will overwhelm variables
with smaller ranges (e.g., age): 50,000 -40,000 = 10,000 versus 50years
40years = 10
Variation of z-score normalisation:
v−m
v0 =
s
where m is the mean and s is the mean absolute deviation (c.f. stdev:
robust to outliers and retains the outliers)
(BEC ) Short version August 8, 2016 64 / 113
Binary Variables
Binary variables have just two possible values: 0 and 1. We consider as
a group all of the binary variables and count for observation xi and xj
the number of times they both have 0, 1, or (0,1) or (1,0) to build a
contingency table:
•Simple matching coefficient (symmetric variable):

b+c
d(xi , yj ) =
a+b+c+d
•Jaccard coefficientt
(asymmetric: 1 is more important - e.g. diseases):

b+c
d(xi , yj ) =
a+b+c
(BEC ) Short version August 8, 2016 65 / 113
Categorical Variables

• A generalisation of the binary variable in that it can take more than


2 levels, e.g., red, yellow, blue, green.
•Method 1: Simple matching
n−p
d(x, y) =
n
•where p is the number of matched categorical variables and n is the
total number of variables.
•Method 2: Convert each level into a binary variable, creating many
new binary variables.

(BEC ) Short version August 8, 2016 66 / 113


Variables of Mixed Types

•A dataset may contain all types of variables: interval, binary,


categorical.
•Use a weighted formula to combine the dierent normalised (to [0,1])
distances, where the weights are used to express the relative
importance of the variables:
X
d(x, y) = wk .dij Ak
k

where wk is the weight of variable Ak ,dij Ak is the dissimilarity of the


ith observation and the jth observation on variable Ak .dij Ak is
normalized to [0, 1]

(BEC ) Short version August 8, 2016 67 / 113


Major Clustering Approaches

•Partitioning algorithm:
Construct various partitions and then evaluate them by some criterion.
•Hierarchy algorithms
Create a hierarchical decomposition of the set of data (or objects)
using some criterion.
•Density-based
based on connectivity and density functions.
•Grid-based
based on a multiple-level granularity structure.
•Model-based
A model is hypothesized for each of the clusters and the idea is to find
the best fit of that model to each other.

(BEC ) Short version August 8, 2016 68 / 113


Partitioning Methods: The Principle

•Given,
→ A data set of n objects
→ K the number of clusters to form.
• Organize the objects into k partitions (k6n) where each partition
represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion
→Objects within a cluster are similar
→ Objects of different clusters are dissimilar

(BEC ) Short version August 8, 2016 69 / 113


K-Means Method

Given k, the k-means algorithm is implemented in four steps:

(BEC ) Short version August 8, 2016 70 / 113


(BEC ) Short version August 8, 2016 71 / 113
Algorithm

• Input
→ K: the number of clusters
→ D: a data set containing n objects
• Output: A set of k clusters
• Method:
1. Arbitrary choose k objects from D as in initial cluster centers
2.Repeat
3. Reassign each object to the most similar cluster based on the mean
value of the objects in the cluster
4. Update the cluster means
5. Until no change

(BEC ) Short version August 8, 2016 72 / 113


K-Means Properties

• The algorithm attempts to determine k partitions that minimize the


square-error function
k X
X
E= (p − mi )2
i−1 p∈Ci

→ E: the sum of the squared error for all objects in the data set
→ E: the sum of the squared error for all objects in the data set
→ mi : is the mean of cluster Ci
• It works well when the clusters are compact clouds that are rather
well separated from one another.

(BEC ) Short version August 8, 2016 73 / 113


K-Means Properties

Advantages
• K-means is relatively scalable and efficient in processing large data
sets
• The computational complexity of the algorithm is O(nkt)
→ n: the total number of objects
→ k: the number of clusters
→ t: the number of iterations
→ Normally: k << n and t << n
Disadvantages
• Can be applied only when the mean of a cluster is defined
• Users need to specify k
• K-means is not suitable for discovering clusters with nonconvex
shapes or clusters of very different size
• It is sensitive to noise and outlier data points(can influence the mean
value)

(BEC ) Short version August 8, 2016 74 / 113


Variations of the K-Means Method

• A few variants of the k-means which differ in


· Selection of the initial k means
· Dissimilarity calculations
· Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang98)
· Replacing means of clusters with modes
· Using new dissimilarity measures to deal with categorical objects
· Using a frequency-based method to update modes of clusters
· A mixture of categorical and numerical data

(BEC ) Short version August 8, 2016 75 / 113


Example of k-means

i ←read.csv(choose.files())
str(i)

names(i)

i2 ←i[,-5]
names(i2)

(BEC ) Short version August 8, 2016 76 / 113


continues..

i3 ←kmeans(i2,3)
i3

(BEC ) Short version August 8, 2016 77 / 113


continues..

table(iris$Species,i3$cluster)

plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species)

(BEC ) Short version August 8, 2016 78 / 113


continues..

plot(iris$Petal.Length,iris$Petal.Width,col=i3$cluster)

(BEC ) Short version August 8, 2016 79 / 113


continues..

plot(iris$Sepal.Length,iris$Sepal.Width,col=i3$cluster)

(BEC ) Short version August 8, 2016 80 / 113


K-Medoids Method

• Minimize the sensitivity of k-means to outliers


• Pick actual objects to represent clusters instead of mean values
• Each remaining object is clustered with the representative object
(Medoid) to which is the most similar
• The algorithm minimizes the sum of the dissimilarities between each
object and its corresponding reference point
k X
X
E= |p − Oi |
i−1 p∈Ci

· E: the sum of absolute error for all objects in the data set
· P: the data point in the space representing an object
· Oi : is the representative object of cluster Ci

(BEC ) Short version August 8, 2016 81 / 113


K-Medoids Method: The Idea

• Initial representatives are chosen randomly


•The iterative process of replacing representative objects by no
representative objects continues as long as the quality of the clustering
is improved
• For each representative Object O
→ For each non-representative object R, swap O and R
• Choose the configuration with the lowest cost
• Cost function is the difference in absolute error-value if a current
representative object is replaced by a non-representative object

(BEC ) Short version August 8, 2016 82 / 113


K-Medoids Method: Example

(BEC ) Short version August 8, 2016 83 / 113


continues..

(BEC ) Short version August 8, 2016 84 / 113


continues..

(BEC ) Short version August 8, 2016 85 / 113


continues..

(BEC ) Short version August 8, 2016 86 / 113


continues..

(BEC ) Short version August 8, 2016 87 / 113


continues..

(BEC ) Short version August 8, 2016 88 / 113


continues..

(BEC ) Short version August 8, 2016 89 / 113


continues..

(BEC ) Short version August 8, 2016 90 / 113


continues..

(BEC ) Short version August 8, 2016 91 / 113


K-Medoids Algorithm(PAM)
PAM : Partitioning Around Medoids
• Input
→ K: the number of clusters
→ D: a data set containing n objects
• Output: A set of k clusters
•Method:
1. Arbitrary choose k objects from D as representative objects (seeds)
2.Repeat
3. Assign each remaining object to the cluster with the nearest
representative object
4. For each representative object Oj
5. Randomly select a non representative object Orandom
6. Compute the total cost S of swapping representative object Oj with
Orandom
7. if S¡0 then replace Oj with Orandom
8. Until no change
(BEC ) Short version August 8, 2016 92 / 113
Example program of k-medoids

x←read.csv(choose.files())
str(x)

names(x)

x1←x[,-5]
names(x1)

(BEC ) Short version August 8, 2016 93 / 113


continues..

library(cluster)
km←pam(x1,3)
km

(BEC ) Short version August 8, 2016 94 / 113


continues..

table(iris$Species,km$clustering)

plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species)

(BEC ) Short version August 8, 2016 95 / 113


continues..

plot(iris$Petal.Length,iris$Petal.Width,col=km$clustering)

(BEC ) Short version August 8, 2016 96 / 113


continues..

plot(iris$Sepal.Length,iris$Sepal.Width,col=iris$Species)

(BEC ) Short version August 8, 2016 97 / 113


continues..

plot(iris$Sepal.Length,iris$Sepal.Width,col=km$clustering)

(BEC ) Short version August 8, 2016 98 / 113


K-Medoids Properties(k-medoids vs.K-means)

• The complexity of each iteration is O(k(n-k)2)

• For large values of n and k, such computation becomes very costly


• Advantages
→ K-Medoids method is more robust than k-Means in the presence of
noise and outliers
•Disadvantages
→ K-Medoids is more costly that the k-Means method
→ Like k-means, k-medoids requires the user to specify k
→ It does not scale well for large data sets.

(BEC ) Short version August 8, 2016 99 / 113


Hierarchical Clustering

•Hierarchical Clustering Approach


-A typical clustering analysis approach via partitioning data set
sequentially
-Construct nested partitions layer by layer via grouping objects into a
tree of clusters (without the need to know the number of clusters in
advance)
-Use (generalised) distance matrix as clustering criteria
•Agglomerative vs. Divisive -Two sequential clustering strategies
for constructing a tree of clusters
-Agglomerative: a bottom-up strategy
Initially each data object is in its own (atomic) cluster
Then merge these atomic clusters into larger and larger clusters
-Divisive: a top-down strategy Initially all objects are in one single
cluster
Then the cluster is subdivided into smaller and smaller clusters

(BEC ) Short version August 8, 2016 100 / 113


Example

Agglomerative and divisive clustering on the data set a, b, c, d ,e

(BEC ) Short version August 8, 2016 101 / 113


Example

data(nutrient,package = ”flexclust”)
View(nutrient)
row.names(nutrient)¡-tolower(row.names(nutrient))
row.names(nutrient)

nutrient.scaled¡-scale(nutrient)
nutrient.scaled

(BEC ) Short version August 8, 2016 102 / 113


continues..

d←dist(nutrient.scaled)
fit.average←hclust(d,method = ”average”)
fit.average

(BEC ) Short version August 8, 2016 103 / 113


continues..

plot(fit.average,hang = -1,cex=1.0,main = ”Average linkage


clustering”)

(BEC ) Short version August 8, 2016 104 / 113


Classification

Definition
The data analysis task is classication,where a model or classier is
constructed to predict categorical labels,such as safe or risky for the
loan application data;yes or no for the marketing data; or treatment A,
treatment B, or treatment C for the medical data.

Regression analysis
Regression analysis is a statistical methodology that is most often
used for numeric prediction,hence the two terms are often used
synonymously.

(BEC ) Short version August 8, 2016 105 / 113


Preparing the Data for Classication and Prediction
The following preprocessing steps may be applied to the data to help
improve the accuracy, efciency, and scalability of the classication or
prediction process.
•Data cleaning:the preprocessing of data in order to remove or
reduce noise and the treatment of missing values.
•Relevance analysis:Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether any
two given attributes are statistically related.
Relevance analysis,in the form of correlation analysis and attribute
subset selection, can be used to detect attributes that do not
contribute to the classication or prediction task.
•Data transformation and reduction: The data may be
transformed by normalization, particularly when neural networks or
methods involving distance measurements.Data can also be reduced by
applying many other methods, ranging from wavelet transformation
and principle components analysist o discretization techniques,such as
binning, histogram analysis, and clustering.
(BEC ) Short version August 8, 2016 106 / 113
Classication by Decision Tree Induction

Decision tree induction is the learning of decision trees from


class-labeled training tuples. A decision tree is a owchart-like tree
structure,where each internal node(non leafnode) denotes a test on an
attribute,each branch represents an outcome of the test,and each leaf
node (or terminal node) holds a class label. The topmost node in a tree
is the root node.

(BEC ) Short version August 8, 2016 107 / 113


Introduction

•Some decision tree algorithms produce only binary trees (where each
internal node branches to exactly two other nodes), whereas others can
produce nonbinary trees.

How are decision trees used for classication? Given a tuple, X, for
which the associated class label is unknown,the attribute values of the
tuple are tested against the decision tree. A path is traced from the
root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classication rules.

Why are decision tree classiers so popular? The construction of


decision tree classiers does not require any domain knowledge or
parameter setting,and therefore is appropriate for exploratory
knowledge discovery.Decision trees can handle high dimensional data.

(BEC ) Short version August 8, 2016 108 / 113


Decision Tree Induction

A researcher in machine learning, developed a decision tree algorithm


known as ID3 (Iterative Dichotomiser). This work expanded on earlier
work on concept learning systems. C4.5 (a successor of ID3), which
became a benchmark to which newer supervised learning algorithms
are often compared.

A researcher in machine learning, developed a decision tree algorithm


known as ID3 (Iterative Dichotomiser). This work expanded on earlier
work on concept learning systems. C4.5 (a successor of ID3), which
became a benchmark to which newer supervised learning algorithms
are often compared.

(BEC ) Short version August 8, 2016 109 / 113


Algorithm

Input:
Data partition, D, which is a set of training tuples and their associated
class labels;
attribute list, the set of candidate attributes;
Attribute selection method,a procedure to determine the splitting
criterion thatbestpartitions the data tuples into individual classes.
This criterion consists of a splitting attribute and, possibly, either a
split point or splitting subset.
Output:A decision tree
Method:
1.create a node N;
2. if tuples in D are all of the same class,C then
3.return N as a leaf node labeled with the class C;
4.if attribute list is empty then
5.return N as a leaf node labeled with the majority class in D;

(BEC ) Short version August 8, 2016 110 / 113


continues..

6.apply Attribute selection method(D, attribute list) to nd the best


splitting criterion;
7.label node N with splitting criterion;
8.if splitting attribute is discrete-valued and multiway splits allowed
then
9. attribute list←attribute list splitting attribute;
10. for each outcome j of splitting criterion
11. let Dj be the set of data tuples in D satisfying outcome j;
12. if Dj is empty then
13.attach a leaf labeled with the majority class in D to node N;
14. else attach the node returned by Generate decision tree(Dj ,
attribute list) to node N; endfor
15.return N;

(BEC ) Short version August 8, 2016 111 / 113


Information gain

ID3 uses information gain as its attribute selection measure. which


studied the value or information content of messages. . Let node N
represent or hold the tuples of partition D.The attribute with the
highest information gain is chosen as the splitting attribute for node N.
This attribute minimizes the information needed to classify the tuples
in the resulting partitions and reects the least randomness or impurity
in these partitions. Such an approach minimizes the expected number
of tests needed to classify a given tuple and guarantees that a simple
(but not necessarily the simplest) tree is found. The expected
information needed to classify a tuple in D is given by

Inf o(D) = − m
P
i=1 pi log 2 (pi )

• pi is the probability that an arbitrary tuple in D belongs to class Ci


|ci,D |
pi = |D|

(BEC ) Short version August 8, 2016 112 / 113


Continues..

• These partitions would correspond to the branches grown from node


N. Ideally, we would like this partitioning to produce an exact
classication of the tuples.
•That is, we would like for each partition to be pure. However, it is
quite likely that the partitions will be impure How much more
information would we still need (after the partitioning) in order to
arrive at an exact classication? This amount is measured by
Pv |Dj |
Inf oA (D) = j=1 |D| ∗ Inf o(Dj )
|D |
• The term |D|j acts as the weight of j th partition.
• Gain(A)=Info(D)-Inf oA (D)
Gain(A) tells us how much would be gained by branching on A. It is
the expected reduction in the information requirement caused by
knowing the value of A.

(BEC ) Short version August 8, 2016 113 / 113

You might also like