R-Prog Unit-5
R-Prog Unit-5
R-Prog Unit-5
Problem: Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of
successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167.
Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
R Code:
> dbinom(2, size=5, prob=0.167)
[1] 0.1612
Problem: In a restaurant seventy percent of people order for Chinese food and thirty percent for Italian food. A
group of three persons enter the restaurant. Find the probability of at least two of them ordering for Italian food.
Solution:-
The probability of ordering Chinese food is 0.7 and the probability of ordering Italian food is 0.3. Now, if
at least two of them are ordering Italian food then it implies that either two or three will order Italian
food.
Cumulative Binomial Probability:- A cumulative binomial probability refers to the probability that the
binomial random variable falls within a specified range (e.g., is greater than or equal to a stated lower
limit and less than or equal to a stated upper limit).
Problem: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five
possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a
student attempts to answer every question at random.
Solution:
Since only one out of five possible answers is correct, the probability of
answering a question correctly by random is 1/5=0.2.
To find the probability of having exactly 4 correct answers by
random attempts as follows.
> dbinom(4, size=12, prob=0.2)
[1] 0.1329
To find the probability of having four or less correct answers by random attempts, we
apply the function dbinom with x = 0,…,4.
> dbinom(0, size=12, prob=0.2) + dbinom(1, size=12, prob=0.2) +
+ dbinom(2, size=12, prob=0.2) + dbinom(3, size=12, prob=0.2) +
+ dbinom(4, size=12, prob=0.2)
[1] 0.9274
Alternatively, we can use the cumulative probability function for binomial
distribution pbinom.
Problem: Fit an appropriate binomial distribution and calculate the theoretical distribution
x: 0 1 2 3 4 5
f: 2 14 20 34 22 8
Solution:
Here n = 5 , N = 100
Mean = ∑ xi fi = 2.84
∑ fi
np = 2.84
p = 2.84/5 = 0.568
q = 0.432
> plot(fitbin)
Examples:
1. The number of defective electric bulbs manufactured by a reputed company.
2. The number of telephone calls per minute at a switch board
3. The number of cars passing a certain point in one minute.
4. The number of printing mistakes per page in a large text.
R has four in-built functions to generate binomial distribution. They are described below.
dpois(x, lambda, log = FALSE) :- This function gives the probability density distribution at each
point.
ppois(q, lambda, lower.tail = TRUE, log.p = FALSE) :- This function gives the cumulative
probability of an event. It is a single value representing the probability.
qpois(p, lambda, lower.tail = TRUE, log.p = FALSE):- This function takes the probability value
and gives a number whose cumulative value matches the probability value.
rpois(n, lamda) :- This function generates required number of random values of given probability
from a given sample.
Following is the description of the parameters used −
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
Problem:- If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution:-
The probability of having sixteen or less cars crossing the bridge in a
particular minute is given by the function ppois.
> ppois(16, lambda=12) # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the
bridge in a minute is in the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE) # upper tail
[1] 0.10129
Answer:- If there are twelve cars crossing a bridge per minute on average, the probability of
having seventeen or more cars crossing the bridge in a particular minute is 10.1%.
Problem:- The average number of homes sold by the Acme Realty company is 2 homes per day. What is the
probability that exactly 3 homes will be sold tomorrow?
Solution: This is a Poisson experiment in which we know the following:
μ = 2; since 2 homes are sold per day, on average.
x = 3; since we want to find the likelihood that 3 homes will be
sold tomorrow.
e = 2.71828; since e is a constant equal to approximately
2.71828.
We plug these values into the Poisson formula as follows:
Cumulative Poisson Probability:- A cumulative Poisson probability refers to the probability that the
Poisson random variable is greater than some specified lower limit and less than some specified upper
limit.
Problem:-Suppose the average number of lions seen on a 1-day safari is 5.
What is the probability that tourists will see fewer than four lions on the next
1-day safari?
Solution: This is a Poisson experiment in which we know the following:
μ = 5; since 5 lions are seen per safari, on average.
x = 0, 1, 2, or 3; since we want to find the likelihood that
tourists will see fewer than 4 lions; that is, we want the
probability that they will see 0, 1, 2, or 3 lions.
e = 2.71828; since e is a constant equal to approximately
2.71828.
To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions. Thus,
we need to calculate the sum of four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5). To compute
this sum, we use the Poisson formula:
P(x < 3, 5) = P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5)
P(x < 3, 5) = [ (e-5)(50) / 0! ] + [ (e-5)(51) / 1! ] + [ (e-5)(52) / 2! ] + [ (e-5)(53) / 3! ]
P(x < 3, 5) = [ (0.006738)(1) / 1 ] + [ (0.006738)(5) / 1 ] + [ (0.006738)(25) / 2 ] +[ (0.006738)(125) / 6]
P(x < 3, 5) = [ 0.0067 ] + [ 0.03369 ] + [ 0.084224 ] + [ 0.140375 ]
P(x < 3, 5) = 0.2650
Thus, the probability of seeing at no more than 3 lions is 0.2650.
R Code:-
> ppois(3,lambda = 5)
[1] 0.2650259
, on the domain .
Standard Normal Distribution
It is the distribution that occurs when a normal random variable has a mean of zero and a standard
deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score.
Every normal random variable X can be transformed into a z score via the following equation:
Z = (X - μ) / σ
where X is a normal random variable, μ is the mean, and σ is the standard deviation.
yielding
Standard Normal Curve:- One way of figuring out how data are
distributed is to plot them in a graph. If the data is evenly distributed,
you may come up with a bell curve. A bell curve has a small percentage
of the points on both tails and the bigger percentage on the inner part of
the curve. The shape of the standard normal distribution looks like
this:
R functions:
dnorm(x, mean = 0, sd = 1, log = FALSE) :- This function gives the probability density distribution
at each point.
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function gives the cumulative
probability of an event. It is a single value representing the probability.
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE):- This function takes the probability
value and gives a number whose cumulative value matches the probability value.
rnorm(n, mean = 0, sd = 1) :- This function generates required number of random values of given
probability from a given sample.
Problem:-X is a normally normally distributed variable with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
Solution:
a) For x = 40, then
z = x − µ /σ
⇒z = (40 – 30) / 4
= 2.5 (=z1 say)
Hence P(x < 40) = P(z < 2.5)
= 0.5+A(z1) = 0.9938
b) For x = 21,
z = x − µ /σ
⇒z = (21 - 30) / 4
= -2.25 (= -z1 say)
Hence P(x > 21) = P(z > -2.25)
= 0.5- A(z1) = 0.9878
c) For x = 30
z = x − µ /σ ⇒,
z = (30 - 30) / 4 = 0 and
for x = 35,
z = x − µ /σ
⇒ z = (35 - 30) / 4
= 1.25
Hence P(30 < x < 35) = P(0 < z < 1.25)
= [area to the left of z = 1.25] - [area to the left of 0]
= 0.8944 - 0.5 = 0.3944
Problem:-The length of life of an instrument produced by a machine has a normal ditribution with a mean of 12
months and standard deviation of 2 months. Find the probability that an instrument
produced by this machine will last.
a) less than 7 months.
b) between 7 and 12 months.
Solution:
a) P(x < 7)
for x = 7
z = x − µ /σ
⇒z = (7 – 12) / 2
= -2.5 (=z1 say)
Hence P(x < 7) = P(z < -2.5)
= 0.0062
b) P(7 < x < 12)
For x=12
z = x − µ /σ
⇒z = (12 – 12) / 2
= 0 (=z1 say)
Hence P(7 < x < 12) = P(-2.5 < z < 0)
= 0.4938
Problem:-The Tahoe Natural Coffee Shop morning customer load follows a normal
distribution with mean 45 and standard deviation 8. Determine the probability that the
number of customers tomorrow will be less than 42.
Solution:-
We first convert the raw score to a z-score. We have
z = x − µ /σ
⇒z =(42−45)/8=−0.375
Next, we use the table to find the probability. The table gives 0.3520. (We have rounded the raw score
to -0.38).
We can conclude that
P(x<42)=P(x<-0.38)
=0.352
That is there is about a 35% chance that there will be fewer than 42 customers tomorrow.
Example:
> x <- c(92,117,109,85,117,107,82,83,119,113,101,106,101,84,126,69,82,79,84,100,104,111,109,92,93,107,
81,118,81,133,111,82,120,103,115,89,74,110,83,110,96,102,108,110,140,106,111,98,98,99,74,101,107,104,
128,87,95,109,104,91,83,98,99,103,126,123,85,98,93,100)
Problem:-Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore,
the mean test score is 72, and the standard deviation is 15.2. What is the percentage of students scoring
84 or more in the exam?
Solution:-
We apply the function pnorm of the normal distribution with mean 72 and standard deviation
15.2. Since we are looking for the percentage of students scoring higher than 84, we are interested in
the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492
Problem:- The local ice cream shop keeps track of how much ice cream they sell versus the temperature
on that day, here are their figures for the last 12 days:
Temperature 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
oC
Ice cream $215 $325 $185 $332 $406 $522 $412 $614 $544 $421 $445 $408
sales
Solution:-
R Code:-
> temp <- c(14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2)
> sales <- c(215,325,185,332,406,522,412,614,544,421,445,408)
> corr_coeff <- cor(temp,sales)
> corr_coeff
[1] 0.9575066
> cov(temp,sales)
[1] 484.0932
#Adds a line of best fit to your scatter plot
> plot(temp, sales, pch=16,col="red")
>abline(lm(sales~temp),col="blue")
T-test for single mean:- One-sample t-test is used to compare the mean of a population to a
specified theoretical mean (μ).
Let X represents a set of values with size n, with mean μ and with standard deviation S.
The comparison of the observed mean (μ) of the population to a theoretical value μ is performed with
the formula below:
x 0
t
s n
To evaluate whether the difference is statistically significant, you first have to read in t test
table the critical value of Student’s t distribution corresponding to the significance level alpha of your
choice (5%). The degrees of freedom (df) used in this test are: df = n−1
Problem:-: A professor wants to know if her introductory statistics class has a good grasp of basic
math. Six students are chosen at random from the class and given a math proficiency test. The professor
wants the class to be able to score above 70 on the test. The six students get scores of 62, 92, 75, 68, 83,
and 95. Can the professor have 90 percent confidence that the mean score for the class on the test would
be above 70?
Solution:-
Null hypothesis: H 0: μ = 70
Alternative hypothesis: H a : μ > 70
First, compute the sample mean and standard deviation:
62 92 75 68 83 95
x
6
475
13.17
6
Null Hypothesis H0 : The sample meet upto standard i.e
µ >70 hours
Alternative Hypothesis HA: µ not greater than 70,
Level of Siginificance: 0.05
x 0
The test statistic is t
s n
79.71 70 9.17
t=
13.17 6 5.38
= 1.71(calculate value of t)
To test the hypothesis, the computed t‐value of 1.71 will be compared to the critical value in the t‐table
with 5 df is 1.67, the calculate of t is more than table value of t, so null hypothsis is rejected.
R code:-
> t.test(x,alternative="two.sided",mu=70)
One Sample t-test
data: x
t = 1.7053, df = 5, p-value = 0.1489
alternative hypothesis: true mean is not equal to 70
95 percent confidence interval:
65.34888 92.98446
sample estimates:
mean of x
79.16667
Problem:-: A Sample of 26 bulbs gives a mean life of 990 hours with S.D of 20 hours. The manufacurer
claims that the mean life of bulbs is 1000 hours. Is sample meet upto the standard.
Solution: Here n = 26,
Sample mean x̅ = 990 hours
S.D s = 20 hours
Population mean µ = 1000 hours
Df = n-1 = 26-1 = 25
Null Hypothesis H0: The sample meet upto standard i.e µ = 1000 hours
Alternative Hypothesis HA: µ not equal to 1000,
Level of Siginificance: 0.05
the test statistic is
x 0
t
s n
t = 990-1000/20/√26
= 2.5 (calculate value of t)
Table value of t with 25 df is 1.708
The calculate value of t is more than table value of t, so null hypotheis is rejected at 5% level.
Paired comparisons( Paired t-test ):- Sometimes data comes from non independent samples. An
example might be testing "before and after" of cosmetics or consumer products. We could use a single
random sample and do "before and after" tests on each person. A hypothesis test based on these data
would be called a paired comparisons test. Since the observations come in pairs, we can study the
difference, d, between the samples. The difference between each pair of measurements is called di.
Test statistic:- With a population of n pairs of measurements, forming a simple random sample from a
normally distributed population, the mean of the difference, d , is tested using the following
implementation of t.
d
t
S/ n
Problem :- The blood pressure of 5 women before and after intake of a certain drug are
given below: Test whether there is significant change in blood pressure at 1% level of
significance.
Before 110 120 125 132 125
After 120 118 125 136 121
1 n
S 2 (di d )2
n 1 i1
1 5
4 (di d )2
i1
1
[(10 1.6)2 (2 1.6)2 (0 1.6)2 (4 1.6)2 (4 1.6)2 ]
4
123.20
30.8
4
S 30.8 5.55
Test statistic: The test statistic is t which is calculated as
d
t
S/ n
1.16 0.645
5.55 / 5
Calculated |t| value is 0.645
Tabulates t0.01 with 5-1 = 4 degrees of freedom is 3.747.
Since calculated t < t0.01 , we accept the Null hypothesis and conclude that there is no significant
change in blood pressure.
R code:-
> x <- c(110,120,125,132,125)
> y <- c(120,118,125,136,121)
> t.test(x,y,paired=TRUE)
Paired t-test
data: x and y
t = -0.64466, df = 4,
p-value = 0.5543
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.490956 5.290956
sample estimates:
mean of the differences
-1.6
State the level of significance and find the critical value. The critical value, from the
student’s t-distribution, has the lesser of n1-1 and n2 -1 degrees of freedom.
Compute the test statistic.
Compare the test statistic to the critical value and state a conclusion.
x y
t ~ t n1 n2 -2
1 1
S
n1 n2
where
(x x)2 ( y i y)2
or S2 i
n s 2 n s 2
S 11
2 2 2
n1 n2 2 n1 n2 2
Problem:- Two horses A and B were tested according to the time (in seconds) to run a particular track
with the following results.
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Test whether the two horses have the same running capacity.
n1 n2 2
(31.4358 26.8336)
5.23
762
Therefore S 5.23 2.3
Computation : t x y
31.286 - 28.16 2.443
1 1 1 1
S (2.3)
n1 n2 7 6
Tabulates t0.05 with 7+6-2 = 11 degrees of freedom at 5% level of significance is 2.2
Since calculated t > t0.05 , we reject the Null hypothesis and conclude that there is no significant change in
blood pressure.
Null hypothesis H0: There are no differences among the mean values of the groups being compared
(i.e., the group means are all equal)–
H0: µ1 = µ2 = µ3 = …= µk
Alternative hypothesis H1: (Conclusion if H0 rejected)?
Not all group means are equal (i.e., at least one group mean is different from the rest).
X cf
2
TSS = S2T ij
i j
Step 4: Treatment sum of squares
2
jT
TrSS = S2Tr N cf
Step 5: Error sum of squares
ESS = S2E = TSS-TrSS
Source of variable d.f Sum of Squares TSS F-Test
Treatment k-1 Tj2 STr 2 S 2Tr
(between sample) S 2Tr cf N S2Tr
k 1
Fcal
S 2E
Error n-k S2E = TSS-TrSS S 2E
S E
2
nk
Unit-6:- Linear Models, Simple Linear Regression, -Multiple Regression Generalized Linear
Models,Logistic Regression, - Poisson Regression- other Generalized Linear Models-Survival
Analysis,Nonlinear Models - Splines, Decision, Random Forests.
Regression:- Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is gathered
through experiments. The other variable is called response variable whose value is derived from the
predictor variable.
Linear Regression:- In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal
to 1 creates a curve.
The general mathematical equation for a linear regression is − y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
lm() Function:-This function creates the relationship model between the predictor and the response
variable.
Syntax: lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
Example:-
> height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> relation <- lm(weight~height)
> print(relation)
Call:
lm(formula = weight ~ height)
Coefficients:
(Intercept) height
-38.4551 0.6746
> plot(weight,height,col = "blue",main = "Height & Weight
Regression")
> abline(lm(height~weight),col="orange")
xy x y
1
r n
n n
1 1
(x x)2 ( y y)2
Multiple Regression :- Multiple regression is an extension of simple linear regression. It is used when
we want to predict the value of a variable based on the value of two or more other variables. The
variable we want to predict is called the dependent variable
The general mathematical equation for multiple regression is – x1 =a0+a1x2+a2x3
Following is the description of the parameters used −
x1 is the response variable.
a0 , a1 , a2 ...bn are the coefficients.
x1, x 2, ...xn are the predictor variables.
The Normal equations for estimating a 0,a1 and a2 .
x na a x a x
1 0 1 2 2 3
x x a x a x a x x
2
1 2 0 2 1 2 2 2 3
x x a x a x x a x
2
1 3 0 3 1 2 3 2 3
We create the regression model using the lm() function in R. The model determines the value of the
coefficients using the input data. Next we can predict the value of the response variable for a given set of
predictor variables using these coefficients.
lm() Function :-This function creates the relationship model between the predictor and the response
variable.
Syntax :- lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between the response variable and predictor
variables.
data is the vector on which the formula will be applied.
Example
> lm(mpg~disp+hp+wt,data=mtcars)
Call:
lm(formula = mpg ~ disp + hp + wt, data = mtcars)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
Logistic Regression : The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It actually measures the
probability of a binary response as the value of response variable based on the mathematical equation
relating it with the predictor variables.
The general mathematical equation for logistic regression is −
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.
The function used to create the regression model is the glm() function.
Syntax :- glm(formula,data,family)
Following is the description of the parameters used −
formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model. It's value is binomial for logistic
regression.
For example, in the built-in data set mtcars, the data column am represents the transmission type of
the automobile model (0 = automatic, 1 = manual). With the logistic regression equation, we can model
the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.
> am.glm = glm(formula=am ~ hp + wt, data=mtcars, family=binomial)
> am.glm
Call: glm(formula = am ~ hp + wt, family = binomial, data = mtcars)
Coefficients:
(Intercept) hp wt
18.86630 0.03626 -8.08348
Poisson Regression:- Poisson Regression involves regression models in which the response variable is
in the form of counts and not fractional numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is −
log(y) = a + b1 x1 + b2 x2 + b nxn .....
Following is the description of the parameters used −
y is the response variable.
a and b are the numeric coefficients.
x is the predictor variable.
The function used to create the Poisson regression model is the glm()function.
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension
(low, medium or high) on the number of warp breaks per loom. Let's consider "breaks" as the response
variable which is a count of number of breaks. The wool "type" and "tension" are taken as predictor
variables.
> output <-glm(formula = breaks ~ wool+tension,data = warpbreaks,
+ family = poisson)
> print(summary(output))
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 297.37 on 53 degrees of freedom
Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06
CURVE FITTING:- Curve fitting is the process of constructing a curve, or mathematical function, that
has the best fit to a series of data points, possibly subject to constraints.
Y nA b X
XY A X b X 2
where
Y = log y , X=log x and A = log a
4 16 6.3 100.8
x 10 x2 30 y 16.9 xy 156.3
Substituting the values, we get
5a+10b = 16.9.........................(1)
10a+30b = 156.3 ...................(2)
Solving (1) and (2), we get
Multiply eq (1) with 2
10a+20b = 33.2....................... (3)
Subtract (3) and (2)
10a + 20b = 33.2
10a + 30b = 156.3
0 -10b =-123.1
Therefore b=12.3 now substitute in (1) and a = -21.24.
Thus the equation of the straight line is y = a + bx
y = -21.24+12.3x
Solution:-
Polynomial equation line is y = a + bx + cx2
The three normal equations are
y na b x c x2
xy a x b x2 c x3
x2 y a x2 b x3 c x4
x x2 x3 x4 y xy x2 y
1 1 1 1 10 10 10
2 4 8 16 12 24 64
3 9 27 81 8 24 72
4 16 64 256 10 40 160
5 25 125 625 14 70 350
x 15 x2 55 x3 225 x 4 979 y 54 xy 168 x2 y 656
Substituting the values, we get
5a+15b+55c = 54........................... (1)
15a+55b+225c = 168................... (2)
55a+225b+979c = 656 ................ (3)
Solving (1) and (2), we get
Multiply eq (1) with 3 and subtract with (2)
15a+45b+165c = 162
(-)15a+55b+225c = 168
0 -10b+60c = -6 ........................ (4)
Y nA b x
xY A x b x2
x y Y = log y xY x2
0 1.05 0.021 0 0
1 2.1 0.324 0.32 1
2 3.85 0.585 1.17 4
3 8.3 0.919 2.75 9
Survival analysis: Survival analysis is generally defined as a set of methods for analyzing data where
the outcome variable is the time until the occurrence of an event of interest. The event can be death,
occurrence of a disease, marriage, divorce, etc.
In survival analysis, there is a special structure for right-censored survival data. To use this, one first
must load the “survival” package, which is included in the main R distribution,
library(survival)
The basic syntax for creating survival analysis in R is −
Surv(time,event)
survfit(formula)
Following is the description of the parameters used −
time is the follow up time until the event occurs.
event indicates the status of occurrence of the expected event.
formula is the relationship between the predictor variables.
Next, define the survival times “tt” and the censoring indicator “status”, where “status = 1” indicates
that the time is an observed event, and “status = 0” indicates that it is censored. Then the “Surv”
function binds them into a single object. In the following example, time 6 is right censored, while the
others are observed event times,
> tt <- c(2, 5, 6, 7, 8)
> status <- c( 1, 1, 0, 1, 1)
> Surv(tt, status) # Create a survival data structure
[1] 2 5 6+ 7 8
Example:-
Nonlinear Models
Decision trees:- Decision tree is a graph to represent choices and their results in form of a tree. The
nodes in the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Examples:
• Predicting an email as spam or not spam,
• Predicting of a tumor is cancerous
• Predicting a loan as a good or bad credit risk based on the factors in each of these.
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax : ctree(formula, data)
Following is the description of the parameters used −
formula is a formula describing the predictor and response variables.
data is the name of the data set used.
Example:
library(party)
model2<-ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=mydata)
plot(model2)
Random Forests:- In the random forest approach, a large number of decision trees are created. Every
observation is fed into every decision tree. The most common outcome for each observation is used as
the final output.
The package "randomForest" has the function randomForest() which is used to create and analyze
random forests.
Syntax :- randomForest(formula, data)
Following is the description of the parameters used −
formula is a formula describing the predictor and response variables.
data is the name of the data set used.
Advantages
⚫ It is one of the most accurate learning algorithms available. For many data sets, it produces
a highly accurate classifier.
⚫ It runs efficiently on large databases.
⚫ It can handle thousands of input variables without variable deletion.
⚫ It gives estimates of what variables are important in the classification.
⚫ It generates an internal unbiased estimate of the generalization error as the forest building
progresses.
⚫ It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
Disadvantages
⚫ Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
⚫ For data including categorical variables with different number of levels, random forests are
biased in favor of those attributes with more levels. Therefore, the variable importance
scores from random forest are not reliable for this type of data.
Splines: A linear spline is a continuous function formed by connecting linear segments. The points
where the segments connect are called the knots of the spline.