Unit3 R
Unit3 R
UNIT-3
Statistics and Probability
Mean, Median, Mode
Mean:It is calculated by taking the sum of the values and dividing with the
number of values in a data series.
Syntax:
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example:
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
o/p:
[1] 8.22
Median: The middle most value in a data series is called the median. The
median() function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
x is the input vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find the median.
median.result <- median(x)
print(median.result)
o/p:
[1] 5.6
BCA V R
Mode:The mode is the value that has highest number of occurrences in a set of
data. Unike mean and median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode.
Finding a mode is perhaps most easily achieved by using R’s table function,
which gives you the frequencies you need.
Example:
R> xdata <- c(2,4.4,3,3,2,2.2,2,4)
R> xtab <- table(xdata)
R> xtab
xdata
2 2.2 3 4 4.4
3 1 21 1
The min and max functions will report the smallest and largest values, with range
returning both in a vector of length 2.
R> min(xdata)
[1] 2
R> max(xdata)
[1] 4.4
R> range(xdata)
[1] 2.0 4.4
tapply() function
The tapply() helps us to compute statistical measures (mean, median, min, max,
etc..) or a self-written function operation for each factor variable in a vector.
Syntax: tapply( x, index, fun )
x: determines the input vector or an object.
index: determines the factor vector that helps us distinguish the data.
fun: determines the function that is to be applied to input data.
tapply(chickwts$weight,INDEX=chickwts$feed,FUN=function(x) length(x) /
nrow(chickwts) )
casein horsebean linseed meatmeal soybean sunflower
0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141
round function, which rounds numeric data output to a certain number of
decimal places.
R> round(table(chickwts$feed)/nrow(chickwts),digits=3)
casein horsebean linseed meatmeal soybean sunflower
0.169 0.141 0.169 0.155 0.197 0.169
Quantiles, Percentiles, and the Five-Number Summary:A quantile is a value
computed from a collection of numeric measurements that indicates an
observation’s rank when compared to all the other present observations. For
BCA V R
example, the median is itself a quantile—it gives you a value below which half
of the measurements lie—it’s the 0:5th quantile. Alternatively, quantiles can be
expressed as a percentile—this is identical but on a “percent scale” of 0 to 100.
quantile function:
Syntax: quantile(x)
x: Data set
Example:
R> xdata <- c(2,4.4,3,3,2,2.2,2,4)
R> quantile(xdata,prob=0.8)
80%
3.6
Summary Function: The summary function also provides summary of all the
above statistics.
R> summary(xdata)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.600 2.825 3.250 4.400
Variance: The variance is a particular representation of the average squared
distance of each observation when compared to the mean.
The standard deviation is simply the square root of the variance.
The interquartile range(IQR) measures the width of the “middle 50 percent” of
the data, that is, the range of values that lie within a 25 percent quartile on either
side of the median.
The direct R commands for computing these measures of spread are
var(variance), sd (standard deviation), and IQR (interquartile range).
R> var(xdata)
[1] 0.9078571
R> sd(xdata)
[1] 0.9528154
R> IQR(xdata)
[1] 1.25
Covariance and Correlation
The covariance expresses how much two numeric variables “change
together” and the nature of that relationship, whether it is positive or
negative.
Correlation allows you to interpret the covariance further by identifying
both the direction and the strength of any association.
R> xdata <- c(2,4.4,3,3,2,2.2,2,4)
R> ydata <- c(1,4.4,1,3,2,2.2,2,7)
R> cov(xdata,ydata)
[1] 1.479286
BCA V R
R> cor(xdata,ydata)
[1] 0.7713962
BASIC DATA VISUALIZATION
R Visualization Packages
1) plotly
The plotly package provides online interactive and quality graphs. This package
extends upon the JavaScript library ?plotly.js.
2) ggplot2
R allows us to create graphics declaratively. R provides the ggplot package for
this purpose. This package is famous for its elegant and quality graphs, which
sets it apart from other visualization packages.
3) tidyquant
The tidyquant is a financial package that is used for carrying out quantitative
financial analysis. This package adds under tidyverse universe as a financial
package that is used for importing, analyzing, and visualizing the data.
4) taucharts
Data plays an important role in taucharts. The library provides a declarative
interface for rapid mapping of data fields to visual properties.
5) ggiraph
It is a tool that allows us to create dynamic ggplot graphs. This package allows
us to add tooltips, JavaScript actions, and animations to the graphics.
6) geofacets
This package provides geofaceting functionality for 'ggplot2'. Geofaceting
arranges a sequence of plots for different geographical entities into a grid that
preserves some of the geographical orientation.
7) googleVis
googleVis provides an interface between R and Google's charts tools. With the
help of this package, we can create web pages with interactive charts based on R
data frames.
8) RColorBrewer
This package provides color schemes for maps and other graphics, which are
designed by Cynthia Brewer.
BCA V R
9) dygraphs
The dygraphs package is an R interface to the dygraphs JavaScript charting
library. It provides rich features for charting time-series data in R.
10) shiny
R allows us to develop interactive and aesthetically pleasing web apps by
providing a shiny package. This package provides various extensions with
HTML widgets, CSS, and JavaScript.
barplot(): R uses the barplot() function to create bar charts. Here, both vertical
and Horizontal bars can be drawn.
Syntax:
barplot(H, xlab, ylab, main, names.arg, col)
Parameters:
H: This parameter is a vector or matrix containing numeric values which are used
in bar chart.
xlab: This parameter is the label for x axis in bar chart.
ylab: This parameter is the label for y axis in bar chart.
main: This parameter is the title of the bar chart.
names.arg: This parameter is a vector of names appearing under each bar in bar
chart.
col: This parameter is used to give colors to the bars in the graph.
Example:
# Create the data for the chart
A <- c(17, 32, 8, 53, 1)
# Plot the bar chart
barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")
BCA V R
R Histogram
A histogram is a type of bar chart which shows the frequency of the number of
values which are compared with a set of values ranges. The histogram is used for
the distribution, whereas a bar chart is used for comparing different entities.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.
Example
# Creating data for the graph.
v <- c(12,24,16,38,21,13,55,17,39,10,60)
# Giving a name to the chart file.
BCA V R
png(file = "histogram_chart.png")
R Pie Charts
A pie-chart is a representation of values in the form of slices of a circle with
different colors. The Pie charts are created with the help of pie () function,
Syntax:
pie(x, labels, radius, main, col, clockwise)
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1
and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or
anti clockwise.
Example:
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
BCA V R
R - Line Chart
A line chart is a graph that connects a series of points by drawing line segments
between them. The plot() function in R is used to create the line graph.
Syntax
plot(v,type,col,xlab,ylab)
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines
and "o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines
example:
v <- c(7,12,28,3,41)
# Plot the bar chart.
plot(v,type = "o")
BCA V R
R – Boxplots
Boxplots are a measure of how well data is distributed across a data set. This
divides the data set into three quartiles. This graph represents the minimum,
maximum, average, first quartile, and the third quartile in the data set.
Syntax
boxplot(x, data, notch, varwidth, names, main)
x is a vector or a formula.
data is the data frame.
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box
proportionate to the sample size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph
example
data<-
data.frame(Group_A=c(25,28,30,32,35,37,38,39,40,41,42),Group_B=c(22,24,2
6,29,31,33,36,37,38,40,43))
boxplot(data,main="Boxplaplot of Group A and
B",xlab="Groups",ylab="values",col=c("lightblue","lightgreen"),border="black
")
BCA V R
R – Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point
represents the values of two variables. One variable is chosen in the horizontal
axis and another in the vertical axis.
Syntax
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example:
> x<-1:10
> y<-c(2,4,5,7,8,10,11,13,14,16)
>plot(x,y,main="scatterplotexample",xlab="X-axis",ylab="Y-
axis",col="blue",pch=16,xlim=c(0,11),ylim=c(0,17))
BCA V R
Syntax :
dnorm(x, mean, sd)
where,
– x represents the data set of values
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
Example:
# creating a sequence of values
# between -15 to 15 with a difference of 0.1
x = seq(-15, 15, by=0.1)
pnorm()
pnorm() function is the cumulative distribution function which measures the
probability that a random number X takes a value less than or equal to x i.e., in
statistics it is given by-
Syntax:
pnorm(x, mean, sd,lower.tail)
– x represents the data set of values
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
– lower.tail represents a logical value including whether to compute lower tail
probability. It’s default value is TRUE
Example
# creating a sequence of values
# between -10 to 10 with a difference of 0.1
BCA V R
qnorm()
qnorm() function is the inverse of pnorm() function. It takes the probability value
and gives output which corresponds to the probability value. It is useful in
finding the percentiles of a normal distribution.
Syntax:
qnorm(p, mean, sd)
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
– p is vector of probabilities
Example:
# Create a sequence of probability values
# incrementing by 0.02.
x <- seq(0, 1, by = 0.02)
y <- qnorm(x, mean(x), sd(x))
plot(x, y)
rnorm()
rnorm() function in R programming is used to generate a vector of random
numbers which are normally distributed.
BCA V R
Syntax:
rnorm(x, mean, sd)
– x represents the data set of values
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
Example
# Create a vector of 1000 random numbers
# with mean=90 and sd=5
x <- rnorm(10000, mean=90, sd=5)
# Create the histogram with 50 bars
hist(x, breaks=50)
Poisson Distribution
The Poisson distribution represents the probability of a provided number of cases
happening in a set period of space or time if these cases happen with an
identified constant mean rate (free of the period since the ultimate event).
The probability mass function of the Poisson distribution is:
Where:
X is a random variable following a Poisson distribution
k is the number of times an event occurs
P(X = k) is the probability that an event will occur k times
e is Euler’s constant (approximately 2.718)
is the average number of times an event occurs
! is the factorial function
• dpois
• ppois
• qpois
• rpois
dpois()
The function dpois() calculates the probability of a random variable that is
available within a certain range.
Syntax:
dpois(k, lambda, log)
where,
K: number of successful events happened in an interval
lambda: mean per interval
log: If TRUE then the function returns probability in form of log
Example:
dpois(2, 3)
dpois(6, 6)
Output:
[1] 0.2240418
[1] 0.1606231
ppois()
The function ppois() calculates the probability of a random variable that will be
equal to or less than a number.
Syntax:
ppois(q,lambda,lower.tail,log)
where,
q: number of successful events happened in an interval
lambda: mean per interval
lower.tail: If TRUE then left tail is considered otherwise if the FALSE
right tail is considered
log: If TRUE then the function returns probability in form of log
Example:
ppois(2, 3)
ppois(6, 6)
Output:
[1] 0.4231901
[1] 0.6063028
qpois()
BCA V R
rpois()
The function rpois() is used for generating random numbers from a given
Poisson’s distribution.
Syntax:
rpois(q, lambda)
where,
q: number of random numbers needed
lambda: mean per interval
Example
rpois(2, 3)
rpois(6, 6)
Output:
[1] 2 3
[1] 6 7 6 10 9 4
Binomial Distribution
The binomial distribution model deals with finding the probability of success of
an event which has only two possible outcomes in a series of experiments. For
example, tossing of a coin always gives a head or a tail. The probability of
finding
BCA V R
exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the
binomial distribution.
pbinom()
This function gives the cumulative probability of an event. It is a single value
representing the probability.
Example
# Probability of getting 26 or less heads from a 51 tosses of a coin.
x <- pbinom(26,51,0.5)
print(x)
When we execute the above code, it produces the following result –
[1] 0.610116
qbinom()
This function takes the probability value and gives a number whose cumulative
value matches the probability value.
Example
# How many heads will have a probability of 0.25 will come out when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
When we execute the above code, it produces the following result −
[1] 23
rbinom()
This function generates required number of random values of given probability
from a given sample.
Example
# Find 8 random values from a sample of 150 with probability of 0.4.
x <- rbinom(8,150,.4)
print(x)
BCA V R
Parameter:
x: input sequence
min, max= range of values
log: indicator, of whether to display the output values as probabilities.
The result produced will be for each value of the interval. Hence, a sequence will
be generated.
Example 1:
# generating a sequence of values
x <- 5:10
print ("dunif value")
Syntax:
punif(q, min = 0, max = 1, lower.tail = TRUE)
All the independent probabilities that satisfy the comparison condition will be
added.
Example:
min <- 0
max <- 60
# calculating punif value
punif (15 , min =min , max = max)
BCA V R
Output
[1] 0.25
Example 2:
# Grid of X-axis values
x <- seq(-0.5, 1.5, 0.01)
qunif() method is used to calculate the corresponding quantile for any probability
(p) for a given uniform distribution. To use this simply the function had to be
called with the required parameters.
Syntax:
qunif(p, min = 0, max = 1)
Parameter :
p – The vector of probabilities
min , max – The limits for calculation of quantile function
Example
min <- 0
max <- 1
BCA V R
Bernoulli Distribution
dbern()
dbern() function in R programming measures the density function of the Bernoulli
distribution.
Syntax: dbern(x, prob, log = FALSE)
Parameter:
x: vector of quantiles
prob: probability of success on each trial
log: logical; if TRUE, probabilities p are given as log(p)
BCA V R
Example:
# Importing the Rlab library
library(Rlab)
pbern()
pbern() function in R programming giver the distribution function for the
Bernoulli distribution.
Syntax: pbern(q, prob, lower.tail = TRUE, log.p = FALSE)
Parameter:
q: vector of quantiles
prob: probability of success on each trial
lowe.tail: logical
log.p: logical; if TRUE, probabilities p are given as log(p).
Example:
# import Rlab library
library(Rlab)
qbern()
qbern() gives the quantile function for the Bernoulli distribution. A quantile
function in statistical terms specifies the value of the random variable such that
the probability of the variable being less than or equal to that value equals the
given probability.
Parameter:
p: vector of probabilities.
prob: probability of success on each trial.
lower.tail: logical
log.p: logical; if TRUE, probabilities p are given as log(p).
Example:
# import Rlab library
library(Rlab)
# x values for the
# qbern( ) function
x <- seq(0, 1, by = 0.2)
BCA V R
rbern()
rbern() function in R programming is used to generate a vector of random
numbers which are Bernoulli distributed.
Parameter:
n: number of observations.
prob: number of observations.
Example:
# import Rlab library
library(Rlab)
set.seed(9999)
# sample size
N <- 100
# generate random variables using
# rbern() function
BCA V R
# plot of randomly
# drawn density
hist(random_values,breaks = 10,main = "")
Output:
[1] 0 0 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
[41] 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0
[81] 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 1 0 1
Student t Distribution
The t-distribution, also known as the Student's t-distribution, is a type of
probability distribution that is similar to the normal distribution with its bell
shape but has heavier tails. It is used for estimating population parameters for
small sample sizes or unknown variances.
BCA V R
dt() function in R is used to find the value of probability density function (pdf)
of the Student’s t-distribution given a random variable x,
Syntax: dt(x, df)
Parameters:
x is the quantiles vector
df is the degrees of freedom(degrees of freedom determines the shape of
distribution, as degree increases, it becomes normal distribution)
Example:
x_dt <- seq(- 10, 10, by = 0.01)
y_dt <- dt(x_dt, df = 3)
plot(y_dt)
Parameter:
q is the quantiles vector
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X >
x].
Example:
x_pt <- seq(- 10, 10, by = 0.01) # Specify x-values for pt function
y_pt <- pt(x_pt, df = 3) # Apply pt function
plot(y_pt) # Plot pt values
The qt() function is used to get the quantile function or inverse cumulative
density function of a t-distribution.
Syntax: qt(p, df, lower.tail = TRUE)
Parameter:
p is the vector of probabilities
df is the degrees of freedom
lower.tail – if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X >
x].
Example:
x_qt <- seq(0, 1, by = 0.01) # Specify x-values for qt function
y_qt <- qt(x_qt, df = 3) # Apply qt function
plot(y_qt) # Plot qt values
BCA V R