Unit3 R
Unit3 R
V Semester BCA
Prepared By
Sabnam Pradhan
Professor and Faculty of Computer Applications
UNIT-3
Content
Statistics And Probability, basic data visualization, probability, common probability
distributions: common probability mass functions, bernoulli, binomial, poisson
distributions,
common probability density functions, uniform, normal, student’s t distribution.
--------------------------------------------------------------------------------------------------------------
Statistics and Probability
Mean, Median, Mode
Mean: It is calculated by taking the sum of the values and dividing with the number of
values in a data series.
Syntax:
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
● x is the input vector.
● trim is used to drop some observations from both end of the sorted vector.
● na.rm is used to remove the missing values from the input vector.
Example:
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
o/p:
[1] 8.22
Median:
The middle most value in a data series is called the median.
The median() function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
● x is the input vector.
● na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find the median.
median.result <- median(x)
print(median.result)
o/p:
[1] 5.6
Mode:
The mode is the value that has highest number of occurrences in a set of data.
Unlike mean and median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode.
Finding a mode is perhaps most easily achieved by using R’s table function, which gives
you the frequencies you need.
Example:
R> xdata <- c(2,4.4,3,3,2,2.2,2,4)
R> xtab <- table(xdata)
R> xtab
xdata
2 2.2 3 4 4.4
31 211
The min and max functions will report the smallest and largest values, with range
returning both in a vector of length 2. R>
min(xdata)
[1] 2
R> max(xdata)
[1] 4.4
R> range(xdata)
[1] 2.0 4.4
tapply() function
The tapply() helps us to compute statistical measures (mean, median, min, max, etc..) or a
self-written function operation for each factor variable in a vector.
Syntax: tapply( x, index, fun )
● x: determines the input vector or an object.
● index: determines the factor vector that helps us distinguish the data.
● fun: determines the function that is to be applied to input data.
tapply(chickwts$weight,INDEX=chickwts$feed,FUN=function(x)
length(x)/nrow(chickwts) )
casein horsebean linseed meat meal soybean sunflower
0.1690141 0.1408451 0.1690141 0.1549296 0.1971831 0.1690141
round function, which rounds numeric data output to a certain number of decimal
places.
R> round(table(chickwts$feed)/nrow(chickwts),digits=3)
casein horsebean linseed meatmeal soybean sunflower
0.169 0.141 0.169 0.155 0.197 0.169
Quantiles, Percentiles, and the Five-Number Summary: A quantile is a value
computed from a collection of numeric measurements that indicates an observation’s rank
when compared to all the other present observations.
For example, the median is itself a quantile—it gives you a value below which half of the
measurements lie—it’s the 0:5th quantile. Alternatively, quantiles can be expressed as a
percentile—this is identical but on a “percent scale” of 0 to 100.
quantile function:
Syntax: quantile(x)
x: Data set
Example:
R> xdata <- c(2,4.4,3,3,2,2.2,2,4)
R> quantile(xdata,prob=0.8)
80%
3.6
Summary Function: The summary function also provides summary of all the above
statistics.
R> summary(xdata)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.600 2.825 3.250 4.400
Variance: The variance is a particular representation of the average squared distance of
each observation when compared to the mean.
The standard deviation is simply the square root of the variance.
The interquartile range(IQR) measures the width of the “middle 50 percent” of the data,
that is, the range of values that lie within a 25 percent quartile on either side of the median.
The direct R commands for computing these measures of spread are
var(variance), sd (standard deviation), and IQR (interquartile range).
R> var(xdata)
[1] 0.9078571
R> sd(xdata)
[1] 0.9528154
R> IQR(xdata)
[1] 1.25
Covariance and Correlation
● The covariance expresses how much two numeric variables “change together” and
the nature of that relationship, whether it is positive or negative.
● Correlation allows you to interpret the covariance further by identifying both the
direction and the strength of any association.
R> xdata <- c(2,4.4,3,3,2,2.2,2,4)
R> ydata <- c(1,4.4,1,3,2,2.2,2,7)
R> cov(xdata,ydata)
[1] 1.479286
R> cor(xdata,ydata)
[1] 0.7713962
BASIC DATA VISUALIZATION
R Visualization Packages
1) plotly
The plotly package provides online interactive and quality graphs. This package extends
upon the JavaScript library ?plotly.js.
2) ggplot2
R allows us to create graphics declaratively. R provides the ggplot package for this
purpose. This package is famous for its elegant and quality graphs, which sets it apart
from other visualization packages.
3) tidyquant
The tidyquant is a financial package that is used for carrying out quantitative financial
analysis. This package adds under tidyverse universe as a financial package that is used
for importing, analyzing, and visualizing the data.
4) taucharts
Data plays an important role in taucharts. The library provides a declarative interface for
rapid mapping of data fields to visual properties.
5) ggiraph
It is a tool that allows us to create dynamic ggplot graphs. This package allows us to add
tooltips, JavaScript actions, and animations to the graphics.
6) geofacets
This package provides geofaceting functionality for 'ggplot2'. Geofaceting arranges a
sequence of plots for different geographical entities into a grid that preserves some of the
geographical orientation.
7) googleVis
googleVis provides an interface between R and Google's charts tools. With the help of
this package, we can create web pages with interactive charts based on R data frames.
8) RColorBrewer
This package provides color schemes for maps and other graphics, which are designed
by Cynthia Brewer.
9) dygraphs
The dygraphs package is an R interface to the dygraphs JavaScript charting library. It
provides rich features for charting time-series data in R.
10) shiny
R allows us to develop interactive and aesthetically pleasing web apps by providing a
shiny package. This package provides various extensions with HTML widgets, CSS, and
JavaScript.
barplot(): R uses the barplot() function to create bar charts. Here, both vertical and
Horizontal bars can be drawn.
Syntax:
barplot(H, xlab, ylab, main, names.arg, col)
Parameters:
H: This parameter is a vector or matrix containing numeric values which are used in bar
chart.
xlab: This parameter is the label for x axis in bar chart. ylab:
This parameter is the label for y axis in bar chart. main: This
parameter is the title of the bar chart.
names.arg: This parameter is a vector of names appearing under each bar in bar chart.
col: This parameter is used to give colors to the bars in the graph.
Example:
# Create the data for the chart A
<- c(17, 32, 8, 53, 1)
# Plot the bar chart
barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")
Creating a Horizontal Bar Chart in R To
create a horizontal bar chart:
● Take all parameters which are required to make a simple bar chart.
● Now to make it horizontal new parameter is added.
barplot(A, horiz=TRUE )
Example:
barplot(A, horiz = TRUE, xlab = "X-axis",ylab = "Y-axis", main ="Horizontal
Bar Chart” )
R Histogram
A histogram is a type of bar chart which shows the frequency of the number of values
which are compared with a set of values ranges. The histogram is used for the distribution,
whereas a bar chart is used for comparing different entities.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
● v is a vector containing numeric values used in histogram.
● main indicates title of the chart.
● col is used to set color of the bars.
● border is used to set border color of each bar.
● xlab is used to give description of x-axis.
● xlim is used to specify the range of values on the x-axis.
● ylim is used to specify the range of values on the y-axis.
● breaks is used to mention the width of each bar.
Example
# Creating data for the graph.
v <- c(12,24,16,38,21,13,55,17,39,10,60)
# Giving a name to the chart file.
png(file = "histogram_chart.png")
R Pie Charts
A pie-chart is a representation of values in the form of slices of a circle with different
colors. The Pie charts are created with the help of pie () function, Syntax:
pie(x, labels, radius, main, col, clockwise)
● x is a vector containing the numeric values used in the pie chart.
● labels is used to give description to the slices.
● radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
● main indicates the title of the chart.
● col indicates the color palette.
● clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
Example:
# Create data for the
graph. x <- c(21, 62, 10,
53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Plot the chart.
pie(x,labels)
R - Line Chart
A line chart is a graph that connects a series of points by drawing line segments between
them. The plot() function in R is used to create the line graph.
Syntax plot(v,type,col,xlab,ylab)
● v is a vector containing the numeric values.
● type takes the value "p" to draw only the points, "l" to draw only the lines and
"o" to draw both points and lines.
● xlab is the label for x axis.
● ylab is the label for y axis.
● main is the Title of the chart.
● col is used to give colors to both the points and lines
example:
v <- c(7,12,28,3,41)
# Plot the bar chart.
plot(v,type = "o")
R – Boxplots
Boxplots are a measure of how well data is distributed across a data set. This divides the
data set into three quartiles. This graph represents the minimum, maximum, average, first
quartile, and the third quartile in the data set.
Syntax
boxplot(x, data, notch, varwidth, names, main)
● x is a vector or a formula.
● data is the data frame.
● notch is a logical value. Set as TRUE to draw a notch.
● varwidth is a logical value. Set as true to draw width of the box proportionate to
the sample size.
● names are the group labels which will be printed under each boxplot.
● main is used to give a title to the graph
example
data<-
data.frame(Group_A=c(25,28,30,32,35,37,38,39,40,41,42),Group_B=c(22,24,2
6,29,31,33,36,37,38,40,43))
boxplot(data,main="Boxplaplot of Group A and
B",xlab="Groups",ylab="values",col=c("lightblue","lightgreen"),border="black
")
R – Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
Syntax
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
● x is the data set whose values are the horizontal coordinates.
● y is the data set whose values are the vertical coordinates.
● main is the tile of the graph.
● xlab is the label in the horizontal axis.
● ylab is the label in the vertical axis.
● xlim is the limits of the values of x used for plotting.
● ylim is the limits of the values of y used for plotting.
● axes indicates whether both axes should be drawn on the plot.
Example:
> x<-1:10
> y<-c(2,4,5,7,8,10,11,13,14,16)
>plot(x,y,main="scatterplotexample",xlab="X-axis",ylab="Y-
axis",col="blue",pch=16,xlim=c(0,11),ylim=c(0,17))
Syntax :
dnorm(x, mean, sd)
where,
– x represents the data set of values
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
Example:
# creating a sequence of values
# between -15 to 15 with a difference of 0.1 x =
seq(-15, 15, by=0.1)
pnorm()
pnorm() function is the cumulative distribution function which measures the probability
that a random number X takes a value less than or equal to x i.e., in statistics it is given
by-
j
Syntax:
pnorm(x, mean, sd,lower.tail)
– x represents the data set of values
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
– lower.tail represents a logical value including whether to compute lower tail
probability. It’s default value is TRUE
Example
# creating a sequence of values
# between -10 to 10 with a difference of 0.1
x <- seq(-10, 10, by=0.1)
y <- pnorm(x, mean = 2.5, sd = 2)
plot(x, y)
qnorm()
qnorm() function is the inverse of pnorm() function. It takes the probability value and
gives output which corresponds to the probability value. It is useful in finding the
percentiles of a normal distribution.
Syntax:
qnorm(p, mean, sd)
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default value is 1
– p is vector of probabilities
Example:
# Create a sequence of probability values #
incrementing by 0.02.
x <- seq(0, 1, by = 0.02)
y <- qnorm(x, mean(x), sd(x))
plot(x, y)
rnorm()
rnorm() function in R programming is used to generate a vector of random numbers
which are normally distributed.
Syntax:
rnorm(x, mean, sd)
– x represents the data set of values
– mean(x) represents the mean of data set x. It’s default value is 0
– sd(x) represents the standard deviation of data set x. It’s default
value is 1 Example
# Create a vector of 1000 random numbers
# with mean=90 and sd=5
x <- rnorm(10000, mean=90, sd=5) #
Create the histogram with 50 bars
hist(x, breaks=50)
Poisson Distribution:
The Poisson distribution is commonly used to model the number of events that occur
within a fixed interval of time or space, where these events happen with a known average
rate, and independently of the time since the last event.
Here are some of the key functions in R for working with the Poisson distribution:
1. dpois(x, lambda): Probability mass function (PMF), which gives the probability
of observing exactly x events when the average rate is lambda.
PMF Example: To find the probability of observing exactly 3 events when the
average rate is 2:
dpois(3, lambda = 2)
CDF Example: To find the probability of observing 3 or fewer events when the
average rate is 2
ppois(3, lambda = 2)
3. qpois(p, lambda): Quantile function, which gives the number of events that
correspond to a given cumulative probability p.
qpois(0.7, lambda = 2)
rpois(10, lambda = 2)
Binomial Distribution
The binomial distribution model deals with finding the probability of success of an event
which has only two possible outcomes in a series of experiments. For example, tossing
of a coin always gives a head or a tail. The probability of finding exactly 3 heads in
tossing a coin repeatedly for 10 times is estimated during the binomial distribution.
R has four in-built functions to generate binomial distribution. They are described below.
1. dbinom(x, size, prob)
# Probability of getting exactly 3 successes in 10 trials with a success probability of 0.5
dbinom(3, size = 10, prob = 0.5)
The Bernoulli distribution is a special case of the binomial distribution where there is only
one trial. It models a single trial with two possible outcomes: success (usually coded as 1)
or failure (usually coded as 0). The probability of success is denoted as ppp.
In R, the Bernoulli distribution can be handled using the binomial distribution functions
with the number of trials (n) set to 1.
Functions for Bernoulli Distribution in R
1. dbinom(x, size, prob): This is the Probability Mass Function (PMF). It returns the
probability of observing exactly x successes.
● x: The number of successes (0 or 1 for Bernoulli).
● size: Number of trials (set to 1 for Bernoulli).
● prob: Probability of success in a single trial.
# Example: Probability of success (1) with prob = 0.6
dbinom(1, size = 1, prob = 0.6)
● The t-distribution is symmetrical and bell-shaped like the normal distribution but
has heavier tails.
● The shape of the t-distribution depends on the degrees of
freedom (df), which is usually the sample size minus one
(n−1n - 1n−1).
● As the degrees of freedom increase, the t-distribution approaches the standard
normal distribution.
Here are the main functions in R for working with the Student's t-distribution:
1. dt(x, df): Density function (PDF), which calculates the probability density for a
specific value of x with a given degrees of freedom df.
dt(2, df = 10)
2. pt(q, df): Cumulative distribution function (CDF), which gives the probability that
a t-distributed random variable is less than or equal to q for the specified degrees of
freedom df.
pt(2, df = 10)
qt(0.95, df = 10)
4. rt(n, df): Random generation, which generates n random numbers from the t-
distribution with the specified degrees of freedom df.
Random Generation Example: To generate 5 random numbers from a t-
distribution with 10 degrees of freedom:
rt(5, df = 10)