Unit 3 R As A Set of Statistical Tables
Unit 3 R As A Set of Statistical Tables
Probability And Statistics are the two important concepts in Maths. Probability is all
about chance. Whereas statistics is more about how we handle various data using
different techniques. It helps to represent complicated data in a very easy and
understandable way. Statistics and probability are usually introduced in Class 10, Class
11 and Class 12 students are preparing for school exams and competitive
examinations. The introduction of these fundamentals is briefly given in your academic
books and notes. The statistic has a huge application nowadays in data science
professions. The professionals use the stats and do the predictions of the business. It
helps them to predict the future profit or loss attained by the company.
What is Probability?
Probability denotes the possibility of the outcome of any random event. The meaning of
this term is to check the extent to which any event is likely to happen. For
example, when we flip a coin in the air, what is the possibility of getting a head? The
answer to this question is based on the number of possible outcomes. Here the
possibility is either head or tail will be the outcome. So, the probability of a head to
come as a result is 1/2.
The probability is the measure of the likelihood of an event to happen. It measures the
certainty of the event. The formula for probability is given by;
P(E) = n(E)/n(S)
Here,
Statistics has a huge scope in many fields such as sociology, psychology, geology,
weather forecasting, etc. The data collected here for analysis could be quantitative or
qualitative. Quantitative data are also of two types such as: discrete and continuous.
Discrete data has a fixed value whereas continuous data is not a fixed data but has a
range. There are many terms and formulas used in this concept. See the below table
to understand them.
Random Experiment
Sample Sample
Random variables
Expected Value
Independence
Variance
Mean
Random Experiment
An experiment whose result cannot be predicted, until it is noticed is called a random
experiment. For example, when we throw a dice randomly, the result is uncertain to us.
We can get any output between 1 to 6. Hence, this experiment is random.
Sample Space
A sample space is the set of all possible results or outcomes of a random experiment.
Suppose, if we have thrown a dice, randomly, then the sample space for this
experiment will be all possible outcomes of throwing a dice, such as;
Sample Space = { 1,2,3,4,5,6}
Random Variables
The variables which denote the possible outcomes of a random experiment are called
random variables. They are of two types:
Discrete random variables take only those distinct values which are countable. Whereas
continuous random variables could take an infinite number of possible values.
Independent Event
When the probability of occurrence of one event has no impact on the probability of
another event, then both the events are termed as independent of each other. For
example, if you flip a coin and at the same time you throw a dice, the probability of
getting a ‘head’ is independent of the probability of getting a 6 in dice.
Mean
Mean of a random variable is the average of the random values of the possible
outcomes of a random experiment. In simple terms, it is the expectation of the possible
outcomes of the random experiment, repeated again and again or n number of times. It
is also called the expectation of a random variable.
Expected Value
Expected value is the mean of a random variable. It is the assumed value which is
considered for a random experiment. It is also called expectation, mathematical
expectation or first moment. For example, if we roll a dice having six faces, then the
expected value will be the average value of all the possible outcomes, i.e. 3.5.
Variance
Basically, the variance tells us how the values of the random variable are spread around
the mean value. It specifies the distribution of the sample space across the mean.
List of Probability Topics
Basic probability topics are:
Box and Whisker Plots Comparing Two Means Comparing Two Proportions
Categorical Data Central Tendency Correlation
Data Handling Degree of freedom Empirical Rule
Frequency Distribution Table Five Number Summary Graphical Representation of Data
Histogram Mean Median
Mode Data Range Relative Frequency
Population and Sample Scatter Plots Standard Deviation
Ungrouped Data Variance Data Sets
Mean=x―=∑xn
M=(n+12)th, if n=odd
Median
M=(n2)thterm+(n2+1)thterm2, if n=even
Solved Examples
Here are some examples based on the concepts of statistics and probability to
understand better. Students can practice more questions based on these solved
examples to excel in the topic. Also, make use of the formulas given in this article in the
above section to solve problems based on them.
Example 1: Find the mean and mode of the following data: 2, 3, 5, 6, 10, 6, 12, 6, 3,
4.
Solution:
Total Count: 10
Mean = 60/10 = 6
Example 2: A bucket contains 5 blue, 4 green and 5 red balls. Sudheer is asked to
pick 2 balls randomly from the bucket without replacement and then one more
ball is to be picked. What is the probability he picked 2 green balls and 1 blue
ball?
Probability of drawing
Probability of picking 2 green balls and 1 blue ball = 4/14 * 3/13 * 5/12 = 5/182.
Example 3: What is the probability that Ram will choose a marble at random and
that it is not black if the bowl contains 3 red, 2 black and 5 green marbles.
Find the number of marbles that are not black and divide by the total number of
marbles.
So P(not black) = (number of red or green marbles)/(total number of marbles)
= 8 /10
= 4/5
Solution: Given,
55 36 95 73 60 42 25 78 75 62
Number of observations = 10
Example 5: Find the median and mode of the following marks (out of 10) obtained
by 20 students:
4, 6, 5, 9, 3, 2, 7, 7, 6, 5, 4, 9, 10, 10, 3, 4, 7, 6, 9, 9
Solution: Given,
4, 6, 5, 9, 3, 2, 7, 7, 6, 5, 4, 9, 10, 10, 3, 4, 7, 6, 9, 9
Number of observations = n = 20
= (6 + 6)/2
=6
# R program to illustrate
# Descriptive Analysis
where:
xi represents the data points.
n represents the total number of data points.
Suppose there are 8 data points. 2, 4, 4, 4, 5, 5, 7, 9 and the
average of these 8 data points is,
where,
µ as mean
N is the total number of elements or frequency of distribution.
Let’s consider the same dataset that we have taken in average.
First, calculate the deviations of each data point from the mean,
and square the result of each,
Variance=9+1+1+1+0+0+4+16/8=4
StandardDeviation=Vvariance
# Calculating standard
# deviation using sd()
print(sd(list))
Output:
[1] 2.13809
Calculating All Three Metrics for a Dataset
Let’s calculate the mean, variance, and standard deviation for
the following dataset:
R
# Define the dataset
data <- c(12, 15, 18, 22, 30, 35)
Output:
Product Age Gender Education MaritalStatus Usage Fitness
Income Miles
1 TM195 18 Male 14 Single 3 4 29562 112
2 TM195 19 Male 15 Single 2 3 31836 75
3 TM195 19 Female 14 Partnered 4 3 30699 66
4 TM195 19 Male 12 Single 3 3 32973 85
5 TM195 20 Male 13 Partnered 4 2 35247 47
6 TM195 20 Female 14 Partnered 3 3 32973 66
Mean in R Programming Language
It is the sum of observations divided by the total number of
observations. It is also defined as average which is the sum
divided by count.
[Mean(μ)=1N∑i=1Nxi][Mean(μ)=N1∑i=1Nxi]
R
# R program to illustrate
# Descriptive Analysis
# R program to illustrate
# Descriptive Analysis
mode = function(){
return(sort(-table(myData$Age))[1])
}
mode()
Output:
25: -25
where,
x represents the x data vector
y represents the y data vector
represents mean of x data vector
represents mean of y data vector
N represents total observations
Covariance Syntax in R
Syntax: cov(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute
covariance. Default is “pearson”.
Example:
R
# Data vectors
x <- c(1, 3, 5, 10)
where,
x represents the x data vector
y represents the y data vector
xˉ xˉ represents mean of x data vector
yˉ yˉ represents mean of y data vector
Correlation in R
Syntax: cor(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute
covariance. Default is “pearson”.
Example:
R
# Data vectors
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations(sample size).
mean is the mean value of the sample data. It's default
value is zero.
sd is the standard deviation. It's default value is 1.
dnorm()
Live Demo
# Create a sequence of numbers between -10 and 10
incrementing by 0.1.
x <- seq(-10, 10, by = .1)
plot(x,y)
qnorm()
rnorm()
This function is used to generate random numbers whose
distribution is normal. It takes the sample size as input and
generates that many random numbers. We draw a histogram to
show the distribution of the generated numbers.
Live Demo
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
dbinom()
pbinom()
print(x)
[1] 0.610116
qbinom()
# How many heads will have a probability of 0.25 will come out
when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
[1] 23
rbinom()
print(x)
When we execute the above code, it produces the following result
−
[1] 58 61 59 66 55 60 61 67