0% found this document useful (0 votes)
18 views602 pages

MSCE Unit 1 Slides Compressed

Uploaded by

taxafac927
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views602 pages

MSCE Unit 1 Slides Compressed

Uploaded by

taxafac927
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 602

MATHEMATICS FOR COMPUTER

SCIENCE ENGINEERS
Generation of Random Variates

Dr.Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Generation of Random Variates

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered

❖ Random Numbers

❖ Random Variate Generator

❖ Random Variates

❖ Techniques for generating Random Variates


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Numbers
● It is a random sequence of numbers obtained from a stochastic process.
● A random number is a number chosen as if by chance from some specified
distribution such that selection of a large set of these numbers reproduces
the underlying distribution.
● A random number is chosen using methods which give equal probability to all
numbers occurring in the specified distribution.
● In real world, random numbers may be generated using a dice or a roulette
wheel.

Source: tcsjohnhuxley.com, amazon.in


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Numbers
● Random numbers are most commonly produced with the help
of a random number generators.
● Random numbers have important applications, especially in
cryptography where they act as ingredients in encryption keys.
● One of the most important prerequisites of a random number
is to be independent, as this helps in establishing no
correlations between successive numbers.
● It must be ensured that the frequency of the occurrence of
these random numbers should be approximately be the same.
As a result, theoretically, it is not easy to generate a long
random number.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Numbers
● Random numbers are also very important for a simulations.
● All the randomness required by a simulation model is
obtained by a random number generator.
● The output of a random number generator is assumed to
be a sequence of independent and identically (uniformly)
distributed random numbers between 0 and 1.
● These random numbers are transformed into required
probability distributions.
● Example: The most common set from which random
numbers are derived is the set of single-digit decimal
numbers {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

Source: medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Generating Random Numbers
● Problem:
Generate sample of a random variable X with a given density f. (The
sample is called a random variate)
● Answer:
Develop an algorithm such that if one used it repeatedly (and
independently) to generate a sequence of samples X1, X2, . . . , Xn then
as n becomes large, the proportion of samples that fall in any interval
[a, b] is close to P(X ∈ [a, b]), i.e

{Xi ∈ [a, b]}/n ≈ P(X ∈ [a, b])


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Generating Random Numbers
● Solution: 2-step process
○ Generate a random variate uniformly distributed in [0, 1], also
called a random number.
○ Use an appropriate transformation to convert the random number
to a random variate of the correct distribution.
● Why is this approach good ?
Answer: It focuses on generating samples from ONE distribution only.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Number Generators
● A random number generator is a hardware device or software algorithm
that generates a number that is taken from a limited or unlimited
distribution and outputs it.
● The two main types of random number generators are pseudo random
number generators and true random number generators.
● The numbers or sequence of numbers generated must lack any pattern
(i.e. must appear random).
● True random number generator:
○ It measures some physical phenomenon that is expected to be
random and then compensates for possible biases in the
measurement process.
○ Example sources include measuring atmospheric noise, thermal
noise, and other external electromagnetic and quantum phenomena.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Number Generators
● Pseudo random number generator:
○ It uses computational algorithms that can produce long sequences of
apparently random results.
○ But these results are in fact completely determined by a shorter
initial value, known as a seed value or key.
○ As a result, the entire seemingly random sequence can be reproduced
if the seed value is known.
○ Properties that pseudo-random number generators should possess:
1. It should be fast and not memory intensive
2. It must be able to reproduce a given stream of random numbers.
3. provision for producing several different independent streams of
random numbers
● The random numbers generated must meet some statistical tests for
randomness intended to ensure that they do not have any easily
discernible patterns.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Number Seed
● A random seed (or seed state, or just
seed) is a number (or vector) used to
initialize a pseudorandom number
generator.
● Computer-based generators use random
number seeds for setting the starting
point of the random number sequence.
● For a seed to be used in a pseudorandom
number generator, it does not need to be
random.
● These seeds are often initialized using a
computer's real time clock in order to
have some external noise.

Source: http://www.cs.bilkent.edu.tr/~cagatay/cs503/_M&S_04_Random_Variate_Generation.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate
● A random variate is a variable generated from uniformly distributed
pseudorandom numbers.
● It is a particular outcome of a random variable.
● The random variates which are other outcomes of the same random
variable might have different values.
● Random variates are used when simulating processes are driven by
random influences (stochastic processes).
● They are frequently used as the input to simulation models.
● Procedures to generate random variates corresponding to a given
distribution are known as procedures for random variate generation or
pseudo-random number sampling.
● Depending on how they are generated, a random variate can be
uniformly or non-uniformly distributed.
● Examples: Inter-arrival time and service time.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation
● Random variate generation is a fundamental aspect of simulation
modeling and analysis.
● The objective of random variate generation is to produce observations
that have the stochastic properties of a given random variable.
● Various methods and algorithms have been developed to generate
random variates that are accurate (representative of the target
distribution) and computationally efficient.
● The distribution from which random variates are generated is assumed
to be completely specified.
● We wish to generate samples from this distribution as input to a
simulation model.
● Random variate generation relies on generating uniformly distributed
random number in the closed interval [0,1].
● Random variate generators use as starting point, random numbers
distributed in U[0,1].
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation : Objectives
● The objective of random variate generation is to produce sample
observations that have the stochastic properties of a given random
variable, X, having distribution function
F(x) = Pr(X ≤ x) , where −∞ < x < ∞
● The development of the theory/concepts surrounding random variate
generation via computer algorithms is based on the following two key
assumptions :
○ Assumption 1 : There exists a perfect uniform (0,1), U(0,1), random
number generator that can produce a sequence of independent
random variables uniformly distributed on (0,1).
○ Assumption 2 : Computers can store and manipulate real numbers.
● Although Assumptions 1 and 2 are used for developing Random
Variate Generation theory, the assumptions are violated when
implementing Random Variate Generation algorithms on digital
computers.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors to be considered for random variate generation
1. Exactness:
● Exactness or accuracy refers to how well the generator produces
random variates with the characteristics of the desired
distribution.
● This refers to the theoretical exactness of the random variate
generator itself, as well as the error that is induced by the U(0,1)
random number generator and the error induced by digital
computer calculations.
2. Speed:
● Speed refers to the computational set-up and execution time
required to generate random variates. Contributions to time are:
a. Setup time
b. Variable generation time
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors to be considered for random variate generation
3. Space:
● Space refers to computer memory that is required for the generator.
● Although space is not typically a major consideration for modern
computers, computer memory was an important consideration in the
early days of Random Variate Generation development.
4. Simplicity:
● Simplicity refers to the both the simplicity of the algorithm as well as the
simplicity of implementation.
● This includes the number of lines of code, support routines required,
number of mathematical operations, as well as portability across
platforms and interaction with other simulation methods such as
variance reduction techniques.
The importance of each of these factors will vary depending on the
particular situation or simulation application.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation Techniques
● We assume that a pseudo random number generator
RN(0,1) producing a sequence of independent values
between 0 and 1 is available.
● General methods:
○ Inverse transform method
○ Acceptance-rejection method
○ Composite method
○ Translations and other simple transforms
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
● The acceptance–rejection method is often used when a closed-form
cumulative distribution function does not exist or is difficult to
calculate.
● In this method, variates are generated from one distribution and are
either accepted or rejected in such a way that the accepted values
have the desired distribution.
● General acceptance–rejection algorithm:
(i) Given a random variable X, let f(x) denote the desired density
function of X.
(ii) Let t(x) be any majorizing function of f(x) such that t(x) ≥ f(x) for all
values of x.
(iii) Let g(x) = t(x)/c denote the density function proportional to t(x) such
that
Source: Kuhl, Michael E. "History of random variate generation." 2017 Winter Simulation
Conference (WSC). IEEE, 2017.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
(iv) Generate x ∼ g(x).
(v) Generate u ∼ U(0,1).
(vi) If u > f(x)/t(x), then reject x and go to step 1.
(vii)Return x.

The execution time of the acceptance–rejection algorithm


depends on three main factors:
● The time to generate x from g(x).
● The time to perform the comparison in step 3.
● The number of iterations required to return an accepted
value for x.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
● Suppose that we need to sample from a distribution whose inverse
function is hard to solve. In that case, acceptance-rejection method can
be used.
● Generate a random point (X,Y) on the graph.
● If (X,Y) lies under the graph of f(X) then
Accept X
Otherwise
Reject X

Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
Illustration: To generate random variates, X ~ U(1/4,1)

● Here, R does not have the desired


distribution, but R conditioned (R’) on
the event {R ≥ ¼} does.
● Efficiency: Depends heavily on the
ability to minimize the number of
rejections.

Source:https://www.mi.fu-berlin.de/inf/groups/ag-tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_with_Simulation/07.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Drawback
● Trials ratio: Average number of points (X,Y) needed to produce one
accepted X.
● Here, we need to make trial ratio close to 1.
● Else the generator may not be efficient enough because of wasted
computing effort.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: to increase efficiency
● One way to make generator efficient is:
To generate points uniformly scattered under a function e(x), where
area between the graph of f and e be small.

Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Constructing e(x)
● Take e(x) = Kg(x)
● g(x) = density function of a distribution for which an easy way of
generating variates already exists.
● K = scale factor

Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Producing (X, Y)
● Let X = a variate produced from Kg(x)
● Let U = RN(0,1)
● (X,Y) = (X, UKg(X))

Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Poisson Distribution
Procedure of generating a Poisson random variate N is as follows:
1. Set n=0, P=1
2. Generate a random number Rn+1, and replace P by P x Rn+1
3. If P < exp(-α), then accept N=n.
Otherwise, reject the current n, increase n by one, and return to step 2.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Poisson Distribution example

Source:https://www.mi.fu-berlin.de/inf/groups/ag-tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_with_Simulation/07.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Poisson Distribution example
• It took five random numbers to generate three Poisson
variates
• In long run, the generation of Poisson variates requires
some overhead!

Source:https://www.mi.fu-berlin.de/inf/groups/ag-tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_with_Simulation/07.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Normal Distribution
● If X is a random variable form a normal distribution N(0,
1), then the density of |X| is given by the function,

● The function g(x) majorizes the function f(x).


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Normal Distribution
The algorithm for generating X ~ N(0,1):
● Generate an exponential Y with mean 1.
● Generate U from U(0,1), independent of Y.
● If U ≤ e −(Y −1)(Y - 1)/2 , then accept Y. Otherwise, reject Y and
return to step 1.
● Return X = Y or X = -Y, both with probability 0.5.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Advantages & Disadvantages
● Advantages:
○ Doesn’t require figuring out the inverse CDF.
○ It can be far more efficient compared with the naive methods in some
situations.
● Disadvantages:
○ May have to sample a lot to get an accept.
○ It can lead to a lot of unwanted samples being taken if the function
being sampled is highly concentrated in a certain region.
○ Curse of dimensionality:
As the dimensions of the problem get larger, the ratio of the
embedded volume to the corners of the embedding volume tends
towards zero.
Thus a lot of rejections can take place before a useful sample is
generated, thus making the algorithm inefficient and impractical.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation: Exercise
1. Implement Random Variate Generation for Poisson Distribution.
2. Implement Random Variate Generation for Normal Distribution.
DATA ANALYTICS
References
http://cs.bilkent.edu.tr/~cagatay/cs503/_M&S_04_Random_Variate_Gene
ration.pdf
https://www.mi.fu-berlin.de/inf/groups/ag-
tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_w
ith_Simulation/07.pdf
[1] Kuhl, Michael E. "History of random variate generation." 2017 Winter
Simulation Conference (WSC). IEEE, 2017.
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU

Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
PROBABILITY PLOTS

D. Uma
Mamatha H R

Computer Science and Engineering


MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Normal Probability Plot

Department of Computer Science and Engineering


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered...

The Normal Probability Plot.


Understanding Q-Q Plot.
Interpreting the Probability Plots.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Are my data “normal”?

 Not all continuous random variables


are normally distributed!!
 It is important to evaluate how well

the data are approximated by a normal


distribution
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Are my data normally distributed?

1. Look at the histogram! Does it appear bell shaped?


2. Compute descriptive summary measures—are mean,
median, and mode similar?
3. Do 2/3 of observations lie within 1 std dev of the mean?
Do 95% of observations lie within 2 std dev of the
mean?
4. Look at a normal probability plot—is it approximately
linear?

https://www.statology.org/histogram-mean-median/
Median = 6
Mean = 7.1
Mode = 0

SD = 6.8
Range = 0 to 24
Median = 5
Mean = 5.4
Mode = none

SD = 1.8
Range = 2 to 9
Median = 3
Mean = 3.4
Mode = 3

SD = 2.5
Range = 0 to 12
Median = 7:00
Mean = 7:04
Mode = 7:00

SD = :55
Range = 5:30 to 9:00
7.1 +/- 6.8 =
0.3 13.9 0.3 – 13.9
7.1 +/- 2*6.8 =
0 – 20.7
7.1 +/- 3*6.8 =
0 – 27.5
5.4 +/- 1.8 =
3.6 – 7.2
3.6 7.2
5.4 +/- 2*1.8 =
1.8 – 9.0
1.8 9.0
5.4 +/- 3*1.8 =
0– 10
0 10
0.9 5.9
3.4 +/- 2.5=
0.9 – 7.9
0 8.4
3.4 +/- 2*2.5=
0 – 8.4
0 10.9
3.4 +/- 3*2.5=
0 – 10.9
6:09
7:59
7:04+/- 0:55 =
6:09 – 7:59
5:14
8:54
7:04+/- 2*0:55
=
5:14 – 8:54
4:19
9:49
7:04+/- 2*0:55
=
4:19 – 9:49
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What are Probability Plots basically mean?

 So far, we have always worked with randomly selected


samples from some population.

 We have used an appropriate probability distribution to fit in


the data accordingly.

 The probability plot is one way of accessing it through


graphical representation.

 By visualizing the data, we can achieve tremendous amount of


information.

 For instance our data may be skewed, or be bi-modal, and


typically determines the distribution from which population it
has come from.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Probability Plots

 The data that is been plotted in the theoretical normal


distribution should form a straight line. This denotes the
normality of the data.

 A straight diagonal line depicts that the data is normally


distributed.

 Identifies whether the data is skewed to left or right which


does not fit the normal distribution.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How can I claim that my data is normally distributed?

For larger samples,


 Histogram will have a bell shaped curve which we call as
symmetric and there will not be any outliers.

 The mean, median and mode will be similar and lie at the
same point.

 In the similar way, 68% of observations lies within one


standard deviation of the mean. 95% within two and 99.7%
with three standard deviations.

For small samples,


 Histogram does not provides good visual presence, hence to
conform its normality we can use Probability Plots.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Construction of a Probability Plot

1) Sort the data.


2) Assign evenly spaced values to the data between 0 and 1.
3) For each xi in the data set,

Where,
i is the position of the data item
n is the size of the data set.
4) Find theoretical quantiles - Qi.
5) Plot every point (xi , Qi).
6) Plot (xi , xi)
7) Look into the observation whether it forms approximately straight
line. This helps us to understand the type of distribution.
Normal probability plot
coffee…

Right-Skewed!
(concave up)
Normal probability plot
writing…

Neither right-skewed
or left-skewed, but
big gap at 6.
Norm prob. plot
Exercise…

Right-Skewed!
(concave up)
Norm prob. plot Wake up
time

Closest to a
straight line…
Formal tests for
normality
 Results:
 Coffee: Strong evidence of non-normality (p<.01)
 Writing love: Moderate evidence of non-normality
(p=.01)
 Exercise: Weak to no evidence of non-normality (p>.10)
 Wakeup time: No evidence of non-normality (p>.25)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Different ways of dividing data equally between 0 and 1.

Methods Plotting Position Method


Blom
Benard
Hazen
Van der Waerden
Kaplan-Meier
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Problem – Normal Probability Plot

Problem:

Construct a normal probability plot for the following data. Do


these data appear to come from an approximately normal
distribution?

3.01, 3.35, 4.79, 5.96, 7.89.

Solution:

Sort the values

3.01, 3.35, 4.79, 5.96, 7.89 / n = 5


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution - Find the Plotting Position using Hazen Method

The value is chosen


to reflect the position of Xi in
the ordered sample.

i Xi There are values less


1 3.01 0.1 than Xi , and i values less than
or equal to Xi .
2 3.35 0.3
3 4.79 0.5 The quantity is a
compromise between the
4 5.96 0.7
proportions and .
5 7.89 0.9
The distribution that the
sample come from is
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution - Understanding behind Normal Probability Plot

 From the plot we can infer that (X1, 0.1) intersects at the point
(Q1, 0.1). We understand that Q1 is at the 10th percentile of
the N(5,22) distribution.

 Applying similar reasoning to the remaining points, we would


expect each Qi to be close to its corresponding Xi by 20th, 30th ,
40th and so on.

 The probability plot consists of the points (Xi , Qi ).

 Since the distribution that generated the Qi was a normal


distribution, this is called a normal probability plot.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution - Understanding behind Normal Probability Plot

 If X1, . . . , Xn do in fact come from the distribution that


generated the Qi , the points should lie close to a straight line.

 To construct the plot, we must compute the Qi.


 These are the 100(i − 0.5)/n percentiles of the distribution that
is suspected of generating the sample.

 In this example the Qi are the 10th, 30th, 50th, 70th, and 90th
percentiles of the N(5, 22) distribution.

 We could approximate these values by looking up the z-scores


corresponding to these percentiles, and then converting to
raw scores.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution - Find the Theoretical Quartiles Qi

i Xi Closest Area Z-score (Qi )


in z - Table X=z*σ+µ
1 3.01 0.1 0.1003 -1.28 -1.28 * 2 + 5 = 2.44
2 3.35 0.3 0.3015 -0.52 -0.52 * 2 + 5 = 3.95
3 4.79 0.5 0.5000 0.00 0.00 * 2 + 5 = 5.00
4 5.96 0.7 0.6985 0.52 0.52 * 2 + 5 = 6.05
5 7.89 0.9 0.8997 1.28 1.28 * 2 + 5 = 7.56
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution - Normal Probability Plot

 The figure shows a normal probability plot


for the sample X1, . . . , X5.

 A straight line is superimposed on the plot,


to make it easier to judge whether the
points lie close to a straight line or not.

 The sample points are close to the line, so it


is quite plausible that the sample came from
a normal distribution.

 The sample points X1, . . . , Xn are called


empirical quantiles.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Q-Q Plot

 The points Q1, . . . , Qn are called quantiles (divides distribution


into equal sized areas) of the distribution.

 These are the points in the data below which a certain


proportion of the data falls.

 The probability plot is sometimes called a quantile–quantile


plot, or QQ plot.

 We can use this Q-Q plot to check the assumption of Normality


of the data.

 Determines whether if two set of quantiles come from the


populations of same distribution. If, yes roughly forms a straight
line.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to use Probability plots for different sample sizes?

 Probability plots work better with larger samples.

 A good rule of thumb is to require at least 30 points before


relying on a probability plot.

 Probability plots can still be used for smaller samples, but they
will detect only fairly large departures from normality.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Interpreting Probability Plots

 It’s best not to use hard-and-fast rules when interpreting a


probability plot. Judge the straightness of the plot by eye.

 When deciding whether the points on a probability plot lie


close to a straight line or not, do not pay too much attention to
the points at the very ends (high or low) of the sample, unless
they are quite far from the line.

 It is common for a few points at either end to stray from the


line somewhat.

 However, a point that is very far from the line when most
other points are close is an outlier, and deserves attention.
THANK YOU

D. Uma
Mamatha H R
Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
UE23MA242A

Unit 1: Population & Sampling

Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Unit 1:Population & Sampling

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered

❖ Statistical Analysis

❖ Population

❖ Sample

❖ Sampling

❖ Types of Population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Problems to be solved

Suppose, you are interested in finding

• Mean height of all male students of all the universities in India.


OR

• Average marks of all female students of PES University. OR

• Relationship between the time a student spends on studying


and the grades that he gets. OR

• Impact of rise in number of student assignments on their grades.

Slide courtesy: Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Statistical Analysis?
It’s the science of collecting, exploring and
presenting large amounts of data to discover
underlying patterns and trends.

Statistics are applied every day – in research,


industry and government – to become more
scientific about decisions that need to be made.

The basic idea behind all statistical methods of data


analysis is to make inferences about a population
by studying a relatively small sample chosen from
it.

Source: media3.giphy.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Population
A population is the entire collection of objects or outcomes about
which information is sought.

As mentioned, statistical methods are based on the idea of analyzing


a sample drawn from a population.

For this idea to work, identifying the population, sample and


choosing the sample in an appropriate manner becomes important.

In research, a population doesn’t always refer to people. It can mean


a group containing elements of anything you want to study, such as
objects, events, organizations, countries, species, organisms, etc.

Source: keydifferences.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample
A sample is a subset of a population, containing the objects or
outcomes that are actually observed.
Sample size: The number of items in a sample is called a sample size.
The size of the sample is always less than the total size of the
population.
The process of taking a predetermined number of observations from
a larger population is called sampling.

Sources: i.gifer.com, keydifferences.com


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Population vs Sample
Population Sample
The population is a complete set. The sample is a subset of the
population
Population is hard to define and A sample is much easier to contact and
observe in real life. observe.
It is time consuming and costly to study It is relatively less time consuming and
a population low cost to study a sample.

Population contains all members of a Sample is a subset that represents the


specified group. entire population.
Reports on a population are a true Reports on a sample are have a margin
representation of opinion. of error.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Population & Sample examples

Population Sample
All countries of the world Countries with published data available
on birth rates and GDP since 2000
Songs from the Eurovision Song Contest Winning songs from the Eurovision
Song Contest that were performed in
English
Undergraduate students in the 300 undergraduate students from three
Netherlands Dutch universities who volunteer for
your psychology research study
Advertisements for IT jobs in the The top 50 search results for
Netherlands advertisements for IT jobs in the
Netherlands on May 1, 2020
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Populations & Samples

In a recent survey, 250 college students at Union College Were


asked if they smoked cigarettes regularly. 35 of the students said
yes. Identify the population and the sample.

Responses of all students


at Union College
(population)

Responses of 250
students in survey
(sample)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Populations & Samples

A city council member wanted to know how her constituents felt about a
planned rezoning. She randomly selected 75 names from the city phone
directory and conducted a phone survey. Identify the population and sample
in this setting.
Answer:

● The population is everyone listed in the city phone directory


● The sample is the 75 people selected to conduct a phone survey.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Sampling?
The process of selecting observations(a sample) in order to make an
inference that can be generalized to the population.
What
you What you
want to actually
talk observe in
about the data

INFERENCE

Source Image : aprendeconalf.es Slide courtesy: Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling
The methodology used to sample from a larger population depends on the
type of analysis being performed.

The population
All of the individuals of interest

The results The sample


from the sample are Selected from the
generalized to the population population

The sample
The individuals selected to
participate in the research study
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling

Sampling
Population Sample

Use statistics to
summarize features
Use parameters to
summarize features

Inference on the population from the sample


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why sampling?
We know that resources such as time, money and people are limited.
When the population is large in size, geographically dispersed, or difficult to
contact, it’s necessary to use a sample. Thus, most projects aim to gather data
from a sample, rather than from the entire population. Some reasons for
sampling are:
● Necessity: Sometimes it’s simply not possible to study the whole population
due to its size or inaccessibility.
● Practicality: It’s easier and more efficient to collect data from a sample.
● Cost-effectiveness: There are fewer participant, laboratory, equipment, and
researcher costs involved.
● Manageability: Storing and running statistical analyses on smaller datasets is
easier and reliable.
● Saves time: As sample size is relatively less, it increases data-collection speed
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Characteristics of a sample
● A sample must be representative of the population.
● It must be appropriately sized. i.e. it must be sufficiently large to represent
the population and provide statistical stability or reliability.
● It must be unbiased. It should contain all types of groups/units present in
the population in fair proportions.
● It must be selected at random. This means that any item in the group has an
equal chance of being and selected and included in the sample.
● It must be economical. The objectives of the survey must be achieved in as
minimum of cost and effort as possible.
● It must be goal-oriented. It must be oriented to the research objectives and
fitted to the survey conditions.

Slide courtesy: Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Is it a good sample?
Study : Survey of the job prospects of the students studying in a university.
Sample: Taking survey from the students who are in Canteen.

Slide courtesy: Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Is it a good sample?
This is not an example of a good sample as,
● The students in the canteen are not completely representative
of the students studying in the university.

● The size of the sample (i.e. the number of students in the


canteen) might not be appropriate or sufficient enough to
represent the population (students studying in the university).

● The sample selection is not performed at random as each


student studying in the university doesn’t have an equal chance
of getting selected.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Is it a good sample?

➔ Study : To measure teenage use of illegal drugs in a city.


Sample : All high school students in the city.
This type of sampling results in a biased sample as it does not include home-
schooled students or dropouts.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Is it a good sample?

➔ Study: To calculate the average number of hours a person spends


exercising.
Sample: "Man on the street" interview which selects people who walk
by a certain location.
This type of sampling results in having an overrepresentation of
healthy individuals who are more likely to be out of the home than
individuals with a chronic illness. This may be an extreme form of
biased sampling, because certain members of the population are
totally excluded from the sample (that is, they have zero probability of
being selected).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Is it a good sample?
➔ Study : A test of the effectiveness of a new high school curriculum introduced
Sample : Dividing an area by school district, then choosing a school or set number
of schools at random and sampling students from each school.
This type of sampling results in a unbiased sample as it each school district in an
area has its representation in the sample. Also, each school has an equal chance of
getting chosen.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Is it a good sample?
➔ Study: Conduct observations to ensure that employees are employing best
practices in the company.
Sample: Each employee is assigned a random number using computer software.
The same software is used periodically to choose a number of the employees and
are observer. This is a good sample as each employee has an equal chance of being
selected.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of population

1. Tangible or concrete population

1. Conceptual population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tangible population
Populations where the members are physical objects, such as cars,
bolts, apples, etc., are called tangible or concrete populations.
Such populations are assumed to be always finite and therefore
involves counting.
After an item is sampled, the population size decreases by 1.
In principle, one could in some cases return the sampled item to the
population, with a chance to sample it again, but this is rarely done
in practice.

Source: https://www.hindivarta.com/jansankhya-
Slide courtesy: Dr.Uma ki-samasya-aur-samadhan-par-nibandh/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Conceptual population
Populations that do not consist of physical or actual objects are
called Conceptual populations.
Conceptual populations are mostly the result of a measurement.
It involves measuring something multiple times.
Ex: length of a metal rod.

It consists of a not well-defined group of which all elements are not


available at the time the sample is collected as the population
increases every day.
The size of a conceptual population is usually large.
Ex:a measuring scale population can be all the possible outputs it
can give. i.e. infinite. The measured values can be thought of as a
sample from this infinite population.
Slide courtesy: Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tangible & Conceptual population examples
Define the population, and state whether it is tangible or conceptual.
● A shipment of bolts is received from a vendor. To check whether
the shipment is acceptable with regard to shear strength, an
engineer reaches into the container and selects 10 bolts, one by
one to test.
Ans: All the bolts in the shipment: Tangible population
● The resistance of a certain resistor is measured 5 times with the
same ohmmeter.
Ans: All measurements that could be made on that resistor with
that ohmmeter : Conceptual population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tangible & Conceptual population examples
Define the population, and state whether it is tangible or conceptual.
● A geologist weighs a rock several times on a sensitive scale.
Ans: All the readings that the scale could produce: Conceptual
population
● A pollster samples 1000 registered voters in a certain state and
asks them which candidate they support for governor.
Ans: All registered voters in that state : Tangible population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tangible & Conceptual population examples
Define the population, and state whether it is tangible or conceptual.
● A quality engineer needs to estimate the percentage of bolts manufactured
on a certain day that meet a strength specification. At 3:00 in the afternoon
he samples the last 100 bolts to be manufactured.
Ans: All bolts manufactured on that day : Tangible population
● In a clinical trial to test a new drug that is designed to lower cholesterol, 100
people with high cholesterol levels are recruited to try the new drug.
Ans: All people with high cholesterol level: Tangible population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Target and Study population
● Target or Theoretical population refers
STUDY POPULATION
to the entire group of individuals or
objects to which researchers are
interested in generalizing the
conclusions.
It must meet a set of criteria of interest
to the researchers.
● Study population or accessible SAMPLE
population is the population to which
the researches can apply their
conclusions to.
It is a subset of the target population. It
may be limited to region, state, city,
county, or institution
TARGET POPULATION
Slide courtesy: Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Target and Study population examples

Target Population Study Population


All institutionalized elderly All institutionalized elderly with Alzheimer's in St.
with Alzheimer's Louis county nursing homes
All people with AIDS All people with AIDS in the metropolitan St. Louis
area
All low birth weight infants All low birth weight infants admitted to the
neonatal ICUs in St. Louis city & county
All school-age children with All school-age children with asthma treated in
asthma pediatric asthma clinics in university-affiliated
medical centers in the Midwest
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Terminologies related to Sampling
● Target or Theoretical Population: The population to which the
investigator wants to generalize his results.
● Sampling Frame : The sampling frame is the list from which the
potential respondents are drawn.
Ex: List of Universities, List of Students, List of Airline Companies,
Telephone Directory
● Sampling Unit : Smallest Unit from which sample can be
selected.
● Sampling Scheme: Method of selecting sampling units from
sampling frame.
● Sample: All selected respondents form a sample.

Slide courtesy: Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Breakdown

Source:
https://image.slidesharecdn.com/qrmtheory-
180918191951/95/how-to-do-sampling-8-
638.jpg?cb=1537298482
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Breakdown
Study : Find the mean weight of all students of all universities in India.

● Whom do you want to generalize results?


All universities in India
➔ Target or Theoretical population
● What population can you get access to?
All universities in Karnataka
➔ Study population
● How can you get access to them?
List of Universities in Karnataka
➔ Sampling frame
● Who is in your study?
Two Universities from Karnataka
➔ Sample
Slide courtesy: Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
References

https://www.questionpro.com/blog/population-vs-sample/
https://www.scribbr.com/methodology/population-vs-sample/
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU

Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 834
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Unit 1: Additional Examples

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Populations & Samples

The owners of a sports stadium wanted to predict what additional


refreshment options would sell well. They selected 80 seat numbers at
random and surveyed the occupants of those seats. Identify the population
and sample in this setting.
Answer:

● The population is the occupants of all seats in the sports stadium.


● The sample is the occupants of the 80 selected seats..
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Populations & Samples
Lucio wants to know whether the food he serves in his restaurant is within a
safe range of temperatures. He randomly selects 70 entrees and measures
their temperatures just before he serves them to his customers. Identify the
population and sample in this setting.

Answer:

● The population is all of the entrees Lucio serves.


● The sample is the 70 selected entrees.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Mean
Calculate the average number of truck shipments from the
United States to five Canadian cities for the following data given
in thousands of bags:
Montreal: 64.0
Ottawa: 15.0
Toronto: 285.0
Vancouver: 228.0
Winnipeg: 45.0

Ans: 127.4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median-example
Consider the data given below:
5, 9, 12, 4, 5, 14, 19, 16, 3, 5, 7
Calculate the median.

Ans:
To calculate the median, we need to put the numbers in order and find the
middle value.
3 4 5 5 5 7 9 12 14 16 19
n = 11
● Here the median is 7 because this is the middle value.
● Half of the other values in the list are below 7 and half are above 7.

Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median-example
Consider the data given below:
3, 6, 7, 8, 11, 15
Calculate the median.
Ans:
● When there are an even number of values, there is no clear middle
value.
In this case, there are two middle values.
3 6 7 8 11 15
n=6
● The median is the mean of these two middle numbers.7 + 8 / 2
=7.5
So the median for this set of values is 7.5.
● Like the mean, the median value does not always appear in the
original list of values.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example
Consider the data given below:
5, 9, 12, 4, 5, 14, 19, 16, 3, 5, 7
Calculate the mode of the above data.
Ans:
3 4 5 5 5 7 9 12 14 16 19
In this list the mode is 5, because it appears most often.

Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode and Range-example
In the article “Evaluation of Low-Temperature Properties of HMA Mixtures” (P.
Sebaaly, A. Lake, and J. Epps, Journal of Transportation Engineering, 2002: 578–
583), the following values of fracture stress (in megapascals) were measured for a
sample of 24 mixtures of hot-mixed asphalt (HMA).
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
Ans:
There are three modes:
80, 179, and 232.
Each of these values appears twice, and no other value appears more than once.
The range is 470 − 30 = 440.

Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Percentile example
Find the 65th percentile for the following data
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470

Ans:
R = (P/100)(N+1)
= (65/100) (24+1)
= 16.25 (is it not an integer)

The 65th percentile is therefore found by averaging the 16th and 17th data points

Percentile value = (16th element + 17th element value)/2


= (236+240)/2
= 238
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile
range.
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
ANS)
n = 24.
first quartile = (0.25)(25) = 6.25
=(6th place+7th place)/2
= (105 + 126)/2 = 115.5

Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile
range.
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
ANS)
n = 24.
Second quartile=median=0.5(25)=12.5
=(12th +13th )/2
=(191+223)/2
=207
third quartile= (0.75)(25) = 18.75=(18th +19th )/2= (242 + 245)/2 = 243.5.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile
range.
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
ANS)
n = 24.
third quartile = (0.75)(25) = 18.75
=(18th +19th )/2
= (242 + 245)/2 = 243.5.

IQR=3rd quartile-1st quartile=243.5-115.5=128


Slide
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Maximum Likelihood Estimation
Prof. Mamatha H R
Prof. Uma D
Prof. Silviya Nancy J
Prof. Suganthi S

Department of Computer Science and Engineering


MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Maximum Likelihood Estimation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered...

The Method of Maximum Likelihood for Bernoulli,


Binomial and Poisson Distributions.

The Method of Maximum Likelihood for Normal


Distribution.

Pitfalls of Point Estimators


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Maximum Likelihood Estimate (MLE)

 If we identify a population following any distribution, but the


population (parameters) mean and variance are unknown.

 By taking adequate number of samples from the population,


by finding the mean and variance so that the observed data is
the one which is most likely to occur.

 Maximum Likelihood Estimate (MLE) is the good method that


can be applied for estimating parameters.

 It can be obtained from any given distribution using the


observed data.

 The suggestion is to estimate the parameter with the value


that makes the observed data most likely.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Definition

Text Book – Chapter 4.9 - Pg. No: 283


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Points to Remember

 The maximum likelihood estimate is the value of the


estimators that when substituted in for the parameters
maximizes the likelihood function.

 The likelihood function can be a probability density function


or a probability mass function.

 It can also be a joint probability density or mass function,


and is often the joint density or mass function of
independent random variables.

 Note: Joint probability is the probability of occurrence of two


or more events together.
Text Book – Chapter 4.9 - Pg. No: 283
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Obesity in Women – Understanding MLE
Let’s start with an example. Annie is a post graduate student
who want to study on the growing health problems with
women due to obesity. She decided to collect data from the
samples she had chosen between age of 20 – 25 years. She
surveys with the questionnaire of the diet and exercise
habits of her 10 class mates to start with and collects their
weights and plots it from low to high.

And it looks like this,

From the collected sample estimates, she intends to make


inferences for the population parameters.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Obesity in Women – MLE for Normal Distribution

At first step, she has to decide on which model can describe


her data best. From the plot she sees that weights are
adequately described by the Normal distribution.

We can understand that from the plot seen, it suggests that


Normal distribution is plausible because most of the data
points are clustered around the middle and few scattered to
the left and right.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Obesity in Women – How do we go about the distribution?

We already know that Normal distribution has two


parameters associated with it. The mean µ and standard
deviation σ. Different values for these parameters may give us
different visuals.

MLE is the method that would help us in finding the value of µ


and σ that will result in the bell curve that fits our data best
in.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
General Steps to proceed with MLE.

Step 1: Write down the likelihood function.

Step 2: Take natural log of likelihood function.


(Reason: the quantity that maximizes log of a function is
always the same quantity that maximizes the function
itself)

Step 3: Differentiate log-likelihood function with respect to the


parameter being estimated.

Step 4: Set the derivative equal to 0 to get MLE.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)

If the Xi are independent Bernoulli random variables with unknown


parameter p, then the probability mass function of each Xi is:
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)

For n observations,
f(x1,x2….xn/p)

=product of individual density functions

P(X1=x1,X2=x2……Xn=xn/p)=

Joint Probability Mass/Density


(distribution) function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)

P(X1=x1,X2=x2……Xn=xn/p)= f(x1:p) f(x2:p)……….f(xn:p)

Likelihood Function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example – Binomial Distribution – Estimate Likelihood Function

Consider the following example,

In the probability mass function it is as written f (7; p) rather


than f (7). Here the data value 7 is constant.

When a probability mass function or probability density


function is considered to be a function of parameters, it is
called a likelihood function.
Text Book – Chapter 4.9 - Pg. No: 282
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How can we maximise this likelihood function?

Text Book – Chapter 4.9 - Pg. No: 282


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Binomial Distribution – Estimate Likelihood Function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Binomial Distribution – Estimate Likelihood Function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Binomial Distribution – Estimate Likelihood Function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Binomial Distribution – Estimate Likelihood Function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
MLE for Poisson Distribution (λ) – Estimating Likelihood

Text Book – Chapter 4.9 - Pg. No: 283


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
MLE for Poisson Distribution (λ) – Estimating Likelihood

Text Book – Chapter 4.9 - Pg. No: 283


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example - Poisson Distribution

Problem:
The following data are the observed frequencies of occurrence
of domestic accidents: we have n = 647 data as follows’
Number of Frequency
Accidents
0 447
1 132
2 42
3 21
4 3
5 2
What is the estimate of λ if a Poisson model is assumed ?
Problem Source - http://wwwf.imperial.ac.uk/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example - Poisson Distribution

Solution:

Problem Source - http://wwwf.imperial.ac.uk/


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

Let x1, ... ,xn be a random sample from N(µ,σ2) population. Find MLE of µ and σ.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

The likelihood function

OR

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

log e log e=1

0
Text Book – Chapter 4.9 - Pg. No: 284
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example – Maximum Likelihood Estimate for Mean (µ)

Problem:

Suppose the weights of randomly selected female college students are


normally distributed with unknown mean μ and standard deviation σ. A
random sample of 10 female college students yielded the following
weights (in pounds):
115 122 130 127 149 160 152 138 149 180
Identify the likelihood function and the maximum likelihood estimator of
μ, the mean weight of all female college students. Using the given sample,
find a maximum likelihood estimate of μ as well.

Numerical Data Source: https://online.stat.psu.edu/stat414/


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example – Maximum Likelihood Estimate for Mean (µ)

Solution:

Numerical Data Source: https://online.stat.psu.edu/stat414/


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Maximum Likelihood (MLE) – Desirable Properties

Maximum likelihood is the most commonly used method of


estimation.

The main reason for this is that in most cases that arise in
practice, MLEs have two very desirable properties,

1. In most cases, as the sample size n increases, the bias of the


MLE converges to 0.
2. In most cases, as the sample size n increases, the variance of
the MLE converges to a theoretical minimum.

Text Book – Chapter 4.9 - Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Pitfalls of Point Estimators

 A point estimator is a single number which may vary from


sample to sample.

 Certainly the point estimators are slightly different from true


population parameter .

 It cannot be confidently claimed to be close to the actual


parameter.

 This can be solved by estimating population parameters in the


given intervals of values where point estimator can be
centered. This interval is called confidence interval.

Note: Confidence Intervals will be covered shortly.


THANK YOU

Prof. Mamatha H R
Prof. Uma D
Prof. Silviya Nancy J
Prof. Suganthi S
Department of Computer Science and Engineering
PRINCIPLES OF POINT ESTIMATION

D. Uma
Mamatha H R

Computer Science and Engineering


MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Point Estimation

Computer Science and Engineering


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered...

Point Estimator.
Measuring Goodness of an Estimator.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample Statistics & Population Parameters

 Statistics are used to estimate parameters.

 Descriptive statistics effectively describes the data and it


does not make any inference from the data.

 Descriptive refers to the numerical summary of the


population which is referred as parameter and for sample
it is called as statistic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Inferential Statistics

 Draw inferences and make conclusion or


evaluation about population using the evidence
provided by the sample.

 It helps to estimate the parameters of the


population.

 The sample may not provide a complete depiction


of the population.

 There will always be an uncertainty when drawing


conclusions about the population from the
sample.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Understanding Point Estimation

 Anna is interested in finding the mean weight of the apples


that are imported from Kashmir. However a survey claims
that the average weight of an apple is around 90g.

What can we do now?

 It is well known that we cannot weigh every apple


(population). So, we need to take samples.

 From those samples we can make inferences about the entire


population.

 There are also chances that the samples what we examine will
have some errors.
Image Source: http://clipart-library.com/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling

 We know that all the apples cannot be weighted, so she


decided to take collection of four samples of size 20
each.

 After performing the test, the sample means are


displayed below.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Point Estimate?

 We have to understand that there is a claim that the average


weight of an apple is around 90g.

 From samples we understand that Sample 4 having 89.9g is


pretty close to 90g.

 Since this is a sample we cannot expect it be exactly 90g. So,


can we claim that these samples accurately reflect the weight
parameter of the overall population?

 On the other hand, 89.9g is close enough to 90g and it is


plausible to accept. This is referred as Point Estimate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Point Estimate

 It is a single numeric value specified for the data which is


also referred as sample statistic.

 We collect data for the purpose of estimating some


numerical characteristic of the population from which they
come.

 A quantity calculated from the data is called a statistic, and


a statistic that is used to estimate an unknown constant, or
parameter, is called a point estimator. Once the data has
been collected, we call it a point estimate.

 Point estimate infers about the population parameters.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Inferences from Point Estimator

Point Estimators Unknown Population Parameters


Sample Mean ( ) Inference Population Mean ( )
Sample Standard Deviation ( ) Population Standard Deviation ( )
Sample Proportion ( ) Population Proportion (p)

Point estimate is used to make an estimation of unknown


population parameters including population mean,
standard deviation and proportion .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example – Point Estimation of Population Mean

 Point Estimate of Population mean is Sample mean .

 Example: Sample of heights of 34 male freshman students in


a class was obtained.

185 161 174 175 202 178 202 139 177


170 151 176 197 214 283 184 189 168
188 170 207 180 167 177 166 231 176
184 179 155 148 180 194 176

 This can be inferred as the single numeric point estimate for


the population mean (true mean) of all the freshman
students.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example – Point Estimation of Population Proportion

 An point estimate of population proportion, p, is the sample


proportion

where X is the number of successes in the sample and n is


the sample size.

Example: A sample of 100 people were selected in a particular


locality to estimate the proportion of them go for walking the
park everyday. In this sample 40 do.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Properties of Point Estimator

 For a large population, when sampling technique is used, it is


not going to be perfect always. There will always be some
uncertainty in estimation.

Property : 1 - Bias

 When the expected value of an estimator is different from


the value of the parameter that is being estimated.

 When they are equal, we call it as unbiased.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Properties of Point Estimator

Property: 2 - Consistency

 This portrays how close the point estimator can be to the


true value of the parameter even if it increases in size.

 The consistency and accuracy of point estimator can be


achieved by using large samples.

 This is can be exercised by mean and the variance.

 To be more consistent the mean of the sample should move


towards the true value of the population parameter.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Properties of Point Estimator

Property: 3 - Efficiency

 A very efficient point estimator should have the following,

a) smallest variance.
b) unbiased observation.
c) consistent.

 All these parameters can be achieved from a normally


distributed population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How can we measure the goodness of an estimator?

 Given a point estimator, how do we determine how good it is?

Goodness Measure - Mean Squared Error(MSE)

 What methods can be used to construct good point estimators?

Good Method to construct Point Estimator - Maximum


Likelihood Estimate (MLE)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measuring Goodness of Estimator – Mean Squared Error (MSE)

 A good estimator should be both accurate and precise.

 Accuracy of an estimator is measured by bias.

 Precision is measured by standard deviation or uncertainty.

 It can be measured by a quantity called Mean Squared Error


(MSE).

 MSE combines both bias and uncertainty.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Mean Squared Error (MSE) - Bias & Uncertainty

 The bias of the estimator is denoted by,

 The difference between the mean of the estimator and true


value.

 is the unknown parameter

 denote the estimator of

 The uncertainty is the standard deviation , defined as the


standard error of the estimator.

Text Book – Chapter 4.9 - Pg. No: 280


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Mean Squared Error (MSE)

 MSE is found by adding the variance to square of the bias.

 By definition,

 An equivalent expression is,

Note: is the difference between estimated value and true


value and it is called as error.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Derivation of Mean Squared Error (MSE)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Derivation of Mean Squared Error (MSE)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Problem
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

Solution:

Note: When bias is 0. MSE will be equal to variance.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

Determine the MSE of the estimator 𝜇ො = 𝑋ത of the


parameter µ of the Poisson(µ) distribution.

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

Determine the MSE of the estimator 𝜇Ƹ = 𝑋ത of the


parameter µ of the normal(µ) distribution.

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

MSE(𝜇)=variance(
Ƹ 𝜇)+bias
Ƹ 2(𝜇)
Ƹ

Since normal distribution is unbiased


E(𝜇)=μ
Ƹ bias=0
We know variance(𝜇)=varience(
Ƹ ത =
𝑋)

𝜎2

𝜎2
𝜎2

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
.
.
. Problem

𝜎2 𝜎2
Therefor MSE = +0=
𝑛 𝑛

Text Book – Chapter 4.9 – Ex.3 Pg. No: 284


THANK YOU

D. Uma
Mamatha H R

Computer Science and Engineering


MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
UE23MA242A
Unit 1: Types of Statistics & Summary
Statistics
Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Unit 1: Types of Statistics & Summary
Statistics

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered

❖ Statistics

❖ Types of Statistics

❖ Summary Statistics
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Statistics
● Statistics is the science of data. It involves collecting, classifying,
summarizing, organizing, analyzing, and interpreting numerical
information.
● It involves study and manipulation of data, including ways to
gather, review, analyze, and draw conclusions from data.

Source: i0.wp.com Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Statistics
Statistics involves:

• Collecting Data
• Ex: Survey

• Presenting Data
• Ex: Charts & Tables

• Characterizing Data
• Ex: Average

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why do we need to know about statistics
● To know how to properly present information.
● To know how to draw conclusions about populations based on sample
information.
● To know how to improve processes.
● To know how to obtain reliable forecasts.
● To find out why a process behaves the way it does.
● To find out why a process produces defective goods and services.
● To check various performance measures of a process.
● To prevent problems caused by various causes of variation in process.
● To analyze the real world.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Statistics

• Economics • Engineering
• Forecasting • Construction
• Demographics • Materials

• Sports • Business
• Individual & Team • Consumer Preferences
Performance • Financial Trends
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Processes of statistics
Statistics involves 2 main processes:

1. Describing sets of data.

1. Drawing conclusions (making estimates, decisions,


predictions, etc) about sets of data based on sampling.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Population and Sample

POPULATION SAMPLE
A population is the entire collection of A sample is a subset of a population,
objects or outcomes about which containing the objects or outcomes that
information is sought. are actually observed.

Source: sigmamagic.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample Statistic and Population Parameter
➔Sample statistic: ➔Population parameter:
● It is a numerical measurement ● It is a numerical measurement
describing some characteristic of a describing some characteristic of a
sample. population.
● Example: sample average, median, ● Example: mean and variance of a
sample standard deviation, and population are population
percentiles. parameters.

Sources: youtube..com,
Slide Courtesy:Dr.Uma Pinkmonkey.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample Statistic and Population Parameter

Sources: youtube..com,
Pinkmonkey.com

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Taxonomy of Statistics

Source: image.slidesharecdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Branches of Statistics
The study of statistics has two major branches:
1) Descriptive statistics
2) Inferential statistics
Statistics

Descriptive Inferential
statistics statistics

Involves Involves using a sample


organization, to draw conclusions
summarization, and about a population.
display of data.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics
● Descriptive statistics are methods for organizing and
summarizing data.
● Descriptive statistics utilizes numerical and graphical methods
to look for patterns in a data set, to summarize the
information revealed in a data set and to present that
information in a convenient form.
● A descriptive value for a population is called a parameter and a
descriptive value for a sample is called a statistic.
● For example, tables or graphs are used to organize data, and
descriptive values such as the average score are used to
summarize data.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics

■ Purpose: To describe data


■ Collect Data
■ e.g. Survey

■ Present Data
■ e.g. Tables and graphs

■ Characterize Data
■ e.g. Sample mean

Source: luminousmen.com/post/ Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics
Types of descriptive statistics:
■ Organize Data
■ Tables
■ Graphs

■ Summarize Data
■ Central Tendency
■ Variation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics
➔ Organizing Data
◆ Tables
● Frequency Distributions
● Relative Frequency Distributions
◆ Graphs
● Bar Chart or Histogram
● Stem and Leaf Plot
● Frequency Polygon
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics
➔ Summarizing Data:

■ Central Tendency (or Groups’ “Middle Values”)


■ Mean
■ Median
■ Mode

■ Variation (or Summary of Differences Within Groups)


■ Range
■ Interquartile Range
■ Variance
■ Standard Deviation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is Descriptive Statistics used?

Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is Descriptive Statistics used?

Figure speaks it all !!!

Source: luminousmen.com/post/, www.slideshare.net


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is Descriptive Statistics used?

Source: slidetodoc.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics Examples

Source: WPP Kantar media


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Inferential Statistics
● Inferential statistics utilizes sample data to make
estimates, decisions, predictions or other generalizations
about a larger set of data.
● There are two main areas of inferential statistics:
1. Estimating parameters: This means taking a statistic
from the sample data (for example the sample mean)
and using it to say something about a population
parameter (for example the population mean).
2. Hypothesis tests: This is where sample data can be
used to answer research questions. For example, one
might be interested in knowing if a new cancer drug is
effective; or if breakfast helps children perform better
in schools.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Inferential Statistics

■ Purpose: Make decision about population


characteristics.
Population?

■ Inferential statistics involves:


■ Estimation: e.g. Population Parameters
■ Hypothesis Testing

Source: luminousmen.com/post/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is Inferential Statistics used?
Suppose you want to know the mean income of the subscribers of
Netflix.
● Mean (µ) — a parameter of a population.
● You draw a random sample of 100 subscribers and determine
that their mean income is $27,500.
● Mean( x̅ ) = $27,500 (a summary statistic).
● Conclusion : You conclude that the population mean income μ
is likely to be close to $27,500 as well.
● This is an example of statistical inference.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Inferential Statistics examples
● You randomly select a sample of 11th graders in your state
and collect data on their SAT scores and other characteristics.

You can use inferential statistics to make estimates and test


hypotheses about the whole population of 11th graders in
the state based on your sample data.

● To find out the average salary of IT engineers across the


country:
We can have a predefined selective number of IT engineers
from a particular city, say Mumbai. We can gather data about
their salaries much more easily and then use the data to
evaluate the average income of IT engineers across the
country.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics vs Inferential Statistics

Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics vs Inferential Statistics

Source: selecthub.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics vs Inferential Statistics

Descriptive Statistics Inferential Statistics

• Organize • Generalize from samples to


• Summarize population
• Simplify • Hypothesis testing
• Presentation of data • Relationships among variables

Describing data Make predictions

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
Q1. In a recent study, volunteers who had less than 6 hours of
sleep were four times more likely to answer incorrectly on a
science test than were participants who had at least 8 hours of
sleep. Decide which part is the descriptive statistic and what
conclusion might be drawn using inferential statistics.

Ans: The statement “four times more likely to answer


incorrectly” is a descriptive statistic.
An inference drawn from the sample is that all individuals
sleeping less than 6 hours are more likely to answer science
question incorrectly than individuals who sleep at least 8 hours.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
Q2. A burger outlet wanted to perform market research to
determine what type of chicken burgers their customers liked.
The outlet is researching to figure out the favourite tastes of
their customers to provide better services and dishes to the
customers. The outlet gathered a customer sample size of a 100
customers in different age groups and regular nearby customers
at the outlet. The outlet was able to determine that 80% of the
customers liked their chicken burgers to be spicy and crispy while
the rest liked it non-crunchy and non-spicy.
What type of statistics was applied to arrive that conclusion?

Ans: Inferential statistics


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of Descriptive Statistics

Interquartile
Range

Variance

Source: geeksforgeeks
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency
● There are three different
types of 'average'.
● These are the mean, the
median and the mode.
● They are used by
statisticians as a way of
summarizing where the
‘centre’ of the data is.

Source: sixsigma-institute.org Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mean
• Mean is the arithmetic average computed by summing all the values
in the dataset and dividing the sum by the number of data values.
• The population mean is represented by Greek letter µ.
• For a finite set of dataset with measurement values X1, X2, …., Xn
(a set of n numbers), it is defined by the formula:

a) Population mean: where N is the population size

a) Sample mean: where n is the sample size

Source: sixsigma-institute.org Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mean

Source: slideshare.net Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Weighted mean
● Weighted mean is an average where certain values of the
data set contribute more to the mean value.
● For a finite set of dataset with measurement values X1, X2,
…., Xn
(a set of n numbers), and the corresponding weights w1,
w2,....wn
it is defined by the formula:
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Trimmed mean
● The trimmed mean is computed by arranging the sample
values in order, “trimming” an equal number of them from
each end, and computing the mean of those remaining.
● If p% of the data are trimmed from each end, the resulting
trimmed mean is called the “p% trimmed mean”.
● There are no hard-and-fast rules on how many values to trim.
● The most commonly used trimmed means are the 5%, 10%,
and 20% trimmed means.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Trimmed mean
● If the sample size is denoted by n, and a p% trimmed mean is
desired, the number of data points to be trimmed is np/100
● It is used to reduce the effects of outliers on the calculated average.
● This method is best suited for data with large, erratic deviations or
extremely skewed distributions.

Source: exceluser.com, MathBitsNotebook


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Question
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mean
➔ Advantages:
● It takes into account all the available information.
● It can be combined with means of other groups to give the overall mean.
● Easy and quick way to represent the entire data values by a single or unique
number due to its straightforward method of calculation.
● Each data set has a unique mean value.
➔ Disadvantages:
● It is a very sensitive measure.
● Thus, its value is easily affected by extreme values known as the outliers.
● It can only be used on interval or ratio data.

Source: slideshare.net Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median
● Median is the value separating the higher half from the lower half of
a data sample, a population, or a probability distribution.
● It is the middle number in a sorted, ascending or descending, list of
numbers and can be more descriptive of that data set than the
average.
● For a data set, it may be thought of as "the middle" value.
● The basic feature of the median in describing data compared to the
mean (often simply described as the "average") is that it is not
affected by a small proportion of extremely large or small values, and
therefore provides a better representation of a "typical" value.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median
➔ Process of calculating median:
1. Arrange all the values of the data set in ascending order.
X1,X2,X3,....,Xn
1. Find the middle position.
3. The element corresponding to middle position is considered as median if odd
number of elements are present.
i.e. if n is odd, median = (n+1/2)th element’s value
4. If there are even number of elements present then the average of the elements
present in the middle positions is considered as median. i.e. if n is even,
median = ( (n/2)th element’s value + (n/2 + 1)th element’s value))/2
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median
Odd no. of elements Even no. of
elements

Source: Chilimath Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median-example
Consider the data given below:
A simple random sample of five men is chosen from a large population of
men, and their heights are measured. The five heights (in inches) are
65.51, 72.30, 68.31, 67.05, 70.68.
Calculate the median.

Ans:
To calculate the median, we need to put the numbers in order and find the middle
value.
The five heights, arranged in increasing order, are
65.51 67.05 68.31 70.68 72.30.
n=5
● The sample median is the middle number, which is 68.31.
● Half of the other values in the list are below 68.31 and half are above 68.31.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median
➔ Advantages:
● Not affected by the outliers in the data set.
● An outlier is a data point that is radically “distant” or “away” from
common trends of values in a given set.
● It does not represent a typical number in the set.
● The concept of the median is intuitive and thus can easily be
explained as the center value.
● Each set has a unique median value.
➔ Disadvantages:
● Its value is perceived as it is.
● It cannot be utilized for further algebraic treatment.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode
● The mode is the value that appears most often in a set of data values
● Like the statistical mean and median, the mode is a way of expressing,
in a (usually) single number.
● To calculate the mode, we need to look at which value appears the
most often.
● Example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
● Given the list of data [1, 1, 2, 4, 4] its mode is not unique.
It has 2 modes: 1 and 4
● A dataset, in such a case, is said to be bimodal, while a set with more
than two modes may be described as multimodal.
● Empirical formula:
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example

Mode = 15k

Source: slideshare.net
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example
Consider the data given below:

The values 3 and 4 appear the most number of times in the above data.
Since the above data has 2 modes, it is bimodal.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example

Sources: sixsigma-institute.org, statistics.laerd.com Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode

Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode
➔ Advantages:
● Quick and easy to compute.
● Unaffected by extreme values.
● Can be used at any level of measurement.
● Useful to find the most “popular” or common item. This includes data
sets that do not involve numbers.
➔ Disadvantages:
● It is a terminal statistic.
● A given subgroup could make this measure unrepresentative of the
population’s centre.
● If the set contains no repeating values, the mode is irrelevant.
● In contrast, if there are many values that have the same count, then
mode can be meaningless.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Questions
Alex did a survey of how many games each of his 20 friends owned, and
got this:
9, 15, 11, 12, 3, 5, 10, 20, 14, 6, 8, 8, 12, 12, 18, 15, 6, 9, 18, 11
Find the mean, median and mode.

Ans:
Sorting in ascending order:
3, 5, 6, 6, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, , 14, 15, 15, 18, 18, 20
● Mean = 222/20 = 11.1
● Median = (11+11)/2 = 11
● Mode = 12
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Skewed and Symmetric distributions
● Skewness is a measure of the asymmetry of the distribution of about
its mean.
● The skewness value can be positive, zero, negative, or undefined.
● Symmetric Distribution: A symmetric distribution is one where the left
and right hand sides of the distribution are roughly equally balanced
around the mean.
● In symmetric distributions, the mean, median, and mode are the same.
● Skewed Distribution: A skewed distribution is one where the left and
right hand sides of the distribution are not balanced around the mean.
● In skewed data, the mean and median lie further toward the skew than
the mode.
● The greater the distance of mean and median, the greater is the
skewness of the distribution.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Skewed and Symmetric distributions

Source:www.slideshare.net Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Skewed and Symmetric distributions

Left skewed Right skewed


Source: https://www.fromthegenesis.com/skewness/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Skewed and Symmetric distributions
● Distribution of a variable: tells us what values the variable takes
and how often it takes these values.
● Shape: It is the “shape” of the distribution of the data.
● If mean = median = mode, the shape of the distribution is
symmetric.
● If mode < median < mean, the shape of the distribution trails to
the right, is positively skewed.
● If mean < median < mode, the shape of the distribution trails to
the left, is negatively skewed.
● Distributions of various “shapes” have different properties and
names such as the “normal” distribution, which is also known
as the “bell curve” (among mathematicians it is called the
Gaussian distribution)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency

Source:www.slideshare.net Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency
Various central tendency measures can be applied on different types of data.
• Quantitative data:
• Mode – the most frequently occurring observation
• Median – the middle value in the data
• Mean – arithmetic average

• Qualitative data:
• Mode – always appropriate
Ex : Maximum Type of Color
• Mean – never appropriate
Ex : Average value of Yellow color

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Question
For the following data
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
Compute the mean, median, and the 5%, 10%, and 20% trimmed
means.
Solution:
● The mean is found by averaging together all 24 numbers, which
produces a value of 195.42.
● The median is the average of the 12th and 13th numbers, which is
(191 + 223)/2 = 207.00.
● To compute the 5% trimmed mean, we must drop 5% of the data
from each end.
● This comes to (0.05)(24) = 1.2 observations.
● We round 1.2 to 1, and trim one observation off each end.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Question
● The 5% trimmed mean is the average of the remaining 22
numbers: 75 + 79 +···+ 274 + 384/22= 190.45
● To compute the 10% trimmed mean, round off (0.1)(24) =
2.4 to 2. Drop 2 observations from each end, and then
average the remaining 20: 79 + 80 +···+ 254 + 274/20=
186.55
● To compute the 20% trimmed mean, round off (0.2)(24) =
4.8 to 5.
● Drop 5 observations from each end, and then average the
remaining 14: 105 + 126 +···+ 242 + 245/14= 194.07
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
When to use mean, median and mode?
TYPE OF VARIABLE BEST MEASURE OF
CENTRAL TENDENCY
Nominal Mode

Ordinal Median

Interval / Ratio (not Mean


skewed)
Interval / Ratio (skewed) Median

Source: sixsigma-institute.org, statistics.laerd.com Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread/Dispersion
● In statistics, the measures of
dispersion help to interpret the
variability of data
● It helps to know how much
homogeneous or heterogeneous the
data is.
● In simple terms, it shows how
squeezed or scattered the variable is
● There are two main types of
dispersion methods in statistics
which are:
(i) Absolute Measure of Dispersion
(ii) Relative Measure of Dispersion
Source: image.slidesharecdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread/Dispersion
● Absolute Measure of Dispersion:
It contains the same unit as the original data set. Absolute
dispersion method expresses the variations in terms of the average
of deviations of observations like standard or mean deviations. It
includes range, standard deviation, quartile deviation, etc.
● Relative Measure of Dispersion:
The relative measures of dispersion are used to compare the
distribution of two or more data sets. This measure compares
values without units. Common relative dispersion methods include:
Coefficient of Range, Coefficient of Variation, Coefficient of
Standard Deviation, Coefficient of Quartile Deviation, Coefficient of
Mean Deviation.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Range
● Range is the most common and easily understandable measure
of dispersion.
● It is the difference between two extreme observations of the
data set.
● If X max and X min are the two extreme observations then

Range = X max – X min

Here, range = 13 - 1

= 12

Source: Chilimath.com Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Range
Class A Class B

Observations:
Since the range of Class A is smaller than in Class B, can we claim that
the age distribution in Class A is more clustered (closely related) than in
Class B? In other words, are the ages listed in Class A more uniform than
in Class B?
Source: Chilimath.com Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Range
Range Can Be Misleading:
● The range can sometimes be misleading when there are
extremely high or low values.
● Example: {8, 11, 5, 9, 7, 6, 3616}
lowest value : 5
highest 3616,
● So the range is 3616 - 5 = 3611.
● The single value of 3616 makes the range large, but most values
are around 10.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Range
➔ Advantages:
● It is the simplest of the measure of dispersion
● Easy to calculate
● Easy to understand
● Independent of change of origin
➔ Disadvantages:
● It is based on two extreme observations. Hence, get affected
by fluctuations
● A range is not a reliable measure of dispersion
● Dependent on change of scale
● It can drastically be affected by outliers (values that are not
typical as compared to the rest of the elements in the set).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread
When presenting or analysing measurements of a continuous
variable it is sometimes helpful to group subjects into several
equal groups.

For example, to create four equal groups we need the values


that split the data such that 25% of the observations are in
each group.

The cut off points are called quartiles, and there are three of
them (the middle one also being called the median).

Likewise, we use two tertiles to split data into three groups,


four quintiles to split them into five groups, and so on.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread

The general term for such cut off points is quantiles;


other values likely to be encountered are deciles,
which split data into 10 parts,
and centiles, which split the data into 100 parts
(also called percentiles).
Values such as quartiles can also be expressed as
centiles; for example, the lowest quartile is also the
25th centile and the median is the 50th centile.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread:
● A quintile is a statistical value of a data set that represents
20% of a given population, so the first quintile represents the
lowest fifth of the data (1% to 20%); the second quintile
represents the second fifth (21% to 40%) and so on.
Example:
● Quintiles are used to create cut-off points for a given
population; a government-sponsored socio-economic study
may use quintiles to determine the maximum wealth a
family could possess in order to belong to the lowest quintile
of society. This cut-off point can then be used as a
prerequisite for a family to receive a special government
subsidy aimed to help society's less fortunate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measure of Spread:Percentile
● A percentile is a comparison measure between a particular value and
the values of the rest of the data set.
● It shows the percentage of values that a particular element has
surpassed.
● For example, if you score 75 points on a test, and are ranked in the
85th percentile, it means that the score 75 is higher than 85% of the
scores.
● The percentile rank is calculated using the formula
R= (P/100)* (N+1) where P is the desired percentile and N is the
number of data points.
● The pth percentile of a sample, for a number p between 0 and 100,
divides the sample such that,
○ p% of the sample values are less than the pth percentile
○ (100-p%) are greater than the pth percentile
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Percentile
Steps to calculate the percentile rank:
1. Order the n samples values from smallest to largest.
2. Compute the quantity (P/100)(n+1), where n is the sample
size.
3. If the above quantity is an integer, the sample value in this
position is the percentile.
4. Otherwise, average the two sample values at the preceding
and succeeding integer positions with respect to the quantity
obtained in step 3.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Percentile example
If the scores of a set of students in a math test are 25, 7, 9, 13, 2 and 8 what is
the 15th percentile and 75th percentile?

Ans: Arrange the numbers in ascending order and give the rank ranging from
1(the lowest number) to 5 (the highest number)
Score 2 7 8 9 13 25
R = (P/100)(N+1)
= (15/100) (6+1)
= 1.05 (is it not an integer)
Percentile value = (1st element + 2nd element value)/2
= (2+7)/2
= 4.5 it is 15th percentile
Thus, score 19 is the 75th percentile
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
● Quartiles are the values that divide a list of numbers into
quarters.
● Quartiles are obtained by first putting the list of numbers in
order and then cutting the list into four equal parts.
● The Quartiles are at the "cuts" in the data.
● The first quartile, (Q1) is the middle number between the
smallest number and the median of the data.
● The second quartile, (Q2) is the median of the data set.
● The third quartile, (Q3) is the middle number between the
median and the largest number.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
● The first quartile is the 25th
percentile

● The median is the 50th percentile

● The third quartile is the 75th


percentile
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
➔ The First Quartile:
● The first quartile is the point which gives us 25% of the area to
the left of it and 75% to the right of it.
● This means that 25% of the observations are less than or equal
to the first quartile and 75% of the observations greater than or
equal to the first quartile.
● The first quartile is also called the 25th percentile.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
● To find the first quartile, compute the value 0.25(n +1).
● If this is an integer, then the sample value in that position is the
first quartile.
● If not, then take the average of the sample values on either side
of this value.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
➔ The Second Quartile or median:
● It is easy to see how to divide the area in Figure into two equal
parts, since the graph is symmetric.
● The point which gives us 50% of the area to the left of it and
50% to the right of it is called the second quartile or median
● Second quartile is calculated using the value 0.5(n+1)

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
➔ The Third Quartile:
● The third quartile is the point which gives
us 75% of the area to the left of it and
25% of the area to the right of it.
● This means that 75% of the observations
are less than or equal to the third quartile
and 25% of the observation are greater
than or equal to the third quartile.
● The third quartile is also called the 75th
percentile.
● The third quartile is computed in the
same way, except that
the value 0.75(n+1) is used.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartile Summary

Source: https://medium.com/analytics-vidhya/descriptive-statistics-in-data-science-with-illustrations-in-python-efd5ccc152f1 Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartile example

Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartile example

Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range
● Interquartile range is the distance or range between the 25th
percentile and the 75th percentile.
● That is, quantifies the difference between the third and first
quartiles.
● Interquartile Range = Upper Quartile(Q3) – Lower Quartile(Q1)
IQR = Q3 –Q1

Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range

Source: sphweb.bumc.bu.edu/
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range
➔ Steps to find IQR :

1. Arrange the data scores in ascending order.


2. Find the median of the data set(the number in the middle).
3. Find the median of the lower half of the scores (Q1).
4. Find the median of the upper half of the scores (Q3).

Note: If the number of scores is even, the median is the


average of the two middle scores.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range example

Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile range.

The following numbers represent the time in minutes that twelve employees took
to get to work on a particular day.

18 34 68 22 10 92 46 52 38 29 45 37

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Variance
● Variance is a measure of the spread of the recorded values on a
variable.
● It is a measure of dispersion, meaning it is a measure of how far
a set of numbers is spread out from their average value.
● The larger the variance, the further the individual cases are from
the mean.

Mean
● The smaller the variance, the closer the individual scores are to
the mean.

Mean
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Variance
● It is the average of the distance that
each score is from the mean
(Squared deviation from the mean)
● Steps to calculate variance:
1. Find the mean value of the given data
values.
2. Subtract mean from each data value.
3. Square each value that is obtained
from step2.
4. Find the sum of all values that is
obtained from step 3.
5. Divide the result that is obtained from
step 4 by N(for population) and n-1(for
sample).
Source: standard-deviation-calculator.com Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Standard Deviation
● Standard deviation signifies the deviation of the elements of the data set
from the mean value of the distribution.
● It quantifies the amount of variation of a set of data values.
● It is a measure of the variability of a single item.
● The standard deviation does not decline as the sample size increases.
● The estimate of the standard deviation becomes more stable as the sample
size increases.

Source: exceluser.com,
MathBitsNotebook
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Standard Deviation
● Larger the standard deviation, greater amounts of variation
around the mean.
● Std deviation = 0 only when all values are the same (only when
you have a constant and not a “variable”)
● If you were to “rescale” a variable, the s.d. would change by
the same magnitude.
● Like the mean, the standard deviation will be inflated by an
outlier case value.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Standard Deviation
Standard Deviation = Square root of Variance

Source: standard-deviation-calculator.com
DATA ANALYTICS
Measures of Spread: Standard Deviation example

Calculate Standard Deviation for the following discrete data:


Items 5 15 25 35
Frequency 2 1 1 3

Mean
x¯=5×2+15×1+25×1+35×37=10+15+25+105 /7=22.15

https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example

Calculate Standard Deviation for the following discrete data:


Items Frequency x¯ x−x¯ f(x−x¯)2
x f

5 2 22.15 -17.15 580.25


15 1 22.15 -7.15 51.12
25 1 22.15 2.85 8.12
35 3 22.15 12.85 495.36
N=7 ∑f(x−x¯)2
=1134.85

https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example

Calculate Standard Deviation for the following discrete data:

https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example
Calculate Standard Deviation for the following continuous data :
Items 0-10 10-20 20-30 30-40
Frequency 2 1 1 3

In case of continous series, a mid point is computed


as lower−limit+upper−limit/2

https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example

Calculate Standard Deviation for the following data:

https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example

Calculate Standard Deviation for the following data:

https://www.tutorialspoint.com/statistics/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Practical Application for Understanding Variance and Standard Deviation
Even though we live in a world where we pay real dollars for goods and
services (not percentages of income), most American employers issue
raises based on percent of salary. Why do supervisors think the most fair
raise is a percentage raise?
Answer:
1) Because higher paid persons win the most money.
2) The easiest thing to do is raise everyone’s salary by a fixed percent.
If your budget went up by 5%, salaries can go up by 5%.
The problem is that the flat percent raise gives unequal increased
rewards.
DATA ANALYTICS
References

Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU

Dr.Mamatha H R

Professor, Department of Computer Science


[email protected]
MATHEMATICS FOR COMPUTER
SCIENCE
Chebyshev’s Inequality
Dr. Mamatha H.R
Devika S Nair
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER
SCIENCE
Chebyshev’s Inequality
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s inequality

● The mean of a random variable is a measure of the center of its


distribution, and the standard deviation is a measure of the
spread.

● Chebyshev’s inequality relates the mean and the standard


deviation by providing a bound on the probability that a
random variable takes on a value that differs from its mean by
more than a given multiple of its standard deviation.

● Specifically, the probability that a random variable differs from


its mean by k standard deviations or more is never greater than
Theorem by Russian mathematician
1/k2. Pafnuty Chebyshev

Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality
Statement of Chebyshev’s Inequality

Chebyshev’s inequality states that at least 1 – 1/K2 of data from a sample


must fall within K standard deviations from the mean, where K is any positive
real number greater than one.

Chebyshev’s Inequality is used to describe the percentage of values in


a distribution within an interval centered at the mean.

Source Image:ThoughtCo.com
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s inequality

Only the case k > 1 is useful. When k ≤ 1 the right hand 1/ k2 ≥ 1 and the
inequality is trivial as all probabilities are ≤ 1.

Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s inequality

Source: prepnuggets.com
MATHEMATICS FOR COMPUTER SCIENCE
Problems on Chebyshev’s inequality

Problem 1: The length of a rivet manufactured by a certain process has


mean μX = 50 mm and standard deviation σX = 0.45 mm. What is the
largest possible value for the probability that the length of the rivet is
outside the interval 49.1–50.9 mm?
MATHEMATICS FOR COMPUTER SCIENCE
Problems on Chebyshev’s inequality

Problem-2:
MATHEMATICS FOR COMPUTER SCIENCE
Problems on Chebyshev’s inequality
MATHEMATICS FOR COMPUTER SCIENCE
More on Chebyshev’s inequality

● Because the Chebyshev bound is generally much larger than the


actual probability, it should only be used when the distribution of the
random variable is unknown.

● When the distribution is known, then the probability density function


or probability mass function should be used to compute probabilities.
MATHEMATICS FOR COMPUTER SCIENCE
Statement of Chebyshev’s inequality

Chebyshev’s inequality states that at least 1-1/K2 of data from a sample


must fall within K standard deviations from the mean, where K is any
positive real number greater than one.

Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE
Statement of Chebyshev’s inequality

● To illustrate the inequality, we will look at it for a few values of K:


○ For K = 2 we have 1-1/K2 = 1 - 1/4 = 3/4 = 75%.
Chebyshev’s inequality says that at least 75% of the data values of
any distribution must be within two standard deviations of the
mean.

○ For K = 3 we have 1 – 1/K2 = 1 - 1/9 = 8/9 = 89%.


So Chebyshev’s inequality says that at least 89% of the data values
of any distribution must be within three standard deviations of
the mean.
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality-Practice problems
Problem 1

Computers from a particular company are found to last on average for three
years without any hardware malfunction, with standard deviation of two
months. At least what percent of the computers last between 31 months
and 41 months?
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality-Practice problems

Problem 2

What is the smallest number of standard deviations from the mean that
we must go if we want to ensure that we have at least 50% of the data
of a distribution?
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality-Practice problems

Do It Yourself !!!

The length of a metal pin manufactured by a certain process has mean


50 mm and standard deviation 0.45mm.

What is the largest possible value for the probability that the length of
the metal pin is outside the interval [49.1 , 50.9] mm?
THANK YOU

Prof. Mamatha H.R


[email protected]

Devika S Nair
[email protected]

Department of Computer Science and Engineering


MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Sampling Distribution

Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Sampling Distribution

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution

• In inferential statistics, we want to use


characteristics of the sample (i.e. a statistic) to
estimate the characteristics of the population (i.e.
a parameter).
• If we obtain a random sample and calculate a sample
statistic from that sample, the sample statistic is a
random variable .
• The population parameters, however, are fixed. If
the statistic is a random variable, can we find the
distribution? The mean? The standard deviation?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

• The answer is yes! This is why we need to study the


sampling distribution of statistics. So what is a
sampling distribution?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

• The sampling distribution of a statistic is a probability


distribution based on a large number of samples of
size n from a given population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution of the Sample Mean
• In this example, the population is the weight of six
pumpkins (in pounds) displayed in a carnival "guess
the weight" game booth. You are asked to guess the
average weight of the six pumpkins by taking a
random sample without replacement from the
population.

Pumpkin A B C D E F
Weight (in 19 14 15 9 10 17
pounds)

Since we know the weights from the population, we can find the population mean.
μ=(19+14+15+9+10+17)/6=14 pounds
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

To demonstrate the sampling distribution,


let’s start with obtaining all of the possible samples of
size n=2 from the populations, sampling without
replacement.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
Sample Weight x¯ Probability
A, B 19, 14 16.5 1/15 The table show all
A, C 19, 15 17.0 1/15 the possible
A, D 19, 9 14.0 1/15 samples, the
A, E 19, 10 14.5 1/15 weights for the
A, F 19, 17 18.0 1/15 chosen pumpkins,
B, C 14, 15 14.5 1/15 the sample mean
B, D 14, 9 11.5 1/15 and the probability
B, E 14, 10 12.0 1/15 of obtaining each
B, F 14, 17 15.5 1/15 sample. Since we
C, D 15, 9 12.0 1/15 are drawing at
C, E 15, 10 12.5 1/15 random, each
C, F 15, 17 16.0 1/15 sample will have the
D, E 9, 10 9.5 1/15 same probability of
D, F 9, 17 13.0 1/15 being chosen.
E, F 10, 17 13.5 1/15
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

We can combine all of the values and create a table of the


possible values and their respective probabilities.
X 9.5 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.5 16.0 16.5 17.0 18.0

Prob 1/15 1/15 2/15 1/15 1/15 1/15 1/15 2/15 1/15 1/15 1/15 1/15 1/15
abilit
y

The table is the probability table for the sample mean


and it is the sampling distribution of the sample mean
weights of the pumpkins when the sample size is 2.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

the chance that the sample mean is exactly the population mean is only 1 in 15,
very small.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
• Now that we have the sampling distribution of the sample
mean,
• we can calculate the mean of all the sample means. In
other words, we can find the mean (or expected value) of
all the possible x¯’s.
• The mean of the sample means is
• μx¯=∑xi¯pxi
• =9.5(1/15)+11.5(1/15)+12(2/15)+12.5(1/15)+13(1/15)+13.
5(1/15)+14(1/15)+14.5(2/15)+15.5(1/15)+16(1/15)+16.5(1
/15)+17(1/15)+18(1/15)=14
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
let's do the same thing as above but with sample size n=5
Sample Weights x¯ Probability
A, B, C, D, E 19, 14, 15, 9, 10 13.4 1/6

A, B, C, D, F 19, 14, 15, 9, 17 14.8 1/6

A, B, C, E, F 19, 14, 15, 10, 17 15.0 1/6

A, B, D, E, F 19, 14, 9, 10, 17 13.8 1/6

A, C, D, E, F 19, 15, 9, 10, 17 14.0 1/6

B, C, D, E, F 14, 15, 9, 10, 17 13.0 1/6


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

The sampling distribution is:


X 13.0 13.4 13.8 14.0 14.8 15.0
Probability 1/6 1/6 1/6 1/6 1/6 1/6

The mean of the sample means is...


μ=(1/6)(13+13.4+13.8+14.0+14.8+15.0)=14 pounds
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

Population: 3, 5, 2, 1
Draw samples of size n = 3 without replacement

Possible samples
3, 5, 2
3, 5, 1
3, 2, 1
5, 2, 1
p(x)
1/
Each value of x-bar is 4
equally likely, with x
probability 1/4 2 3
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
Consider a population that consists of the numbers 1, 2, 3, 4 and 5
generated in a manner that the probability of each of those values
is 0.2 no matter what the previous selections were. This population
could be described as the outcome associated with a spinner such
as given below with the distribution next to it.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
If the sampling distribution for the means of
samples of size two is analyzed, it looks like
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

The original distribution and the sampling distribution


of means of samples with n=2 are given below.

1 2 3 4 5
1 2 3 4 5

Original distribution Sampling distribution

n=2
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions

Sampling distributions for n=3 and n=4 were calculated and are
illustrated below. The shape is getting closer and closer to the
normal distribution.

1 2 3 4 5
1 2 3 4 5
Original distribution
Sampling distribution n = 2

1 2 3 4 5
1 2 3 4 5
Sampling distribution n = 3 Sampling distribution n = 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Mean and Variance of a Sample Mean
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution-Example
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution-Example
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution of

If a random sample of n measurements is selected from a


population with mean μ and standard deviation σ, the
sampling distribution of the sample mean will have a mean

and a standard deviation


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution of

Central Limit Theorem: If random samples of n


observations are drawn from a nonnormal population with
finite μ and standard deviation σ , then, when n is large, the
sampling distribution of the sample mean is approximately
normally distributed, with mean μ and standard deviation
. The approximation becomes more accurate as n
becomes large.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is this Important?

✔The Central Limit Theorem also implies that the


sum of n measurements is approximately normal with
mean nμ and standard deviation

✔Many statistics that are used for statistical inference


are sums or averages of sample measurements.

✔When n is large, these statistics will have


approximately normal distributions.

✔This will allow us to describe their behavior and


evaluate the reliability of our inferences.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Large is Large?

If the sample is normal, then the sampling


distribution of will also be normal, no matter
what the sample size.

When the sample population is approximately


symmetric, the distribution becomes approximately
normal for relatively small values of n.

When the sample population is skewed, the sample


size must be at least 30 before the sampling
distribution of becomes approximately normal.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Central Limit Theorem
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Illustrations of Sampling Distributions

Symmetric normal like population


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Illustrations of Sampling Distributions

Skewed population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Illustrations of Sampling Distributions
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Finding Probabilities for the Sample Mean

✔If the sampling distribution of is normal or


approximately normal, standardize or rescale the interval
of interest in terms of

✔Find the appropriate area using Table.

Example: A random sample of size n = 16 from a normal distribution


with μ = 10 and σ = 8.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example

A soda filling machine is supposed to fill cans of soda with 12


fluid ounces. Suppose that the fills are actually normally
distributed with a mean of 12.1 oz and a standard deviation of .2
oz.
What is the probability that the average fill for a 6-pack of soda is
less than 12 oz?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Sampling Distribution of the Sample Proportion

✔The Central Limit Theorem can be used to


conclude that the binomial random variable x is
approximately normal when n is large, with mean np
and variance npq.
✔The sample proportion, is simply a rescaling
of the binomial random variable x, dividing it by n.
✔From the Central Limit Theorem, the sampling
distribution of will also be approximately
normal, with a rescaled mean and standard deviation.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Sampling Distribution of the Sample Proportion

✔A random sample of size n is selected from a


binomial population with parameter p.
✔Τhe sampling distribution of the sample proportion,

will have mean p and standard deviation


✔If n is large, and p is not too close to zero or one, the
sampling distribution of will be approximately
normal.
The standard deviation of p-hat is sometimes called
the STANDARD ERROR (SE) of p-hat.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Sampling Distribution of the Sample Proportion
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Sampling Distribution of the Sample Proportion
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Finding Probabilities for the Sample Proportion

✔If the sampling distribution of is normal or


approximately normal, standardize or rescale the interval of
interest in terms of

✔Find the appropriate area using Z Table .


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
The soda bottler in the previous example claims
that only 5% of the soda cans are underfilled.
A quality control technician randomly samples 200 cans
of soda. What is the probability that more than 10% of
the cans are underfilled?
n = 200
S: underfilled can
p = P(S) = .05
q = .95
np = 10 nq = 190
This would be very unusual,
OK to use the normal if indeed p = .05!
approximation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
Suppose 3% of the people contacted by phone are
receptive to a certain sales pitch and buy your product.
If your sales staff contacts 2000 people, what is the
probability that more than 100 of the people contacted
will purchase your product?
OK to use the normal
n=2000, p= 0.03, np=60, nq=1940, approximation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
Let X denote the number of flaws in a 1 in. length of copper wire.
The probability mass function of X is presented in the following
table.
x P(X = x)
0 0.48
1 0.39
2 0.12
3 0.01
One hundred wires are sampled from this population. What is the
probability that the average number of flaws per wire in this
sample is less than 0.5?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
The population mean number of flaws is μ = 0.66, and the population
variance is σ2 = 0.5244
We need to find P( < 0.5).
sample size is n = 100,which is a large sample. It follows from the
Central Limit Theorem

that ∼ N(0.66, 0.005244).


The z-score of 0.5 is therefore
z = (0.5 − 0.66) /√0.005244 = −2.21

From the z table, the area to the left of −2.21 is 0.0136.


Therefore P( < 0.5) =0.0136, so only 1.36% of samples of size 100
will have fewer than 0.5 flaws per wire.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
At a large university, the mean age of the students is 22.3 years, and the
standard deviation is 4 years. A random sample of 64 students is drawn.
What is the probability that the average age of these students is greater
than 23 years?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
Let X1, . . . , X64 be the ages of the 64 students in the sample.

Find P( > 23).

Now the population from which the sample was drawn has mean
μ = 22.3 and variance σ2 = 16.

The sample size is n = 64.


It follows from the Central Limit Theorem that ∼ N(22.3, 0.25).

The z-score for 23 is


z = 23 − 22.3 /√0.25= 1.40

From the z table, the area to the right of 1.40 is 0.0808.


Therefore P( > 23) =0.0808.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
The manufacture of a certain part requires two different machine
operations. The time on machine 1 has mean 0.4 hours and standard
deviation 0.1 hours. The time on machine 2 has mean 0.45 hours and
standard deviation 0.15 hours. The times needed on the machines are
independent. Suppose that 65 parts are manufactured. What is the
distribution of the total time on machine 1? On machine 2? What is the
probability that the total time used by both machines together is between
50 and 55 hours?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example

Let X1, . . . , X65 represent the times of the 65 parts on machine 1.

The population from which this sample was drawn has mean μX = 0.4
and standard deviation σX = 0.1.

Let SX = X1 + ··· + X65 be the total time on machine 1.

It follows from the Central Limit Theorem that

SX ∼ N(65μX , 65σ2X) = N(26, 0.65)


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example

Let Y1, . . . , Y65 represent the times of the 65 parts on machine 2.

The population from which this sample was drawn has mean μY = 0.45
and standard deviation σY = 0.15.

Let SY = Y1 + ··· + Y65 be the total time on machine 2.

It follows from the Central Limit Theorem that

SY ∼ N(65μY , 65σ2Y) = N(29.25, 1.4625)


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example
let T = SX + SY represent the total time on both machines.

Since
SX ∼ N(26, 0.65), SY ∼ N(29.25, 1.4625), and SX and SY are
independent, it follows that
μT =26+29.25=55.25, σ2T=0.65+1.4625=2.1125, and

T ∼ N(55.25, 2.1125)

To find P(50 < T < 55) we compute the z-scores of 50 and of 55.

z = (50 − 55.25) /√2.1125= −3.61

z = (55 − 55.25) /√2.1125= −0.17


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example

The area to the left of z = −3.61 is 0.0002.


The area to the left of z = −0.17 is 0.4325.
The area between z = −3.61 and z = −0.17 is 0.4325 − 0.0002 =
0.4323.

The probability that the total time used by both machines together is
between 50 and 55 hours is 0.4323
THANK YOU

Dr.Mamatha H R

Department of Computer Science and Engineering


MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS UE23MA242A
Unit 1:Introduction
Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Unit 1:Introduction

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Course content
Unit 1: Applications of Probability Distributions and Principles of Point Estimation
Introduction, Motivating Examples and Scope. Statistics: Introduction, Types of Statistics, Types of Data, Types
of Experiments – Controlled and Observational study, Sampling: Sampling Methods, Sampling Errors, Case
Study. Chebyshev's inequality, Normal Probability Plots, Introduction to Generation of Random Variates and
mention the types, Acceptance-Rejection method, Sampling Distribution, The Central Limit Theorem and
Applications, Principles of Point Estimation - Mean Squared Error for Bernoulli, Binomial, Poisson, Normal,
Maximum Likelihood Estimate for Bernoulli, Binomial, Poisson, Normal and Case Study. Introduction to
multivariate normal distribution, MAP distribution.

Self-Learning: Generation of Random Variates -Inverse Transform Method. 16 Hours

Unit 2: Confidence Intervals and Hypothesis Testing

Confidence Intervals: Interval Estimates for Mean of Large and Small Samples, Student's t Distribution, Interval
Estimates for Proportion of Large and Small Samples, Confidence Intervals for the Difference between Two
Means, Interval Estimates for Paired Data. Factors affecting Margin of Error, Hypothesis Testing for Population
Mean and Population Proportion of Large and Small Samples, Drawing conclusions from the results of
Hypothesis tests, Case Study.

Self-Learning: Confidence interval for difference between two proportions. 12


Hours
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Course content
Unit 3: Distribution Free Tests and Multiple Linear Regression

Distribution Free Tests, Chi-squared Test, Fixed Level Testing, Type I and Type II Errors, Power of a Test, Factors
Affecting Power of a Test. Simple Linear Regression: Introduction, Correlation, the Least Square Lines, Predictions
using regression models - Uncertainties in Regression Coefficients, Checking Assumptions and transforming data,
Introduction to the Multiple Regression Model, Case Study.
Self-Learning: F test for equality of Variance.
14 Hours
Unit 4: Engineering optimization
Introduction to Optimization-Based Design, Modelling Concepts, Unconstrained Optimization, Discrete Variable
Optimization, Genetic and Evolutionary Optimization, Constrained Optimization.
Self-Learning: Mathematical concepts of objective function, Constraints and Decision variables.

14 Hours
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Course content: Applications
Unit 1: Applications:
1. Poisson distribution, calculation of number of calls received in a specified time duration in call centers.
2. Variance, standard deviation, identifying the customer satisfaction in online shopping
3. Central limit theorem, Load Balancing in distributed systems and internet traffic prediction
4. Sampling mean, Estimating database query response times
Unit 2: Applications:
1. t-distribution, confidence interval, students’ performance analysis based on hours of study
2. z-test, application form processing in banking system.
3. Hypothesis testing, randomly trained students placement into tier-I and tier-II companies.
Unit 3: Applications:
1. Linear regression, stock market prediction
2. using Chi-Square Test, Analyzing the association between vaccination and recovery of the patients considering COVID data.
3. Chi-Square Test and Test of Independence, Analyzing the relationship between gender and preference for a product purchase.
4. Identifying Type 1 and Type 2 Errors in Spam mail classification.
Unit 4: Applications:
1.Minimize a Loss functions in Neural Networks using Batch gradient descent (Unconstrained Optimization)
2. Lagrange Multipliers to find local maxima and minima of a function subject to equations constraints (Constrained Optimization)
3. Case study on Bayesian Optimization with Discrete Variables (Discrete Variable optimization)
4. Use Genetic Algorithms to optimize Production Scheduling in a manufacturing environment, focusing on minimizing total production
costs while meeting job deadlines and machine constraints. Evaluate the GA’s effectiveness against traditional scheduling methods.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tools and Textbooks
Tools / Languages/Libraries: Jupyter Notebook, Python, Pandas, Matplotlib, Scipy, Seaborn, BeautifulSoup,
Numpy, Scikit learn.

Text Book(s):
1. “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition, 2015.

2. “Optimization Methods for Engineering Design, Parkinson, A.R., Balling, R., and J.D. Hedengren, Second Edition,
Brigham Young University, 2018
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Evaluation Policy
ISA Components
Conduction Reduced to

ISA 1 40 20

ISA 2 40 20

Assignment Coding-5M 10
Datathon-20

ESA 100 50
Assignment Components
1. Submission of the hands-on session code submission=5 Marks
2. Datathon----------------------------------------------------------=5 Marks
Total=10 Marks
Note
1. It is expected that the codes and solutions for hands-on sessions to be submitted on the same day they are
conducted.
2. Datathon will be conducted for 20 Marks and will be reduced to 5M
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Data Science?
● Have you ever wondered how YouTube recommends videos of
your liking?
● How Google’s autocomplete works?
● How Gmail filters your emails into spam and non-spam
categories?
These are some of the simplest applications of Data Science. Such
tasks would be impossible without the availability of data. Thus in
simple words, Data Science is all about using data to solve problems.

Source: https://coralogix.com/blog/elasticsearch-
autocomplete-with-search-as-you-type/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Data Science?
Data Science is an interdisciplinary field.
● It is focused on extracting knowledge and insights from data.
● Those insights are then applied to solve problems across a wide
range of domains.
● It incorporates skills from Statistics, Computer Science,
Mathematics, Business etc.

Source: theblog.adobe..com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science

Source: edureka.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science

Source: edureka.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Airlines Industry
Data Science is used for various purposes like: route planning,
revenue management, prediction on in-flight sales and food
supplies etc.

Sources: Simplilearn, datasciencecentral.com


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Airlines Industry

Sources: Simplilearn, datasciencecentral.com


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Logistics Industries

Source: Simplilearn
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Logistics Industries
Logistics is a sector where data scientists can make a significant
impact in several areas such as:
● waste reduction
● optimizing delivery routes (which can translate into lower
delivery costs)
● selecting carriers that deploy best practices in mitigating the
effects of CO2 emissions
● ensuring that hazardous materials are handled with the
utmost care
● forecasting the supply and demand cycles
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems

Source : https://www.martechadvisor.com/articles/customer-experience-2/recommendation-engines-how-amazon-and-netflix-are-
winning-the-personalization-battle/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems
Amazon has a huge bank of data on online consumer purchasing
behaviour.
The data includes
● purchased shopping cart
● items added to carts but abandoned
● wish lists
● dwell time
● referral sites
● customers’ demographic information
● number of times viewed an item before final purchase
● click paths in session, pricing experiments online etc.
Using this data it can easily find the hidden factors and patterns
to generate the “Recommended for You” section which helps to
create a personalized shopping experience for every customer.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems

Source: https://medium.com/swlh/recommendations-in-time-context-93b32f73d98d
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems
Netflix has set up 1300 recommendation clusters based on users
viewing preferences.
Netflix’s personalized recommendation algorithms produce $1
billion a year in value from customer retention and accounts for
80% of its total views. Some of the user information that Netflix
captures to help in recommendation include:
● Viewer interactions with Netflix services like viewer ratings,
viewing history, etc.
● Movie’s information about the categories, year of release,
title, genres etc.
● Other viewers with similar watching preferences.
● Time duration of a viewer watching a show.
● The device on which a viewer is watching.
● The time of the day a viewer watches.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Weather Forecasting

Source: phys.org/news
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Weather Forecasting
Weather forecasts are made by collecting quantitative data
about the current state of the atmosphere at a given place and
using meteorology to project how the atmosphere will change.
So in general, weather forecasting is driven by the data about the
atmosphere.
There are a wide variety of devices and technologies gathering
information about the weather like:
thermometers, barometers, anemometers, weather balloons,
radar systems, satellites etc.
Various weather models analyse and try to make sense of all
the incoming information to accurately predict the weather.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports

Source: https://arstechnica.com/information-technology/2015/10/big-data-an-it-buzzword-that-is-actually-producing-results/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports
Players, team managers, coaches and fans rely on sports analytics
before making decisions or developing strategies to win games.

Sports data analysts spend their time collecting on-field and off-
field data from a variety of sources and then analyzing and
interpreting that data looking for meaningful insights.

The main objective of sports analysis is to improve team


performance and enhance the chances of winning the game.

Major teams and their analytics partner:


(i)Real Madrid and Microsoft
(ii) Manchester United and Aon
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports
● Moneyball, an American biographical film accounts for the attempts
of baseball team’s general manager to assemble a competitive team
using sports analytics.
● He utilized sabermetrics to evaluate his potential roster by
performing data mining on hundreds of individual baseball players,
identifying statistics that were highly predictive of how many runs a
player would score.

Source:
https://en.wikipedia.org/wiki/Moneyball_(film)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports

Source: https://fivethirtyeight.com/features/billion-dollar-billy-beane/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Politics
Political parties and their strategists have realized the importance of
mining real-time demographic and polling data.
The various data points may include voter sentiment, mass emotions,
citizen concerns in different constituencies, popular outlooks in
various states, etc. Political parties can use these insights to,
● pull voter donations
● convert undecided voters
● enroll young volunteers
● organize resources
● social media campaigns
● improve effectiveness of electioneering activities etc.
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics

https://www.datacouncil.ai/talks/how-data-is-transforming-politics
https://projects.fivethirtyeight.com/polls/generic-ballot/2024/
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
Political strategists and digital analysts can deploy modern software
analytics to create detailed maps of voting patterns.
Data analytics can help these campaigners to paint a vivid picture of
political winds, party supporters, and trenchant opponents in every
demographic region.
This demographic data and other information can be used in
campaign-spending management. It can help determine whether a
voter would be most receptive to a phone call, a flyer or mailer, an in-
person visit, or some other form of campaigning.
By using data in this way, campaigns can avoid wasting money on
ineffective or unnecessary advertising, and have a better chance of
reaching someone who is receptive.
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics

Source:
Historical U.S. Presidential Elections
1789-2020 - 270toWin
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics

Source:
270toWin - 2024 Presidential Election
Interactive Map
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Healthcare & Medicine

Source:
http://www.primeclasses.in/blog/2019/08/26/the-
need-for-data-science-in-healthcare-industry/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Healthcare & Medicine
There are several fields in healthcare like medical imaging, drug
discovery, genetics, predictive diagnosis etc that make use of
data science.
● Hospitals analyse medical data and patient records to predict
those patients that are likely to seek readmission within a
few months of discharge.
● Omada Health is a digital medical company that uses smart
devices to create customized behavioral plans and online
training to help prevent chronic health conditions, such as
diabetes, high blood pressure, and high cholesterol.
● On the mental health side, Canada’s new start-up, Awake
Labs, is tracking data on children with autism in dress,
informing parents before the meltdown.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Healthcare & Medicine

Source: https://allofus.nih.gov
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in predicting people’s opinions

Source: Simplilearn
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Data?
Technically, data refers to individual facts, statistics, or items of
information, often numeric, that are collected through
observation.

Source: https://www.twinkl.de/teaching-wiki/data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data vs Information
➔ Data
● Raw facts, usually formatted in a special way.
● Based on records, observations etc.
● Unorganized.
➔ Information
● A collection of facts organized in such a way that they have
additional value beyond the value of the facts themselves.
● Based on analysis of data.
● Organized and always depends on data.

Ex : Data – thermometer readings of temperature


taken every hour: (16.0, 17.0, 16.0, 18.5, 17.0,15.5….)
[on
transformation]
Information – today’s high: 18.5, today’s low: 15.5
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data vs Information

Source: https://effectualsystems.com/data-need-information/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of Data

Data Represented by

Alphanumeric data Numbers, letters, and other characters

Image data Graphic images or pictures

Audio data Sound, noise, tones

Video data Moving images or pictures


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Structured, Unstructured & Semi-structured Data

Source: https://towardsdatascience.com/data-extraction-from-a-pdf-table-
with-semi-structured-layout-ef694f3f8ff1

Source: slidegeeks.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Structured, Unstructured & Semi-structured Data
Structured Data:
Structured data is the data whose elements are
addressable for effective analysis. The data is
organized into a formatted repository that is typically a
database. Ex: Relational data.
Semi-Structured Data:
It is the data that doesn’t reside in relational database
but has some organizational properties that make it
easier to analyse. Ex: XML data.
Unstructured Data:
It is the data which is not organized in a predefined
manner or doesn’t have a predefined data model, thus
not a good fit for a mainstream relational database.
Ex: Word, pdf, text etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Structured, Unstructured & Semi-structured Data

Source:
https://www.slidegeeks.com/pics/dgm/l/f/Forms_Type_Of_Big_Data_Ppt_PowerPoint_Presentation_Infographic_Template_Slide_1-.jpg
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Information

Source: guru99.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Information Concepts

Source: https://learningforsustainability.net
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Science
■ Science-latin word Scientia
■ Meaning Knowledge
■ Science is a systematic enterprise that builds and organizes
knowledge in the form of testable explanations and
predictions about the universe.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why do we need Data Science?

Source: https://static.seekingalpha.com/uploads/2020/1/14/50485001-15789998083991578_origin.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why do we need Data Science?
The main reason why we need data science is the ability to process
and interpret data. This enables users and industries to make
informed decisions as well as helps in their growth, optimization,
and performance.

We know that, unstructured data is generated everywhere, every


second. Unstructured data isn't well organized or easy to access.
But its growth is enormous and importance of analyzing and
drawing inferences from this type of data is crucial.

Data Science provides a number of methods and techniques to


deal with such data.
This certainly helps many businesses and industries significantly to
improve their productivity.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How is Data generated?
There is tons of data getting generated each day.
Some of the major sources from which data is generated are:
web, databases, media, IoT, cloud etc.
Insight into data generation in a day over the internet:
● 500 million tweets are sent
● 294 billion emails are sent
● 4 petabytes of data are created on Facebook
● 4 terabytes of data are created from each connected car
● 65 billion messages are sent on WhatsApp
● 5 billion searches are made

Slide courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How is Data generated?
By 2025, it’s estimated that 463 exabytes of data will be created
each day globally
– that’s the equivalent of 212,765,957 DVDs per day!

Source: theblog.adobe..com Slide courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data generation
● In 2014, Oscars-host Ellen DeGeneres’ “celeb selfie” tweet
that was viewed 26 million times across the Web during a 12-
hour period.
● More than one billion hours of TV shows and movies are
streamed from Netflix per month.
● Walmart, handles more than 1 million customer transactions
every hour, feeding databases estimated at more than 2.5
petabytes. (the equivalent of 167 times the books in
America's Library of Congress)
● Facebook, is home to 40 billion photos.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data generation

Source: https://twitter.com/theellenshow
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data generation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data generation

Source: https://trak.in/tags/business/2014/04/15/digital-data-universe-expansion-2020/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Growth in Data generation
The total amount of data created, captured, copied and consumed
globally has been exponentially increasing.
In 2020, the amount of data created & replicated was higher than
expected caused by the increased demand due to the pandemic.
Up to 2025, global data creation is projected to grow to more than
180 zettabytes.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Growth in Data generation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Growth in Data generation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How much of data is put into use?

Source: IDC, 2014


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How much of data is put into use?

Though there is a huge amount of data getting generated each


day, it shall serve no purpose if it is left unused.

This can further lead to information overload where there is an


overabundance of information but it is not put into work due to
lack of time, resources, understanding of the information,
irrelevance of the information or other reasons.

Thus, it is important to understand the data and know how to


utilize it in the right manner.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
No one knows how to use it

Source: https://image.slidesharecdn.com/instroductiontodatascience-160420090623/95/introduction-to-data-science-38-
638.jpg?cb=1461307670
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
But is data all we need?
The graph below shows a cause & effect relationship between
‘Age of Miss America’ and ‘Murders by steam, hot vapour and hot
objects’ which practically doesn’t seem correct.
Thus, we see that the presence of interesting patterns need not
imply their correctness.
Blindly applying various processes and techniques on data can
result in incorrect inferences.

Source: https://i2.wp.com/boingboing.net/wp-
content/uploads/2016/02/chart.jpg?fit=800%2C315&ssl=1
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
But is data all we need?
The following work highlights the risk of amplifying and reinforcing
biases present in the data by blindly applying machine learning on it.

Source: https://arxiv.org/abs/1607.06520
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Learn how to use data
The above examples help us understand that we need to learn
how to utilize and handle the available data in the right manner
to be able to arrive at correct results and draw meaningful
inferences.

➔ Explore: identify patterns


➔ Predict: make informed guesses
➔ Infer: quantify what you know
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Learn how to use data

Source:slidesharecdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science project life cycle
The correct process of using available data is shown in this life
cycle. It outlines the major stages in a data science project.

Source: https://static.javatpoint.com/tutorial/data-
science/images/data-science-lifecycle.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science project life cycle

Source: https://res.cloudinary.com/practicaldev
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Scientist
Data Scientists in simple words are those who make sense out of all the
data that are available and figure out the things that can be done with it.

Source: proschoolonline.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Scientist

Source: edureka!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What does a Data Scientist do?
They are responsible for collecting, analyzing, modelling and
interpreting large amounts of data. Their role combines Computer
Science, Mathematics, Statistics etc.

Source: https://edvancer.in/wp-
content/uploads/2015/11/76c99311fc4be19bf4353
cfc3c2e94b2.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What does a Data Scientist do?

Source: medium.com

Slide courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prerequisites for a Data Scientist

Curiosity Common Communication


Sense skills

Sources: quickanddirtytips.com, Slide courtesy:Dr.Uma


dreamstime.com,linkedin.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prerequisites for a Data Scientist

Source: data-
Slide courtesy:Dr.Uma flair.training
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Demand for Data Scientist
Data Science is a growing field. It is a popular and lucrative
profession. Glassdoor has ranked this profession at #3 in 2022
despite the occurrence of the pandemic.

Sources : Glassdoor, Forbes


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Demand for Data Scientist

Source:
https://cdn.ttgtmedia.com/rms/onlineimages/busin
ess_analytics-data_scientist_01_mobile.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How is it different from what Statisticians have been doing?

Both Statisticians and Data Scientists work closely with data.


● Statisticians use mathematical equations and statistical
models to analyze data and arrive at conclusions.
● Data Scientists however focus on delivering actionable
results and sometimes need to deploy the model to the
production system.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How is it different from what Statisticians have been doing?

Source:
https://scientistcafe.com/ids/images/softskill1.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science vs Data Analysis
● Data Science is primarily used to make decisions
and predictions making use of predictive causal
analytics, prescriptive analytics (predictive plus
decision science) and machine learning.

● Data Analysis includes descriptive analytics and


prediction to a certain extent.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science vs Data Analysis

Source:
https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-
content/uploads/2017/01/Data-Analyst-vs-Data-
Science-1-422x300.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science vs Data Analysis

Source: edureka!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Common tasks in Data Science

Source: Simplilearn
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Common tasks in Data Science

Source: https://static.javatpoint.com/tutorial/data-science/images/how-to-solve-a-problem-in-data-science.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
References

Text Book:
Statistics for Engineers and Scientists, William Navidi.4th Edition ,
McGraw Hill Education, India
THANK YOU

Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 712
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
UE23MA242A
Unit 1: Sampling Methods

Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Unit 1:Sampling Methods

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered

❖ Sampling methods

❖ Sampling process

❖ Probability and Non-probability sampling

❖ Advantages and disadvantages of different sampling methods


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What are Sampling methods?
In a statistical study, sampling methods refer to how we select
members from the population to be included in the study.
The selected sample must be representative of the population.
If a sample isn't randomly selected, it will probably be biased in
some way and the data may not be representative of the
population.
There are many ways to select a sample—some good and some
bad.

Sources: blog.masterofproject.com,
analytics-magazine.org
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling process
Define Target
population Specify Sampling Specify Sampling
(population of frame method
concern)

Sampling and data Implement the Determine


collecting sampling plan sample size

Reviewing the
sampling process
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling
➔ Factors that influence sample representativeness:
● Sampling procedure
● Sample size
● Participation (response)

➔ When might you sample the entire population?


● When your population is very small
● When you have extensive resources
● When you don’t expect a very high response

Source: thumbs.dreamstime.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Recap Population vs Sample
● A population can be defined as, including all people or items
with the characteristic one wishes to understand.
● Because there is very rarely enough time or money to gather
information from everyone or everything in a population, the
goal becomes finding a representative sample (or subset) of
that population.
➔ Note:
● The population from which the sample is drawn may not be the
same as the population about which we actually want
information.
● Often there is large but not complete overlap between these
two groups due to frame issues etc .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Frame
Sampling frame is the list of items or events from which the
potential respondents are drawn or which are possible to
measure.
● Sometimes, it is possible to identify and measure every single
item in the population and to include any one of them in our
sample.
● However, in the more general case this is not possible.
● There is no way to identify all rats in the set of all rats.
● As a remedy, we seek a sampling frame which has the
property that we can identify every single element and include
any of them in our sample.
● The sampling frame must be representative of the population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Representative & Biased Sample

Sample 1

Representative of the
population

Sample 2

Population Biased Sample


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of Sampling methods

Samples

Probability Samples
Non-Probability
Samples
Simple
Random Stratified
Judgement Snowball Cluster
Systematic
Convenience Quota
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Probability Sampling
● Probability sampling is a type of sampling in which every unit in the
population has a chance/probability (greater than zero) of being selected
in the sample, and this probability can be accurately determined.
● This type of sampling decreases bias and sampling error in the selection
process.
● When every element in the population does have the same
probability of selection, this is known as an 'equal probability
of selection' (EPS) design. Such designs are also referred to as
'self-weighting' because all sampled units are given the same
weight.

Source: www.mathstopia.net
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Non-Probability Sampling
● Non-Probability sampling is a type of sampling in which every unit in the
population doesn’t have a chance/probability (greater than zero) of being
selected in the sample.
● Here, some elements of the population have no chance of selection
(these are sometimes referred to as 'out of coverage'/'undercovered'),
or the probability of selection can't be accurately determined.
● It involves the selection of elements based on assumptions regarding the
population of interest, which forms the criteria for selection.
● The selection of elements is non random.
● Thus, non-probability sampling does not allow the estimation of sampling
errors.
● It is more likely to produce a biased sample and restricts generalization.
● It is not an appropriate data collection method for most of the statistical
analysis.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Probability Sampling
● Subjects of the sample are chosen based on known
probabilities.
Probability Samples

Simple
Systematic Stratified Cluster
Random
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling
Simple random sampling, as the name suggests, is an entirely random
method of selecting the sample.
● Here, each subject or unit in the population has an equal chance of
being selected.
● The sampling frame should include the whole population.
● A table of random number or lottery system is used to determine which
units are to be selected.
● Simple random sampling is always an EPS design, but not all EPS designs
are simple random sampling.

Source: datasciencemadesimple.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling

● Purpose: It is random and thus results in a representative-sample.


● When to Use: Best to use when population is small
as it produces a better representative-sample.
● Key Aspect: Each member of the population has an equal
probability of getting selected.
● General Procedure: Assign numbers to all members of the
population & select randomly.
○ For a small population: Manual lottery method can be used
for selection.
○ For a larger population : System generated numbers can be
used to select elements from the population.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples

● At a birthday party, teams for a game are chosen by putting everyone's


name into a jar, and then choosing the names at random for each team.
● A restaurant leaves a fishbowl on the counter for diners to drop their
business cards. Once a month, a business card is pulled out to award
one lucky diner with a free meal.
● All students in the Computer Science department are assigned numbers
and 100 random numbers are chosen to attend a webinar.

Sources: c8.alamy.com, wordwall.net


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples

Here, each of the 20 coins have an equal probability of getting selected.

Source:
analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples
Probability = (n/N) x 100
Calculating the probability of each coin getting selected.
● Total population size (N) = 20
● Sample size (n) = 5
● Probability = (5/20) x 100
= 25%
● Thus each coin has 25% of probability of getting selected.

Source:
analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples
In a company consisting of 10,000 employees, 25 employees are selected
to survey the average number of hours a day they are present in the
office.
● Population frame: List of all employees numbered from 1-10,000
● Sample : Random number table consisting of 25 random employees.
● Probability of selection of each employee :
N = 10,000; n = 25
probability = (25/10,000) x 100 = 0.25%

Source: 5found.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling: Advantages
➔ Advantages:
● This method is simple to use.
● Estimates are easy to calculate.
● Random samples are usually fairly representative since they don't
favor certain members of the population.
● Low sampling error.
● It needs only a minimum knowledge of the study group of
population in advance.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling: Disadvantages
➔ Disadvantages:
● If sampling frame is large, this method impracticable.
● Minority subgroups of interest in population may not be present in
sample in sufficient numbers for study.
● This type of sampling can’t be employed where the units of the
population are heterogeneous in nature.
● Sometimes, it is difficult to have a completely cataloged universe.
● This method lacks the use of available knowledge concerning the
population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling with replacement
● This is a sampling procedure in which each sampling unit randomly
selected from the population is measured or recorded and then returned
to the population. Thus, a sampling unit may be sampled multiple times.
● When sampling the first marble, each marble has the same chance of
0.1 of being sampled. When sampling the second marble and all the
subsequent marbles, each marble still has a 0.1 chance of being sampled.
● Each time we sample a unit, all units have similar chances of being
sampled.

Source: www.spss-tutorials.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling without replacement
● This is a sampling procedure in which sampling units are selected from a
population of without replacement such that every sample unit has an
equal probability of being selected.
● No element can be selected more than once in the same sample.
● For the first marble sampled, each marble has a 0.1 chance of being
sampled. However, the first unit we sampled has a zero chance of being
sampled again.
● Thus, the other 9 units each have a chance of 1 in 9 = 0.11 of being
sampled as the second unit.

Source: www.spss-tutorials.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling
● Systematic sampling relies on arranging the target population
according to some ordering scheme and then selecting elements at
regular intervals through that ordered list.
● The first element is selected randomly.
● Then it proceeds with the selection of every kth element. Where k is
the size of the selection interval. k = (population size/sample size)

● It is important that the starting point is not automatically the first in


the list, but is instead randomly chosen from within the first to the kth
element in the list.
● A simple example would be to select every 10th name from the
telephone directory (an 'every 10th' sample, also referred to as
'sampling with a skip of 10').
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling
● Systematic sampling is an Equal Probability Sampling method, as
all elements have the same probability of selection (in the below
example given, one in twelve).
● It is not 'simple random sampling' because different subsets of the
same size have different selection probabilities
● Ex: the set {2,5,8,11} has a one-in-twelve probability of selection,
but the set {1,3,6,7} has zero probability of selection.

Source: www.netquest.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling
● When to Use: When project budget is tight and less time to complete.
● Key Aspect: Find the kth value to select every kth member.
k=N/n
● General Procedure:
○ Assign numbers to each population element.
○ Order the population elements in an ordered sequence
○ Find ‘k’ the size of the selection interval.
○ Select the first sample element randomly from the first
k population elements.
○ Thereafter, select the sample elements at a constant
interval, k, from the ordered sequence frame.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling Examples
From a classroom consisting of 64 students, the teacher wants to
select 8 students to check their assignments.
● Population size = N = 64
● Sample size = n =8
● Size of selection interval = k = N/n
Selecting the
= 64/8 = 8 subsequent 8th
student

Randomly selecting
the first student

N = 64
n=8
k=8
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling Examples
Purchase orders for the previous fiscal year are serialized 1 to
10,000. A sample of fifty purchases orders is needed for an audit.
● N = 10,000
● n = 50
● k = 10,000/50
= 200
● First select an element randomly from the first 200 purchase
orders.
● Assume the 45th purchase order was selected.

● Subsequent sample elements: 245,


445(245+200),
645(445+200), . .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling Examples
Given a set of 20 coins, 5 coins must be selected from the population.
● N = 20; n = 5
● k = N/n = 20/5 = 4
● Randomly selecting the first element = 3 (suppose)
● Subsequent coins are to be selected at an interval 4 from the 3rd coin
● Sampled coins = { 3, 3+4 = 7, 7+4 = 11, 11+4 = 15, 15+4 = 19}

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling: Advantages
● Sample is easy to select.
● Suitable sampling frame can be identified easily.
● Sample evenly spreads over entire reference population.
● It is a cost effective sampling method.
● It guarantees that the entire population is evenly sampled.
● Systematic sampling also carries a low-risk factor because there
is a low chance that the data can be contaminated.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling: Disadvantages
● This type of sampling might lead to bias if there is an underlying
pattern/periodicity in the population which coincides with the selection.
Ex : If the HR database groups employees by team, and team members are
listed in order of seniority, there is a risk that the interval might skip over
people in junior roles, resulting in a sample that is skewed towards senior
employees.
● Difficult to assess precision of estimate from one survey.
● Each element does not have an equal chance in getting selected
● Ignorance of all the elements between two kth elements.
● The size of the population is needed. Without knowing the specific
number of participants in a population, systematic sampling does not
work well.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
● Stratified sampling is the type of sampling in which the population is
divided into 2 or more groups called strata based on a shared
characteristic or trait.
● Then simple random samples are selected from each group.
● The selected 2 or more samples are combined into one.
● The strata or groups don’t overlap. But, they represent the entire
population.
● The shared characteristics based on which the population is divided
could be gender, educational attainment, income, age etc.

Source: datasciencemadesimple.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
● Each stratum is sampled as an independent sub-population.
● Every unit in a stratum has same chance of being selected.
● Using same sampling fraction for all strata ensures proportionate
representation in the sample.
● Adequate representation of minority subgroups of interest can be
ensured by stratification & varying sampling fraction between strata
as required.
● Since each stratum is treated as an independent population,
different sampling approaches can be applied to different strata.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
● Purpose: To obtain an unbiased random sample from a larger
population.
● When to Use: When population proportion must be reflected in
sample.
● Key Aspect: Sample proportion is same as Population proportion,
Strata is homogeneous.
● General Procedure:
○ Divide the population into Strata or Groups.
○ Criteria for division could be: Gender, Hair Color, Eye Color,
Salary, Designation, Age etc.
○ Selection of sample: Simple Random Sampling approach is used
to sample units from each strata.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling examples
Given 20 coins of different colours.
● Population of coins is divided into 4 strata based on their colours.
● Coins from each strata are sampled using simple random sampling.

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling examples
To find out the most popular song among the FM radio listeners.
● All listeners are stratified by age.
● Listeners from each age group are selected using simple random
sampling and surveyed for their favourite song of the year.
Stratified by Age

20 - 30 years old
(homogeneous within the
stratum) Strata are
Heterogeneous
30 - 40 years old
(homogeneous within the
stratum) Strata are
Heterogeneous
40 - 50 years old
(homogeneous within the
stratum)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling examples
A high school principal wants to conduct a survey to collect the
opinions of students.
● The students are grouped into 4 stratums based on their grade.
● Then, simple random samples of 50 students from each grade are
selected to be included in the survey.

Source: statology.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling: Advantages
● It enhances the representativeness of the sample.
● It is easy to carry out.
● It has higher statistical efficiency.
● A stratified sample can provide a higher precision than a simple
random sample of the same size.
● As it provides a greater precision, this type of sampling often
requires a smaller sized sample which saves money.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling: Disadvantages
● Sampling frame of the entire population has to be prepared
separately for each stratum.
● When examining multiple criteria to divide the population,
stratifying variables may be related to some but not to others
further complicating the design and potentially reducing the utility
of the strata.
● In some cases (such as designs with a large number of strata, or
those with a specified minimum sample size per group), stratified
sampling can potentially require a larger sample than other
methods.
● It is time consuming and expensive.
● It leads to classification errors.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling
● In cluster sampling, population is divided into non-overlapping
clusters or areas similar to Stratified sampling.
● Each cluster is a miniature or microcosm of the population.
● Each cluster should have similar characteristics to the whole sample.
● Instead of sampling individuals from each subgroup like in stratified
sampling, in cluster sampling entire clusters are randomly selected.
● A subset of the clusters is selected randomly for the sample.
● If the number of elements in the subset of clusters is larger than the
desired value of n(sample size), these clusters may be subdivided to
form a new set of clusters and subjected to a random selection
process.

Source: dataz4s.com Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling

Source:www.netquest.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling
● When to Use: When population is already broken up into
groups(clusters).
● Key Aspect: Heterogeneous members in each group.
● General Procedure:
○ Population is divided into non-overlapping areas(clusters).
○ Each cluster is a miniature or microcosm of a population.
○ Clusters are selected randomly.
○ All elements of the selected-clusters are included in the sample
or elements from the selected-clusters are chosen using simple
random sampling.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling examples
Given a set of 20 coins of different colours
● Population is divided into 5 clusters each having 4 coins.
● A whole cluster is randomly selected to be included in the sample.

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling examples
An athletic organization wishes to find out which
sports Grade 11 students are participating in across
Canada.
● It would be too costly and lengthy to survey
every Canadian in Grade 11, or even a couple of
students from every Grade 11 class in Canada.
● Instead, each school is consisting of Grade 11
students is considered as a cluster and 100
schools are randomly selected from all over
Canada.
● These schools provide clusters of samples. Then,
every Grade 11 student in all 100 clusters is
surveyed. In effect, the students in these clusters
represent all Grade 11 students in Canada.
Source: s4be.cochrane.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling examples
The municipal council of a small city wants
to investigate the use of health care
services by residents.
● The council first obtains electoral
subdivision maps that identify and label
each city block. From these maps, the
council creates a list of all city blocks.
This list will serve as the sampling
frame.
● Every household in that city belongs to a
city block, and each city block
represents a cluster of households. The
council randomly picks a number of city
blocks.
Source:coronainsights.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Advantages
● It is more convenient for geographically dispersed populations.
● It can reduce the travel costs to contact sample elements.
● It simplifies the administration of the survey.
● It is more feasible. The division of the entire population into
homogeneous groups increases the feasibility of the sampling.
● Since each cluster represents the entire population, more
subjects can be included in the study.
● Requires fewer resources. Since cluster sampling selects only
certain groups from the entire population, the method requires
fewer resources for the sampling process.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Disadvantages
● It is statistically less efficient when the cluster elements are
similar.
● Costs and the number of problems occurring are greater than
that of simple random sampling.
● There is higher sampling error.
● The method is prone to biases. If the clusters representing the
entire population were formed under a biased opinion, the
inferences about the entire population would be biased as
well.
● It’s difficult to guarantee that the sampled clusters are really
representative of the whole population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Types
There are 2 types of cluster sampling methods.
● One-stage sampling: All of the elements within selected
clusters are included in the sample.
● Two-stage sampling: A subset of elements within selected
clusters are randomly selected for inclusion in the sample.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: One-stage cluster sampling
Here, the population is divided into clusters. Then, some of the clusters are
randomly selected and all members from those clusters are included in the sample.

Source:statology.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Two-stage cluster sampling
As the name suggests, this method of sampling involves 2 stages.

Step 1: Split a population into clusters, then randomly select some of the clusters.

Step 2: Within each chosen cluster, randomly select some of the members to be
included in the survey.

Source:statology.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Difference between Strata and Clusters
Although strata and clusters are both non-overlapping subsets of the
population, they differ in several ways.

● All strata are represented in the sample. But only a subset of clusters are in
the sample.
● With stratified sampling, the best survey results occur when elements
within strata are internally homogeneous. However, with cluster sampling,
the best results occur when elements within clusters are internally
heterogeneous.

Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Non-probability Sampling
Non-Probability sampling is a type of sampling in which every unit in
the population doesn’t have a chance/probability (greater than zero) of
being selected in the sample.
Non-Probability Samples

Judgement Snowball

Convenience
Quota
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling
● Sometimes it is also known as grab or opportunity sampling or
accidental or haphazard sampling.
● This is a type of nonprobability sampling which involves the sample
being drawn from that part of the population which is close to hand.
That is, readily available and convenient.
● Here, sample elements are selected for the convenience of the
researcher.
● The researcher using such a sample cannot scientifically make
generalizations about the total population from this sample because
it would not be representative enough.

Source: googleusercontent.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling
● When to Use: When population is not clearly defined or sampling
unit is not clear or complete source list is not available.
● Key Aspect: Subjects for a study are easily available within the
proximity of the researcher.
● General procedure:
○ It is done at the “convenience” of the researcher.
○ Selection : The individuals that are convenient and easiest to
reach are selected to be included in the sample.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling examples
Given a set of 20 coins of different colours.
● Let’s say that the researcher likes the numbers 4,7,12,15,20 .
● Thus, the coins with the same numbers are included in the sample.

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling examples
To research the opinions about student support services in your university
● After each of your classes, you ask your fellow students to complete a
survey on the topic.
● This is a convenient way to gather data, but as you only surveyed students
taking the same classes as you at the same level, the sample is not
representative of all the students at your university.

Source: assets.pearsonschool.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling examples
To record the popular opinions of people about the current laws of the city.
● The researcher surveys all people that pass by his house.
● Again, this is a convenient way of studying the opinions of people living in
the city. But, it doesn’t reflect the opinions of all the residents of the city.

Source:slideshare.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling: Advantages & Disadvantages
➔ Advantages:
● This type of sampling is useful in pilot study.
● It costs less and is an inexpensive way to gather initial data for the research.
● It saves time.
● It is relatively easy to get a sample.
● It is simple and easy to implement.
➔ Disadvantages:
● It is prone to significant bias as the sample may not be representative of the
characteristics of the population.
● Since the same may not be representative of the population, this type of
sampling can’t produce generalizable results.
● It might lead to sampling errors.
● A study conducted on a convenience sample will have limited external
validity.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling
● Judgemental or Purposive sampling is a
type of non-probability sampling where
the researcher chooses the sample
based on who they think would be
appropriate for the study.
● This is used primarily when there is a
limited number of people that have
expertise in the area being researched.
● The sample depends on the judgement
of the experts conducting the study.
● It is not a scientific method of sampling.

Source: dataz4s.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling
● When to Use: This is used primarily when there is a limited number
of people that have expertise in the area being researched.
Also, the researcher must be confident that the chosen sample is
truly representative of the entire population.
● Key Aspect: The researcher selects a sample based on
experience or knowledge of the group to be sampled.
● General Procedure:
○ On the basis of the researcher’s knowledge and judgment
elements of the population are sampled.
○ Selection : Elements that own the qualities expected by the
researcher.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling examples
Given a set of 20 coins of different colours.
● Suppose, the experts believe that coins numbered 1, 7, 10, 15, and
19 should be considered for the sample as they may help us to infer
the population in a better way.

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling examples
To know more about the opinions and experiences of disabled students
at your university
● You purposefully select a number of students with different support
needs at your university in order to gather a varied range of data on
their experiences with student services.

Source: rm-15da4.kxcdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling examples
A panel decides to understand the factors which lead a person to select
ethical hacking as a profession.
● The researchers who understand what ethical hacking is will be
able to decide who should form the sample to learn about it as a
profession.
● Researchers can easily filter out those participants who can be
eligible to be a part of the research sample.

Source:statisticshowto.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling: Advantages & Disadvantages
➔ Advantages:
● It consumes minimum time.
● The researcher is given an opportunity to bring his judgement
and expertise to play.
● No special knowledge of statistics is needed.
● Real time results can be obtained.
➔ Disadvantages:
● It is prone to errors in judgment by researcher.
● Low level of reliability and high levels of bias.
● Inability to generalize research findings to the entire
population.
● It is difficult to choose the appropriate sample size.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling
● In this type of sampling, sample elements are selected until the
quota controls are satisfied.
● The population is first segmented into mutually exclusive sub-
groups, just as in stratified sampling.
● Then judgment is used to select subjects or units from each segment
based on a specified proportion.
● The population units are selected based on predetermined
characteristics of the population.
● It is similar to Stratified sampling but it doesn’t involve random
selection.
● Ex: recruiting the first 50 men and first 50 women that meet
inclusion criteria.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling
● When to Use: If a study aims to investigate a trait or a characteristic
of a certain subgroup, this type of sampling is the ideal technique.
● Key Aspect: Sample elements are selected until the quota controls
are satisfied.
● General Procedure:
○ Divide the population into subgroups.
○ Identify proportions or weightage in which the subgroups are
present in the population.
○ Select an appropriate sample size while maintaining the
proportions of the subgroups.
○ Conduct the surveys according to the quotas defined

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples
Given a set of 20 coins of different colours.
● Here we need to select items based on predetermined characteristics of
the population.
● Suppose we have to select coins having a number in multiples of four for
our sample. Thus, the coins 4,8,12,16,20 are sampled.

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples

To survey individuals about what


smartphone brand they prefer to use.

● Suppose the researcher considers a


sample size of 500 respondents.
Also, the researcher is only
interested in surveying ten states in
the US. The researcher divides the
population as follows
● Gender: 250 males and 250 females
● Age: 125 respondents each between
the ages of 1-50, and 51+
● Location: 50 responses per state

Source: ovationmr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples
A cool drinks company wants to find out what age group prefers what brand of
drinks in a particular city.

● The researcher applies quotas on the age groups of 11-21,22-31, 32-41, 42-51.
● The researcher then samples people from each quota and surveys them to
gauge the trend among the population of the city.

Source: ovationmr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples: Advantages & Disadvantages
➔ Advantages:
● It is a cost effective method.
● There is convenience in execution of this sampling.
● It is a speedy process.
● The information can be deciphered once the sampling is done.
● It improves the representation of certain groups within the population
and also ensures that they are not over-represented.
➔ Disadvantages:
● Impossible to determine sampling error as the sample is not chosen
using random selection.
● Can result in sampling bias if the selection of units was based on ease of
access and cost considerations.
● It is not possible to make statistical inferences from the sample to the
population leading to the problems of generalization.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling
● In this type of sampling, survey subjects are selected based on referral
from other survey respondents.
● Existing subjects are asked to nominate further subjects known to them
so that the sample increases in size like a rolling snowball.
● This method of sampling is effective when a sampling frame is difficult to
identify.
● Usually applied when the subjects are difficult to trace. Ex: it will be
extremely challenging to survey shelter less people or illegal immigrants.

Source: cuttingedgepr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling

Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling

● When to Use: When the desired sample characteristic is rare.


● Key Aspect: Research starts with a key person and introduce the next
one to become a chain. It may be extremely difficult or cost prohibitive
to locate respondents in these situations.
● How:
○ Identify an initial subject and ask these people to identify others.
○ Selection : This technique relies on referrals from initial subjects to
generate additional subjects.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples
To select students from a class of 20 to be a part of a volunteer club.
● Here, we had randomly chosen person 1 for our sample, and then
he/she recommended person 6, and person 6 recommended person 11,
and so on. 1->6->11->14->19

Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples

To study the level of customer satisfaction among


the members of an elite country club.
● It is extremely difficult to collect primary data
sources unless a member of the club agrees to
have a direct conversation with you and
provides the contact details of the other
members of the club.
● Thus the primary data source is randomly
selected and it nominates other potential data
sources that will be able to participate in the
research studies.

Source: cdn.scribbr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples

To research the experiences of


homelessness in your city.
● Since there is no list of all homeless
people in the city, probability
sampling isn’t possible.
● You meet one person who agrees to
participate in the research, and she
puts you in contact with other
homeless people that she knows in
the area.

Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples: Advantages & Disadvantages
➔ Advantages:
● The chain referral process allows the researcher to reach populations that are
difficult to sample when using other sampling methods.
● The process is cheap, simple and cost-efficient.
● This sampling technique needs little planning and fewer workforce compared
to other sampling techniques.
➔ Disadvantages:
● There is a significant risk of selection bias in snowball sampling, as the
referenced individuals will share common traits with the person who
recommends them.
● It is usually impossible to determine the sampling error or make inferences
about populations based on the obtained sample.
● The researcher has little control over the sampling method.
● Representativeness of the sample is not guaranteed.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample size
● The more heterogeneous a population is, the larger the sample
needs to be.
● For probability sampling, the larger the sample size, the better.
● With nonprobability samples, sample size is not generalizable.
● The main factors affecting the sample size are:
○ Total size of the population
○ Margin of error
○ Confidence level
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample statistic & Population parameter
➔ Sample statistic:
● A sample statistic is a piece of information you get from a fraction
of a population i.e. a sample.
● It can also be defined as any number or statistic computed from
the sample data.
● Example: sample average, median, sample standard deviation,
and percentiles.
➔ Population parameter:
● A quantity or statistical measure, for a given population is called a
population parameter.
● It can also be defined as data that refers to something about an
entire population.
● Example: mean and variance of a population are population
parameters.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample statistic & Population parameter
Decide whether the numerical value describes a population
parameter or a sample statistic.

a.) A recent survey of a sample of 450 college students reported that


the average weekly income for students is $325.

Ans: Because the average of $325 is based on a sample, this is a


sample statistic.

b.) The average weekly income for all students is $405.

Ans: Because the average of $405 is based on a population, this is a


population parameter.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Errors in sampling

Sampling error or
Random error

occurs when sample is not


representative of the population
Errors in sampling

Non-sampling error
or Systematic error

occurs during data collection, causing


the data to differ from the true
values.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling error
● The discrepancy between a sample statistic and its population
parameter is called sampling error.
● Defining and measuring sampling error is a large part of inferential
statistics.
● It occurs when the sample is not representative of the population.
● The sampling error for a given sample is unknown but when the
sampling is random, for some estimates (for example, sample mean,
sample proportion) theoretical methods may be used to measure
the extent of the variation caused by sampling error.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling error

As we can see there is a difference


between population parameters and
sample parameters. This is due to
sampling error.

Two samples of same population


have differing parameters. This is due
to sampling variation. It is also the
reason why scientific experiments
produce different result under
identical scenarios.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Non-sampling error
● Non-sampling errors are the results of mistakes made in
implementing data collection and data processing, such as
○ failure to locate and interview the correct household
○ errors in understanding of the questions by either the
interviewer or the respondent
○ data entry errors
○ missing Data
○ poorly conceived concepts, unclear definitions, and defective
questionnaires
○ response errors occurring when people are unaware, refuse to
answer, or overstate in their answers
● Major sources : Sampling Bias, Non-response Bias.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Bias
● Sampling bias occurs when a chosen sample is not representative of
the larger population.
● It occurs due to the sampling technique/method used to perform
data collection.
● It can be either selection bias and non-response bias.
● A sampling method has a sampling bias if all subjects in the
population are not equally likely to be included in a sample.
● That is, a sample is collected in such a way that some members of
the intended population have a lower or higher sampling probability
than others.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Selection Bias & Nonresponse bias
➔ Selection bias:
● It is a bias in which a sample is collected in such a way that some
members of the intended population have a lower or higher
sampling probability than others.
● It results in a biased sample of a population in which all individuals,
or instances, were not equally likely to have been selected.
➔ Nonresponse bias:
● Nonresponse bias is a type of sampling bias that occurs because of
the absence of certain objects or subjects from a sample.
● For example, some subjects don’t respond to surveys because they
refuse, cannot be contacted, or have a lack of interest in the survey
content.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bias ex:
Q) A new chemical process is run 10 times each morning for five
consecutive mornings. If the new process is put into production, it
will be run 10 hours each day, from 7 A.M. until 5 P.M. Is it
reasonable to consider the 50 yields to be a simple random
sample?
Ans) Since the new process runs during both morning and
afternoon, the population consists of all the yields that would ever
be observed, including both morning and afternoon runs.
The sample however is drawn only from that portion of the
population that consists of morning runs, and thus it is not a
simple random sample. It exhibits a bias is not representative of
the population intended to be studied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling variation
● Simple random samples always differ from their populations in
some ways, and occasionally may be substantially different.
● Two different samples from the same population will differ from
each other as well.
● This phenomenon is known as sampling variation.
● Sampling variation is one of the reasons that scientific experiments
produce somewhat different results when repeated, even when the
conditions appear to be identical.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Independence
● The items in a sample are said to be independent if knowing
the values of some of them does not help to predict the values
of the others.
● With a finite, tangible population, the items in a simple
random sample are not strictly independent, because as each
item is drawn, the population changes.
● This change can be substantial when the population is small.
● However, when the population is very large, this change is
negligible and the items can be treated as if they were
independent
● The sample can be considered independent if sample size is
smaller than 5% of population size.
● Since conceptual population have infinite/very large size the
sample obtained (ex: measuring a rock) is always independent
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q1.) A physical education professor wants to study the physical
fitness levels of students at her university. There are 20,000 students
enrolled at the university, and she wants to draw a sample of size 100
to take a physical fitness test. She obtains a list of all 20,000 students,
numbered from 1 to 20,000. She uses a computer random number
generator to generate 100 random integers between 1 and 20,000
and then invites the 100 students corresponding to those numbers to
participate in the study. Which sampling technique is used?
Answer:
The simple random sampling technique is used.
Note that it is analogous to a lottery in which each student has a
ticket and 100 tickets are drawn.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q2) A quality engineer wants to inspect rolls of wallpaper in order
to obtain information on the rate at which flaws in the printing are
occurring. She decides to draw a sample of 50 rolls of wallpaper
from a day’s production. Each hour for 5 hours, she takes the 10
most recently produced rolls and counts the number of flaws on
each. Is this a simple random sample?
Answer:
No. Not every subset of 50 rolls of wallpaper is equally likely to
comprise the sample. To construct a simple random sample, the
engineer would need to assign a number to each roll produced
during the day and then generate random numbers to determine
which rolls comprise the sample.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q3) A construction engineer has just received a shipment of 1000
concrete blocks, each weighing approximately 50 pounds. The
blocks have been delivered in a large pile. The engineer wishes to
investigate the crushing strength of the blocks by measuring the
strengths in a sample of 10 blocks. Which sampling method is
suitable?
Answer:
To draw a simple random sample would require removing blocks
from the center and bottom of the pile, which might be quite
difficult. For this reason, the engineer might construct a sample
simply by taking 10 blocks off the top of the pile.
convenience sample
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q4) A quality inspector draws a simple random sample of 40 bolts
from a large shipment and measures the length of each. He finds
that 34 of them, or 85%, meet a length specification. He concludes
that exactly 85% of the bolts in the shipment meet the specification.
The inspector’s supervisor concludes that the proportion of good
bolts is likely to be close to, but not exactly equal to, 85%. Which
conclusion is appropriate?
Answer:
Because of sampling variation, simple random samples don’t reflect
the population perfectly. However, they are often fairly close. It is
therefore appropriate to infer that the proportion of good bolts in
the lot is likely to be close to the sample proportion, which is 85%. It
is not likely that the population proportion is equal to 85%.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q5) Another inspector repeats the study with a different simple
random sample of 40 bolts. She finds that 36 of them, or 90%, are
good. The first inspector claims that she must have done something
wrong, since his results showed that 85%, not 90%, of bolts are good.
Is he right?
Answer:
No, he is not right. This is sampling variation at work. Two different
samples from the same population will differ from each other and
from the population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q6) A geologist weighs a rock several times on a sensitive scale. Each
time, the scale gives a slightly different reading. Under what
conditions can these readings be thought of as a simple random
sample? What is the population?
Answer:
If the physical characteristics of the scale remain the same for each
weighing, so that the measurements are made under identical
conditions, then the readings may be considered to be a simple
random sample. The population is conceptual. It consists of all the
readings that the scale could in principle produce.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
(Q7) What sampling method can be recommended?
● Determining proportion of undernourished five year olds in a
village.
● Investigating nutritional status of preschool children.
● In estimation of immunization coverage in a province, data on
seven children aged 12-23 months in 30 clusters are used to
determine proportion of fully immunized children in the
province.Give reasons why cluster sampling is used in this
survey.
DATA ANALYTICS
References

https://www.spss-tutorials.com/simple-random-sampling-what-is-it/
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-
guide-8-types-of-sampling-techniques/
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU

Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 712
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS UE23MA242A
Unit 1: Types of Data & Experiments

Mamatha.H.R

Department of Computer Science and


Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Unit 1: Types of Data & Experiments

Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered

❖ Types of data

❖ Variables or Attributes

❖ Types of studies
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data
● Data refers to individual facts, statistics, or items of information
that are collected through observation.
● It can also be defined as the facts and figures collected,
summarized, analyzed and interpreted.
● The data collected in a particular study are referred to as the
data set.

Source: twinkl.de
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of data

Source: lh5.googleusercontent.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of data
Based on their mathematical properties, data are divided into
four groups :
NOIR-
• Nominal
• Ordinal
• Interval
• Ratio
They are ordered with their increasing
•Accuracy
•Powerfulness of measurement
•Preciseness
•Wide application of statistical techniques

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data
● Quantitative Data are measurements that are recorded on a
naturally occurring numerical scale.
● These are easily open for statistical manipulation and can be
represented by a wide variety of statistical types of graphs and
charts like line charts, bar graphs, scatter plots, etc.
● These types of data tries to find the answers to questions such as
○ “how many,
○ “how much” and
○ “how often”
● Example: Age, GPA, Salary, Cost of books this semester, Scores of
tests and exams, weight of a person, temperature in a room etc.
● There are 2 general types of quantitative data:
○ Discrete data
○ Continuous data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data
● Qualitative Data are measurements that cannot be recorded on a
natural numerical scale, but are recorded in categories.
● It is also known as Categorical Data as the information can be sorted
by category, not by number.
● Example: Year in school, Live on/off campus, Major, Gender, colors
etc.
● These can answer the questions like:
○ “how this has happened”, or
○ “why this has happened”.
● In general, there are 2 types of qualitative data:
○ Nominal data
○ Ordinal data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data
● This data type is used just for labeling variables, without having any
quantitative value.
● Here, the term ‘nominal’ comes from the Latin word “nomen” which
means ‘name’.
● They are categories without any particular order or direction.
● The nominal data sometimes is referred to as “labels”.
● Their use is restricted to keeping track of people, objects and
events.
● They are least powerful in measurement with no arithmetic
origin or order.
● Hence, nominal data is of restricted or limited use.
● Examples: Gender (Women, Men), Hair color (Blonde, Brown,
Brunette, Red, etc.), Marital status (Married, Single, Widowed) etc.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data

● It can’t be manipulated using mathematical operators.


● But, it can be visualized using pie chart.
● Nominal data can be both quantitative and qualitative.
● Quantitative labels lack a relationship.

Source: dpbnri2zg3lc2.cloudfront.net, researchgate.net


Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data
➔ How to analyze Nominal Data?
● Using grouping method.
● Group them into categories.
● For each category, frequency or percentage can be calculated.
● Hypothesis testing is carried out using nonparametric tests such as
Chi-Square test.
● To determine whether there is a significant difference between the
expected frequency and the observed frequency.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data examples
Gender, marital status or any alphabetic / numeric code without
intrinsic order or ranking.

Source:
www.slideshare.net

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Ordinal data
● In simple words, we can understand the ordinal data as qualitative data for
which the values are ordered.
● Ordinal data may indicate superiority.
● But, we cannot do arithmetic operations with ordinal data because they
only show the sequence.
● Based on the relative position, we can also assign numbers to ordinal data.
For example, “first, second, third…etc.”
● Ordinal data allows for setting up inequalities, but it has no absolute value.
● More precise comparisons are not possible.
● Examples:
○ Ranking of users in a competition: The first, second, and third, etc.
○ Rating of a product taken by the company on a scale of 1-10.
○ Economic status: low, medium, and high.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Ordinal data
● Here, the order matters but not the
difference between values.
● Example: Pain Scales
○ Patients are asked to express the
amount of pain they are feeling on a
scale of 1 to 10.
○ A score of 7 means more pain than a
score of 5, and that is more pain than
a score of 3.
○ But the difference between the 7 and Source: Questionpro, slideshare.net

the 5 may not be the same as that


between 5 and 3.
○ The values simply express an order.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Ordinal data examples

Source: Questionpro, slideshare.net, Slide Courtesy:Dr.Uma


analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Discrete data
● A set of data is said to be discrete if the values belonging to the set are
discrete and separate.
● The data values cannot be divided into smaller parts. For Example, the
number of students in a class is an example of discrete data since we
can count whole individuals but can’t count in fractions like 2.5, 3.75,
kids.
● It has a limited number of possible values e.g. days of the month.
● Discrete data can take only certain values by a finite ‘jumps’ i.e. It
‘jumps’ from one value to another but does not take any intermediate
value between them.
● Examples of discrete data:
○ The number of students in a class.
○ The number of workers in a company.
○ The number of test questions you answered correctly.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Discrete data
● Bar charts can be used to display discrete numerical data.
● For example, the bar chart below shows the number of CDs bought by a
group of children in a given month.

Source: slideplayer.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Continuous data
● A set of data is said to be continuous if the values belonging to
the set can take on any value within a finite or infinite interval.
● It represents the information that could be meaningfully
divided into its finer levels.
● It can be measured on a scale or continuum and can have
almost any numeric value. For Example, we can measure our
height at very precise scales in different units such as meters,
centimeters, millimeters, etc.
● Examples of continuous data:
○ The amount of time required to complete a project.
○ The height of children.
○ The speed of cars.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Continuous data

Source: slideplayer.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Interval data
● It is a data type which is measured along a scale, in which each point is
placed at equal distance from one another.
● These data types are measurable and ordered with the nearest items
but have no meaningful zero.
● Interval scales not only educate us about the order of the items but in
addition, give information about the value between every item.
● There are some descriptive statistics that we can calculate for interval
data such as :
○ Central measures of tendency (mean, median, mode)
○ Range (minimum, maximum)
○ Spread (percentiles, interquartile range, and standard deviation).
● Examples: Temperature (°C or F, but not Kelvin),
Dates (1055, 1297, 1976 etc), Time Gap on a 12-hour clock (6 am, 6pm)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Interval data

Source: dpbnri2zg3lc2.cloudfront.net, slideshare.net


Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Ratio data
● Ratio data classifies and ranks data, and uses measured, continuous
intervals, just like interval data.
● However, unlike interval data, ratio data has a true zero.
● This basically means that zero is an absolute, below which there are no
meaningful values.
● Speed, age, or weight are all excellent examples since none can have a
negative value (you cannot be -10 years old or weigh -160 pounds)
● These data are also in the ordered units that have the same difference.
● Ratio data allow for forming quotients inaddition to setting up inequalities
and forming differences.
● All mathematical operations(manipulations with real numbers) are
possible on ratio data.
● It is the most precise data and allow for application of all statistical
techniques.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Ratio data

Source: /www.chi2innovations.com Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Ratio data
● Example:

Source: www.slideshare.net Slide Courtesy:Dr.Uma


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the type of data
➔ Number of cartons of milk manufactured each day.
Quantitative data, Discrete data
➔ Temperatures of airplane interiors at a given airport in Celsius.
Quantitative data, Continuous data, Interval data.
➔ College major of each student in a class.
Qualitative data, Nominal data
➔ Method of payment
Qualitative data, Nominal data
➔ Incomes of college students on work study programs.
Quantitative data, Discrete data
➔ Weights of newborn calves.
Quantitative data, Continuous data, Ratio data.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the type of data
➔ Gender of each employee at a company.
Qualitative data, Nominal data
➔ Number of tomatoes on each plant in a field.
Quantitative data, Discrete data
➔ Number of defective items in a lot.
Quantitative data, Discrete data
➔ Salaries of CEOs of oil companies.
Quantitative data, Discrete data

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Attribute or Variable

● Attribute(or variable, feature, dimension) is a data field,


representing a characteristic or feature of a data object.
● It is a property of a data object which is measured for each
observation or record.
● It can vary from one observation to another.
● Example : name, age, Student-ID, address, marks, gender etc.
● There are different types of attributes or variables such as:
○ Nominal
○ Ordinal
○ Discrete
○ Continuous
○ Binary
○ Interval
○ Ratio
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Attribute or Variable

Source: towardsdatascience.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable
● A variable that can be measured numerically is called a quantitative
variable.
● The data collected on a quantitative variable are called quantitative
data.
● Thus, a quantitative variable represents a measure and is numeric.
Its values can be recorded on a numeric scale.
● Example: a country’s population, a book’s price, height, weight,
number of items sold to a shopper, time in 100 yard dash etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Discrete
● A variable whose values are countable is called a discrete variable.
● In other words, a discrete variable can assume only certain values
with no intermediate values.
● Its number of values is finite or limited.
● Example: number of oranges in a bag, number of students in a
classroom, shoe size etc.

Sources: previews.123rf.com, teachoo.com


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Continuous
● A variable that can assume any numerical value over a certain
interval or intervals is called a continuous variable.
● It represents the numerical data as accurately as possible.
● It can take unlimited number of values between the lowest and
highest points of measurements.
● Continuous attributes are typically represented as floating-point
variables.
● Example: height, weight, temperature

Sources: slideplayer.com, researchgate.com


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative Variable
● A variable that can not assume a numerical value but can be
classified into two or more non-numeric categories is called a
qualitative
● It is also known as a categorical variable.
● The data collected on such a variable are called qualitative data.
● The values of this variable are not numeric as they do not result from
counting or measuring.
● Thus, arithmetic operations can’t be applied on these variables.
● Example: hair colour, favorite books, religion, political party in
power, profession, name etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative Variable: Nominal
● A nominal variable is a type of variable that is used to name, label or
categorize particular attributes that are being measured.
● Nominal means “relating to names” .
● The utilities of a nominal attribute are sign or title of objects .
● Each value represents some kind of category, code or state.
● It takes qualitative values representing different categories.
● There is no intrinsic ordering of these categories.
● Example: Gender- male, female;
Marital status- married, unmarried;
Skin colour- dark, white, brown;
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative Variable: Ordinal
● An ordinal variable is a type of categorical variable that takes values with an
order or rank.
● Qualitative variables have natural, ordered categories and the distances
between the categories is not known.
● Example: Likert scale, or the survey question "Is your general health poor,
reasonable, good, or excellent?" may have those answers coded respectively as
1, 2, 3, and 4.
● Educational status- undergraduate, postgraduate, matriculate

Source: cdn.wallstreetmojo.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Interval
● It is interval measured on a scale of equal-sized units
● Values of inter variables have order
● It has no true zero-point.
● Interval variables allow to rank the items measured in order.
● They also allow to quantify and compare the magnitude of
differences between them.
● Example: temperature in ˚ C or ˚ F, calendar dates etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Ratio
● Ratio variables represent the highest level of measurement.
● A ratio variable has an inherent or true zero-point.
● The numerical relationship between the values of a ratio variable is
meaningful.
● We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K is twice as high as 5 K).
● Example: temperature in Kelvin, length, counts, monetary quantities
etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Properties of Attributes
The type of an attribute depends on which of the following properties it
possesses:
● Distinctness: =, ≠
● Order: <, >
● Addition: +, -
● Multiplication: *, /

➔ Nominal attribute: distinctness


➔ Ordinal attribute: distinctness & order
➔ Interval attribute: distinctness, order & addition
➔ Ratio attribute: all 4 properties
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Properties of Attributes

Source: dpbnri2zg3lc2.cloudfront.net
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Examples
In the table below identify which columns represent qualitative
variables and which columns represent quantitative variables.

Answer: Qualitative variables: Name, River, State


Quantitative variable: height, Completed
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of studies

We do studies to gather information and draw conclusions. The


type of conclusion we draw depends on the study method used:

I. Observational study: In an observational study, we measure


or survey members of a sample without trying to affect them.
II. Controlled study: In a controlled experiment, we assign
people or things to groups and apply some treatment to one
of the groups, while the other group does not receive the
treatment.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of studies

Source: statisticsguruonline.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Study

Source: prehospitalresearch.eu
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Study

Sources: www.scienceabc.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Group vs Experimental Group

Source: thoughtco.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Control Group vs Experimental Group
➔ Control Group:
● A control group is a group separated from the rest of the experiment
such that the independent variable being tested cannot influence the
results.
● This isolates the independent variable’s effect on the experiment and
can help rule out alternative explanations of the experimental results.
➔ Experimental Group:
● An experimental group is a test sample or the group that receives an
experimental procedure.
● This group is exposed to changes in the independent variable being
tested.
● The values of the independent variable and the impact on the
dependent variable are recorded. An experiment may include
multiple experimental groups at one time.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Experiment
● While all experiments have an experimental group, not all experiments
require a control group.
● Controls are extremely useful where the experimental conditions are
complex and difficult to isolate.
● Experiments that use control groups are called controlled experiments.

Source: cdn.kastatic.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Observed Experiment ex:
• there have been many studies conducted to determine the effect of
cigarette smoking on the risk of lung cancer. In these studies, rates of
cancer among smokers are compared with rates among non-smokers.

• The experimenters cannot control who smokes and who doesn’t; people
cannot be required to smoke just to make a statistician’s job easier

• Observational studies are not nearly as good as controlled experiments for


obtaining reliable conclusions regarding cause and effect. for example,
people who choose to smoke may be more likely to get cancer for other
reasons

• it took many years of carefully done observational studies before scientists


could be sure that smoking was actually the cause of the higher rate.

Source: cdn.kastatic.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Observational study vs. Experimental study
Observational Study Experimental Study
Observe only, no “Treatment” assigned.
“treatment” assigned.
Generally a control group is Uses control group
not needed. for comparison.
Reports an association. Report a cause and effect.
May (or not) use random sample Randomization of
sets. sample group.
May (or not) generalize to population. Generalize to population.

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the types of study
Q1.A study took random sample of adults and asked them about their bedtime
habits. The data showed that people who drank a cup of tea before bedtime
were more likely to go to sleep earlier than those who didn't drink tea.

Answer : Observation Study

Q2.A study took a group of adults and randomly divided them into two groups.
One group was told to drink tea every night for a week, while the other group
was told not to drink tea that week. Researchers then compared when each
group fell asleep.

Answer : Experimental Study

Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the types of study
Q3.A study randomly assigned volunteers to one of two groups:
One group was directed to use social media sites as they usually do.
One group was blocked from social media sites.

Answer : Experimental Study

Q4.A study took a random sample of people and examined their social
media habits. Each person was classified as either a light, moderate, or
heavy social media user. The researchers looked at which groups tended
to be happier.

Answer : Observation Study

Slide Courtesy:Dr.Uma
DATA ANALYTICS
References

https://www.analyticsvidhya.com/blog/2021/06/complete-guide-to-
data-types-in-statistics-for-data-science/

Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU

Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 834
K.M Mitravinda
[email protected]

You might also like