MSCE Unit 1 Slides Compressed
MSCE Unit 1 Slides Compressed
SCIENCE ENGINEERS
Generation of Random Variates
Dr.Mamatha.H.R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered
❖ Random Numbers
❖ Random Variates
Source: medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Generating Random Numbers
● Problem:
Generate sample of a random variable X with a given density f. (The
sample is called a random variate)
● Answer:
Develop an algorithm such that if one used it repeatedly (and
independently) to generate a sequence of samples X1, X2, . . . , Xn then
as n becomes large, the proportion of samples that fall in any interval
[a, b] is close to P(X ∈ [a, b]), i.e
Source: http://www.cs.bilkent.edu.tr/~cagatay/cs503/_M&S_04_Random_Variate_Generation.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate
● A random variate is a variable generated from uniformly distributed
pseudorandom numbers.
● It is a particular outcome of a random variable.
● The random variates which are other outcomes of the same random
variable might have different values.
● Random variates are used when simulating processes are driven by
random influences (stochastic processes).
● They are frequently used as the input to simulation models.
● Procedures to generate random variates corresponding to a given
distribution are known as procedures for random variate generation or
pseudo-random number sampling.
● Depending on how they are generated, a random variate can be
uniformly or non-uniformly distributed.
● Examples: Inter-arrival time and service time.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation
● Random variate generation is a fundamental aspect of simulation
modeling and analysis.
● The objective of random variate generation is to produce observations
that have the stochastic properties of a given random variable.
● Various methods and algorithms have been developed to generate
random variates that are accurate (representative of the target
distribution) and computationally efficient.
● The distribution from which random variates are generated is assumed
to be completely specified.
● We wish to generate samples from this distribution as input to a
simulation model.
● Random variate generation relies on generating uniformly distributed
random number in the closed interval [0,1].
● Random variate generators use as starting point, random numbers
distributed in U[0,1].
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation : Objectives
● The objective of random variate generation is to produce sample
observations that have the stochastic properties of a given random
variable, X, having distribution function
F(x) = Pr(X ≤ x) , where −∞ < x < ∞
● The development of the theory/concepts surrounding random variate
generation via computer algorithms is based on the following two key
assumptions :
○ Assumption 1 : There exists a perfect uniform (0,1), U(0,1), random
number generator that can produce a sequence of independent
random variables uniformly distributed on (0,1).
○ Assumption 2 : Computers can store and manipulate real numbers.
● Although Assumptions 1 and 2 are used for developing Random
Variate Generation theory, the assumptions are violated when
implementing Random Variate Generation algorithms on digital
computers.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors to be considered for random variate generation
1. Exactness:
● Exactness or accuracy refers to how well the generator produces
random variates with the characteristics of the desired
distribution.
● This refers to the theoretical exactness of the random variate
generator itself, as well as the error that is induced by the U(0,1)
random number generator and the error induced by digital
computer calculations.
2. Speed:
● Speed refers to the computational set-up and execution time
required to generate random variates. Contributions to time are:
a. Setup time
b. Variable generation time
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors to be considered for random variate generation
3. Space:
● Space refers to computer memory that is required for the generator.
● Although space is not typically a major consideration for modern
computers, computer memory was an important consideration in the
early days of Random Variate Generation development.
4. Simplicity:
● Simplicity refers to the both the simplicity of the algorithm as well as the
simplicity of implementation.
● This includes the number of lines of code, support routines required,
number of mathematical operations, as well as portability across
platforms and interaction with other simulation methods such as
variance reduction techniques.
The importance of each of these factors will vary depending on the
particular situation or simulation application.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Random Variate Generation Techniques
● We assume that a pseudo random number generator
RN(0,1) producing a sequence of independent values
between 0 and 1 is available.
● General methods:
○ Inverse transform method
○ Acceptance-rejection method
○ Composite method
○ Translations and other simple transforms
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
● The acceptance–rejection method is often used when a closed-form
cumulative distribution function does not exist or is difficult to
calculate.
● In this method, variates are generated from one distribution and are
either accepted or rejected in such a way that the accepted values
have the desired distribution.
● General acceptance–rejection algorithm:
(i) Given a random variable X, let f(x) denote the desired density
function of X.
(ii) Let t(x) be any majorizing function of f(x) such that t(x) ≥ f(x) for all
values of x.
(iii) Let g(x) = t(x)/c denote the density function proportional to t(x) such
that
Source: Kuhl, Michael E. "History of random variate generation." 2017 Winter Simulation
Conference (WSC). IEEE, 2017.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
(iv) Generate x ∼ g(x).
(v) Generate u ∼ U(0,1).
(vi) If u > f(x)/t(x), then reject x and go to step 1.
(vii)Return x.
Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method
Illustration: To generate random variates, X ~ U(1/4,1)
Source:https://www.mi.fu-berlin.de/inf/groups/ag-tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_with_Simulation/07.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Drawback
● Trials ratio: Average number of points (X,Y) needed to produce one
accepted X.
● Here, we need to make trial ratio close to 1.
● Else the generator may not be efficient enough because of wasted
computing effort.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: to increase efficiency
● One way to make generator efficient is:
To generate points uniformly scattered under a function e(x), where
area between the graph of f and e be small.
Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Constructing e(x)
● Take e(x) = Kg(x)
● g(x) = density function of a distribution for which an easy way of
generating variates already exists.
● K = scale factor
Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Producing (X, Y)
● Let X = a variate produced from Kg(x)
● Let U = RN(0,1)
● (X,Y) = (X, UKg(X))
Source: cs.bilkent.edu.tr
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Poisson Distribution
Procedure of generating a Poisson random variate N is as follows:
1. Set n=0, P=1
2. Generate a random number Rn+1, and replace P by P x Rn+1
3. If P < exp(-α), then accept N=n.
Otherwise, reject the current n, increase n by one, and return to step 2.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Poisson Distribution example
Source:https://www.mi.fu-berlin.de/inf/groups/ag-tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_with_Simulation/07.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Poisson Distribution example
• It took five random numbers to generate three Poisson
variates
• In long run, the generation of Poisson variates requires
some overhead!
Source:https://www.mi.fu-berlin.de/inf/groups/ag-tech/teaching/2012_SS/L_19540_Modeling_and_Performance_Analysis_with_Simulation/07.pdf
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Acceptance and Rejection Method: Normal Distribution
● If X is a random variable form a normal distribution N(0,
1), then the density of |X| is given by the function,
Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
PROBABILITY PLOTS
D. Uma
Mamatha H R
https://www.statology.org/histogram-mean-median/
Median = 6
Mean = 7.1
Mode = 0
SD = 6.8
Range = 0 to 24
Median = 5
Mean = 5.4
Mode = none
SD = 1.8
Range = 2 to 9
Median = 3
Mean = 3.4
Mode = 3
SD = 2.5
Range = 0 to 12
Median = 7:00
Mean = 7:04
Mode = 7:00
SD = :55
Range = 5:30 to 9:00
7.1 +/- 6.8 =
0.3 13.9 0.3 – 13.9
7.1 +/- 2*6.8 =
0 – 20.7
7.1 +/- 3*6.8 =
0 – 27.5
5.4 +/- 1.8 =
3.6 – 7.2
3.6 7.2
5.4 +/- 2*1.8 =
1.8 – 9.0
1.8 9.0
5.4 +/- 3*1.8 =
0– 10
0 10
0.9 5.9
3.4 +/- 2.5=
0.9 – 7.9
0 8.4
3.4 +/- 2*2.5=
0 – 8.4
0 10.9
3.4 +/- 3*2.5=
0 – 10.9
6:09
7:59
7:04+/- 0:55 =
6:09 – 7:59
5:14
8:54
7:04+/- 2*0:55
=
5:14 – 8:54
4:19
9:49
7:04+/- 2*0:55
=
4:19 – 9:49
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What are Probability Plots basically mean?
The mean, median and mode will be similar and lie at the
same point.
Where,
i is the position of the data item
n is the size of the data set.
4) Find theoretical quantiles - Qi.
5) Plot every point (xi , Qi).
6) Plot (xi , xi)
7) Look into the observation whether it forms approximately straight
line. This helps us to understand the type of distribution.
Normal probability plot
coffee…
Right-Skewed!
(concave up)
Normal probability plot
writing…
Neither right-skewed
or left-skewed, but
big gap at 6.
Norm prob. plot
Exercise…
Right-Skewed!
(concave up)
Norm prob. plot Wake up
time
Closest to a
straight line…
Formal tests for
normality
Results:
Coffee: Strong evidence of non-normality (p<.01)
Writing love: Moderate evidence of non-normality
(p=.01)
Exercise: Weak to no evidence of non-normality (p>.10)
Wakeup time: No evidence of non-normality (p>.25)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Different ways of dividing data equally between 0 and 1.
Problem:
Solution:
From the plot we can infer that (X1, 0.1) intersects at the point
(Q1, 0.1). We understand that Q1 is at the 10th percentile of
the N(5,22) distribution.
In this example the Qi are the 10th, 30th, 50th, 70th, and 90th
percentiles of the N(5, 22) distribution.
Probability plots can still be used for smaller samples, but they
will detect only fairly large departures from normality.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Interpreting Probability Plots
However, a point that is very far from the line when most
other points are close is an outlier, and deserves attention.
THANK YOU
D. Uma
Mamatha H R
Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
UE23MA242A
Mamatha.H.R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered
❖ Statistical Analysis
❖ Population
❖ Sample
❖ Sampling
❖ Types of Population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Problems to be solved
Source: media3.giphy.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Population
A population is the entire collection of objects or outcomes about
which information is sought.
Source: keydifferences.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample
A sample is a subset of a population, containing the objects or
outcomes that are actually observed.
Sample size: The number of items in a sample is called a sample size.
The size of the sample is always less than the total size of the
population.
The process of taking a predetermined number of observations from
a larger population is called sampling.
Population Sample
All countries of the world Countries with published data available
on birth rates and GDP since 2000
Songs from the Eurovision Song Contest Winning songs from the Eurovision
Song Contest that were performed in
English
Undergraduate students in the 300 undergraduate students from three
Netherlands Dutch universities who volunteer for
your psychology research study
Advertisements for IT jobs in the The top 50 search results for
Netherlands advertisements for IT jobs in the
Netherlands on May 1, 2020
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Populations & Samples
Responses of 250
students in survey
(sample)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Populations & Samples
A city council member wanted to know how her constituents felt about a
planned rezoning. She randomly selected 75 names from the city phone
directory and conducted a phone survey. Identify the population and sample
in this setting.
Answer:
INFERENCE
The population
All of the individuals of interest
The sample
The individuals selected to
participate in the research study
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling
Sampling
Population Sample
Use statistics to
summarize features
Use parameters to
summarize features
1. Conceptual population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tangible population
Populations where the members are physical objects, such as cars,
bolts, apples, etc., are called tangible or concrete populations.
Such populations are assumed to be always finite and therefore
involves counting.
After an item is sampled, the population size decreases by 1.
In principle, one could in some cases return the sampled item to the
population, with a chance to sample it again, but this is rarely done
in practice.
Source: https://www.hindivarta.com/jansankhya-
Slide courtesy: Dr.Uma ki-samasya-aur-samadhan-par-nibandh/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Conceptual population
Populations that do not consist of physical or actual objects are
called Conceptual populations.
Conceptual populations are mostly the result of a measurement.
It involves measuring something multiple times.
Ex: length of a metal rod.
Source:
https://image.slidesharecdn.com/qrmtheory-
180918191951/95/how-to-do-sampling-8-
638.jpg?cb=1537298482
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Breakdown
Study : Find the mean weight of all students of all universities in India.
https://www.questionpro.com/blog/population-vs-sample/
https://www.scribbr.com/methodology/population-vs-sample/
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU
Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 834
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS
Unit 1: Additional Examples
Answer:
Ans: 127.4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median-example
Consider the data given below:
5, 9, 12, 4, 5, 14, 19, 16, 3, 5, 7
Calculate the median.
Ans:
To calculate the median, we need to put the numbers in order and find the
middle value.
3 4 5 5 5 7 9 12 14 16 19
n = 11
● Here the median is 7 because this is the middle value.
● Half of the other values in the list are below 7 and half are above 7.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median-example
Consider the data given below:
3, 6, 7, 8, 11, 15
Calculate the median.
Ans:
● When there are an even number of values, there is no clear middle
value.
In this case, there are two middle values.
3 6 7 8 11 15
n=6
● The median is the mean of these two middle numbers.7 + 8 / 2
=7.5
So the median for this set of values is 7.5.
● Like the mean, the median value does not always appear in the
original list of values.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example
Consider the data given below:
5, 9, 12, 4, 5, 14, 19, 16, 3, 5, 7
Calculate the mode of the above data.
Ans:
3 4 5 5 5 7 9 12 14 16 19
In this list the mode is 5, because it appears most often.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode and Range-example
In the article “Evaluation of Low-Temperature Properties of HMA Mixtures” (P.
Sebaaly, A. Lake, and J. Epps, Journal of Transportation Engineering, 2002: 578–
583), the following values of fracture stress (in megapascals) were measured for a
sample of 24 mixtures of hot-mixed asphalt (HMA).
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
Ans:
There are three modes:
80, 179, and 232.
Each of these values appears twice, and no other value appears more than once.
The range is 470 − 30 = 440.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Percentile example
Find the 65th percentile for the following data
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
Ans:
R = (P/100)(N+1)
= (65/100) (24+1)
= 16.25 (is it not an integer)
The 65th percentile is therefore found by averaging the 16th and 17th data points
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile
range.
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
ANS)
n = 24.
Second quartile=median=0.5(25)=12.5
=(12th +13th )/2
=(191+223)/2
=207
third quartile= (0.75)(25) = 18.75=(18th +19th )/2= (242 + 245)/2 = 243.5.
Slide
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile
range.
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
ANS)
n = 24.
third quartile = (0.75)(25) = 18.75
=(18th +19th )/2
= (242 + 245)/2 = 243.5.
For n observations,
f(x1,x2….xn/p)
P(X1=x1,X2=x2……Xn=xn/p)=
Likelihood Function
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bernoulli Distribution – Estimate Likelihood Function for (p)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example – Binomial Distribution – Estimate Likelihood Function
Problem:
The following data are the observed frequencies of occurrence
of domestic accidents: we have n = 647 data as follows’
Number of Frequency
Accidents
0 447
1 132
2 42
3 21
4 3
5 2
What is the estimate of λ if a Poisson model is assumed ?
Problem Source - http://wwwf.imperial.ac.uk/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example - Poisson Distribution
Solution:
Let x1, ... ,xn be a random sample from N(µ,σ2) population. Find MLE of µ and σ.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function
OR
0
Text Book – Chapter 4.9 - Pg. No: 284
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normal Distribution – Estimate Likelihood Function
Problem:
Solution:
The main reason for this is that in most cases that arise in
practice, MLEs have two very desirable properties,
Prof. Mamatha H R
Prof. Uma D
Prof. Silviya Nancy J
Prof. Suganthi S
Department of Computer Science and Engineering
PRINCIPLES OF POINT ESTIMATION
D. Uma
Mamatha H R
Point Estimator.
Measuring Goodness of an Estimator.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample Statistics & Population Parameters
There are also chances that the samples what we examine will
have some errors.
Image Source: http://clipart-library.com/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling
Property : 1 - Bias
Property: 2 - Consistency
Property: 3 - Efficiency
a) smallest variance.
b) unbiased observation.
c) consistent.
By definition,
Solution:
MSE(𝜇)=variance(
Ƹ 𝜇)+bias
Ƹ 2(𝜇)
Ƹ
𝜎2
𝜎2
𝜎2
𝜎2 𝜎2
Therefor MSE = +0=
𝑛 𝑛
D. Uma
Mamatha H R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered
❖ Statistics
❖ Types of Statistics
❖ Summary Statistics
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Statistics
● Statistics is the science of data. It involves collecting, classifying,
summarizing, organizing, analyzing, and interpreting numerical
information.
● It involves study and manipulation of data, including ways to
gather, review, analyze, and draw conclusions from data.
• Collecting Data
• Ex: Survey
• Presenting Data
• Ex: Charts & Tables
• Characterizing Data
• Ex: Average
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why do we need to know about statistics
● To know how to properly present information.
● To know how to draw conclusions about populations based on sample
information.
● To know how to improve processes.
● To know how to obtain reliable forecasts.
● To find out why a process behaves the way it does.
● To find out why a process produces defective goods and services.
● To check various performance measures of a process.
● To prevent problems caused by various causes of variation in process.
● To analyze the real world.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Statistics
• Economics • Engineering
• Forecasting • Construction
• Demographics • Materials
• Sports • Business
• Individual & Team • Consumer Preferences
Performance • Financial Trends
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Processes of statistics
Statistics involves 2 main processes:
POPULATION SAMPLE
A population is the entire collection of A sample is a subset of a population,
objects or outcomes about which containing the objects or outcomes that
information is sought. are actually observed.
Source: sigmamagic.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample Statistic and Population Parameter
➔Sample statistic: ➔Population parameter:
● It is a numerical measurement ● It is a numerical measurement
describing some characteristic of a describing some characteristic of a
sample. population.
● Example: sample average, median, ● Example: mean and variance of a
sample standard deviation, and population are population
percentiles. parameters.
Sources: youtube..com,
Slide Courtesy:Dr.Uma Pinkmonkey.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample Statistic and Population Parameter
Sources: youtube..com,
Pinkmonkey.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Taxonomy of Statistics
Source: image.slidesharecdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Branches of Statistics
The study of statistics has two major branches:
1) Descriptive statistics
2) Inferential statistics
Statistics
Descriptive Inferential
statistics statistics
■ Present Data
■ e.g. Tables and graphs
■ Characterize Data
■ e.g. Sample mean
■ Summarize Data
■ Central Tendency
■ Variation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics
➔ Organizing Data
◆ Tables
● Frequency Distributions
● Relative Frequency Distributions
◆ Graphs
● Bar Chart or Histogram
● Stem and Leaf Plot
● Frequency Polygon
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics
➔ Summarizing Data:
Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is Descriptive Statistics used?
Source: slidetodoc.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics Examples
Source: luminousmen.com/post/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why is Inferential Statistics used?
Suppose you want to know the mean income of the subscribers of
Netflix.
● Mean (µ) — a parameter of a population.
● You draw a random sample of 100 subscribers and determine
that their mean income is $27,500.
● Mean( x̅ ) = $27,500 (a summary statistic).
● Conclusion : You conclude that the population mean income μ
is likely to be close to $27,500 as well.
● This is an example of statistical inference.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Inferential Statistics examples
● You randomly select a sample of 11th graders in your state
and collect data on their SAT scores and other characteristics.
Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics vs Inferential Statistics
Source: selecthub.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Descriptive Statistics vs Inferential Statistics
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Questions
Q1. In a recent study, volunteers who had less than 6 hours of
sleep were four times more likely to answer incorrectly on a
science test than were participants who had at least 8 hours of
sleep. Decide which part is the descriptive statistic and what
conclusion might be drawn using inferential statistics.
Interquartile
Range
Variance
Source: geeksforgeeks
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency
● There are three different
types of 'average'.
● These are the mean, the
median and the mode.
● They are used by
statisticians as a way of
summarizing where the
‘centre’ of the data is.
Ans:
To calculate the median, we need to put the numbers in order and find the middle
value.
The five heights, arranged in increasing order, are
65.51 67.05 68.31 70.68 72.30.
n=5
● The sample median is the middle number, which is 68.31.
● Half of the other values in the list are below 68.31 and half are above 68.31.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Median
➔ Advantages:
● Not affected by the outliers in the data set.
● An outlier is a data point that is radically “distant” or “away” from
common trends of values in a given set.
● It does not represent a typical number in the set.
● The concept of the median is intuitive and thus can easily be
explained as the center value.
● Each set has a unique median value.
➔ Disadvantages:
● Its value is perceived as it is.
● It cannot be utilized for further algebraic treatment.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode
● The mode is the value that appears most often in a set of data values
● Like the statistical mean and median, the mode is a way of expressing,
in a (usually) single number.
● To calculate the mode, we need to look at which value appears the
most often.
● Example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
● Given the list of data [1, 1, 2, 4, 4] its mode is not unique.
It has 2 modes: 1 and 4
● A dataset, in such a case, is said to be bimodal, while a set with more
than two modes may be described as multimodal.
● Empirical formula:
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example
Mode = 15k
Source: slideshare.net
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example
Consider the data given below:
The values 3 and 4 appear the most number of times in the above data.
Since the above data has 2 modes, it is bimodal.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode-example
Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Mode
➔ Advantages:
● Quick and easy to compute.
● Unaffected by extreme values.
● Can be used at any level of measurement.
● Useful to find the most “popular” or common item. This includes data
sets that do not involve numbers.
➔ Disadvantages:
● It is a terminal statistic.
● A given subgroup could make this measure unrepresentative of the
population’s centre.
● If the set contains no repeating values, the mode is irrelevant.
● In contrast, if there are many values that have the same count, then
mode can be meaningless.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Questions
Alex did a survey of how many games each of his 20 friends owned, and
got this:
9, 15, 11, 12, 3, 5, 10, 20, 14, 6, 8, 8, 12, 12, 18, 15, 6, 9, 18, 11
Find the mean, median and mode.
Ans:
Sorting in ascending order:
3, 5, 6, 6, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, , 14, 15, 15, 18, 18, 20
● Mean = 222/20 = 11.1
● Median = (11+11)/2 = 11
● Mode = 12
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Skewed and Symmetric distributions
● Skewness is a measure of the asymmetry of the distribution of about
its mean.
● The skewness value can be positive, zero, negative, or undefined.
● Symmetric Distribution: A symmetric distribution is one where the left
and right hand sides of the distribution are roughly equally balanced
around the mean.
● In symmetric distributions, the mean, median, and mode are the same.
● Skewed Distribution: A skewed distribution is one where the left and
right hand sides of the distribution are not balanced around the mean.
● In skewed data, the mean and median lie further toward the skew than
the mode.
● The greater the distance of mean and median, the greater is the
skewness of the distribution.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Skewed and Symmetric distributions
• Qualitative data:
• Mode – always appropriate
Ex : Maximum Type of Color
• Mean – never appropriate
Ex : Average value of Yellow color
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Question
For the following data
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
Compute the mean, median, and the 5%, 10%, and 20% trimmed
means.
Solution:
● The mean is found by averaging together all 24 numbers, which
produces a value of 195.42.
● The median is the average of the 12th and 13th numbers, which is
(191 + 223)/2 = 207.00.
● To compute the 5% trimmed mean, we must drop 5% of the data
from each end.
● This comes to (0.05)(24) = 1.2 observations.
● We round 1.2 to 1, and trim one observation off each end.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of central tendency: Question
● The 5% trimmed mean is the average of the remaining 22
numbers: 75 + 79 +···+ 274 + 384/22= 190.45
● To compute the 10% trimmed mean, round off (0.1)(24) =
2.4 to 2. Drop 2 observations from each end, and then
average the remaining 20: 79 + 80 +···+ 254 + 274/20=
186.55
● To compute the 20% trimmed mean, round off (0.2)(24) =
4.8 to 5.
● Drop 5 observations from each end, and then average the
remaining 14: 105 + 126 +···+ 242 + 245/14= 194.07
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
When to use mean, median and mode?
TYPE OF VARIABLE BEST MEASURE OF
CENTRAL TENDENCY
Nominal Mode
Ordinal Median
Here, range = 13 - 1
= 12
Observations:
Since the range of Class A is smaller than in Class B, can we claim that
the age distribution in Class A is more clustered (closely related) than in
Class B? In other words, are the ages listed in Class A more uniform than
in Class B?
Source: Chilimath.com Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Range
Range Can Be Misleading:
● The range can sometimes be misleading when there are
extremely high or low values.
● Example: {8, 11, 5, 9, 7, 6, 3616}
lowest value : 5
highest 3616,
● So the range is 3616 - 5 = 3611.
● The single value of 3616 makes the range large, but most values
are around 10.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Range
➔ Advantages:
● It is the simplest of the measure of dispersion
● Easy to calculate
● Easy to understand
● Independent of change of origin
➔ Disadvantages:
● It is based on two extreme observations. Hence, get affected
by fluctuations
● A range is not a reliable measure of dispersion
● Dependent on change of scale
● It can drastically be affected by outliers (values that are not
typical as compared to the rest of the elements in the set).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread
When presenting or analysing measurements of a continuous
variable it is sometimes helpful to group subjects into several
equal groups.
The cut off points are called quartiles, and there are three of
them (the middle one also being called the median).
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Percentile example
If the scores of a set of students in a math test are 25, 7, 9, 13, 2 and 8 what is
the 15th percentile and 75th percentile?
Ans: Arrange the numbers in ascending order and give the rank ranging from
1(the lowest number) to 5 (the highest number)
Score 2 7 8 9 13 25
R = (P/100)(N+1)
= (15/100) (6+1)
= 1.05 (is it not an integer)
Percentile value = (1st element + 2nd element value)/2
= (2+7)/2
= 4.5 it is 15th percentile
Thus, score 19 is the 75th percentile
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
● Quartiles are the values that divide a list of numbers into
quarters.
● Quartiles are obtained by first putting the list of numbers in
order and then cutting the list into four equal parts.
● The Quartiles are at the "cuts" in the data.
● The first quartile, (Q1) is the middle number between the
smallest number and the median of the data.
● The second quartile, (Q2) is the median of the data set.
● The third quartile, (Q3) is the middle number between the
median and the largest number.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
● The first quartile is the 25th
percentile
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
● To find the first quartile, compute the value 0.25(n +1).
● If this is an integer, then the sample value in that position is the
first quartile.
● If not, then take the average of the sample values on either side
of this value.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
➔ The Second Quartile or median:
● It is easy to see how to divide the area in Figure into two equal
parts, since the graph is symmetric.
● The point which gives us 50% of the area to the left of it and
50% to the right of it is called the second quartile or median
● Second quartile is calculated using the value 0.5(n+1)
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartiles
➔ The Third Quartile:
● The third quartile is the point which gives
us 75% of the area to the left of it and
25% of the area to the right of it.
● This means that 75% of the observations
are less than or equal to the third quartile
and 25% of the observation are greater
than or equal to the third quartile.
● The third quartile is also called the 75th
percentile.
● The third quartile is computed in the
same way, except that
the value 0.75(n+1) is used.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartile Summary
Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Quartile example
Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range
● Interquartile range is the distance or range between the 25th
percentile and the 75th percentile.
● That is, quantifies the difference between the third and first
quartiles.
● Interquartile Range = Upper Quartile(Q3) – Lower Quartile(Q1)
IQR = Q3 –Q1
Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range
Source: sphweb.bumc.bu.edu/
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range
➔ Steps to find IQR :
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range example
Source: mathsisfun.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Inter-quartile Range question
For the following data sets, calculate the quartiles and find the interquartile range.
The following numbers represent the time in minutes that twelve employees took
to get to work on a particular day.
18 34 68 22 10 92 46 52 38 29 45 37
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Variance
● Variance is a measure of the spread of the recorded values on a
variable.
● It is a measure of dispersion, meaning it is a measure of how far
a set of numbers is spread out from their average value.
● The larger the variance, the further the individual cases are from
the mean.
Mean
● The smaller the variance, the closer the individual scores are to
the mean.
Mean
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Variance
● It is the average of the distance that
each score is from the mean
(Squared deviation from the mean)
● Steps to calculate variance:
1. Find the mean value of the given data
values.
2. Subtract mean from each data value.
3. Square each value that is obtained
from step2.
4. Find the sum of all values that is
obtained from step 3.
5. Divide the result that is obtained from
step 4 by N(for population) and n-1(for
sample).
Source: standard-deviation-calculator.com Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Standard Deviation
● Standard deviation signifies the deviation of the elements of the data set
from the mean value of the distribution.
● It quantifies the amount of variation of a set of data values.
● It is a measure of the variability of a single item.
● The standard deviation does not decline as the sample size increases.
● The estimate of the standard deviation becomes more stable as the sample
size increases.
Source: exceluser.com,
MathBitsNotebook
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Standard Deviation
● Larger the standard deviation, greater amounts of variation
around the mean.
● Std deviation = 0 only when all values are the same (only when
you have a constant and not a “variable”)
● If you were to “rescale” a variable, the s.d. would change by
the same magnitude.
● Like the mean, the standard deviation will be inflated by an
outlier case value.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measures of Spread: Standard Deviation
Standard Deviation = Square root of Variance
Source: standard-deviation-calculator.com
DATA ANALYTICS
Measures of Spread: Standard Deviation example
Mean
x¯=5×2+15×1+25×1+35×37=10+15+25+105 /7=22.15
https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example
https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example
https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example
Calculate Standard Deviation for the following continuous data :
Items 0-10 10-20 20-30 30-40
Frequency 2 1 1 3
https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example
https://www.tutorialspoint.com/statistics/
DATA ANALYTICS
Measures of Spread: Standard Deviation example
https://www.tutorialspoint.com/statistics/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Practical Application for Understanding Variance and Standard Deviation
Even though we live in a world where we pay real dollars for goods and
services (not percentages of income), most American employers issue
raises based on percent of salary. Why do supervisors think the most fair
raise is a percentage raise?
Answer:
1) Because higher paid persons win the most money.
2) The easiest thing to do is raise everyone’s salary by a fixed percent.
If your budget went up by 5%, salaries can go up by 5%.
The problem is that the flat percent raise gives unequal increased
rewards.
DATA ANALYTICS
References
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU
Dr.Mamatha H R
Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality
Statement of Chebyshev’s Inequality
Source Image:ThoughtCo.com
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s inequality
Only the case k > 1 is useful. When k ≤ 1 the right hand 1/ k2 ≥ 1 and the
inequality is trivial as all probabilities are ≤ 1.
Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s inequality
Source: prepnuggets.com
MATHEMATICS FOR COMPUTER SCIENCE
Problems on Chebyshev’s inequality
Problem-2:
MATHEMATICS FOR COMPUTER SCIENCE
Problems on Chebyshev’s inequality
MATHEMATICS FOR COMPUTER SCIENCE
More on Chebyshev’s inequality
Source: www.slideshare.net
MATHEMATICS FOR COMPUTER SCIENCE
Statement of Chebyshev’s inequality
Computers from a particular company are found to last on average for three
years without any hardware malfunction, with standard deviation of two
months. At least what percent of the computers last between 31 months
and 41 months?
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality-Practice problems
Problem 2
What is the smallest number of standard deviations from the mean that
we must go if we want to ensure that we have at least 50% of the data
of a distribution?
MATHEMATICS FOR COMPUTER SCIENCE
Chebyshev’s Inequality-Practice problems
Do It Yourself !!!
What is the largest possible value for the probability that the length of
the metal pin is outside the interval [49.1 , 50.9] mm?
THANK YOU
Devika S Nair
[email protected]
Mamatha.H.R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution
Pumpkin A B C D E F
Weight (in 19 14 15 9 10 17
pounds)
Since we know the weights from the population, we can find the population mean.
μ=(19+14+15+9+10+17)/6=14 pounds
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
Prob 1/15 1/15 2/15 1/15 1/15 1/15 1/15 2/15 1/15 1/15 1/15 1/15 1/15
abilit
y
the chance that the sample mean is exactly the population mean is only 1 in 15,
very small.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
• Now that we have the sampling distribution of the sample
mean,
• we can calculate the mean of all the sample means. In
other words, we can find the mean (or expected value) of
all the possible x¯’s.
• The mean of the sample means is
• μx¯=∑xi¯pxi
• =9.5(1/15)+11.5(1/15)+12(2/15)+12.5(1/15)+13(1/15)+13.
5(1/15)+14(1/15)+14.5(2/15)+15.5(1/15)+16(1/15)+16.5(1
/15)+17(1/15)+18(1/15)=14
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
let's do the same thing as above but with sample size n=5
Sample Weights x¯ Probability
A, B, C, D, E 19, 14, 15, 9, 10 13.4 1/6
Population: 3, 5, 2, 1
Draw samples of size n = 3 without replacement
Possible samples
3, 5, 2
3, 5, 1
3, 2, 1
5, 2, 1
p(x)
1/
Each value of x-bar is 4
equally likely, with x
probability 1/4 2 3
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
Consider a population that consists of the numbers 1, 2, 3, 4 and 5
generated in a manner that the probability of each of those values
is 0.2 no matter what the previous selections were. This population
could be described as the outcome associated with a spinner such
as given below with the distribution next to it.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
If the sampling distribution for the means of
samples of size two is analyzed, it looks like
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
1 2 3 4 5
1 2 3 4 5
n=2
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distributions
Sampling distributions for n=3 and n=4 were calculated and are
illustrated below. The shape is getting closer and closer to the
normal distribution.
1 2 3 4 5
1 2 3 4 5
Original distribution
Sampling distribution n = 2
1 2 3 4 5
1 2 3 4 5
Sampling distribution n = 3 Sampling distribution n = 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Mean and Variance of a Sample Mean
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution-Example
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution-Example
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Distribution of
Skewed population
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Illustrations of Sampling Distributions
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Finding Probabilities for the Sample Mean
Now the population from which the sample was drawn has mean
μ = 22.3 and variance σ2 = 16.
The population from which this sample was drawn has mean μX = 0.4
and standard deviation σX = 0.1.
The population from which this sample was drawn has mean μY = 0.45
and standard deviation σY = 0.15.
Since
SX ∼ N(26, 0.65), SY ∼ N(29.25, 1.4625), and SX and SY are
independent, it follows that
μT =26+29.25=55.25, σ2T=0.65+1.4625=2.1125, and
T ∼ N(55.25, 2.1125)
To find P(50 < T < 55) we compute the z-scores of 50 and of 55.
The probability that the total time used by both machines together is
between 50 and 55 hours is 0.4323
THANK YOU
Dr.Mamatha H R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Course content
Unit 1: Applications of Probability Distributions and Principles of Point Estimation
Introduction, Motivating Examples and Scope. Statistics: Introduction, Types of Statistics, Types of Data, Types
of Experiments – Controlled and Observational study, Sampling: Sampling Methods, Sampling Errors, Case
Study. Chebyshev's inequality, Normal Probability Plots, Introduction to Generation of Random Variates and
mention the types, Acceptance-Rejection method, Sampling Distribution, The Central Limit Theorem and
Applications, Principles of Point Estimation - Mean Squared Error for Bernoulli, Binomial, Poisson, Normal,
Maximum Likelihood Estimate for Bernoulli, Binomial, Poisson, Normal and Case Study. Introduction to
multivariate normal distribution, MAP distribution.
Confidence Intervals: Interval Estimates for Mean of Large and Small Samples, Student's t Distribution, Interval
Estimates for Proportion of Large and Small Samples, Confidence Intervals for the Difference between Two
Means, Interval Estimates for Paired Data. Factors affecting Margin of Error, Hypothesis Testing for Population
Mean and Population Proportion of Large and Small Samples, Drawing conclusions from the results of
Hypothesis tests, Case Study.
Distribution Free Tests, Chi-squared Test, Fixed Level Testing, Type I and Type II Errors, Power of a Test, Factors
Affecting Power of a Test. Simple Linear Regression: Introduction, Correlation, the Least Square Lines, Predictions
using regression models - Uncertainties in Regression Coefficients, Checking Assumptions and transforming data,
Introduction to the Multiple Regression Model, Case Study.
Self-Learning: F test for equality of Variance.
14 Hours
Unit 4: Engineering optimization
Introduction to Optimization-Based Design, Modelling Concepts, Unconstrained Optimization, Discrete Variable
Optimization, Genetic and Evolutionary Optimization, Constrained Optimization.
Self-Learning: Mathematical concepts of objective function, Constraints and Decision variables.
14 Hours
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Course content: Applications
Unit 1: Applications:
1. Poisson distribution, calculation of number of calls received in a specified time duration in call centers.
2. Variance, standard deviation, identifying the customer satisfaction in online shopping
3. Central limit theorem, Load Balancing in distributed systems and internet traffic prediction
4. Sampling mean, Estimating database query response times
Unit 2: Applications:
1. t-distribution, confidence interval, students’ performance analysis based on hours of study
2. z-test, application form processing in banking system.
3. Hypothesis testing, randomly trained students placement into tier-I and tier-II companies.
Unit 3: Applications:
1. Linear regression, stock market prediction
2. using Chi-Square Test, Analyzing the association between vaccination and recovery of the patients considering COVID data.
3. Chi-Square Test and Test of Independence, Analyzing the relationship between gender and preference for a product purchase.
4. Identifying Type 1 and Type 2 Errors in Spam mail classification.
Unit 4: Applications:
1.Minimize a Loss functions in Neural Networks using Batch gradient descent (Unconstrained Optimization)
2. Lagrange Multipliers to find local maxima and minima of a function subject to equations constraints (Constrained Optimization)
3. Case study on Bayesian Optimization with Discrete Variables (Discrete Variable optimization)
4. Use Genetic Algorithms to optimize Production Scheduling in a manufacturing environment, focusing on minimizing total production
costs while meeting job deadlines and machine constraints. Evaluate the GA’s effectiveness against traditional scheduling methods.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Tools and Textbooks
Tools / Languages/Libraries: Jupyter Notebook, Python, Pandas, Matplotlib, Scipy, Seaborn, BeautifulSoup,
Numpy, Scikit learn.
Text Book(s):
1. “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition, 2015.
2. “Optimization Methods for Engineering Design, Parkinson, A.R., Balling, R., and J.D. Hedengren, Second Edition,
Brigham Young University, 2018
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Evaluation Policy
ISA Components
Conduction Reduced to
ISA 1 40 20
ISA 2 40 20
Assignment Coding-5M 10
Datathon-20
ESA 100 50
Assignment Components
1. Submission of the hands-on session code submission=5 Marks
2. Datathon----------------------------------------------------------=5 Marks
Total=10 Marks
Note
1. It is expected that the codes and solutions for hands-on sessions to be submitted on the same day they are
conducted.
2. Datathon will be conducted for 20 Marks and will be reduced to 5M
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Data Science?
● Have you ever wondered how YouTube recommends videos of
your liking?
● How Google’s autocomplete works?
● How Gmail filters your emails into spam and non-spam
categories?
These are some of the simplest applications of Data Science. Such
tasks would be impossible without the availability of data. Thus in
simple words, Data Science is all about using data to solve problems.
Source: https://coralogix.com/blog/elasticsearch-
autocomplete-with-search-as-you-type/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Data Science?
Data Science is an interdisciplinary field.
● It is focused on extracting knowledge and insights from data.
● Those insights are then applied to solve problems across a wide
range of domains.
● It incorporates skills from Statistics, Computer Science,
Mathematics, Business etc.
Source: theblog.adobe..com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science
Source: edureka.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science
Source: edureka.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Airlines Industry
Data Science is used for various purposes like: route planning,
revenue management, prediction on in-flight sales and food
supplies etc.
Source: Simplilearn
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Logistics Industries
Logistics is a sector where data scientists can make a significant
impact in several areas such as:
● waste reduction
● optimizing delivery routes (which can translate into lower
delivery costs)
● selecting carriers that deploy best practices in mitigating the
effects of CO2 emissions
● ensuring that hazardous materials are handled with the
utmost care
● forecasting the supply and demand cycles
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems
Source : https://www.martechadvisor.com/articles/customer-experience-2/recommendation-engines-how-amazon-and-netflix-are-
winning-the-personalization-battle/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems
Amazon has a huge bank of data on online consumer purchasing
behaviour.
The data includes
● purchased shopping cart
● items added to carts but abandoned
● wish lists
● dwell time
● referral sites
● customers’ demographic information
● number of times viewed an item before final purchase
● click paths in session, pricing experiments online etc.
Using this data it can easily find the hidden factors and patterns
to generate the “Recommended for You” section which helps to
create a personalized shopping experience for every customer.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems
Source: https://medium.com/swlh/recommendations-in-time-context-93b32f73d98d
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Recommender systems
Netflix has set up 1300 recommendation clusters based on users
viewing preferences.
Netflix’s personalized recommendation algorithms produce $1
billion a year in value from customer retention and accounts for
80% of its total views. Some of the user information that Netflix
captures to help in recommendation include:
● Viewer interactions with Netflix services like viewer ratings,
viewing history, etc.
● Movie’s information about the categories, year of release,
title, genres etc.
● Other viewers with similar watching preferences.
● Time duration of a viewer watching a show.
● The device on which a viewer is watching.
● The time of the day a viewer watches.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Weather Forecasting
Source: phys.org/news
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Weather Forecasting
Weather forecasts are made by collecting quantitative data
about the current state of the atmosphere at a given place and
using meteorology to project how the atmosphere will change.
So in general, weather forecasting is driven by the data about the
atmosphere.
There are a wide variety of devices and technologies gathering
information about the weather like:
thermometers, barometers, anemometers, weather balloons,
radar systems, satellites etc.
Various weather models analyse and try to make sense of all
the incoming information to accurately predict the weather.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports
Source: https://arstechnica.com/information-technology/2015/10/big-data-an-it-buzzword-that-is-actually-producing-results/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports
Players, team managers, coaches and fans rely on sports analytics
before making decisions or developing strategies to win games.
Sports data analysts spend their time collecting on-field and off-
field data from a variety of sources and then analyzing and
interpreting that data looking for meaningful insights.
Source:
https://en.wikipedia.org/wiki/Moneyball_(film)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Sports
Source: https://fivethirtyeight.com/features/billion-dollar-billy-beane/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Politics
Political parties and their strategists have realized the importance of
mining real-time demographic and polling data.
The various data points may include voter sentiment, mass emotions,
citizen concerns in different constituencies, popular outlooks in
various states, etc. Political parties can use these insights to,
● pull voter donations
● convert undecided voters
● enroll young volunteers
● organize resources
● social media campaigns
● improve effectiveness of electioneering activities etc.
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
https://www.datacouncil.ai/talks/how-data-is-transforming-politics
https://projects.fivethirtyeight.com/polls/generic-ballot/2024/
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
Political strategists and digital analysts can deploy modern software
analytics to create detailed maps of voting patterns.
Data analytics can help these campaigners to paint a vivid picture of
political winds, party supporters, and trenchant opponents in every
demographic region.
This demographic data and other information can be used in
campaign-spending management. It can help determine whether a
voter would be most receptive to a phone call, a flyer or mailer, an in-
person visit, or some other form of campaigning.
By using data in this way, campaigns can avoid wasting money on
ineffective or unnecessary advertising, and have a better chance of
reaching someone who is receptive.
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
Source:
Historical U.S. Presidential Elections
1789-2020 - 270toWin
Mathematics for Computer Science Engineers
Applications of Data Science : Data Science in Politics
Source:
270toWin - 2024 Presidential Election
Interactive Map
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Healthcare & Medicine
Source:
http://www.primeclasses.in/blog/2019/08/26/the-
need-for-data-science-in-healthcare-industry/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Healthcare & Medicine
There are several fields in healthcare like medical imaging, drug
discovery, genetics, predictive diagnosis etc that make use of
data science.
● Hospitals analyse medical data and patient records to predict
those patients that are likely to seek readmission within a
few months of discharge.
● Omada Health is a digital medical company that uses smart
devices to create customized behavioral plans and online
training to help prevent chronic health conditions, such as
diabetes, high blood pressure, and high cholesterol.
● On the mental health side, Canada’s new start-up, Awake
Labs, is tracking data on children with autism in dress,
informing parents before the meltdown.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in Healthcare & Medicine
Source: https://allofus.nih.gov
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Applications of Data Science : Data Science in predicting people’s opinions
Source: Simplilearn
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What is Data?
Technically, data refers to individual facts, statistics, or items of
information, often numeric, that are collected through
observation.
Source: https://www.twinkl.de/teaching-wiki/data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data vs Information
➔ Data
● Raw facts, usually formatted in a special way.
● Based on records, observations etc.
● Unorganized.
➔ Information
● A collection of facts organized in such a way that they have
additional value beyond the value of the facts themselves.
● Based on analysis of data.
● Organized and always depends on data.
Source: https://effectualsystems.com/data-need-information/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of Data
Data Represented by
Source: https://towardsdatascience.com/data-extraction-from-a-pdf-table-
with-semi-structured-layout-ef694f3f8ff1
Source: slidegeeks.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Structured, Unstructured & Semi-structured Data
Structured Data:
Structured data is the data whose elements are
addressable for effective analysis. The data is
organized into a formatted repository that is typically a
database. Ex: Relational data.
Semi-Structured Data:
It is the data that doesn’t reside in relational database
but has some organizational properties that make it
easier to analyse. Ex: XML data.
Unstructured Data:
It is the data which is not organized in a predefined
manner or doesn’t have a predefined data model, thus
not a good fit for a mainstream relational database.
Ex: Word, pdf, text etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Structured, Unstructured & Semi-structured Data
Source:
https://www.slidegeeks.com/pics/dgm/l/f/Forms_Type_Of_Big_Data_Ppt_PowerPoint_Presentation_Infographic_Template_Slide_1-.jpg
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Information
Source: guru99.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Information Concepts
Source: https://learningforsustainability.net
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Science
■ Science-latin word Scientia
■ Meaning Knowledge
■ Science is a systematic enterprise that builds and organizes
knowledge in the form of testable explanations and
predictions about the universe.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why do we need Data Science?
Source: https://static.seekingalpha.com/uploads/2020/1/14/50485001-15789998083991578_origin.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Why do we need Data Science?
The main reason why we need data science is the ability to process
and interpret data. This enables users and industries to make
informed decisions as well as helps in their growth, optimization,
and performance.
Slide courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How is Data generated?
By 2025, it’s estimated that 463 exabytes of data will be created
each day globally
– that’s the equivalent of 212,765,957 DVDs per day!
Source: https://twitter.com/theellenshow
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data generation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data generation
Source: https://trak.in/tags/business/2014/04/15/digital-data-universe-expansion-2020/
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Growth in Data generation
The total amount of data created, captured, copied and consumed
globally has been exponentially increasing.
In 2020, the amount of data created & replicated was higher than
expected caused by the increased demand due to the pandemic.
Up to 2025, global data creation is projected to grow to more than
180 zettabytes.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Growth in Data generation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Growth in Data generation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How much of data is put into use?
Source: https://image.slidesharecdn.com/instroductiontodatascience-160420090623/95/introduction-to-data-science-38-
638.jpg?cb=1461307670
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
But is data all we need?
The graph below shows a cause & effect relationship between
‘Age of Miss America’ and ‘Murders by steam, hot vapour and hot
objects’ which practically doesn’t seem correct.
Thus, we see that the presence of interesting patterns need not
imply their correctness.
Blindly applying various processes and techniques on data can
result in incorrect inferences.
Source: https://i2.wp.com/boingboing.net/wp-
content/uploads/2016/02/chart.jpg?fit=800%2C315&ssl=1
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
But is data all we need?
The following work highlights the risk of amplifying and reinforcing
biases present in the data by blindly applying machine learning on it.
Source: https://arxiv.org/abs/1607.06520
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Learn how to use data
The above examples help us understand that we need to learn
how to utilize and handle the available data in the right manner
to be able to arrive at correct results and draw meaningful
inferences.
Source:slidesharecdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science project life cycle
The correct process of using available data is shown in this life
cycle. It outlines the major stages in a data science project.
Source: https://static.javatpoint.com/tutorial/data-
science/images/data-science-lifecycle.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science project life cycle
Source: https://res.cloudinary.com/practicaldev
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Scientist
Data Scientists in simple words are those who make sense out of all the
data that are available and figure out the things that can be done with it.
Source: proschoolonline.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Scientist
Source: edureka!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What does a Data Scientist do?
They are responsible for collecting, analyzing, modelling and
interpreting large amounts of data. Their role combines Computer
Science, Mathematics, Statistics etc.
Source: https://edvancer.in/wp-
content/uploads/2015/11/76c99311fc4be19bf4353
cfc3c2e94b2.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
What does a Data Scientist do?
Source: medium.com
Slide courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prerequisites for a Data Scientist
Source: data-
Slide courtesy:Dr.Uma flair.training
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Demand for Data Scientist
Data Science is a growing field. It is a popular and lucrative
profession. Glassdoor has ranked this profession at #3 in 2022
despite the occurrence of the pandemic.
Source:
https://cdn.ttgtmedia.com/rms/onlineimages/busin
ess_analytics-data_scientist_01_mobile.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How is it different from what Statisticians have been doing?
Source:
https://scientistcafe.com/ids/images/softskill1.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science vs Data Analysis
● Data Science is primarily used to make decisions
and predictions making use of predictive causal
analytics, prescriptive analytics (predictive plus
decision science) and machine learning.
Source:
https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-
content/uploads/2017/01/Data-Analyst-vs-Data-
Science-1-422x300.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data Science vs Data Analysis
Source: edureka!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Common tasks in Data Science
Source: Simplilearn
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Common tasks in Data Science
Source: https://static.javatpoint.com/tutorial/data-science/images/how-to-solve-a-problem-in-data-science.png
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
References
Text Book:
Statistics for Engineers and Scientists, William Navidi.4th Edition ,
McGraw Hill Education, India
THANK YOU
Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 712
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
UE23MA242A
Unit 1: Sampling Methods
Mamatha.H.R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered
❖ Sampling methods
❖ Sampling process
Sources: blog.masterofproject.com,
analytics-magazine.org
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling process
Define Target
population Specify Sampling Specify Sampling
(population of frame method
concern)
Reviewing the
sampling process
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling
➔ Factors that influence sample representativeness:
● Sampling procedure
● Sample size
● Participation (response)
Source: thumbs.dreamstime.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Recap Population vs Sample
● A population can be defined as, including all people or items
with the characteristic one wishes to understand.
● Because there is very rarely enough time or money to gather
information from everyone or everything in a population, the
goal becomes finding a representative sample (or subset) of
that population.
➔ Note:
● The population from which the sample is drawn may not be the
same as the population about which we actually want
information.
● Often there is large but not complete overlap between these
two groups due to frame issues etc .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sampling Frame
Sampling frame is the list of items or events from which the
potential respondents are drawn or which are possible to
measure.
● Sometimes, it is possible to identify and measure every single
item in the population and to include any one of them in our
sample.
● However, in the more general case this is not possible.
● There is no way to identify all rats in the set of all rats.
● As a remedy, we seek a sampling frame which has the
property that we can identify every single element and include
any of them in our sample.
● The sampling frame must be representative of the population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Representative & Biased Sample
Sample 1
Representative of the
population
Sample 2
Samples
Probability Samples
Non-Probability
Samples
Simple
Random Stratified
Judgement Snowball Cluster
Systematic
Convenience Quota
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Probability Sampling
● Probability sampling is a type of sampling in which every unit in the
population has a chance/probability (greater than zero) of being selected
in the sample, and this probability can be accurately determined.
● This type of sampling decreases bias and sampling error in the selection
process.
● When every element in the population does have the same
probability of selection, this is known as an 'equal probability
of selection' (EPS) design. Such designs are also referred to as
'self-weighting' because all sampled units are given the same
weight.
Source: www.mathstopia.net
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Non-Probability Sampling
● Non-Probability sampling is a type of sampling in which every unit in the
population doesn’t have a chance/probability (greater than zero) of being
selected in the sample.
● Here, some elements of the population have no chance of selection
(these are sometimes referred to as 'out of coverage'/'undercovered'),
or the probability of selection can't be accurately determined.
● It involves the selection of elements based on assumptions regarding the
population of interest, which forms the criteria for selection.
● The selection of elements is non random.
● Thus, non-probability sampling does not allow the estimation of sampling
errors.
● It is more likely to produce a biased sample and restricts generalization.
● It is not an appropriate data collection method for most of the statistical
analysis.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Probability Sampling
● Subjects of the sample are chosen based on known
probabilities.
Probability Samples
Simple
Systematic Stratified Cluster
Random
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling
Simple random sampling, as the name suggests, is an entirely random
method of selecting the sample.
● Here, each subject or unit in the population has an equal chance of
being selected.
● The sampling frame should include the whole population.
● A table of random number or lottery system is used to determine which
units are to be selected.
● Simple random sampling is always an EPS design, but not all EPS designs
are simple random sampling.
Source: datasciencemadesimple.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples
Source:
analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples
Probability = (n/N) x 100
Calculating the probability of each coin getting selected.
● Total population size (N) = 20
● Sample size (n) = 5
● Probability = (5/20) x 100
= 25%
● Thus each coin has 25% of probability of getting selected.
Source:
analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling Examples
In a company consisting of 10,000 employees, 25 employees are selected
to survey the average number of hours a day they are present in the
office.
● Population frame: List of all employees numbered from 1-10,000
● Sample : Random number table consisting of 25 random employees.
● Probability of selection of each employee :
N = 10,000; n = 25
probability = (25/10,000) x 100 = 0.25%
Source: 5found.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling: Advantages
➔ Advantages:
● This method is simple to use.
● Estimates are easy to calculate.
● Random samples are usually fairly representative since they don't
favor certain members of the population.
● Low sampling error.
● It needs only a minimum knowledge of the study group of
population in advance.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling: Disadvantages
➔ Disadvantages:
● If sampling frame is large, this method impracticable.
● Minority subgroups of interest in population may not be present in
sample in sufficient numbers for study.
● This type of sampling can’t be employed where the units of the
population are heterogeneous in nature.
● Sometimes, it is difficult to have a completely cataloged universe.
● This method lacks the use of available knowledge concerning the
population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling with replacement
● This is a sampling procedure in which each sampling unit randomly
selected from the population is measured or recorded and then returned
to the population. Thus, a sampling unit may be sampled multiple times.
● When sampling the first marble, each marble has the same chance of
0.1 of being sampled. When sampling the second marble and all the
subsequent marbles, each marble still has a 0.1 chance of being sampled.
● Each time we sample a unit, all units have similar chances of being
sampled.
Source: www.spss-tutorials.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Random Sampling without replacement
● This is a sampling procedure in which sampling units are selected from a
population of without replacement such that every sample unit has an
equal probability of being selected.
● No element can be selected more than once in the same sample.
● For the first marble sampled, each marble has a 0.1 chance of being
sampled. However, the first unit we sampled has a zero chance of being
sampled again.
● Thus, the other 9 units each have a chance of 1 in 9 = 0.11 of being
sampled as the second unit.
Source: www.spss-tutorials.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling
● Systematic sampling relies on arranging the target population
according to some ordering scheme and then selecting elements at
regular intervals through that ordered list.
● The first element is selected randomly.
● Then it proceeds with the selection of every kth element. Where k is
the size of the selection interval. k = (population size/sample size)
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling
● Systematic sampling is an Equal Probability Sampling method, as
all elements have the same probability of selection (in the below
example given, one in twelve).
● It is not 'simple random sampling' because different subsets of the
same size have different selection probabilities
● Ex: the set {2,5,8,11} has a one-in-twelve probability of selection,
but the set {1,3,6,7} has zero probability of selection.
Source: www.netquest.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling
● When to Use: When project budget is tight and less time to complete.
● Key Aspect: Find the kth value to select every kth member.
k=N/n
● General Procedure:
○ Assign numbers to each population element.
○ Order the population elements in an ordered sequence
○ Find ‘k’ the size of the selection interval.
○ Select the first sample element randomly from the first
k population elements.
○ Thereafter, select the sample elements at a constant
interval, k, from the ordered sequence frame.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling Examples
From a classroom consisting of 64 students, the teacher wants to
select 8 students to check their assignments.
● Population size = N = 64
● Sample size = n =8
● Size of selection interval = k = N/n
Selecting the
= 64/8 = 8 subsequent 8th
student
Randomly selecting
the first student
N = 64
n=8
k=8
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling Examples
Purchase orders for the previous fiscal year are serialized 1 to
10,000. A sample of fifty purchases orders is needed for an audit.
● N = 10,000
● n = 50
● k = 10,000/50
= 200
● First select an element randomly from the first 200 purchase
orders.
● Assume the 45th purchase order was selected.
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling: Advantages
● Sample is easy to select.
● Suitable sampling frame can be identified easily.
● Sample evenly spreads over entire reference population.
● It is a cost effective sampling method.
● It guarantees that the entire population is evenly sampled.
● Systematic sampling also carries a low-risk factor because there
is a low chance that the data can be contaminated.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Systematic Sampling: Disadvantages
● This type of sampling might lead to bias if there is an underlying
pattern/periodicity in the population which coincides with the selection.
Ex : If the HR database groups employees by team, and team members are
listed in order of seniority, there is a risk that the interval might skip over
people in junior roles, resulting in a sample that is skewed towards senior
employees.
● Difficult to assess precision of estimate from one survey.
● Each element does not have an equal chance in getting selected
● Ignorance of all the elements between two kth elements.
● The size of the population is needed. Without knowing the specific
number of participants in a population, systematic sampling does not
work well.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
● Stratified sampling is the type of sampling in which the population is
divided into 2 or more groups called strata based on a shared
characteristic or trait.
● Then simple random samples are selected from each group.
● The selected 2 or more samples are combined into one.
● The strata or groups don’t overlap. But, they represent the entire
population.
● The shared characteristics based on which the population is divided
could be gender, educational attainment, income, age etc.
Source: datasciencemadesimple.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
● Each stratum is sampled as an independent sub-population.
● Every unit in a stratum has same chance of being selected.
● Using same sampling fraction for all strata ensures proportionate
representation in the sample.
● Adequate representation of minority subgroups of interest can be
ensured by stratification & varying sampling fraction between strata
as required.
● Since each stratum is treated as an independent population,
different sampling approaches can be applied to different strata.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling
● Purpose: To obtain an unbiased random sample from a larger
population.
● When to Use: When population proportion must be reflected in
sample.
● Key Aspect: Sample proportion is same as Population proportion,
Strata is homogeneous.
● General Procedure:
○ Divide the population into Strata or Groups.
○ Criteria for division could be: Gender, Hair Color, Eye Color,
Salary, Designation, Age etc.
○ Selection of sample: Simple Random Sampling approach is used
to sample units from each strata.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling examples
Given 20 coins of different colours.
● Population of coins is divided into 4 strata based on their colours.
● Coins from each strata are sampled using simple random sampling.
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling examples
To find out the most popular song among the FM radio listeners.
● All listeners are stratified by age.
● Listeners from each age group are selected using simple random
sampling and surveyed for their favourite song of the year.
Stratified by Age
20 - 30 years old
(homogeneous within the
stratum) Strata are
Heterogeneous
30 - 40 years old
(homogeneous within the
stratum) Strata are
Heterogeneous
40 - 50 years old
(homogeneous within the
stratum)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling examples
A high school principal wants to conduct a survey to collect the
opinions of students.
● The students are grouped into 4 stratums based on their grade.
● Then, simple random samples of 50 students from each grade are
selected to be included in the survey.
Source: statology.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling: Advantages
● It enhances the representativeness of the sample.
● It is easy to carry out.
● It has higher statistical efficiency.
● A stratified sample can provide a higher precision than a simple
random sample of the same size.
● As it provides a greater precision, this type of sampling often
requires a smaller sized sample which saves money.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Stratified Sampling: Disadvantages
● Sampling frame of the entire population has to be prepared
separately for each stratum.
● When examining multiple criteria to divide the population,
stratifying variables may be related to some but not to others
further complicating the design and potentially reducing the utility
of the strata.
● In some cases (such as designs with a large number of strata, or
those with a specified minimum sample size per group), stratified
sampling can potentially require a larger sample than other
methods.
● It is time consuming and expensive.
● It leads to classification errors.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling
● In cluster sampling, population is divided into non-overlapping
clusters or areas similar to Stratified sampling.
● Each cluster is a miniature or microcosm of the population.
● Each cluster should have similar characteristics to the whole sample.
● Instead of sampling individuals from each subgroup like in stratified
sampling, in cluster sampling entire clusters are randomly selected.
● A subset of the clusters is selected randomly for the sample.
● If the number of elements in the subset of clusters is larger than the
desired value of n(sample size), these clusters may be subdivided to
form a new set of clusters and subjected to a random selection
process.
Source:www.netquest.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling
● When to Use: When population is already broken up into
groups(clusters).
● Key Aspect: Heterogeneous members in each group.
● General Procedure:
○ Population is divided into non-overlapping areas(clusters).
○ Each cluster is a miniature or microcosm of a population.
○ Clusters are selected randomly.
○ All elements of the selected-clusters are included in the sample
or elements from the selected-clusters are chosen using simple
random sampling.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling examples
Given a set of 20 coins of different colours
● Population is divided into 5 clusters each having 4 coins.
● A whole cluster is randomly selected to be included in the sample.
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling examples
An athletic organization wishes to find out which
sports Grade 11 students are participating in across
Canada.
● It would be too costly and lengthy to survey
every Canadian in Grade 11, or even a couple of
students from every Grade 11 class in Canada.
● Instead, each school is consisting of Grade 11
students is considered as a cluster and 100
schools are randomly selected from all over
Canada.
● These schools provide clusters of samples. Then,
every Grade 11 student in all 100 clusters is
surveyed. In effect, the students in these clusters
represent all Grade 11 students in Canada.
Source: s4be.cochrane.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling examples
The municipal council of a small city wants
to investigate the use of health care
services by residents.
● The council first obtains electoral
subdivision maps that identify and label
each city block. From these maps, the
council creates a list of all city blocks.
This list will serve as the sampling
frame.
● Every household in that city belongs to a
city block, and each city block
represents a cluster of households. The
council randomly picks a number of city
blocks.
Source:coronainsights.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Advantages
● It is more convenient for geographically dispersed populations.
● It can reduce the travel costs to contact sample elements.
● It simplifies the administration of the survey.
● It is more feasible. The division of the entire population into
homogeneous groups increases the feasibility of the sampling.
● Since each cluster represents the entire population, more
subjects can be included in the study.
● Requires fewer resources. Since cluster sampling selects only
certain groups from the entire population, the method requires
fewer resources for the sampling process.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Disadvantages
● It is statistically less efficient when the cluster elements are
similar.
● Costs and the number of problems occurring are greater than
that of simple random sampling.
● There is higher sampling error.
● The method is prone to biases. If the clusters representing the
entire population were formed under a biased opinion, the
inferences about the entire population would be biased as
well.
● It’s difficult to guarantee that the sampled clusters are really
representative of the whole population.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Types
There are 2 types of cluster sampling methods.
● One-stage sampling: All of the elements within selected
clusters are included in the sample.
● Two-stage sampling: A subset of elements within selected
clusters are randomly selected for inclusion in the sample.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: One-stage cluster sampling
Here, the population is divided into clusters. Then, some of the clusters are
randomly selected and all members from those clusters are included in the sample.
Source:statology.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Cluster Sampling: Two-stage cluster sampling
As the name suggests, this method of sampling involves 2 stages.
Step 1: Split a population into clusters, then randomly select some of the clusters.
Step 2: Within each chosen cluster, randomly select some of the members to be
included in the survey.
Source:statology.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Difference between Strata and Clusters
Although strata and clusters are both non-overlapping subsets of the
population, they differ in several ways.
● All strata are represented in the sample. But only a subset of clusters are in
the sample.
● With stratified sampling, the best survey results occur when elements
within strata are internally homogeneous. However, with cluster sampling,
the best results occur when elements within clusters are internally
heterogeneous.
Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Non-probability Sampling
Non-Probability sampling is a type of sampling in which every unit in
the population doesn’t have a chance/probability (greater than zero) of
being selected in the sample.
Non-Probability Samples
Judgement Snowball
Convenience
Quota
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling
● Sometimes it is also known as grab or opportunity sampling or
accidental or haphazard sampling.
● This is a type of nonprobability sampling which involves the sample
being drawn from that part of the population which is close to hand.
That is, readily available and convenient.
● Here, sample elements are selected for the convenience of the
researcher.
● The researcher using such a sample cannot scientifically make
generalizations about the total population from this sample because
it would not be representative enough.
Source: googleusercontent.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling
● When to Use: When population is not clearly defined or sampling
unit is not clear or complete source list is not available.
● Key Aspect: Subjects for a study are easily available within the
proximity of the researcher.
● General procedure:
○ It is done at the “convenience” of the researcher.
○ Selection : The individuals that are convenient and easiest to
reach are selected to be included in the sample.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling examples
Given a set of 20 coins of different colours.
● Let’s say that the researcher likes the numbers 4,7,12,15,20 .
● Thus, the coins with the same numbers are included in the sample.
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling examples
To research the opinions about student support services in your university
● After each of your classes, you ask your fellow students to complete a
survey on the topic.
● This is a convenient way to gather data, but as you only surveyed students
taking the same classes as you at the same level, the sample is not
representative of all the students at your university.
Source: assets.pearsonschool.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling examples
To record the popular opinions of people about the current laws of the city.
● The researcher surveys all people that pass by his house.
● Again, this is a convenient way of studying the opinions of people living in
the city. But, it doesn’t reflect the opinions of all the residents of the city.
Source:slideshare.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Convenience Sampling: Advantages & Disadvantages
➔ Advantages:
● This type of sampling is useful in pilot study.
● It costs less and is an inexpensive way to gather initial data for the research.
● It saves time.
● It is relatively easy to get a sample.
● It is simple and easy to implement.
➔ Disadvantages:
● It is prone to significant bias as the sample may not be representative of the
characteristics of the population.
● Since the same may not be representative of the population, this type of
sampling can’t produce generalizable results.
● It might lead to sampling errors.
● A study conducted on a convenience sample will have limited external
validity.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling
● Judgemental or Purposive sampling is a
type of non-probability sampling where
the researcher chooses the sample
based on who they think would be
appropriate for the study.
● This is used primarily when there is a
limited number of people that have
expertise in the area being researched.
● The sample depends on the judgement
of the experts conducting the study.
● It is not a scientific method of sampling.
Source: dataz4s.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling
● When to Use: This is used primarily when there is a limited number
of people that have expertise in the area being researched.
Also, the researcher must be confident that the chosen sample is
truly representative of the entire population.
● Key Aspect: The researcher selects a sample based on
experience or knowledge of the group to be sampled.
● General Procedure:
○ On the basis of the researcher’s knowledge and judgment
elements of the population are sampled.
○ Selection : Elements that own the qualities expected by the
researcher.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling examples
Given a set of 20 coins of different colours.
● Suppose, the experts believe that coins numbered 1, 7, 10, 15, and
19 should be considered for the sample as they may help us to infer
the population in a better way.
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling examples
To know more about the opinions and experiences of disabled students
at your university
● You purposefully select a number of students with different support
needs at your university in order to gather a varied range of data on
their experiences with student services.
Source: rm-15da4.kxcdn.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling examples
A panel decides to understand the factors which lead a person to select
ethical hacking as a profession.
● The researchers who understand what ethical hacking is will be
able to decide who should form the sample to learn about it as a
profession.
● Researchers can easily filter out those participants who can be
eligible to be a part of the research sample.
Source:statisticshowto.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Judgemental Sampling: Advantages & Disadvantages
➔ Advantages:
● It consumes minimum time.
● The researcher is given an opportunity to bring his judgement
and expertise to play.
● No special knowledge of statistics is needed.
● Real time results can be obtained.
➔ Disadvantages:
● It is prone to errors in judgment by researcher.
● Low level of reliability and high levels of bias.
● Inability to generalize research findings to the entire
population.
● It is difficult to choose the appropriate sample size.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling
● In this type of sampling, sample elements are selected until the
quota controls are satisfied.
● The population is first segmented into mutually exclusive sub-
groups, just as in stratified sampling.
● Then judgment is used to select subjects or units from each segment
based on a specified proportion.
● The population units are selected based on predetermined
characteristics of the population.
● It is similar to Stratified sampling but it doesn’t involve random
selection.
● Ex: recruiting the first 50 men and first 50 women that meet
inclusion criteria.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling
● When to Use: If a study aims to investigate a trait or a characteristic
of a certain subgroup, this type of sampling is the ideal technique.
● Key Aspect: Sample elements are selected until the quota controls
are satisfied.
● General Procedure:
○ Divide the population into subgroups.
○ Identify proportions or weightage in which the subgroups are
present in the population.
○ Select an appropriate sample size while maintaining the
proportions of the subgroups.
○ Conduct the surveys according to the quotas defined
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples
Given a set of 20 coins of different colours.
● Here we need to select items based on predetermined characteristics of
the population.
● Suppose we have to select coins having a number in multiples of four for
our sample. Thus, the coins 4,8,12,16,20 are sampled.
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples
Source: ovationmr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples
A cool drinks company wants to find out what age group prefers what brand of
drinks in a particular city.
● The researcher applies quotas on the age groups of 11-21,22-31, 32-41, 42-51.
● The researcher then samples people from each quota and surveys them to
gauge the trend among the population of the city.
Source: ovationmr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quota sampling examples: Advantages & Disadvantages
➔ Advantages:
● It is a cost effective method.
● There is convenience in execution of this sampling.
● It is a speedy process.
● The information can be deciphered once the sampling is done.
● It improves the representation of certain groups within the population
and also ensures that they are not over-represented.
➔ Disadvantages:
● Impossible to determine sampling error as the sample is not chosen
using random selection.
● Can result in sampling bias if the selection of units was based on ease of
access and cost considerations.
● It is not possible to make statistical inferences from the sample to the
population leading to the problems of generalization.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling
● In this type of sampling, survey subjects are selected based on referral
from other survey respondents.
● Existing subjects are asked to nominate further subjects known to them
so that the sample increases in size like a rolling snowball.
● This method of sampling is effective when a sampling frame is difficult to
identify.
● Usually applied when the subjects are difficult to trace. Ex: it will be
extremely challenging to survey shelter less people or illegal immigrants.
Source: cuttingedgepr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling
Source: questionpro.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples
To select students from a class of 20 to be a part of a volunteer club.
● Here, we had randomly chosen person 1 for our sample, and then
he/she recommended person 6, and person 6 recommended person 11,
and so on. 1->6->11->14->19
Source: analyticsvidhya.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples
Source: cdn.scribbr.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples
Source: miro.medium.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Snowball sampling examples: Advantages & Disadvantages
➔ Advantages:
● The chain referral process allows the researcher to reach populations that are
difficult to sample when using other sampling methods.
● The process is cheap, simple and cost-efficient.
● This sampling technique needs little planning and fewer workforce compared
to other sampling techniques.
➔ Disadvantages:
● There is a significant risk of selection bias in snowball sampling, as the
referenced individuals will share common traits with the person who
recommends them.
● It is usually impossible to determine the sampling error or make inferences
about populations based on the obtained sample.
● The researcher has little control over the sampling method.
● Representativeness of the sample is not guaranteed.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample size
● The more heterogeneous a population is, the larger the sample
needs to be.
● For probability sampling, the larger the sample size, the better.
● With nonprobability samples, sample size is not generalizable.
● The main factors affecting the sample size are:
○ Total size of the population
○ Margin of error
○ Confidence level
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample statistic & Population parameter
➔ Sample statistic:
● A sample statistic is a piece of information you get from a fraction
of a population i.e. a sample.
● It can also be defined as any number or statistic computed from
the sample data.
● Example: sample average, median, sample standard deviation,
and percentiles.
➔ Population parameter:
● A quantity or statistical measure, for a given population is called a
population parameter.
● It can also be defined as data that refers to something about an
entire population.
● Example: mean and variance of a population are population
parameters.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Sample statistic & Population parameter
Decide whether the numerical value describes a population
parameter or a sample statistic.
Sampling error or
Random error
Non-sampling error
or Systematic error
https://www.spss-tutorials.com/simple-random-sampling-what-is-it/
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-
guide-8-types-of-sampling-techniques/
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU
Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 712
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS UE23MA242A
Unit 1: Types of Data & Experiments
Mamatha.H.R
Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Topics to be covered
❖ Types of data
❖ Variables or Attributes
❖ Types of studies
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Data
● Data refers to individual facts, statistics, or items of information
that are collected through observation.
● It can also be defined as the facts and figures collected,
summarized, analyzed and interpreted.
● The data collected in a particular study are referred to as the
data set.
Source: twinkl.de
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of data
Source: lh5.googleusercontent.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of data
Based on their mathematical properties, data are divided into
four groups :
NOIR-
• Nominal
• Ordinal
• Interval
• Ratio
They are ordered with their increasing
•Accuracy
•Powerfulness of measurement
•Preciseness
•Wide application of statistical techniques
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data
● Quantitative Data are measurements that are recorded on a
naturally occurring numerical scale.
● These are easily open for statistical manipulation and can be
represented by a wide variety of statistical types of graphs and
charts like line charts, bar graphs, scatter plots, etc.
● These types of data tries to find the answers to questions such as
○ “how many,
○ “how much” and
○ “how often”
● Example: Age, GPA, Salary, Cost of books this semester, Scores of
tests and exams, weight of a person, temperature in a room etc.
● There are 2 general types of quantitative data:
○ Discrete data
○ Continuous data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data
● Qualitative Data are measurements that cannot be recorded on a
natural numerical scale, but are recorded in categories.
● It is also known as Categorical Data as the information can be sorted
by category, not by number.
● Example: Year in school, Live on/off campus, Major, Gender, colors
etc.
● These can answer the questions like:
○ “how this has happened”, or
○ “why this has happened”.
● In general, there are 2 types of qualitative data:
○ Nominal data
○ Ordinal data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data
● This data type is used just for labeling variables, without having any
quantitative value.
● Here, the term ‘nominal’ comes from the Latin word “nomen” which
means ‘name’.
● They are categories without any particular order or direction.
● The nominal data sometimes is referred to as “labels”.
● Their use is restricted to keeping track of people, objects and
events.
● They are least powerful in measurement with no arithmetic
origin or order.
● Hence, nominal data is of restricted or limited use.
● Examples: Gender (Women, Men), Hair color (Blonde, Brown,
Brunette, Red, etc.), Marital status (Married, Single, Widowed) etc.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Nominal data examples
Gender, marital status or any alphabetic / numeric code without
intrinsic order or ranking.
Source:
www.slideshare.net
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Ordinal data
● In simple words, we can understand the ordinal data as qualitative data for
which the values are ordered.
● Ordinal data may indicate superiority.
● But, we cannot do arithmetic operations with ordinal data because they
only show the sequence.
● Based on the relative position, we can also assign numbers to ordinal data.
For example, “first, second, third…etc.”
● Ordinal data allows for setting up inequalities, but it has no absolute value.
● More precise comparisons are not possible.
● Examples:
○ Ranking of users in a competition: The first, second, and third, etc.
○ Rating of a product taken by the company on a scale of 1-10.
○ Economic status: low, medium, and high.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Ordinal data
● Here, the order matters but not the
difference between values.
● Example: Pain Scales
○ Patients are asked to express the
amount of pain they are feeling on a
scale of 1 to 10.
○ A score of 7 means more pain than a
score of 5, and that is more pain than
a score of 3.
○ But the difference between the 7 and Source: Questionpro, slideshare.net
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Qualitative data: Ordinal data examples
Source: slideplayer.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Continuous data
● A set of data is said to be continuous if the values belonging to
the set can take on any value within a finite or infinite interval.
● It represents the information that could be meaningfully
divided into its finer levels.
● It can be measured on a scale or continuum and can have
almost any numeric value. For Example, we can measure our
height at very precise scales in different units such as meters,
centimeters, millimeters, etc.
● Examples of continuous data:
○ The amount of time required to complete a project.
○ The height of children.
○ The speed of cars.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Continuous data
Source: slideplayer.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Interval data
● It is a data type which is measured along a scale, in which each point is
placed at equal distance from one another.
● These data types are measurable and ordered with the nearest items
but have no meaningful zero.
● Interval scales not only educate us about the order of the items but in
addition, give information about the value between every item.
● There are some descriptive statistics that we can calculate for interval
data such as :
○ Central measures of tendency (mean, median, mode)
○ Range (minimum, maximum)
○ Spread (percentiles, interquartile range, and standard deviation).
● Examples: Temperature (°C or F, but not Kelvin),
Dates (1055, 1297, 1976 etc), Time Gap on a 12-hour clock (6 am, 6pm)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative data: Interval data
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the type of data
➔ Gender of each employee at a company.
Qualitative data, Nominal data
➔ Number of tomatoes on each plant in a field.
Quantitative data, Discrete data
➔ Number of defective items in a lot.
Quantitative data, Discrete data
➔ Salaries of CEOs of oil companies.
Quantitative data, Discrete data
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Attribute or Variable
Source: towardsdatascience.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable
● A variable that can be measured numerically is called a quantitative
variable.
● The data collected on a quantitative variable are called quantitative
data.
● Thus, a quantitative variable represents a measure and is numeric.
Its values can be recorded on a numeric scale.
● Example: a country’s population, a book’s price, height, weight,
number of items sold to a shopper, time in 100 yard dash etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Discrete
● A variable whose values are countable is called a discrete variable.
● In other words, a discrete variable can assume only certain values
with no intermediate values.
● Its number of values is finite or limited.
● Example: number of oranges in a bag, number of students in a
classroom, shoe size etc.
Source: cdn.wallstreetmojo.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Interval
● It is interval measured on a scale of equal-sized units
● Values of inter variables have order
● It has no true zero-point.
● Interval variables allow to rank the items measured in order.
● They also allow to quantify and compare the magnitude of
differences between them.
● Example: temperature in ˚ C or ˚ F, calendar dates etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Quantitative Variable: Ratio
● Ratio variables represent the highest level of measurement.
● A ratio variable has an inherent or true zero-point.
● The numerical relationship between the values of a ratio variable is
meaningful.
● We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K is twice as high as 5 K).
● Example: temperature in Kelvin, length, counts, monetary quantities
etc.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Properties of Attributes
The type of an attribute depends on which of the following properties it
possesses:
● Distinctness: =, ≠
● Order: <, >
● Addition: +, -
● Multiplication: *, /
Source: dpbnri2zg3lc2.cloudfront.net
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Examples
In the table below identify which columns represent qualitative
variables and which columns represent quantitative variables.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of studies
Source: statisticsguruonline.com
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Study
Source: prehospitalresearch.eu
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Study
Sources: www.scienceabc.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Group vs Experimental Group
Source: thoughtco.com
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Control Group vs Experimental Group
➔ Control Group:
● A control group is a group separated from the rest of the experiment
such that the independent variable being tested cannot influence the
results.
● This isolates the independent variable’s effect on the experiment and
can help rule out alternative explanations of the experimental results.
➔ Experimental Group:
● An experimental group is a test sample or the group that receives an
experimental procedure.
● This group is exposed to changes in the independent variable being
tested.
● The values of the independent variable and the impact on the
dependent variable are recorded. An experiment may include
multiple experimental groups at one time.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Experiment
● While all experiments have an experimental group, not all experiments
require a control group.
● Controls are extremely useful where the experimental conditions are
complex and difficult to isolate.
● Experiments that use control groups are called controlled experiments.
Source: cdn.kastatic.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Observed Experiment ex:
• there have been many studies conducted to determine the effect of
cigarette smoking on the risk of lung cancer. In these studies, rates of
cancer among smokers are compared with rates among non-smokers.
• The experimenters cannot control who smokes and who doesn’t; people
cannot be required to smoke just to make a statistician’s job easier
Source: cdn.kastatic.org
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Observational study vs. Experimental study
Observational Study Experimental Study
Observe only, no “Treatment” assigned.
“treatment” assigned.
Generally a control group is Uses control group
not needed. for comparison.
Reports an association. Report a cause and effect.
May (or not) use random sample Randomization of
sets. sample group.
May (or not) generalize to population. Generalize to population.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the types of study
Q1.A study took random sample of adults and asked them about their bedtime
habits. The data showed that people who drank a cup of tea before bedtime
were more likely to go to sleep earlier than those who didn't drink tea.
Q2.A study took a group of adults and randomly divided them into two groups.
One group was told to drink tea every night for a week, while the other group
was told not to drink tea that week. Researchers then compared when each
group fell asleep.
Slide Courtesy:Dr.Uma
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Identify the types of study
Q3.A study randomly assigned volunteers to one of two groups:
One group was directed to use social media sites as they usually do.
One group was blocked from social media sites.
Q4.A study took a random sample of people and examined their social
media habits. Each person was classified as either a light, moderate, or
heavy social media user. The researchers looked at which groups tended
to be happier.
Slide Courtesy:Dr.Uma
DATA ANALYTICS
References
https://www.analyticsvidhya.com/blog/2021/06/complete-guide-to-
data-types-in-statistics-for-data-science/
Text Book:
Statistics for Engineers and Scientists, William Navidi.
THANK YOU
Dr.Mamatha H R
Professor, Department of Computer Science
[email protected]
+91 80 2672 1983 Extn 834
K.M Mitravinda
[email protected]