Unit 1
Unit 1
Date : 21/09/2024
Subject : #UE23MA242A
What is Data?
Data refers to individual fact, statistics or items of
information that are collected through observation
Example:
Data => {13.5C, 14.7C, 19.9C}
Information => Highest temp = 19.9C and lowest temp = 13.5C
Population:
A population is the entire collection of data/objects about
which the information is sought
Sample:
Population Sample
It is the complete set It is a subset of the population
Hard to define and Easier to define and observe
observe
Time consuming and Faster and Cheaper to study
Costly to study
Contains all the members Contains a small part of the
of a specified grp population that represents the
population
Eg: All the countries of Eg: Countries with data available
the world about GDP and birth rates since 2000
Why Sampling?
Necessity: Sometimes it’s simply not possible to study the whole
population
due to its size or inaccessibility.
Practicality:
It’s easier and more efficient to collect data from a
sample.
Cost-effectiveness:
There are fewer participant, laboratory, equipment, and
researcher costs involved.
Manageability:
Storing and running statistical analyses on smaller datasets
is easier and reliable.
Saves time:
As sample size is relatively less, it increases data-
collection speed
Characteristics of Sample:
1. Unbiased, which means it shd have all types of data in equal
proportion from the population
2. It should represent the population
3. It should be Goal-Oriented
4. It should be random i.e. every item in the population must have
an equal chance of getting selected
5. It should be Appropriately sized, it cannot be very small
compared to the population and also cannot be bigger than the
population. It should be just enough so that it can fit the
other characteristics
Types of Population:
1. Tangible or Concrete Population:
This type of populations consists of physical such as cars,
bolts, apples etc which can be counted. Thus, this
population is usually finite and involves counting
Each time an item is Sampled, population size decreases by 1
2. Conceptual Population:
Population which do not include physical objects as their
population are called Conceptual Population
This type of population usually involves measuring something
multiple times
The size of the population is usually very large
Example:
1. A geologist weighs a rock, several times on a scale ->
Conceptual Population
3. Target Population:
It is the entire population that researchers want to
generalise the conclusions
4. Study Population:
It is the population that the researchers have access to, it
can be due to geographic limitations, permissions, money etc
It is a subset of the Target Population
Sampling Breakdown:
Study : Find the mean weight of all students of all universities
in India.
Samples
Probability Sampling
It is a type of sampling in which every unit in the population
has a non-zero chance/probability of being selected
This type of sampling reduces bias
When every item in the population has a equal probability to get
selected in the sample it is known as Equal Probability Sampling
or Self-Weighing
Probability Samples
1) Simple Random:
n
( ) × 100
N
2) Systematic:
In this type of sampling, we arrange the population is some
systematic way and then pick out samples at regular intervals
The first element is also selected randomly
Then we choose every K'th element where K = (Population Size /
Sample Size)
This is also an Equal Probability Sampling method
3) Stratified:
4) Cluster:
In this, clusters are made -> random clusters are selected ->
entire clusters are put in the sample
In this, clusters are made -> random clusters are selected ->
random items from the randomly chosen clusters are selected ->
put in the sample
Non-Probability Sampling:
In this type of sampling, every item does not have an equal
chance of getting selected in the sample.
This means some items might have a 0 probability of getting
selected, these items are usually referred to as Out of
coverage/Undercovered items
It involves the selection of elements based on assumptions
regarding the population of interest
It is a more biased sample
Non-Probability Samples
1) Convenience Sampling:
2) Judgement Sampling:
3) Snowball Sampling:
4) Quota Sampling:
In this sampling, sample elements are selected until the Quota
controls are satisfied
The population is first divided into mutually exclusive sub
groups like Stratified sampling and then judgement sampling is
used to select item from each segment based on a specific
proportion
Errors in Sampling
1. Sampling Error or random error
Occurs when sample is not representative of the population
The discrepancy between a sample statistic and its
population parameter is called sampling error
It occurs when the sample is not representative of the
population
2. Non-Sampling or Systematic Error
Occurs during data collection, causing the data to differ
from the true values
Sampling Bias:
This bias occurs when the sample is not representative of
the population
It can be either Selection bias or Non-Response Bias
Selection Bias:
It is a bias in which, the samples are chosen in such
a way that some members of the intended population
have a higher or lower sampling probability that
others
Non-Response Bias:
This occurs due to the absence of certain groups of
items from the population during sampling
Types of Data:
NOIR => Nominal Ordinal Interval Ratio
1. Qualitative (N-O):
Measurements that cannot be recorded on a naturally
occurring scale
These information can be categorized by category but not by
number
Nominal and Ordinal
Nominal:
Data that can be categorized without any natural order
Ex: Gender (Male, Female), Colours(Red, Green and Blue)
Ordinal:
Data has a natural order, but does not have a regular
interval between them
Ex: Grades (A,B,C), Satisfaction Levels
2. Quantitative (I-R):
This is the measurements that can be recorded on a naturally
occurring scale
These data are easily open for statistics and can be plotted
on various graphs
Discrete, Continuous, Interval and Ratio
Discrete:
If the values of the set are discrete and separate then
it is said to be discrete data
Usually bar charts are used to display this data
It has a limited number of values
Continuous:
If the values in the set can take any value finite or
infinite from the interval it is said to be continuous
data
Interval:
In this data type, data is measured along a scale with
regular intervals
Although they are placed at regular intervals, they do
not have a meaningful zero
Ex: Temperature in Celsius (0 Celsius does not mean "No
temperature")
Ratio:
Similar to interval data but it has a meaningful zero
point
It is the most precise type of data and allows for all
statistical techniques
Ex: Weight (0 means no weight)
Types of Studies:
Observational:
No interference from the researcher, subject are just
observed
Experimental:
Interference by the researcher to perform an experiment then
observations are made
Usually, experimental study happen with two groups namely
Control and Experimental
Control groups usually have no intervention by the
researcher so that the independent variable being
measured has no effect
Experimental group are the group on which the experiment
is conducted, and where the effect of the independent
variable is observed
Types of Statistics:
Descriptive Statistics:
Involves organization. summarization and display of data
It uses numerical and graphical methods to look for patterns
in a data set, to summarize the given data and to display
the data in a convenient form
Inferential Statistics:
Involves inferring something from the sample to infer
something about the population
It uses sample data to make estimates, generalization,
decisions and predictions about a larger set of data
There are two main area of Inferential Statistics:
Estimating Parameters:
This means taking a sample statistic (from sample
data) and using it to say something about a
population parameter
Hypothesis Testing:
This is where sample data is used to answer research
questions
Descriptive Statistics:
Measures of central tendency:
There are 3 diff. types of average mean, median and mode
All of them summarize where the centre of the data is
Mean:
It is the arithmetic average of the given data
Population mean
n
1
x = ∑ xi
N
i=1
n
1
x̄ = ∑ xi
n
i=1
Weighted mean:
It is the average where some of the elements contribute more to
the mean value
n
∑ wi xi
i=1
x̄ =
n
∑ wi
i=1
Trimmed Mean:
np
100
Median:
It is the value separating the upper half values from the lower
half
It is the middle number of a sorted sample
To find the median we first arrange the data in ascending order
then,
If the no.of. elements is odd then median is the (n+1)/2 th terms
value
If the no.of. elements is even then the median is the average of
(n/2) and (n+1)/2 th term
Mode:
Empirical Formula
Skewness:
It is the measure of asymmetry of the distribution about its
mean
Skewness can be +ive, -ive, zero or undefined
Symmetric distribution is the one where the left and right side
of the distributions are balanced. The mean, median and mode are
the same
Skewed distribution is the one where the left and right side of
the distribution is imbalanced. The mean, median lie more
towards the skewness than the mode.
mean < median < mode -> Left Skewed
mode < median < mean -> Right Skewed
mean = median = mode -> Symmetric
Measures of Spread/Deviation:
It helps us tell how much the data is spread or how
homogenous/heterogenous the data is
There are two main methods to measure the spread of a data
Absolute:
Contains the same unit as the data, usually is the
average of deviations of observations such as standard
deviation etc
Relative:
This is used to compare the deviation of two or more data
sets
Range:
Percentile:
A percentile is a comparison measure between a particular value
and the values of rest of the dataset
For example, if u have scored 75 marks and are ranked in the
85th percentile that means that 75 marks is greater than 85% of
the scores
The percentile rank is calculated using
P
R = ( )(n + 1)
100
Quartile:
In this method the distribution is divided into 4 parts
Q1 = 0.25(n + 1)
Q2 = 0.5(n + 1) or this can also be the median
Q3 = 0.75(n + 1)
Here also if Q1, Q2, Q3 are integers we directly take the value
at that point, if not we take the average of the preceding and
succeeding values
Also we need to order the values in ascending order
InterQuartile Range:
Variance:
It is the measure of how spread the values are from the
centre/average of the distribution
2
∑(x − x̄)
2
Sample V ariance, s =
n − 1
2
∑(x − μ)
2
P opulation V ariance, σ =
N
Steps to calculate:
1. Find mean of the given data
2. Subtract the mean from the data
3. Square the deviation found in step 2
4. add all the deviation and divide by N for population and n-1
for sample
Standard Deviation:
Standard deviation = sqrt(Variance)
Larger the standard deviation, greater the spread
Just like mean, std. deviation is affected by outlier values
Chebyshev's Inequality:
This states that at least 1 - (1/k^2) of data from a sample must
fall within K standard deviations from the mean
Example: For K = 2, we have 1 - (1/k^2 = 1 - 1/4 = 3/4 = 75% .
According to Chebyshev's Inequality at least 75% of the data
should lie within 2 standard deviations from the mean
The inequality is represented something like this
1
P (|X − μ| ≥ kσ) ≤
2
k
1
P (μ − kσ ≤ X ≤ μ + kσ) ≥ 1 −
2
k
Sampling Distribution:
It is a probability distribution of a sample statistic like
sample mean etc taken from different samples from the same
population
It is used to estimate population parameters
If X1...Xn is a simple random sample from a population with mean
μ and variance σ 2 , then the sample mean X̄ is a random variable
with
μ x̄ = μ
2
σ
σ x̄ 2 =
n
σ
σ x̄ =
√n
Next : _MCSE_/UNIT 2