Bio401 Mid Subj by KAINAT ALVI
Bio401 Mid Subj by KAINAT ALVI
1. Descriptive Statistics
These are the statistical methodologies which are used to Organize and
Summarize data.
This summarization can be in numeric as well as in graphical form. We
use different methodologies within them. And these descriptive
methodologies form the basis of any quantitative analysis of data
Descriptive analysis is also known as Exploratory Data Analysis (EDA)
2.Inferential Statistics
These are the methods which we used to talk about the population from
The information that comes out of the sample. OR. These are statistical
methodologies using which we reach a conclusion about a population on
the basis of the information contained in the sample.
Inference
Use a random sample to learn something about a larger population
Variable
It is a characteristic that takes on different values in different persons,
places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.
Types of Variable
Quantitative Variable:
it is the one that can be measured and expressed numerically.
Discrete variable: it is the ones that have gap or jumps in the possible
values. These gaps indicate the absence of values between particular
possible values of our variable. It is also known as countable values. E.g:
no of pages in book, teeth per child in a school, children in a family.
Continuous variable: It is the ones that take on each and every possible
value within an interval. Don’t possess gaps in their possible values. E.g;
height of a person, weight of a person.
Qualitative variable: It is the one that can not be measure in a numerical
form. These are non-numeric variable where each possible value is a
category of a variable that’s why they are also known as categorical
variable or attribute. E.g; brand of PC, hair color, marital status, category of
different types of animals etc.
Measurement
It is Defined as the assignment of numbers to objects or events according
to a set of rule.
Properties of measurement scales:
Each scale of measurement satisfies one or more of the following
properties of measurement. Identity. Each value on the measurement
scale has a unique meaning.
Magnitude. Values on the measurement scale have an ordered
relationship to one another. That is, some values are larger and some are
smaller.
Equal intervals. Scale units along the scale are equal to one another.
This means, for example, that the difference between 1 and 2 would be
equal to the difference between 19 and 20.
A minimum value of zero. The scale has a true zero point, below which
no values exist.
Types of Measurement Scales:
There are four types of measurement scales • Nominal • Ordinal • Interval •
Ratio
Location parameter
The mean and median are special cases of a family of parameters known
as location parameter.
Measures of position
Other location parameters which helps us to determine the position of a
particular observations in a data are called measures of position.
Quantile
These measures of position when help us to divide sample data into n
equal part or divide the probability distribution into contiguous intervals with
equal probabilities are called quantiles.
Types of quantiles
Quartile: In an array quartiles are 3 cut points which divide the values into
4 equal parts. Denoted by Qk(k is a subscript).
Descile
In an array the Deciles are 9 values/cut-points that divides the values into
10 equal parts.
Denoted by Dk
Percentile
In an array the Percentiles are 99 values/cut-points that divides the values
into 100 equal parts.
Denoted by Pk
2. Random sample or probability sample: The selection of units in the
sample from a population is governed by the laws of chance or
probability. The probability of selection of a unit can be equal as well as
unequal.
3. Non-random sample or purposive sample: The selection of units in
the sample from the population is not governed by the probability laws.
For example, the units are selected on the basis of the personal
judgment of the surveyor. The persons volunteering to take some
medical test or to drink a new type of coffee also constitute the sample
on non-random laws.
Probability Sampling
This Sampling technique uses randomization to make sure that every
element of the population gets an equal chance to be part of the selected
sample. It’s alternatively known as random sampling. Such sampling
techniques can be classified in the following way.
• Simple random sampling • Stratified sampling • Systematic sampling •
Cluster (area) sampling • Multistage sampling
Simple Random Sampling:Every element has an equal chance of getting
selected to be the part sample. It is used when we don’t have any kind of
prior information about the target population.
For example: Random selection of 20 students from class of 50 student.
Each student has equal chance of getting selected. Here probability of
selection is 1/50
Stratified Sampling: This technique divides the elements of the
population into small subgroups (strata) based on the similarity in such a
way that the elements within the group are homogeneous and
heterogeneous among the other subgroups formed. And then the elements
are randomly selected from each of these strata. We need to have prior
information about the population to create subgroups.
Cluster Sampling Our entire population is divided into clusters or
sections and then the clusters are randomly selected. All the elements of
the cluster are used for sampling. Clusters are identified using details such
as age, sex, location etc. Cluster sampling can be done in following ways:
Single Stage Cluster Sampling Entire cluster is selected randomly for
sampling.
Two Stage Cluster Sampling Here first we randomly select clusters and
then from those selected clusters we randomly select elements for
sampling
Systematic Clustering: Here the selection of elements is systematic and
not random except the first element. Elements of a sample are chosen at
regular intervals of population. All the elements are put together in a
sequence first where each element has the equal chance of being
selected.
Multi-Stage Sampling It is the combination of one or more methods
described above. Population is divided into multiple clusters and then
these clusters are further divided and grouped into various sub groups
(strata) based on similarity. One or more clusters can be randomly
selected from each stratum. This process continues until the cluster can’t
be divided anymore. For example country can be divided into states, cities,
urban and rural and all the areas with similar characteristics can be
merged together to form a strata.
Non-Probability Sampling Methods
It does not rely on randomization. This technique is more reliant on the
researcher’s ability to select elements for a sample. Outcome of sampling
might be biased and makes difficult for all the elements of population to be
part of the sample equally. This type of sampling is also known as non-
random sampling.
• Convenience Sampling • Purposive Sampling • Quota Sampling •
Referral /Snowball Sampling Convenience Sampling: Here the samples
are selected based on the availability. This method is used when the
availability of sample is rare and also costly. So based on the convenience
samples are selected. For example: Researchers prefer this during the
initial stages of survey research, as it’s quick and easy to deliver results.
Purposive Sampling This is based on the intention or the purpose of
study. Only those elements will be selected from the population which suits
the best for the purpose of our study. For Example: If we want to
understand the thought process of the people who are interested in
pursuing master’s degree then the selection criteria would be “Are you
interested for Masters in..?” All the people who respond with a “No” will be
excluded from our sample.
Quota Sampling This type of sampling depends of some pre-set
standard. It selects the representative sample from the population.
Proportion of characteristics/ trait in sample should be same as population.
Elements are selected until exact proportions of certain types of data is
obtained or sufficient data in different categories is collected
For example: If our population has 45% females and 55% males then our
sample should reflect the same percentage of males and females.
Referral /Snowball Sampling This technique is used in the situations
where the population is completely unknown and rare. Therefore we will
take the help from the first element which we select for the population and
ask him to recommend other elements who will fit the description of the
sample needed. So this referral technique goes on, increasing the size of
population like a snowball.
For example: It’s used in situations of highly sensitive topics like HIV Aids
where people will not openly discuss and participate in surveys to share
information about HIV Aids. Not all the victims will respond to the questions
asked so researchers can contact people they know or volunteers to get in
touch with the victims and collect information
Descriptive Statistics is performed using Tabular, Graphical, and
Numerical methods.
Tabular Methods are used to summarize the data in table form. It is a
systematic organization of information in grid row and columnar structure.
The most frequently used tabular format for data summarization is
Frequency table and Cross-tabulation.
Graphical Methods are a visual way of presenting data using charts and
graphs. The visuals make the data intuitive and self-understandable. The
most frequently used visual representation of data are Bar Plot, Histogram,
Pareto Chart, Box Plot, Pie Chart, Line Plot, and Scatter Plot.
Ordered array
It is the listing of the values of a collection(either sample or population) in
order of magnitude from the smallest value to largest value.
Statistical Distribution
It is the distribution of statics across an infinite number of samples
Frequency Distribution
It is the organization of Raw data into tabular form using two important
components of frequency distribution.
• Class. And Class frequency
Class is a quantitative or qualitative variable whereas class frequency is
the number of data values in a specific class.
Class is a range for the values of a variable. Frequency is the number of
observations associated with a class. Relative Frequency is the proportion
of observations (frequency) associated with a class
Frequency distribution
. A table that organizes data values into classes or intervals along with
number of values that fall in
each class (frequency, f ).
Types of frequency distribution
Categorical frequency Distribution : is used for qualitative data {Nominal
& Ordinal}
Ungrouped frequency distribution: is used for quantitative data { Interval
& Ratio}
for data sets with few different values, each value is in its own class
Grouped frequency distribution: is used for quantitative data. It shows
the frequency of an item in each separate data value rather than groups of
data values.
for data sets with many different values, which are grouped together in the
classes.
Relative frequency
• Shows the portion or percentage of the data that falls in a particular class.
OR
A relative frequency is the ratio (fraction or proportion) of the number of
times a value of the data occurs in the set of all outcomes to the total
number of outcomes.
To convert a frequency into a proportion or relative frequency, we should
divide the frequency for each class by the total of the frequencies. The sum
of the relative frequencies will always be 1.
R.F= frequency of the class/total
Cumulative relative frequency
It is the accumulation of the previous relative frequencies. To find the
cumulative relative frequencies, add all the previous relative frequencies to
the relative frequency for the current row.
Cumulative relative frequency = sum of previous relative frequencies +
current class frequency
Contingency tables
Contingency tables (also called crosstabs or two-way tables) are used in
statistics to summarize the relationship between several categorical
variables. A contingency table is a special type of frequency distribution
table, where two variables are shown simultaneously. The entries in the
cells of a two-way table can be frequency counts or relative frequencies
(just like a one-way table).
Frequency Histograms
A graphical display of distribution of frequencies is known as frequency
histogram.
A bar graph that represents the frequency distribution.
• The horizontal scale is quantitative and measures the data values.
• The vertical scale measures the frequencies of the classes.
• Consecutive bars must touch.
HISTOGRAM
A histogram consists of a set of adjacent rectangles whose bases are
marked off by class boundaries along the X-axis, and whose heights are
proportional to the frequencies associated with the respective
Classes.
FREQUENCY POLYGON:
A frequency polygon is obtained by plotting the class frequencies against
the mid-points of the classes, and connecting the points so obtained by
straight line segments.
FREQUENCY CURVE: When the frequency polygon is smoothed, we
obtain what may be called the frequency curve.
Box and whisker plot
It shows five number summary of the set of data: minimum, lower quartile,
median, upper quartile, maximum.
A box plot (also called a box and whisker plot) shows data using the
middle value of the data and the quartiles, or 25% divisions of the data.
Construction:
Step 1: Arrange the data in ascending order.
Step 2: Find the median, lower quartile and upper quartile.
Step 3: Draw a number line that will include the smallest and the largest
data.
Step 4: Draw three vertical lines at the lower quartile (12), median (22) and
the upper quartile (36), just above the number line.
Step 5: Join the lines for the lower quartile and the upper quartile to form a
box.
Step 6: Draw a line from the smallest value (5) to the left side of the box
and draw a line from the right side of the box to the biggest value (53).
Interquartile range
The interquartile range is the difference between the upper quartile and the
lower quartile.
53 - 5=48
1. What are outlier?
Unusual data values as compared to the rest of the set. They may be
distinguished by gaps in a histogram.
What are shapes of distribution?
Symmetric • Data is symmetric if the left half of its histogram is roughly a
mirror image of its right half. Skewed • Data is skewed if it is not symmetric
and if it extends more to one side than the other. Uniform • Data is uniform
if it is equally distributed (on a histogram, all the bars are the same height
or approximately the same height).