Complete Notes STATS
Complete Notes STATS
Statistics is a branch of applied mathematics which deals with the collection, organization, presentation, analysis and interpretation of data.
Biostatistics is the application of statistics to problems in the biological sciences, health, and medicine
Epidemiology is the study of the distribution and determinants of health, disease, or injury in human populations and the application of this
study to the control of health problems
TYPES OF STATISTICS
A. Descriptive Statistics
deals with the collection and presentation of data and collection of summarizing values to describe its group characteristics
B. Inferential Statistics
deals with predictions and inferences based on the analysis and interpretation of the results of the information gathered by
the statistician
VARIABLES
numerical characteristics or attribute associated with the population being studied
Types of Variables:
o Categorical or Qualitative Variables
example: Gender, Eye color, Blood Type, Civil Status, Socio Economic Status
o Numerical - Valued or Quantitative Variables
Discrete - is a variable whose values are obtained by counting
Continuous - is a variable whose values are obtained by measuring such as temperature, distance, area, age,
height
SCALES OF MEASUREMENT
A. Nominal Scale
Sex, Nationality
B. Ordinal Scale
ordered but differences between values are not important
e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction
e.g., pain ratings
C. Interval Scale
ordered, constant scale, but no natural zero
e.g., temperature (C,F)
D. Ratio Scale
ordered, constant scale, natural zero
e.g., height, weight, age, length
SAMPLING TECHNIQUE
Population
o is defined as groups of people, animals, places, things or ideas to which any conclusions based on characteristics of a sample
will be applied
Sample
o subgroup of the population
SLOVIN’S FORMULA:
n= _____N_____
1 + N(e)2
where:
n – sample
N – population
1 – constant
e – sampling error
1. Convenience Sampling
no system of selection but only those whom the researcher or interviewer meet by chance are include the sample.
process of picking out people in the most convenient and fastest way to immediately get their reactions to a certain hot
and controversial issue
not representative of target population because sample are selected if they can be accessed easily and conveniently.
Advantage: easy to use
Disadvantage: bias is present
it could deliver accurate resultwhen the population is homogeneous
2. Purposive Sampling
the respondents are chosen based on their knowledge of the information desired.
o Quota Sampling
specified number of persons of certain types are include in the sample.
o Judgement Sampling
sample is taken based on certain judgements about the overall population
DESCRIPTIVE STATISTICS
✔ Deals with the collection and presentation of data and collection of summarizing values to describe its group
characteristics
DATA
✔ gathered body of facts
✔ central thread of any activity
✔ Understanding the nature of data is most fundamental for proper and effective use of statistical skills
❖ TYPES OF DATA
o According to Source:
▪ Primary Data – interview, registration, experiment, questionnaire, etc.
▪ Secondary Data – book, journal, newpaper, thesis, dissertation, etc.
❖ Measures of dispersion
o single value that is used to describe the spread of the distribution
o A measure of central tendency alone does not uniquely describe a distribution
❖ Symmetry
o A distribution is said to be symmetric about the mean, if the distribution to the left of mean is the “mirror image”
of the distribution to the right of the mean
▪ Skewness - measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data
set, is symmetric if it looks the same to the left and right of the center point
● Positively Skew
● Negatively Skew
● Symmetrical Distribution/Equal
▪ Kurtosis - measure of whether the data are peaked or flat relative to a normal distribution.
● Leptokurtic
● Mesokurtic (Normal)
● Platykurtic
PROBABILITY
✔ a branch of mathematics which deals with the study of possible outcomes of an event or set of events together with the
outcomes' relative likelihood and distributions
o Two types of probability:
1. Objective probability
a. Classical Probability – calculated by the process of abstract reasoning
b. Relative Frequency Probability – depends on the repeatability of some process and the
ability to count
2. Subjective probability – based upon an educated guess
PROBABILITY DISTRIBUTION
Frequency Distribution
It is a listing of observed / actual frequencies of all the outcomes of an experiment that occurred when
experiment was done
Probability Distribution
it is a listing of the probability of all the possible outcomes that could occur if the experiment was done
o It can be described as:
A diagram (Probability Tree)
A table
A mathematical formula
BINOMIAL DISTRIBUTION
There are certain phenomena in nature which can be identified as Bernoulli’s processes, in which:
o There is a fixed number of n trials carried out
o Each trial has only two possible outcomes say success or failure, true or false etc.
o Probability of occurrence of any outcome remains same over successive trials
o Trials are statistically independent
expresses the probability of one set of alternatives – success (p) and failure (q)
o P (X = x) = nrC pr qn-r (prob. of r successes in a trials)
n = no. of trials undertaken
r = no. of successes desired
p = probability of success
q = probability of failure
POISSON DISTRIBUTION
When there is a large number of trials, but a small probability of success, binomial calculation becomes impractical
If ƛ = mean no. of occurence of an event per unit interval of time/space, then probability that it will occur exactly ‘x’ times
is given by
P(x) = ƛx e-ƛ where e is napier constant and e = 2.7182
Characteristics of Poisson Distribution
1. It is a discrete distribution
2. Occurrences are statistically independent
3. Mean no. of occurrences in a unit of time is proportional to size of unit
4. It is always right skewed
5. PD is a good approximation to BD when n > or = 20 and p < or = 0.05
NORMAL DISTRIBUTION
Also called as Gaussian Distribution
Develop by eighteenth century mathematician – astronomer Karl Gauss
It is symmetrical, unimodal (one peak)
The tails are asymptotic to horizontal axis.
X axis represents random variable like height, weight etc.
Y axis represent its probability density function
The total area under the curve is 1 (or 100%)
Only two parameters are considered: Mean and Standard Deviation
Area under the curve tells the probability
o The mean ±1 standard deviation covers approximately 68% of the area under the curve
o The mean ±2 standard deviation covers approximately 95.5% of the area under the curve
o The mean ±3 standard deviation covers approximately 99.7% of the area under the curve
SAMPLING DISTRIBUTION
distribution of values taken by the statistic in all possible samples of the same size from the same population.
SAMPLE PROPORTION
When we want information about the population proportion p of successes, we often take an SRS and use the sample
proportion p ˆ to estimate the unknown parameter p. The sampling distribution of p ˆ describes how the statistic varies in
all possible samples from the population.
The mean of the sampling distribution of p ˆ is equal to the population proportion p. That is, p ˆ is an unbiased estimator
of p.
When the sample size n is larger, the sampling distribution of p ˆ is close to a Normal distribution with mean p and
standard deviation
In practice, use this Normal approximation when both np ≥ 10 and n(1 - p) ≥ 10 (the Normal condition).
SAMPLE MEANS
Sampling from Normal Population
o We have described the mean and standard deviation of the sampling distribution of the sample mean x but not
its shape. That's because the shape of the distribution of x depends on the shape of the population distribution
o In one important case, there is a simple relationship between the two distributions. If the population distribution
is Normal, then so is the sampling distribution of x . This i s true no matter what the sample size is.
The mean is the same as the average value of a data set and is found using a
calculation. Add up all of the numbers and divide by the number of numbers in
the data set.
The median is the central number of a data set. Arrange data points from
smallest to largest and locate the central number. This is the median. If there are
2 numbers in the middle, the median is the average of those 2 numbers.
The mode is the number in a data set that occurs most frequently. Count how
many times each number occurs in the data set. The mode is the number with
the highest tally. It's ok if there is more than one mode. And if all numbers occur
the same number of times there is no mode.
Mean Formula
The mean x̄ of a data set is the sum of all the data divided by the count n.
For the data set 1, 1, 2, 6, 6, 9 the median is 4. Take the mean of 2 and 6 or,
(2+6)/2 = 4.
Outliers
Potential Outliers are values that lie above the Upper Fence or below the Lower
Fence of the sample set.
Upper Fence = Q3 + 1.5 × Interquartile Range
Lower Fence = Q1 − 1.5 × Interquartile Range
Quartiles
Quartiles mark each 25% of a set of data:
The second quartile Q2 is easy to find. It is the median of any data set and it
divides an ordered data set into upper and lower halves.
The first quartile Q1 is the median of the lower half not including the value of Q2.
The third quartile Q3 is the median of the upper half not including the value of Q2.
If the size of the data set is even, the median is the average of the middle 2
values in the data set. Add those 2 values, and then divide by 2. The median
splits the data set into lower and upper halves and is the value of the second
quartile Q2.
• IQR = Q3 - Q1
Ordering a data set from lowest to highest value, x1 ≤ x2 ≤ x3 ≤ ... ≤ xn, the
minimum is the smallest value x1. The formula for minimum is:
Ordering a data set from lowest to highest value, x1 ≤ x2 ≤ x3 ≤ ... ≤ xn, the
maximum is the largest value xn. The formula for maximum is:
How to Find the Range of a Set of Data
The range of a data set is the difference between the minimum and maximum.
To find the range, calculate xn minus x1.
Coefficient of Variation
For a Population
For a Sample
Standard Deviation
Standard deviation is a measure of dispersion of data values from the mean. The
formula for standard deviation is the square root of the sum of squared
differences from the mean divided by the size of the data set.
For a Population
For a Sample.
Variance
Variance measures dispersion of data from the mean. The formula for variance is
the sum of squared differences from the mean divided by the size of the data set.
For a Population
For a Sample
Midrange
The midrange of a data set is the average of the minimum and maximum values.
EXAMPLES