2.introduction To Statistics
2.introduction To Statistics
3
Statistics: can be divided in to two main branches (see the previous diagram).
1. Descriptive statistics :- is concerned with the organization, presentation, and
summarization of data.
Tables , graphs, numerical summary measures
2. Inferential statistics:- Methods used for drawing conclusions about a population
based on the information obtained from a sample of observations drawn from that
population.
Principles of probability, estimation, confidence interval, comparison of two or
more means or proportions, hypothesis testing, etc.
4
Parameter & Statistic
Parameter: A descriptive measure computed from the data of a population.
The mean (µ) age of the target population
Statistic: A descriptive measure computed from the data of a sample.
The mean ( ) age of the sample
5
Difference Between Descriptive and Inferential Statistics
6
Role of statistics in using information from a sample to make inferences
about the population
Overview of population & sample
(Sampling frame)
7
Role of statistics in using information from a sample to make inferences about the population…
Generalizability:
Is a two‐stage procedure : we need to able to generalize
From the sample to the study population and
Then, from the study population to the target population.
If the sample is not representative of the population, the conclusions are
restricted to the sample & don’t have general applicability.
8
Role of statistics in using information from a sample to make inferences about the population…
9
Limitations of statistics:
It deals with only those subjects of inquiry that are capable of being quantitatively
measured and numerically expressed.
It deals on aggregates of facts and no importance is attached to individual items
Suited only if their group characteristics are desired to be studied.
Statistical data are only approximate and not mathematically correct.
10
Variable
Is characteristics of subjects that take on different values for different subjects. OR
It is any aspect/ characteristics of an individual or object that:
Can be measured (e.g., height, weight, BP, age) or
Can be categorized (e.g. sex, marital status, HIV test result, … ) and
Takes different value for different individuals .
Based on their nature, variables can be categorized as:
Qualitative or non-numeric or categorical
Quantitative or numeric
11
Variable…
Response/Explanatory Variable Distinction:
Most statistical analyses distinguish between response variables and explanatory
variables. For instance,
Statistical models describe how the distribution of a continuous response variable, such as
annual income, changes according to levels of explanatory variables. While categorical
response variables analyse how such responses are influenced by explanatory variables.
The explanatory variables can be categorical or continuous.
The response variable is sometimes called the dependent variable or Y variable, and the
explanatory variable is sometimes called the independent variable or X variable.
12
Variables…
13
Data
Data (DATUM -SINGULAR)
Data are numbers which can be obtained by measuring or by counting or by
observing.
The raw material for statistics
Numerical descriptions of things
Raw facts from which information is extracted
Raw data: a data collected in original form.
DATA⇒INFORMATION ⇒KNOWLEDGE⇒WISDOM
14
Types of Data
1. Primary data: collected from the items or individual respondents directly
by the investigator for the purpose of a study.
Original and first hand information
Unorganized or raw data
2. Secondary data: which had been collected by certain people or
organization, & statistically treated and the information contained in it is
used for other purpose by other people.
NB. Data can be primary for one person and secondary for the other
15
Sources of Data
Routine health facility data (hospital, health center, clinic, health posts)
Routinely kept records, reports
Literature (published and unpublished)
Disease notifications, Epidemic reports
Census, Civil or vital registration
Laboratories
Surveys
Prospective and experimental studies …etc.
16
Measurement Scales
Measurement: A procedure where qualities or quantities are assigned to the
characteristics of subjects, objects or events, which can be compared as well.
There are four types of data/ scales of measurements.
1. Nominal data
2. Ordinal data
3. Interval data
4. Ratio data
17
1. Nominal data/scale:-
The values assigned to variables are used to identify category.
The simplest type of data, where the measurement of a variable involves the naming
or categorization of possible values of the variable
The values fall into unordered categories or classes
Mutually exclusive categories
Uses names, labels, or symbols to assign each measurement.
Examples: Blood type, sex, race, marital status, religion, cause of illness, cause
of death, etc.
18
2. Ordinal scale:
The values assigned to variables are used to identify category and show magnitude..
Assigns each measurement to one of a limited number of categories that are ranked
in terms of order.
Although non‐numerical, can be considered to have a natural ordering. e.g. cancer
stages, social class, etc.
The spaces or intervals between the categories are not necessarily equal. For example:
1. strongly agree
2. agree
3. no opinion
4. disagree
5. strongly disagree
19
3. Interval scale:
The values assigned to variables are used to identify category, show magnitude and
has equal intervals.
In interval data the intervals between values are the same. For example,
In the Fahrenheit temperature scale, the difference between 70 degrees and 71
degrees is the same as the difference between 32 and 33 degrees.
But 40 degrees Fahrenheit is not twice as much as 20 degrees Fahrenheit.
It has no true zero point. “0” is arbitrarily chosen and doesn’t reflect the absence of
temp.
E.g., Intelligence (IQ), time (year), BP, etc
20
4. Ratio scale
The values assigned to variables are used to identify category, show magnitude, has
equal intervals and begins at a true zero point.
The data values in ratio data have meaningful ratios, for example, age is
Ratio data, some one who is 40 is twice as old as someone who is 20.
Measurement begins at a true zero point and the scale has equal space.
Examples: Height, age, weight, etc.
The highest scale of measurement
21
Degree of precision in measuring
Nominal
Ordinal
23
24
Frequency Tabulation and Distributions
Frequency tabulation serves to provide a convenient counting summary for a set of
data that facilitates interpretation of various aspect s of those data.
To produce an efficient counting summary of a sample of data points for ease of
interpretation.
Any level o f measurement can be used for a variable summarised
The display of frequency tabulation is often referred to as the frequency distribution.
For each value of a variable, the frequency of its occurrence is reported.
It is possible to compute various Percent (relative frequency) and percentile values
from a frequency distribution.
It may be more useful to tabulate frequencies on the basis of intervals of scores, if we
have many different scores for a particular variable.
25
Frequency Tabulation
Statistical Tables
A statistical table is an orderly and systematic presentation of numerical data in
rows and columns.
Rows (stubs) are horizontal and columns (captions) are vertical arrangements.
The use of tables for organizing data involves:
Grouping the data into mutually exclusive categories of the variables and
Counting the number of occurrences (frequency) to each category.
26
Frequency Distribution
Features of a Distribution are Center, Size, Position, Shape
Normal Distribution
Ideally, data would be distributed symmetrically around the Center
Bell shaped and the majority of the observation lie around the Center, that is, as one
moves from the Center, the frequency of occurrence of events will be infrequent
The normal distribution is important because it is the one form of distribution that you
must assume describes the scores of a variable in the population when parametric tests
of statistical inference are under taken.
The curve drawn on the frequency distribution (histogram) shows the ideal distribution
(Normal Distribution)
27
The standard normal distribution is defined as having a population mean of 0.0 and a population
standard deviation of 1.0.
28
Frequency Distribution…
Skewed Distributions(Skew-ness)
Extremely low or extremely high
observations are present in a
distribution.
Lack of symmetry
Right or left tailed distributions
The most frequent observations
are clustered at one end of the
scale
29
Frequency Distribution…
Kurtosis
Degree of clustering
Peaked-ness (Clustering around the center)
Flat-ness (Clustering around the tails)
30
Summary Measures
Measures of Central Tendency (MCT):
Provide numerical summary measures that give an indication of the central,
average or typical score in a distribution of scores for a variable
The three most commonly reported MCT are the mean, median and mode.
One very important feature of the mean is that it uses every score (the mode
and median ignore most of the scores in a data set).
The mean tends to be stable in different samples.
31
Based on:
a) Negatively skewed distribution: occurs when
majority of scores are at the right end of the
curve and a few small scores are scattered
at the left end.
b) Positively skewed distribution: Occurs when
the majority of scores are at the left end of
the curve and a few extreme large scores
are scattered at the right end.
c) Symmetrical distribution: It is neither
positively nor negatively skewed. A curve is
symmetrical if one half of the curve is the
mirror image of the other half.
NB. use the next example data to show this using software!
32
When the distribution is skewed, the median is a better description (than the mean) of the
majority
Example
Data: 14, 89, 93, 95, 96
Skewness is reflected in the outlying low value of 14
The sample mean is 77.4
The median is 93
When the data are skewed, the mean is “dragged” in the direction of the
skewness
It is possible in extreme cases for all but one of the sample points to be on one
side of the arithmetic mean & in this case, the mean is a poor measure of central
location or does not reflect the center of the sample.
Measures of dispersion/variation/spread/scatter:
Give an indication of the degree of spread in a sample
of scores; that is, how different the scores tend to be
from each other with respect to a specific MCT.
Two or more sets may have the same mean and/or
median but they may be quite different in their spread.
MCT are not good to describe about the variability or
spread of the values.
Consider the following two sets of data:
A: 177 193 195 209 226 Mean = 200
B: 192 197 200 202 209 Mean = 200
These two distributions have the same mean, median,
and mode, but have different measures of dispersion.
36
Measures of dispersion/variation/spread/scatter…
The Interquartile Range (IQR) is a
measure of variability that is
specifically designed to be used in
conjunction with the median.
IQR is the difference between the
75th and 25th percentiles, thus it
indicates the spread of the middle
50% (median) of the observations.
37
Summary Measures…
Other measures of location (Percentiles and Z-scores):
Percentiles are measures of relative
standing of observations.
Commonly used percentiles:
• 10, 20, ….. 90% (declines)
• 20, 40, ….. 80% (quintiles)
• 25, 50, 75% (quartiles)
• 33.3, 66.7% (tertiles)
38
Software Procedures for Descriptive statistics (univariate analysis):
A proper analysis of data must begin with an analysis of the statistical
attributes of each variable in univariate analysis.
From such an analysis we can learn:
• how the values of a variable are distributed:
Normal, binomial, etc.
Under plots
Click for
Normality plots with tests
3.1. Testing for normality using explore…
2
3
1
2
0
1
-1
Expected Normal
-2
-1 -3
Expected Normal
-4
-2
-10 0 10 20 30 40 50
-3
Observed Value
50 60 70 80 90 100
Observed Value
The Box Plot also has a lot of outliers, showing the data are not normally distributed.
50
1393
40
1237
833
869
423
1150
1260
1383
418
1388
1395
1276
889
898
821
339
1285
1385
1274
1146
30 1413
1382
493
420
788
1041
294
896
636
20
10
1366
1379
0 22
929
-10
N= 1441
When the variable is non-normal, the log of the variable may be distributed normally.
To do this:-
Go to Analyze/Descriptive /Q-Q.
Place the variable into the box “Variable.”
On the right, choose “Normal” in the “Test Distribution” box.
In the “Transform” options area, choose “Natural Log Transform”
Click on "OK."
Then test the transform variable again for normality, if seen normal in the chart (the dotted
curve coincides with the straight line).
OR:
Generate new variable using; LN(var)