0% found this document useful (0 votes)
4 views51 pages

2.introduction To Statistics

The document provides an introduction to statistics, covering its definition, branches (descriptive and inferential), and the application of biostatistics in public health. It explains key concepts such as parameters, statistics, variables, data types, measurement scales, and the importance of frequency distributions and summary measures. Additionally, it discusses the role of statistics in making inferences about populations based on sample data and highlights limitations and methods of data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views51 pages

2.introduction To Statistics

The document provides an introduction to statistics, covering its definition, branches (descriptive and inferential), and the application of biostatistics in public health. It explains key concepts such as parameters, statistics, variables, data types, measurement scales, and the importance of frequency distributions and summary measures. Additionally, it discusses the role of statistics in making inferences about populations based on sample data and highlights limitations and methods of data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction to statistics

 Statistics: A field of study concerned with methods and procedures:


 Of collection, organization, analysis, summarization and interpretation of numerical
data, &
 To make scientific inferences about a population using data collected from a
representative sample drawn from population under study.
 Biostatistics: Application of statistical methods and procedures to the fields of
biology, medicine and public health.
 Concerned with valid interpretation of biological/public health data &
 The communication of information derived from these data to someone else.
 Has central role in public health and medical investigations.

3
 Statistics: can be divided in to two main branches (see the previous diagram).
1. Descriptive statistics :- is concerned with the organization, presentation, and
summarization of data.
 Tables , graphs, numerical summary measures
2. Inferential statistics:- Methods used for drawing conclusions about a population
based on the information obtained from a sample of observations drawn from that
population.
 Principles of probability, estimation, confidence interval, comparison of two or
more means or proportions, hypothesis testing, etc.

4
Parameter & Statistic
Parameter: A descriptive measure computed from the data of a population.
 The mean (µ) age of the target population
Statistic: A descriptive measure computed from the data of a sample.
 The mean ( ) age of the sample

5
Difference Between Descriptive and Inferential Statistics

6
 Role of statistics in using information from a sample to make inferences
about the population
 Overview of population & sample

(Sampling frame)

7
Role of statistics in using information from a sample to make inferences about the population…

Collect information from a Draw conclusions about a


relatively SMALL sample rather LARGE population

 Generalizability:
 Is a two‐stage procedure : we need to able to generalize
 From the sample to the study population and
 Then, from the study population to the target population.
 If the sample is not representative of the population, the conclusions are
restricted to the sample & don’t have general applicability.
8
Role of statistics in using information from a sample to make inferences about the population…

9
Limitations of statistics:
 It deals with only those subjects of inquiry that are capable of being quantitatively
measured and numerically expressed.
 It deals on aggregates of facts and no importance is attached to individual items
 Suited only if their group characteristics are desired to be studied.
 Statistical data are only approximate and not mathematically correct.

10
Variable
 Is characteristics of subjects that take on different values for different subjects. OR
 It is any aspect/ characteristics of an individual or object that:
 Can be measured (e.g., height, weight, BP, age) or
 Can be categorized (e.g. sex, marital status, HIV test result, … ) and
 Takes different value for different individuals .
 Based on their nature, variables can be categorized as:
 Qualitative or non-numeric or categorical
 Quantitative or numeric

11
Variable…
 Response/Explanatory Variable Distinction:
 Most statistical analyses distinguish between response variables and explanatory
variables. For instance,
 Statistical models describe how the distribution of a continuous response variable, such as
annual income, changes according to levels of explanatory variables. While categorical
response variables analyse how such responses are influenced by explanatory variables.
 The explanatory variables can be categorical or continuous.
 The response variable is sometimes called the dependent variable or Y variable, and the
explanatory variable is sometimes called the independent variable or X variable.

12
Variables…

13
Data
 Data (DATUM -SINGULAR)
 Data are numbers which can be obtained by measuring or by counting or by
observing.
 The raw material for statistics
 Numerical descriptions of things
 Raw facts from which information is extracted
 Raw data: a data collected in original form.

DATA⇒INFORMATION ⇒KNOWLEDGE⇒WISDOM

14
Types of Data
1. Primary data: collected from the items or individual respondents directly
by the investigator for the purpose of a study.
 Original and first hand information
 Unorganized or raw data
2. Secondary data: which had been collected by certain people or
organization, & statistically treated and the information contained in it is
used for other purpose by other people.

NB. Data can be primary for one person and secondary for the other

15
Sources of Data
 Routine health facility data (hospital, health center, clinic, health posts)
 Routinely kept records, reports
 Literature (published and unpublished)
 Disease notifications, Epidemic reports
 Census, Civil or vital registration
 Laboratories
 Surveys
 Prospective and experimental studies …etc.

16
Measurement Scales
 Measurement: A procedure where qualities or quantities are assigned to the
characteristics of subjects, objects or events, which can be compared as well.
 There are four types of data/ scales of measurements.
1. Nominal data
2. Ordinal data
3. Interval data
4. Ratio data

17
1. Nominal data/scale:-
 The values assigned to variables are used to identify category.
 The simplest type of data, where the measurement of a variable involves the naming
or categorization of possible values of the variable
 The values fall into unordered categories or classes
 Mutually exclusive categories
 Uses names, labels, or symbols to assign each measurement.
 Examples: Blood type, sex, race, marital status, religion, cause of illness, cause
of death, etc.

18
2. Ordinal scale:
 The values assigned to variables are used to identify category and show magnitude..
 Assigns each measurement to one of a limited number of categories that are ranked
in terms of order.
 Although non‐numerical, can be considered to have a natural ordering. e.g. cancer
stages, social class, etc.
 The spaces or intervals between the categories are not necessarily equal. For example:
1. strongly agree
2. agree
3. no opinion
4. disagree
5. strongly disagree
19
3. Interval scale:
 The values assigned to variables are used to identify category, show magnitude and
has equal intervals.
 In interval data the intervals between values are the same. For example,
 In the Fahrenheit temperature scale, the difference between 70 degrees and 71
degrees is the same as the difference between 32 and 33 degrees.
 But 40 degrees Fahrenheit is not twice as much as 20 degrees Fahrenheit.
 It has no true zero point. “0” is arbitrarily chosen and doesn’t reflect the absence of
temp.
 E.g., Intelligence (IQ), time (year), BP, etc

20
4. Ratio scale
 The values assigned to variables are used to identify category, show magnitude, has
equal intervals and begins at a true zero point.
 The data values in ratio data have meaningful ratios, for example, age is
Ratio data, some one who is 40 is twice as old as someone who is 20.
 ƒMeasurement begins at a true zero point and the scale has equal space.
 Examples: Height, age, weight, etc.
 The highest scale of measurement

21
Degree of precision in measuring

Nominal

Ordinal

 Both interval and ratio data involve measurement.


 Most data analysis techniques that apply to ratio data also apply
Interval to interval data.
 Therefore, in most practical aspects, these types of data
(interval & ratio) are grouped under metric data.
 For interval or ratio data, the mean and standard deviation are
appropriate, provided the data are not too skewed..
Ratio  In some other instances, these types of data are also known as
numerical discrete and numerical continuous
Descriptive statistics for summarising data
 In a research process, after data is
collected and processed, the next step is
to analyse the data you have collected.
 When data are of quantitative nature,
analysis involves both looking at your
data graphically to see what the general
trends in the data are, and also fitting
statistical models to the data.

23
24
Frequency Tabulation and Distributions
 Frequency tabulation serves to provide a convenient counting summary for a set of
data that facilitates interpretation of various aspect s of those data.
 To produce an efficient counting summary of a sample of data points for ease of
interpretation.
 Any level o f measurement can be used for a variable summarised
 The display of frequency tabulation is often referred to as the frequency distribution.
 For each value of a variable, the frequency of its occurrence is reported.
 It is possible to compute various Percent (relative frequency) and percentile values
from a frequency distribution.
 It may be more useful to tabulate frequencies on the basis of intervals of scores, if we
have many different scores for a particular variable.
25
Frequency Tabulation
 Statistical Tables
 A statistical table is an orderly and systematic presentation of numerical data in
rows and columns.
 Rows (stubs) are horizontal and columns (captions) are vertical arrangements.
 The use of tables for organizing data involves:
 Grouping the data into mutually exclusive categories of the variables and
 Counting the number of occurrences (frequency) to each category.

26
Frequency Distribution
Features of a Distribution are Center, Size, Position, Shape
 Normal Distribution
 Ideally, data would be distributed symmetrically around the Center
 Bell shaped and the majority of the observation lie around the Center, that is, as one
moves from the Center, the frequency of occurrence of events will be infrequent
 The normal distribution is important because it is the one form of distribution that you
must assume describes the scores of a variable in the population when parametric tests
of statistical inference are under taken.
 The curve drawn on the frequency distribution (histogram) shows the ideal distribution
(Normal Distribution)

27
The standard normal distribution is defined as having a population mean of 0.0 and a population
standard deviation of 1.0.

28
Frequency Distribution…
 Skewed Distributions(Skew-ness)
 Extremely low or extremely high
observations are present in a
distribution.
 Lack of symmetry
 Right or left tailed distributions
 The most frequent observations
are clustered at one end of the
scale

29
Frequency Distribution…
 Kurtosis
 Degree of clustering
 Peaked-ness (Clustering around the center)
 Flat-ness (Clustering around the tails)

30
Summary Measures
 Measures of Central Tendency (MCT):
 Provide numerical summary measures that give an indication of the central,
average or typical score in a distribution of scores for a variable
 The three most commonly reported MCT are the mean, median and mode.
 One very important feature of the mean is that it uses every score (the mode
and median ignore most of the scores in a data set).
 The mean tends to be stable in different samples.

31
Based on:
a) Negatively skewed distribution: occurs when
majority of scores are at the right end of the
curve and a few small scores are scattered
at the left end.
b) Positively skewed distribution: Occurs when
the majority of scores are at the left end of
the curve and a few extreme large scores
are scattered at the right end.
c) Symmetrical distribution: It is neither
positively nor negatively skewed. A curve is
symmetrical if one half of the curve is the
mirror image of the other half.

NB. use the next example data to show this using software!
32
 When the distribution is skewed, the median is a better description (than the mean) of the
majority
 Example
 Data: 14, 89, 93, 95, 96
 Skewness is reflected in the outlying low value of 14
 The sample mean is 77.4
 The median is 93
 When the data are skewed, the mean is “dragged” in the direction of the
skewness

 It is possible in extreme cases for all but one of the sample points to be on one
side of the arithmetic mean & in this case, the mean is a poor measure of central
location or does not reflect the center of the sample.
 Measures of dispersion/variation/spread/scatter:
 Give an indication of the degree of spread in a sample
of scores; that is, how different the scores tend to be
from each other with respect to a specific MCT.
 Two or more sets may have the same mean and/or
median but they may be quite different in their spread.
 MCT are not good to describe about the variability or
spread of the values.
 Consider the following two sets of data:
 A: 177 193 195 209 226 Mean = 200
 B: 192 197 200 202 209 Mean = 200
 These two distributions have the same mean, median,
and mode, but have different measures of dispersion.

NB. Use software to show this!


35
Measures of dispersion/variation/spread/scatter…
 There are a variety of measures of variability to choose from including the
range, interquartile range, variance, standard deviation, and coefficient of
Variation (CV).
 The easiest way to look at dispersion is to use the range of scores.
 The range is not explicitly associated with any measure of central tendency.
 Variance is difficult to interpret, thus standard deviation is usually used to
quantify the spread of data in conjunction with reports of sample mean.

36
Measures of dispersion/variation/spread/scatter…
 The Interquartile Range (IQR) is a
measure of variability that is
specifically designed to be used in
conjunction with the median.
 IQR is the difference between the
75th and 25th percentiles, thus it
indicates the spread of the middle
50% (median) of the observations.

37
Summary Measures…
 Other measures of location (Percentiles and Z-scores):
 Percentiles are measures of relative
standing of observations.
 Commonly used percentiles:
• 10, 20, ….. 90% (declines)
• 20, 40, ….. 80% (quintiles)
• 25, 50, 75% (quartiles)
• 33.3, 66.7% (tertiles)

 Z-score or standard scores tell the position of an observation


relative to its mean in standard deviation units.

38
 Software Procedures for Descriptive statistics (univariate analysis):
 A proper analysis of data must begin with an analysis of the statistical
attributes of each variable in univariate analysis.
From such an analysis we can learn:
• how the values of a variable are distributed:
 Normal, binomial, etc.

• the central tendency of the values of a variable:


 Mean, median, and mode

• dispersion of the values:


 standard deviation, variance, range, and quartiles

• presence of outliers (extreme values)


• if a statistical attribute (e.g. mean) of a variable equals a
hypothesized value
Software Procedures for Descriptive statistics…
 Examining Summary Statistics for Individual Variables
 Categorical: Also referred to as qualitative data.
 Categorical variables can be string (alphanumeric) data or numeric variables that use
numeric codes to represent categories (for example, 0 = Unmarried and 1 = Married).
 There are two basic types of categorical data:
• Nominal; Categorical data where there is no inherent order to the categories.
• Ordinal; Categorical data where there is a meaningful order of categories, but there is not a
measurable distance between categories.
 Scale: Also referred to as quantitative or continuous data.
 Data measured on an interval or ratio scale, where the data values indicate both the
order of values and the distance between values.
1. Summary Measures for Categorical Data:
1.1. Frequencies for Categorical Data
 For categorical data, the most typical summary measure is the number or
percentage of cases in each category.
 The mode is the category with the greatest number of cases. For ordinal
data, the median(the value at which half of the cases fall above and
below) may also be a useful summary measure if there is a large number
of categories.
 The Frequencies procedure produces frequency tables that display both
the number and percentage of cases for each observed value of a
variable.
To run the procedure: Analysis  Descriptive statistics  Frequency
1. Summary Measures for Categorical Data…

1.2. Charts for Categorical Data


 You can graphically display the information in a frequency table with a bar chart or pie chart.
To run the procedure: go similar way as the frequency or Analysis  Descriptive statistics  Frequency
Charts Select a bar chart or pie chart and then click Continue  Click OK in the main dialog box.
2. Summary Measures for Scale Data
2.1. Measures of central tendency and dispersion.
To run the both procedures: go similar way as the frequency or Analysis  Descriptive statistics 
Frequency Click Reset to clear any previous settings  Select and move the Variable(s) list 
Click Statistics Select Mean, Median, Std. deviation, Minimum, and/or Maximum and then click
Continue  Click OK in the main dialog box.
 NB. Deselect Display frequency tables in the main dialog box. Because frequency tables are
usually not useful for scale variables since there may be almost as many distinct values as
there are cases in the data file.

The Frequencies Statistics table shows: there is


a large difference between the mean and the
median, indicating that the values are not
normally distributed. Similarly, you can visually
check the distribution with a histogram.
2. Summary Measures for Scale Data…
2.2. Charts for Scale Variables.
 The difference between the mean and the median displayed on the previous frequencies
statistics table, can be visually checked using a histogram distribution.
To run the procedure: go similar way as the frequency or Analysis  Descriptive statistics  Frequency
Charts Select histograms with normal curve. and then click Continue  Click OK in the main dialog box.

 The majority of cases are clustered at the


lower end of the scale, with most falling
below 100,000. There are, however, a few
cases in the 500,000 range and beyond
(too few to even be visible without
modifying the histogram). These high
values for only a few cases have a
significant effect on the mean but little or
no effect on the median, making the
median a better indicator of central
tendency in this example.
3. Checking the nature of the distribution of scale data

 Continuous dependent variables should go through assumptions.


 These continuous variables should be tested for their symmetrical distribution.
 If not, they should not pass through many methods of analysis (they should follow non-
parametric analysis).
3.1. Testing for normality using explore:-
 Since histograms in the frequency provide only a rough visual idea regarding the distribution of a
variable, using explore is the easiest way to obtain the data summaries
 Therefore: Analysis  Descriptive statistics  Explore…
 Under Explore:- Click ‘Plot’ and select “Normality plots with test”
 Result is found by:-
 “Q-Q plot test”:- May not be sufficient for determining whether a variable is distributed
normally.
 “Kolmogorov-Smirnov”:- Formal test and sufficient for determining whether a variable is
distributed normally.
 “Boxplots”:-

o The spread of the values can be depicted using boxplots.

o A boxplot chart provides the medians, quartiles, and ranges.

o It also provides information on outliers.


3.1. Testing for normality using explore…

Under plots

Click for
Normality plots with tests
3.1. Testing for normality using explore…

Normal Q-Q Plot of verbal fluency - animal naming sc


4
Normal Q-Q Plot of age in years
4 3

2
3

1
2
0
1
-1

Expected Normal
-2

-1 -3

Expected Normal
-4
-2
-10 0 10 20 30 40 50

-3
Observed Value
50 60 70 80 90 100

Observed Value

 Normal Q-Q plot:-


 Tells us that if the data is normally distributed, then
the red dots should lie on the straight diagonal line.
 If Significant, it is not
normally distributed
3.1. Testing for normality using explore…

 The Box Plot also has a lot of outliers, showing the data are not normally distributed.

50

1393
40
1237
833
869
423
1150
1260
1383
418
1388
1395
1276
889
898
821
339
1285
1385
1274
1146
30 1413
1382
493
420
788
1041
294
896
636

20

10

1366
1379
0 22
929

-10
N= 1441

verbal f luency - ani


3.2. In addition to visual ways, numerical ways of detecting non-normality.
 Outlier:
1. An outlying value is a value x such that either
• x >upper quartile +1.5 ×(IQR) OR x <lower quartile -1.5 ×(IQR)
2. An extreme outlying value is a value x such that either
• x >upper quartile +3.0 ×(IQR) OR x <lower quartile -3.0 ×(IQR)
 Skew-ness (asymmetric distribution tails) and kurtosis(peaked-ness of the distribution).
 In a normal distribution, the values for the skew-ness and kurtosis statistics are both zero (skew-
ness = 0 means a symmetric distribution; kurtosis = 0 means a mesokurtic distribution). Whereas a
value of skewness>0 indicates positive skewness and skewness<0 indicates negative skewness.
OR
 By taking values form SPSS out put and using the following formula, :
the skewness statistic + 2*std.error of the skewness statistic, we calculate the interval (the low and
high endpoints). Then,
 If zero falls inside of this interval, the distribution is normal.
 If zero falls outside of the interval, then you likely have an issue with non-normality
50
 Transforming a variable to make it normally distributed:-

 When the variable is non-normal, the log of the variable may be distributed normally.
 To do this:-
 Go to Analyze/Descriptive /Q-Q.
 Place the variable into the box “Variable.”
 On the right, choose “Normal” in the “Test Distribution” box.
 In the “Transform” options area, choose “Natural Log Transform”
 Click on "OK."
 Then test the transform variable again for normality, if seen normal in the chart (the dotted
curve coincides with the straight line).
OR:
 Generate new variable using; LN(var)

You might also like