0% found this document useful (0 votes)
23 views134 pages

1introduction and Descriptive Stats

The document outlines a course on Advanced Biostatistics offered by the College of Health and Medical Science, detailing its objectives, course content, and methods of data collection. It emphasizes the importance of statistics in understanding health-related data and introduces key concepts such as types of variables, levels of measurement, and data collection methods. Additionally, it discusses the design and implementation of questionnaires for effective data gathering in research.

Uploaded by

Abas Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views134 pages

1introduction and Descriptive Stats

The document outlines a course on Advanced Biostatistics offered by the College of Health and Medical Science, detailing its objectives, course content, and methods of data collection. It emphasizes the importance of statistics in understanding health-related data and introduces key concepts such as types of variables, levels of measurement, and data collection methods. Additionally, it discusses the design and implementation of questionnaires for effective data gathering in research.

Uploaded by

Abas Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 134

College of health and medical science

Department of Epidemiology and


Biostatistics
Introduction and descriptive
statistics
By Adisu B. (MPH, Assistant professor)
Course Description
• Course Name: Advanced Biostatistics

Course code
• COMH513 for Adult Health Nursing
• Pubh 614 for Maternity and Neonatal Nursing
• COMH513 for Paediatrics and Child Health Nursing
• PuHe5022 for MSc in Midwifery
• Credit Hours: 3
COURSE CONTENTS
• Introduction
• Methods of data collection
• Descriptive statistics
• Basic probability concepts
• Sampling techniques and sampling distribution
• Inference
• Sample size determination
• Statistical Softwares (EPIDATA, Epi Info, STATA)
• Correlation and Linear regression
• Logistic regression
Introduction
Objectives of the section
At the end of this section students will be able to:
 Understand the rationale of statistics
 Familiar with some frequently used concepts/terms in
biostatistics
 Identify the different types of variables
 Understand the different types of data collection
methods
 Familiar with methods of data organization,
presentation and summarizations
Basic Concepts of statistics and
Biostatistics
 Statistical data: it refers to numerical descriptions
of things.
These may take the form of counts or
measurements.
• Statistics is always about numerical description
 Statistical methods: it refers to a body of methods
used for collecting, organising, analyzing and
interpreting numerical data for understanding a
phenomenon or making wise decisions
Some basic concepts ….
 Statistics is the science of gaining information from
data in presensence of variation
• Statistics is a scientific discipline, or the science of
gathering and describing data and the subsequent
drawing of conclusions(inferences) from data.
Some basic concepts ….
 The tools of statistics are employed in many fields
(e.g. health, agriculture, business, economics,
education, psychology etc.)
 When the data being analyzed are derived from
the biological sciences, medicine or public
health, we use the term biostatistics to
distinguish this particular application of statistical
tools and concepts from the others
Rational for studying Statistics
• Statistics provides a way of organizing information
• There is a great deal of inherent variation in most biological
processes
• The planning, conducting, and interpretation of much of
medical research
• Quantitative nature of public health and medicine
• For the projection of health trends
• For the analysis of health and health-related data
Some basic concepts ….
Data: numerical fact or figure which is raw or
unprocessed
Data → information → knowledge →
wisdom
Statistics is a tool to convert data into useful
information

Variable: A characteristic which takes different


values in different persons, places, or objects.
(e.g., BP, age, sex, birth weight, etc)
Types of variables
 In broad terms we have two types of variables:
(quantitative and qualitative)
1.Quantitative variable:
Information is measured by assigning numbers (eg.
Age, BP, weight)
 We can further divide quantitative variable in to two:
a. Discrete data:
 Discrete data are restricted to taking only specified
values often integers or counts that differ by fixed
amounts.
e.g. Number of new AIDS cases reported during one
year period
 b. Continuous data: represent measurable
quantities but are not restricted to taking on certain
specific values i.e fractional values are possible e.g.
Types of variables

2. Qualitative variable/categorical:
 Information is measured by assigning names to items
(events) according to a set of rules, which result on
different types of data.
 Eg. Gender, blood group, Marital status
Levels of Measurement
The level of measurement determines which statistical
calculations are meaningful.
The four levels of measurement are: nominal, ordinal,
interval, and ratio.

Nominal
Lowest to
Levels of Ordinal highest
Measurement Interval
Ratio
Identify level of measurement
• Age
• FBS
• creatinine
• Viral load
• Educational status
• Level of anemia
• Temperature
• Marital status
Methods of Data Collection
Data and measurement
• Data is the raw material for statistics

• We may define data as figures which can be


obtained from the process of measurements or
by counting
Example:
• When a hospital administrator counts the number
of patients (counting).

• When a nurse weighs a patient (measurement)


Source of data
• Goal: choose the source that gives data closest to the
“gold standard” while being feasible to collect.
• Such data are available from one or more of the
following sources:
1.Routinely kept records  Hospital records
2.External source:
– Published reports
– the research literature, i.e. someone else has
already asked the same question
source of data ...

3. Survey : if the data needed is about


answering certain questions.
4. Experiment , etc…
source of data

Type of data based on source


1. Primary data: collected from the items or
individual respondents directly by the researcher
for the purpose of certain study.

2. Secondary data: collected by certain people or


agency, and statistically treated and the
information contained in it is used for other
purpose
The secondary data can be obtained from journals,
reports, government publications, publications of
professionals and research organizations
Primary Vs Secondary Data

Primary Data Secondary Data


• Real time data. • Past data.
• Sure about sources of • Not sure about sources of
data. data.
• Costly and Time • Cheap and Not time
consuming process. consuming process.
• Avoid biasness of • Can not know data biasness
response data • Less Flexible.
• More flexible.
Data collection
- It is the process which is going on after proposal
development and before analysis.

- Data Collection is obtaining useful information.

- Allow us to systematically collect information


- Haphazardly collected data will be difficult to
answer our research question
Data collection choice

• What you must ask yourself:


– Will the data answer my research
question?
• To answer that
– You must first decide what your research
question is
– Then you need to decide what
data/variables are needed to scientifically
answer the question
– Study subject (level of education)
Data collection choice

• If that data exist in secondary form, then use them to the extent
you can, keeping in mind limitations.

• But if it does not, and you are able to fund primary collection,
then it is the method of choice.
Data collection methods…

The choice of methods of data collection is based


on:
♣ The accuracy of information they will yield
♣ Practical considerations, such as, the need for
personnel, time, equipment and other facilities.
§ Types of data/information to be collected
Data collection method…
Types of Methods of data collection
data
Qualitative FGD
In-depth – interview
Observation
Quantitative Questionnaires
-Open/closed
-Structured/unstructured
-Self/Interviewer administered
-Observation
-Use of documentary sources
1. Observation
 Observation is a technique that involves
systematically selecting, watching and recording
behaviors of people or other phenomena and aspects
of the setting in which they occur.
 Participant observation: The researcher actively
participates in the activities of the group they are
studying for deeper insights into the context.
 Non-participant observation:The researcher
observes from the outside without actively engaging
Observation...

Structured observation:
 Uses a predetermined checklist or observation
guide to record specific behaviors or events.
Unstructured observation:
 Allows for more flexibility in recording
observations without a strict framework, capturing
a wider range of information
Advantages:
Gives relatively more accurate data on behavior and
activities.
Disadvantages:
 Investigators or observer’s own biases, desires etc
Needs more resources and skilled human power during the use
of high level Machines.
2. Interview and self- administered questionnaires

Probably most commonly used research data collection


techniques
Interview Are three in type
The direct personal interviewing: common

The telephone interview


 Postal or mail method interview
The advantages of self-administered
questionnaires:
Simpler and cheaper
Can be administered to many persons
simultaneously
Disadvantage
Demand a certain level of education and skill on the
part of the respondents;
 People of a low socio-economic status are less likely
Face-to-face interviews (Advantages)
A good interviewer can stimulate and maintain the
respondent’s interest
Can create atmosphere conducive to the answering of
question
Serious approach by respondent resulting in accurate
information
Good response rate
Completed data can be obtained
Advantages…
 Possible in-depth questions
 Interviewer can give help if there is a problem i.e.
Provide an explanation or alternative wording
 Can investigate motives and feelings
 Can use recording equipment
Disadvantages:
 Need to set up interviews
 Time consuming
 Geographic limitations
 Can be expensive
 Respondent bias
3. Documentary source
These are sources, which include
1.Official publications of Central Statistical Authority
2.Publication of Ministry of Health and Other Ministries
3. News Papers and Journals.
4. International Publications like Publications by WHO,
World Bank, UNICEF
5. Records of hospitals or any Health Institutions.
Designing Questionnaire
Questionnaires are a research instrument used for the purpose of
gathering information from respondents.
Requirements of questions
Clarity
Must have validity
Must be unambiguous
Must not be offensive
The questions should be fair
Avoid questions like:
 Questions which frightened the respondents
 Questions which may anger the informants
 Leading Questions
 Have you stopped smoking
 Is your married life happy
 Such types of questions should be avoided (or ask
them in tactful manner)
Type of Questions

Depending up on how questions are asked and


recorded there are two types of questions

Open- ended questions

Closed-ended questions
Open-ended questions

In these questions, the respondents recorded his/her


answers or suggestions freely in his/her own words
The respondents are free to give any type of
response.
There is no any possible answers to choose from.
Eg Can you describe why most of the children in this
village develop diarrhoea?
Open-ended questions are useful to obtain
information on:
Facts with which the researcher is not very familiar

Opinions, attitudes, and suggestions of informants, or

Sensitive issues
Closed- ended questions
Give a list of possible options which are exhaustive and
mutually exclusive and they are called the fixed- choice
questions
Closed questions are useful if the range of possible
responses is known.
Eg. Do you have children now? 1.Yes 2. No
 What is your marital status? A. Single B. Married C.
Separated/ divorced/ D. widowed
Designing a Questionnaires
 Designing a good questionnaire always takes several drafts.
 In the first draft we should concentrate on the content.
 In the second, we should look critically at the formulation and
sequencing of the questions.
 Then we should scrutinize the format of the questionnaire.
 Finally, we should do a test-run to check whether the questionnaire
gives us the information we require and whether both the
respondents and we feel at ease with it.
 Usually the questionnaire will need some further adaptation before
we can use it for actual data collection
Steps in Questionnaire design
Step1: CONTENT
Take your objectives and variables as your starting
point.
Decide on what questions will be needed to measure
or to define your variables and reach your objectives.
When developing the questionnaire, you should
reconsider the variables you have chosen, and, if
necessary, add, drop or change some.
Step 2: FORMULATING QUESTIONS
Formulate one or more questions that will provide the
information needed for each variable.
Take care that questions are specific and precise
enough that different respondents do not interpret
them differently.
Avoid leading question
Step 3: SEQUENCING OF QUESTIONS
Design your interview schedule or questionnaire to be
“consumer friendly.”
The sequence of questions must be logical for the
respondent
At the beginning of the interview, keep questions
concerning “background variables” (e.g., age,
religion, education, marital status, or occupation) to a
minimum.
If possible, pose most or all of these questions later in
 Start with an interesting but non-controversial
question (preferably open) that is directly related to
the subject of the study.
 This type of beginning should help to raise the
informants’ interest and lessen suspicions concerning
the purpose of the interview (e.g., that it will be used
to provide information to use in levying taxes).
 Pose more sensitive questions as late as possible in
the interview (e.g., questions pertaining to income,
sexual behavior, or diseases with stigma attached
to them, etc.
Step 4: FORMATTING THE QUESTIONNAIRE
When you finalize your questionnaire, be sure that:
Each questionnaire has a heading and space to insert
the number, data and location of the interview
Questions belonging together appear together
visually.
If the questionnaire is long, you may use subheadings
for groups of questions.
Sufficient space is provided for answers to open-
ended questions
Step 5: TRANSLATION
If interview will be conducted in one or more local
languages, the
questionnaire has to be translated to standardize the
way questions
will be asked.
After having it translated you should have it
retranslated into the original language.
You can then compare the two versions for differences
Problems in gathering data
Common problems might include:
 Language barriers
 Lack of adequate time
 Expense
 Inadequately trained and experienced staff
 Invasion of privacy
 Suspicion (mistrust)
 Cultural norms
Pre – test Vs. pilot study

What is a pre-test or pilot study of the


methodology?
• A PRE-TEST usually refers to a small-scale
trial of particular research components.

• A PILOT STUDY is the process of carrying


out a preliminary study, going through the
entire research procedure with a small
sample.
Pre – test Vs. pilot study cont…
WHY do we carry out a pre-test or pilot study?
• A pre-test or pilot study serves as a trial run that allows
us to identify potential problems in the proposed study.

• Pre-test and/or pilot study enables us, if necessary, to


revise the methods and logistics of data collection
before starting the actual fieldwork.
• As a result, a good deal of time, effort and money can
be saved in the long run.
Pre – test Vs. pilot study cont…

• Pre-testing is simpler and less time-consuming and


costly than pilot study.
Pre – test Vs. pilot study cont…
WHAT aspects can be evaluated during pre-
testing?
1. Reactions of the respondents to the research
procedures
2. The data-collection tools
– The sequence of questions is logical.
– The wording of the questions is clear.
– Translations are accurate.
– Space for answers is sufficient.
– There is a need to pre-categorise some answers or to
change closed questions into open-ended questions.
Pre – test Vs. pilot study cont…
3. Sampling procedures
- the instructions concerning how to select the sample
- time is needed to locate individuals to be included in
the study
4. The proposed work plan and budget for research
activities
Pre – test Vs. pilot study cont…
• When do we carry out a pre-test?
– Pre-testing at least your data collection tools, either
during the workshop, or, if that is impossible,
immediately thereafter, in the actual field situation.
– Pre-testing the data collection and data-analysis
process 1-2 weeks before starting the fieldwork.
• Who should be involved in the pre-test or pilot
study?
– The research team, headed by the principal
investigator.
– Any additional research assistants or data collectors
that have been recruited.
Descriptive statistics
Introduction #1
• The aim of nearly all study is to extrapolate from
observations made on a sample of individuals to the
population as a whole.
• The main components of a statistical systems can be
described at the following figure.
B io s t a t is t ic s

D e s c rip t iv e S t a t is t ic s I n f e re n t ia l S ta t is t ic s

c o lle c t io n m a k in g in f e re n c e s
o rg a n iz in g h y p o t h e s is t e s t in g
s u m m a riz in g d e t e rm in in g re la t io n s h ip
p re s e n t in g o f d a ta m a k in g t h e p re d ic t io n
Descriptive Statistics
• In descriptive analysis there are three major characteristics
of a single variable that we tend to look at:
• the distribution: The simplest distribution would list
every value of a variable and the number of persons
who had each value. E.g. frequency distribution
• the central tendency: an estimate of the "center" of
a distribution of values.
• the dispersion: the spread of the values around the
central tendency.
• Decriptive statistics do not involve generalizing beyond the
data at hand.
Methods of data organization, presentation and
summarization
-Tables
-Graphs
-NSM
Methods of data organization and presentation

 Numbers that have not been summarized and organized are called raw
data. i.e. The data collected in a survey/study
 In most cases, useful information is not immediately evident from the
mass of unsorted data/raw data.
• Statistics is used to organize and interpret research observations and
findings
• Before interpretation & communication of the findings, the raw data
must be organized and presented in a clear and understandable way
1. Frequency Distribution

 The arrangement of data set in a table using values and their corresponding frequency
of occurrence within a data set.
 Frequency: the number of same values within a data set.
 Consists of the set of classes or categories along with their numerical counts in each
i.e. frequency
 The actual summarization and organization of data starts from frequency distribution
 The distribution condenses the raw data into a more useful form and allows for a quick
visual interpretation of the data
Cumulative Frequency:
 When frequencies of two or more classes are added up, such total frequencies are
called Cumulative Frequencies.
 This frequencies help as to find the total number of items whose values are less than
or greater than some value.
Relative frequencies
 Express the frequency of each value or class as a percentage to the total frequency.
A. Frequency distributions for categorical variables

 Count the number of observations (frequency) in each


category and present as relative frequencies (percentages)
 Often presented in the form of table, bar and pie charts
Example 2

 n=118 female patients were diagnosed with regards to


depressive illness as “no depression”, “mild depression”,
“moderate depression” and “severe depression.”
 The observed absolute and relative frequencies are shown in
a one-way table:
Depression Frequency Relative Frequency
None 26 23.6
Mild 67 60.9
Moderate 17 15.5
Total 110 100.0
B. Frequency distribution for numerical variables
 Show at different values or within certain ranges
 For a discrete variable, the frequencies may be tabulated either for each value of
the variable or for groups of values
 With continuous variables, groups (class intervals) have to be formed
 the values are grouped into:-
 distinct non-overlapping intervals,
usually of equal width.
each value can be placed in one, and only one, of the intervals.
Frequency distribution Tables
• The blood type of 30 patients
• Grouped frequency distribution
were given as follows: A AB B
B A O O AB AB B O A A of body mass index data for
B B A AB A O AB B AB AB 120 participants from the 1998
O A AB AB O A O National Health Interview
Survey
Solution
Class Interval Frequency ( f )
Type Frequency Relative
for BMI Levels
frequency
18 – 20 4
A 8 0.267 21 – 23 24
24 – 26 28
B 6 0.20 27 – 29 27
AB 9 0.30 30 – 32 18
33 – 35 6
O 7 0.233 36 – 38 8
Total 30 1.00 39– 41 5
Total 120
Charts/graphs
 The frequency distribution of a categorical variable is often presented
graphically as a bar chart or pie chart.

Bar charts
Display the frequency distribution for nominal or ordinal data.
 In a bar chart the various categories into which the observation fall are
represented along horizontal axis and a vertical bar is drawn above
each category such that the height of the bar represents either the
frequency or the relative frequency of observation within the class.

 The bars should be of equal width and should be separated from one
another so as not to imply continuity
In order to compare the distribution of a variable for
two or more groups, bars are often drawn along side
each other for groups being compared in a single bar
chart

Figure 1. Bar charts showing frequency distribution of


the variable ‘BWT’.
100 88.9 89
90
80
70
Percent
60
50 Yes
40 No
30
20 9 7.9
10 2.1 3.1
0
Low Normal Big

BWT

Fig 2. Bar chart indicating categories of birth weight of 9975


newborns grouped by antenatal follow-up of the mothers
Displays the frequency distribution for nominal or ordinal
Pie chart
data.
In a pie chart the various categories into which the
observation fall are represented along sectors of a circle,
such that each sector represents either the frequency or the
relative frequency of observation within the class the angles
of which are proportional to frequency or the relative
Fig 3(a) Pie chart indicating fre que ncy of cate gorie s
of birth we ight

frequency. 43 793
268

Very low
Low
Normal
Big

8870
Histogram

• Histograms are frequency distributions with


continuous class intervals that have been turned into
graphs.
• To construct a histogram, we draw the interval
boundaries on a horizontal line and the frequencies on
a vertical line.
• Non-overlapping intervals that cover all of the data
values must be used.
• Bars are drawn over the intervals in such a way that the
areas of the bars are all proportional in the same way to
their interval frequencies.

• The height of bar is proportional to the frequency of


observations in the interval
Example: Distribution of the age of women at the time of marriage
Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49
group
Number 11 36 28 13 7 3 2

Age of w omen at the time of marriage

40

35
No of women

30

25

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
Box and Whisker Plot
• It is another way to display information when the
objective is to illustrate certain locations (skewness) in
the distribution .
• Can be used to display a set of discrete or continuous
observations using a single vertical axis – only certain
summaries of the data are shown
• First the percentiles (or quartiles) of the data set
must be defined
• A box is drawn with the top of the box at the third
quartile (75%) and the bottom at the first quartile
(25%).
• The location of the mid-point (50%) of the distribution is
indicated with a horizontal line in the box.
• Finally, straight lines, or whiskers, are drawn from the
centre of the top of the box to the largest observation
and from the centre of the bottom of the box to the
smallest observation.
Percentiles and Quartiles
P0: The minimum
– P25: 25% of the sample values are less than or equal
to this value. 1st Quartile
. P25 means 25th percentile
– P50: 50% of the sample are less than or equal to this
value.
2nd Quartile
– P75: 75% of the sample values are less than or equal
to this value. 3rd Quartile
– P100: The maximum
• Percentile = p(n+1), p=the required percentile
• Arrange the numbers in ascending order

A. 1st quartile = 0.25 (n+1)th


B. 2nd quartile = 0.5 (n+1)th
C. 3rd quartile = 0.75 (n+1)th
The pth percentile is:
– The observation corresponding to p(n+1)th if p(n+1) is
an integer
– The average of (k)th and (k+1)th observations if p(n+1)
is not an integer, where k is the largest integer less
than p(n+1).
– If p(n+1) = 3.6, the average of 3rd and 4th
observations
How can the lower quartile, median and
upper quartile be used to judge the
symmetry of a distribution?
1. If the distribution is symmetric, then the upper and
lower quartiles should be approximately equally
spaced from the median.
2. If the upper quartile is farther from the median than
the lower quartile, then the distribution is positively
skewed.
3. If the lower quartile is farther from the median than
the upper quartile, then the distribution is
negatively skewed.
Box plots are useful for comparing two or
more groups of observations
Scatter plot
• For two quantitative variables we use bivariate plots (also
called scatter plots or scatter diagrams).
• In the study on percentage saturation of bile, information
was collected on the age of each patient to see whether a
relationship existed between the two measures.
 A scatter diagram is constructed by drawing X-and
Y-axes. Each point represented by a point or dot()
represents a pair of values measured for a single
study subject
Age and percentage saturation of bile for women patients in
hospital Z, 1998
160
 The graph suggests
140 the possibility of a
120
positive relationship
Saturation of bile

100

80 between age and


60
percentage
40

20
saturation of bile in
0
0 10 20 30 40 50 60 70
women.
80
Age
Line graph
• Useful for assessing the trend of particular situation
overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along
the horizontal axis, and
• Values of the quantity being studied is marked on the
vertical axis.
• Values for each category are connected by continuous
line.
Numerical Summary Measures
 NSM shows Single numbers which quantify the
characteristics of a distribution of values
 Measures of central tendency (location)
 Measures of dispersion (spread)
 A frequency distribution is a general picture of the
distribution of a variable
 But, can’t indicate the average value and the spread of
the values
Measures of Central tendency

Definitions
- On the scale of values of a variable there is a certain
stage at which the largest number of items tends to
cluster/center around.
- Since this stage is usually in the center of
distribution, the tendency of the statistical data to
get concentrated at this stage/value is called" central
Definitions…

- The various measures determining the actual value


at which the data tends to concentrate are called
measures of central tendency.
- So, a measure of central location is the single value
that best represents the whole series.
Types of Measures of Central Tendency

The most commonly used are:


 Mean
 Mode
 Median
Common Measures of Central
tendency
A. Mean
The most commonly used measure of central location.
It is commonly called “mean” or “average.”
In formulas, the arithmetic mean is usually represented as
µ for population mean, and, X (read as, x-bar) for sample
mean.
The Mean…
Ungrouped data: The formula for calculating the mean
from individual data of a sample is:

For population mean, substitute x-bar by µ and n by N.


The Mean…

Example:
• The following are the lengths (in cm) of a sample of
six garment blanks chosen at random from a large
batch of similar blanks: 54.5, 55.0, 55.7, 51.8, 54.2,
52.4
What is the mean length of the sample of garments?
Solution:
1 6 1
X   X i  (54.5  53.0  55.7  51.8  54.2  52.4) 53.6m
6 i 1 6
The Mean…

Grouped data

k
1 1
X   f i X i  ( f 1 X 1  f 2 X 2  ...  f k X k )
n i 1 n
Example
• The table below shows age group of 50 MOHA
permanent workers. Calculate the mean
Class limits Frequency (fi) class mark
(Xi)
fi X i

42-48 8 45 360
49-55 8 52 416
56-62 13 59 767
63-69 7 66 462
70-76 6 73 438
77-83 5 80 400

Solution:84-90 3 87 261

1 k 1 3104
X   fi X i  (360  416  ...  261)  60.08
n i 1 50 50
B. The Mode
• Mode is a value which occurs most frequently in a set
of values.
• The mode may not exist and even if it does exist, it may
not be unique.
• The value having the maximum frequency is the modal
value.
Examples:
1. Find the mode of 5, 3, 5, 8, 9 Mode =5
2. Find the mode of 8, 9, 9, 7, 8, 2, 5. It is a bimodal Data:
8 and 9
3. Find the mode of 4, 12, 3, 6, and 7. No mode for this
data.
Mode for Grouped data
In grouped data, we usually refer to the modal class,
class with highest frequency.
If a single value for the mode of grouped data must be
specified, it is taken as: 1
Mode L  w
1   2
Where  1  f mod  f 1  2  f mod  f 2
L = The lower class boundary of the modal class;
w = the size of the modal class
f1= frequency of the class preceding the modal class.
f2= frequency of the class succeeding the modal class
fmod = frequency of the modal class.
The Mode…
Example 2.40: Calculate the modal age for the age
distribution of 228 patients below.

Class interval Number of


women
15-19 6
20-24 19
25-29 50
30-34 57
35-39 48
40-44 27
45-49 21
Total 228
The Mode…
Solution: By inspection (simply looking at the
frequencies), the mode lies in the fourth class, where L
=29.5,
 1 57fmod
 50 7=, 57, f1=50, f2=48, w = 5,
2 57  48 9
and

Therefore, the modal age,


7
x̂ 29.5  5
7 9
29.5  2.2
31.7
C. The Median

C. The Median
• The median is the middle value of a set of data that
has been put into rank order.
• In a distribution, median is the value of the variable
which divides it in to two equal halves.
• Useful measure of central tendency when data are
skewed.
The Median…
A) Ungrouped data
Let x1 , x2 , , xn be n ordered observations.
Then, the median value is:
th
a) The  n  1  value if n is odd, and
 
 2 
th th
 n  n 
b) The      1 value if n is even.
 2  2 
2
Example: Find the median of the following numbers.
a) 6, 5, 2, 8, 9, 4.
The Median…

B) Grouped Data
Steps
Locate the class interval in which the median is
located. We use the following procedure for this.
– Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2 & this is
the median class.
Find a unique median value, use the following
interpolation format:
Median for grouped data

w n  ~
Median L    CF   x
f med  2 
The Median…
Example 2.26: Compute the median for the
following distribution.

Grade Frequen CF
cy
40-49 5 5
50-59 18 23
60-69 27 50
70-79 15 65
80-89 6 71
The Median…
Solution: Construct the less than
cumulative frequency distribution as
follows:
CB’s Freq. CF

39.5-49.5 5 5
49.5-59.5 18 23
59.5-69.5 27 50
69.5-79.5 15 65
79.5-89.5 6 71
The Median…
Since n = 71, 71/2 = 35.5, and the smallest less than CF
greater than or equal to 35.5 is 50; thus, the median class is
the third class. And for this class,
L = 59.5, w = 10, ,CF = 23. Then applying the
Formula we get: f med 27

~
x 10
59.5  35.5  23 64.13
27
Exercise

Values
Frequency
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
190- 200 107
200- 210 49
210- 220 34
220- 230 31
230- 240 16
Compute240- 250 mode
the mean, and12median.
Measures of Variation/Dispersion
Measures of Variation/Dispersion

Introduction
• Consider the following two sets of scores:
• Set 1: 40, 50, 60, 60, 40, 50
Set 2: 0, 100, 25, 75, 80, 20
• Both these sets have the same mean (50), but
the second set is a lot more widely dispersed
("scattered") than the first.
Measures of Variation/Dispersion

Set 1 Set 2
Measures of
Variation/Dispersion
• In other words the degree to which numerical data tend to
spread about an average value is called dispersion or
variation of the data.

• Measures of dispersions are statistical measures which


provide ways of measuring the extent in which data are
dispersed or spread out.
Objectives of Measuring
Variation
 To determine the reliability of an average by pointing out
as how far an average is representative of the entire
data.
 To determine the nature and cause of variation in order
to control the variation itself.
 Enable comparison of two or more distribution with
regard to their variability.
 Measuring variability is of great importance to other
statistical analysis. E.g., it is the basis of statistical quality
control
Types of Measures of Dispersion

Range
Variance
 Standard deviation
Coefficient of variation
The Range

The Range
- The difference between the largest (maximum)
and smallest (minimum) values.
Range = Maximum-Minimum
For frequency distributed data, the range is:
The difference between the upper class
boundary of the last class and the lower class
boundary of the first class.
Example 4.1

Find the Range of


54.5, 55.0, 55.7, 51.8, 54.2, 52.4
Solution: range(R) = 55.7- 51.8 = 3.9cm
Example 4.2
Given the following frequency distribution.
Find the range.
Class f
52.5-63.5 6
63.5-74.5 12
74.5-85.5 25
85.5-96.5 18
96.5-107.5 14
107.5-118.5 5
Solution: Range =UCBL – LCBF=118.5-52.5=66
Inter quartile range

 Shows variability within middle 50% of observations


 It is defined as the difference between the 75th and 25th
percentiles of the data.
 Inter quartile Range =Q3-Q1
Variance and Standard Deviation
• The variance and standard deviation are the most
superior and widely used measures of dispersion

• Both measures the average dispersion of the


observations around the mean.

• The variance is defined as the average of the squared


deviation from the mean.
Population and Sample Variance

 X  
2
i
Population var iance  2  i 1 i 1, 2, . . ., N
N
k

 f X  
2
i i
For the case of frequency distributi on  2  i 1 i 1, 2, . . ., k X i are class marks
N

Sample variance: the sample variance


is represented by S2 and is given by
n

 X  X
2
i
2 i 1
Sample var iance S  i 1, 2, . . ., n
n 1
k

 f i X i  X 
2

For the case of frequency distributi on S  i 1 2


i 1, 2, . . ., k where n  f i
n 1
Standard deviation:
The positive square root of the variance is called standard
deviation. Therefore


 i
X   2

Population s tan dard deviation    2 


N

 X  X
2
2 i
Sample s tan dard deviation  S  S 
n 1
Example : 4.6
24, 25, 29,29,30,31 Mean= 28. Find variance and
standard deviation
Solution:
Value minus Mean Difference Difference Squared
2

24-28 -4 16
25-28 -3 9
29-28 1 1
29-28 1 1
30-28 2 4
31-28 3 9
0 40
Variance and Standard
Deviation…
Variance = Sum of difference squared /n-1
= 40/5 = 8
Standard deviation = = 2.83
Variance is the mean of the squared differences of the
observations from the mean.
The standard deviation is the square root of the
variance.
Individual Assignment
Find the variance, Quartiles (Q1, Q2, Q3), Quartile deviation,
mean deviation about mean and median, decile (D1….D9)
and standard deviation of the following sample data . Also
draw box and whisker plot, histogram for both data sets
1. 5, 17, 12, 10, 20, 15, 16, 8
2. The data is given in the form of frequency distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Special properties of standard deviation /variance

• The main drawback of variance => unit is squared


and this is difficult to interpret.
• Variance gives weight to extreme values than
those near to the mean value. This is because the
difference is squared.
• Variance will be zero for distributions with equal
magnitude.
• The greater the difference in the values, the
greater the variance and vise versa.
Coefficient of variation (CV)

 In situations where either two series have different units of


measurements, or their means differ sufficiently in size, the
CV should be used as a measure of dispersion.
S tan dard deviation
Coefficien t of Variation (CV )  100%
Mean
 S
CV  100% for sample and

CV  100% for population
X 

 In spite of the fact that the C.V. is broadly applied, its


disadvantage is that it’s not useful when the mean is zero.

 Interpretation of the coefficient of variation: the


distribution having less CV is said to be more consistent
EXAMPLE 4.7

For the garment length data mean = 53.6 and standard


deviation = 1.46cm, so that the coefficient of variation is

S 1.46
CV  X 100%  X 100% 2.72%
X 53.6
Example
Suppose that the mean weight of a group of
students is 165 pounds with a S.D of 8
pounds. If the height of the same group of
students has a mean of 60 inches with a S.D
of 3 inches, compare the variability in weight
and height measurements.
8 b
100  4.85%
16 5
Solution: For weight,
3 in C.V = ,
 100 5%
and for height 60 in

=> The height data is more variable/less


consistent than the weight data.
Skewness

 Skewness is the degree of asymmetry or


departure from symmetry of a distribution.

 A skewed frequency distribution is one that is


not symmetrical.

 Skewness is concerned with the shape of the


curve not size
Skewness
 If the frequency curve (smoothed frequency polygon) of a
distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be
skewed to the right or said to have positive skewness.
 If it has a longer tail to the left of the central maximum
than to the right, it is said to be skewed to the left or said
to have negative skewness.
Skewness

Remarks:
 In a positively skewed distribution, smaller
observations are more frequent than larger
observations. i.e. the majority of the observations
have a value below an average.

 In a negatively skewed distribution, smaller


observations are less frequent than larger
observations. i.e. the majority of the observations
have a value above an average.
Skewness
Median Mode Mean
Fig. 2(a). Symmetric Distribution Mode Median Mean
Fig. 2(b). Distribution skewed to the right

Mean > Median > Mode


Mean = Median = Mode

Mean Median Mode


Fig. 2(c). Distribution skewed to
the left
Mean < Median < Mode
Skewness
Example.
Suppose the mean, the mode, and the standard
deviation of a certain distribution are 32, 30.5
and 10 respectively. What is the shape of the
curve representing the distribution?
Solution:
Mean  Mode 32  30.5
Sk   0.15
S tan dard deviation 10

The distribution is positively skewed.


Skewness

Measures of Skewness
The Karl Pearson’s Coefficient of
Skewness (SK):
Mean  Mode 3( Mean  Median )
Sk  Sk 
S tan dard deviation S tan dard deviation

If SK = 0, then the distribution is


symmetrical.
If SK > 0, then the distribution is
positively skewed.
Summarize the data
• When the distribution of the data is symmetric and unimodal
(i.e. the data are approximately normally distributed), it is
usual to summarize the data using means and standard
deviations.
• However when the data are skewed, it is preferable to use
the median and quartiles as summary statistics.
• Median and quartiles are not easily influenced by extreme
values in a skewed distribution unlike means and standard
deviations.
• Remark:
– The mean and median of symmetric distribution coincide.
– When the distribution is skewed to the right, its mean is
larger than its median.
– When the distribution is skewed to the left, its mean is
smaller than its median.
Assignement (Presentation)

Group 1 : Basic probability concept and


probability distribution (Binomial, Normal,
t distribution)
Group 2: Sampling techniques and
sampling distributions
to be presented on Thursday (February 27)
Any Question???

You might also like