0% found this document useful (0 votes)
14 views207 pages

Stat Basics-of-Research

The document provides an overview of variables in research, categorizing them into qualitative and quantitative types, as well as discrete and continuous classifications. It also discusses the measurement scales used in research, including nominal, ordinal, interval, and ratio scales, and outlines the stages and purposes of conducting research. Additionally, it emphasizes the importance of data collection methods and the use of multiple sources to ensure accurate and trustworthy information.

Uploaded by

Aireen Maten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views207 pages

Stat Basics-of-Research

The document provides an overview of variables in research, categorizing them into qualitative and quantitative types, as well as discrete and continuous classifications. It also discusses the measurement scales used in research, including nominal, ordinal, interval, and ratio scales, and outlines the stages and purposes of conducting research. Additionally, it emphasizes the importance of data collection methods and the use of multiple sources to ensure accurate and trustworthy information.

Uploaded by

Aireen Maten
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 207

Variables

 A variable is a characteristic or
condition that can change or take on
different values. Ex?
 The values of the variable are data

 Most research begins with a general


question about a variable or the
relationship between two variables for
a specific group of individuals.
1
TYPES OF VARIABLES 1:
Qualitative, or Attribute, or Categorical, Variable:
A variable that categorizes or describes.
Ex. Dead/alive Blood Type: Room#
Note: Arithmetic operations, such as addition and averaging,
are not meaningful for data resulting from a qualitative
variable.
Quantitative, or Numerical Variable: A variable
that quantifies. Ex. Ht, Wt, age,

Note: Arithmetic operations such as addition and averaging,


are meaningful for data resulting from a quantitative variable.
Statistical Description of Data
 Statistics describes a QUANTITATIVE
or numeric set of data by its
 Center (ex. Average height of Filipinos)
 Variability
 Shape
 Statistics describes a QUALITATIVE or
categorical set of data by
 Frequency, percentage or proportion of
each category. Ex. Percent of bakla in SMU
Types of Variables (2)
 Variables can be classified as
discrete or continuous.
 Discrete variables (such as class
size) consist of indivisible categories
 List other examples
 1. No. of siblings ( 0,1,2,3,4,5.)
 2
 3
 4
 5
4
Types of Variables (2)
 Variables can be classified as
 discrete or continuous.
 continuous variables (such as time
or weight) are infinitely divisible into
whatever units a researcher may
choose. For example, time can be
measured to the nearest minute,
second, half-second, etc.
 Other Ex?

5
Continuous variables have
Real Limits
 To define the units for a continuous
variable, a researcher must use real
limits which are boundaries located
exactly half-way between adjacent
categories. Ex. 1.5 to 2.5

1 2 3 4
5

6
Types of Variables (3)
 Univariate
 Bivariate
 Multivariate
Types of Variables (3)
 Univariate, bivariate, and multivariate
are terms used to describe the number of
variables being analyzed in a statistical
analysis.
• Univariate: Analyzes one variable at a
time
• Bivariate: Analyzes two variables at a
time
• Multivariate: Analyzes more than two
variables at a time
Types of Variables (3)
 Examples
• Univariate: Analyzing the length of iris
flower sepals in a dataset
• Bivariate: Analyzing the relationship
between temperature and ice cream
sales
• Multivariate: Analyzing how the
popularity of advertisements on a website
depends on age, gender, and location
Types of Variables (4)
 Independent Variables
 Dependent Variables
Note: In Experiments: IV-DV
 Is there an effect of IV on DV
 In Correlation:
 Is there a relationship between IV & DV
 In Differences:
 Is there a difference in DV when grouped by
IV?
Measuring Variables
 Researchers must observe the
variables and record their
observations. This requires that the
variables be measured.
 The process of measuring a variable
requires a set of categories called a
scale of measurement and a process
that classifies each individual into one
category.

11
4 Types of Measurement Scales
(NOIR)
1.nominal scale
2.ordinal scale
3.interval scale
4.ratio scale

12
4 Types of Measurement Scales
1. A nominal scale is an unordered set
of categories identified only by
name. Nominal measurements only
permit you to determine whether two
individuals are the same or different.
Examples:
1=male, 2=female
1=public, 2=private
1=accountancy, 2=nursing, 3=criminology

13
4 Types of Measurement Scales
2. An ordinal scale is an ordered set of
categories. Ordinal measurements tell
you the direction of difference between
two individuals. But does not indicate
how much is the difference.

Ex. 1st , 2nd, 3rd


Small (1), Medium (2), Large (3)
Grade 7, 8, 9, 10
14
4 Types of Measurement Scales
3. An interval scale is an ordered series of
equal-sized categories. Interval
measurements identify the direction and
magnitude of a difference. The zero point
is located arbitrarily on an interval scale.
Indicates amount of difference between
scores.
Ex. Grades 85, 90, 91

15
4 Types of Measurement Scales
4. A ratio scale is an interval scale where a
value of zero indicates none of the
variable. Ratio measurements identify
the direction and magnitude of
differences and allow ratio comparisons
of measurements.

Ex. Tower1 is 10 meters. Tower 2 is 20 meters. Tower 2 is


twice as tall as Tower1

16
Review of Definitions
Variable - any characteristic of an individual or entity. A variable can
take different values for different individuals. Variables can be
categorical or quantitative: Discrete or Continuous.
• Nominal - Categorical variables with no inherent order or ranking sequence
such as names or classes (e.g., gender). Value may be a numerical, but without
numerical value (e.g., I, II, III). The only operation that can be applied to Nominal
variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe.
Can be compared for equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally,
differences between values are meaningful, however, the scale is not absolutely
anchored. Calendar dates and temperatures on the Fahrenheit scale are examples.
Addition and subtraction, but not multiplication and division are meaningful
operations.
• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary
zero point, e.g. age, weight, temperature (Kelvin). Addition, subtraction,
multiplication, and division are all meaningful operations.
QUIZ: SELFIE
List 15 sentences that explain ideas
about basic concepts in statistics
discussed in class. Each sentence must
contain a basic statistical concept.
Underline the concept.
Ex.
1. a variable is a characteristic that can
have different values.
QUIZ: Multiple choice
1. The characteristics of a sample are
called statistic while the characteristics
of a population are called______.
a. inferential statistic
b. variable
c. descriptive statistic
d. parameter
MODULE 2

BASIC CONCEPTS
IN RESEARCH
BASIC CONCEPTS IN research
 What are the ways of knowing?
 What is research?
 What is Basic & Applied Research?
 What are the 3 basic purposes of
research?
 What are the stages of research
 What is the traditional thesis format
 What is the IMRAD format of research?
Ways of Knowing
 Five ways we can know something
 Personal experience
 Tradition
 Experts and authorities
 Logic
 Inductive
 Deductive
 The scientific method (Research)

Obj.
1.2
What is Research?
 Research is Creative work undertaken
systematically to increase the stock of
knowledge (of humanity, culture and society),
and the use of this knowledge to devise new
applications (OECD)

 Research is the systematic process of collecting


and analysing information (data) in order to
increase our understanding of the phenomenon
with which we are concerned or interested.
What is research?
 activity classified as research is
characterised by originality
 investigation is a primary aim
 results are sufficiently general for
humanity's stock of knowledge to
be increased
 includes empirical and non-
empirical work
 Empirical means based on data
TYPES of Research:
Basic vs Applied
 BASIC
Seeks to increase understanding and prediction
of a particular phenomenon across individuals
and sites

 Ex. Thesis/Dissertation:
Theory Building
Theory Testing
Theory Expansion
Basic vs Applied
 APPLIED
 Seeks to increase understanding in order to
address the needs of an individual, group or
organization.
Ex. Culminating Project/Thesis
Assessment/Diagnosis
Program Development
Material Development
Evaluation Research
Case Study
Action Research
Purpose of Research
 Description
-Document a particular phenomena
-Usually involves survey, naturalistic observations, case
studies, review of records (archives), etc

 Explanation
– Establish causality
– Usually involves experimental methods

 Prediction
• To forecast the variables, events and behaviors
associated With or resulting from the phenomenon
• Correlational studies
Stages of conducting research:
Research report format
 Traditional Thesis Format
 Chap 1: THE PROBLEM & ITS BACKGROUND
 Rationale
 Statement of the Problem
 Hypotheses
 Theoretical/Conceptual/Analytical
Framework
 Assumptions
 Significance of the Study
 Scope & Delimitations
 Definition of Terms
Research report format
 Traditional Thesis Format
 Chap II: REVIEW OF RELATED
LITERATURE AND STUDIES

Arrange by Topics combining Literature & Studies


following the flow of the problems

Provide Synthesis (how your study will differ


from the studies cited
Research report format
 Traditional Thesis Format
 Chap III: METHODOLOGY
 Research Design
 Research Environment
 Population and Sample
 Instruments
 Procedure
 Data Analysis
Research report format
 Traditional Thesis Format
 Chap IV: RESULTS AND DISCUSSION
 by Section based on problems
 Chapter V: SUMMARY, FINDINGS &
CONCLUSIONS
 Summary
 Findings
 Conclusions
 Recommendations
REFERENCES
APPENDICES
Research report format:
APA format: IMRAD
 INTRODUCTION
 METHODS
 RESULTS
 and
 DISCUSSION
MODULE 3

Data Collection
Data collection
 Types of Data
 Methods of Data Collection
 Sampling
 Sampling Techniques
 Sampling Size
Types of Data:
Quantitative and Qualitative Data

5 (Quantity) Happy (Quality)


"Not everything that counts can be counted."
Kids
• Quantitative data collection methods
produce numbers.

• Qualitative data collection methods


produce words, pictures, images
• Quanti vs Quali

• Quantitative and qualitative each has its


strengths and weaknesses.

• Quantitative methods are more


structured and allow for aggregation
and generalization.

• Qualitative methods are more open and


provide for depth and richness.
Quantitative Approach
 Data in numerical form
 Data that can be precisely measured
 age, cost, length, height, area, volume,
weight, speed, time, and temperature
 Harder to develop
 Easier to analyze

IPDET © 2009 39
Qualitative Approach
 Data that deal with description
 Data that can be observed or self-
reported, but not always precisely
measured
 Less structured, easier to develop
 Can provide “rich data” — detailed and
widely applicable
 Is challenging to analyze
 Is labor intensive to collect
 Usually generates longer reports
IPDET © 2009 40
Which Data? Quali-Quanti?

If you: Then Use:


- want to conduct statistical analysis
- want to be precise Quantitative
- know what you want to measure
- want to cover a large group
- want narrative or in-depth information

- are not sure what you are able to measure Qualitative


- do not need to quantify the results

IPDET © 2009 41
Obtrusive vs. Unobtrusive
Methods
Obtrusive Unobtrusive
data collection data collection
methods that methods that do not
directly obtain collect information
information from directly from evaluees
those being e.g., document analysis,
evaluated GoogleEarth,
e.g. interviews, observation at a distance,
surveys, focus trash of the stars
groups

IPDET © 2009 42
What method shall I use?

There is no simple answer

There is no ONE best method

It all depends…

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


43
When choosing methods, consider…
• The purpose of your evaluation − Will the method
allow you to gather information that can be
analyzed and presented in a way that will be
credible and useful to you and others?

• The respondents − What is the most appropriate


method, considering how the respondents can
best be reached, how they might best respond,
literacy, cultural considerations, etc.?

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


44
Consider…
• What kind of data your stakeholders will find
most credible and useful

• Resources available. Time, money, and staff


to design, implement, and analyze the
information. What can you afford?

• Type of information you need. Numbers,


percents, comparisons, stories, examples,
etc.

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


45
Consider…

• Interruptions to program or participants. Which


method is likely to be least intrusive?

• Advantages and disadvantages of each


method.

• .The importance of ensuring cultural


appropriateness.

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


46
Often, it is better to use more than one data
collection method.

Why would this be so?

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


47
When we use several methods we say we are
‘triangulating’. Triangulation is important in
evaluation because we want accurate and
trustworthy information.

Triangulation means the use of multiple sources


and methods to gain a better understanding.
Each source and each method has inherent
biases so using more than one source and/or
method provides a more accurate picture.

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


48
How might you mix various
sources of information
in your data gathering?

How might you mix various


data collection methods
to gather data?

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


49
Some ideas
Mix sources of information
For example, you might collect information
from program participants AND parents;
or from campers and camp leaders.

Mix data collection methods


For example, you might survey
participants AND interview a sample of
participants. You might conduct focus
group interviews with community service
participants AND observe the community
service projects.
© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation
50
Mixing sources and methods
Thinking back on the examples in the previous
slide, what different type of information might you
get from the different sources and methods?
Using multiple sources and/or methods means
more time and resources.
The choice of data collection method ultimately
depends upon the resources you have available.

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


51
Data Gathering Methods
 Self-reports
 Observation
 Biophysiological measurement
 Psychological Testing
 Interview
 Document review
 Portfolio assessment
 etc
The term, “instrument”, sounds like we
are talking about a dental office, a
cockpit or an orchestra.

Actually, we use the term “instrument”


to mean the tool on which the data is
actually recorded: the questionnaire,
the recording form, the video or audio
tape, for example.
If you have selected a survey as your
method, you automatically know that
you will need a questionnaire. But, if
you choose a method such as focus
group or interview or observation, think
about what you will use for recording
the information.
Identify your unit of
analysis
? Who supplies the information
Students, teachers, parents and some 
combination of these individuals or
.entire schools
At this early stage ,you need to decide 
at what level the data needs to be
gathered .ex: individuals , family,
school
.school district 
This level is referred as the unit of 
.analysis
Specify the population
and sample
Select individuals who are 
representative
.of the entire group
selection of Representative: refers
to the individuals from a sample of a
population ,enabling you to draw
conclusions from the sample about the
.population as a whole
Population :a group of individuals who
.have the same characteristic
Population:
a set which includes all
measurements of interest
to the researcher
(The collection of all
responses, measurements, or
counts that are of interest)

Sample:
A subset of the population
Target Population:
The population to be studied/ to which the
investigator wants to generalize his results
Types of sampling strategies:
Probability: Nonprobability:
• Why? • Why? Generalizability
Generalize to not as important.
Want to focus on
population.
“right cases.”
Some examples:
Some examples:
– Simple random
– Quota sample
sample
– “Purposeful” sample
– Stratified sample
– “Convenience” or
– Cluster sample “opportunity” sample
– Systematic sample

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


58
Non probability samples

 Convenience samples (ease of access)


sample is selected from elements of a population
that are easily accessible
 Snowball sampling (friend of friend….etc.)
 Purposive sampling (judgemental)
• You chose who you think should be in the
study
Quota sample

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


59
Non probability samples

-Probability of being chosen is unknown


-Cheaper- but unable to generalise
-potential for bias

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


60
© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation
61
Probability samples

• Random sampling
– Each subject has a known probability of
being selected
• Allows application of statistical sampling
theory to results to:
– Generalise
– Test hypotheses

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


62
Conclusions

• Probability samples are the best

• Ensure
– Representativeness
– Precision

© 2009 University of Wisconsin-Extension, Cooperative Extension, Program Development and Evaluation


63
Methods used in probability
samples

 Simple random sampling


 Systematic sampling

 Stratified sampling

 Multi-stage sampling

 Cluster sampling
Simple random sampling
Table of random numbers

684257954125632140
582032154785962024
362333254789120325
985263017424503686
Systematic sampling

Sampling fraction
Ratio between sample size and population
size
Systematic sampling
Cluster sampling
Cluster: a group of sampling units close to each
other i.e. crowding together in the same area or
neighborhood
Cluster sampling
Section 1 Section 2

Section 3

Section 5

Section 4
Cluster Sampling
Divide the population
into groups (called
clusters), randomly
select some of the
groups, and then
collect data from ALL
members of the
selected groups
Used extensively by
government and
private research
organizations
Examples:
 Exit Polls
 Stratified sampling
 Divide into groups and randomly sample from
each group (Male & Female) (Year level)

 Multi-stage sampling
 Ex. Randomly sample 20 of 80 provinces
 From each of the 20 provinces, randomly

select 5 towns
 From each town, select 10 barangays
Sampling Size
 Depends on Type of Research:
 If Survey,
 Slovin's Formula. - is used to calculate
the sample size (n) given the population size
(N) and a margin of error (e). -It is computed as
n = N / (1+Ne2). - If a sample is taken from a
population, a formula must be used to take
into account confidence levels and margins of
error.
Sampling Size-recomended
 If Experiment- 15 per cell
 If Interview- 15
 If case study, 1 to 5
 If correlation, more than 30
MODULE 4

ORGANIZING AND
PRESENTING DATA
ORGANIZING AND PRESENTING DATA
 Distribution of data
 Frequency distribution
 Cumulative frequency distribution
 Presenting Categorical variables
 Bar Graph
 Pie chart
 Presenting Numerical data
 Histogram
 Boxplot
 Pareto chart
 Ogive
Distribution
Distribution - (of a variable) tells us what values the variable
takes and how often it takes these values.
• Unimodal - having a single peak
• Bimodal - having two distinct peaks
• Multimodal-having several peaks
• Unimodal Symmetric - left and right half are mirror images.
• Also known as normal distribution
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the
frequency distribution of variable ‘age’ can be tabulated as
follows:
Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution of Age:
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page
Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6


Frequency 8 12 6

Cumulative Frequency 8 20 26
Data Presentation
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern. Over all pattern usually described by
shape, center, and spread of the data. An individual value that falls
outside the overall pattern is called an outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf ,Box-plot are used for numerical variable.
Data Presentation –Categorical
Variable
Bar Diagram: Lists the categories and presents the percent or count of
individuals who fall in each category.

Figure 1: Bar Chart of Subjects in


Treatm ent Groups Treatment Frequency Proportion Percent
Group (%)
Number of Subjects

30
25
1 15 (15/60)=0.25 25.0
20
15 2 25 (25/60)=0.333 41.7
10
5
3 20 (20/60)=0.417 33.3
0 Total 60 1.00 100
1 2 3
Treatm ent Group
Data Presentation –Categorical
Variable
Pie Chart: Lists the categories and presents the percent or count of
individuals who fall in each category.

Figure 2: Pie Chart of Treatment Frequency Proportion Percent


Subjects in Treatment Groups Group (%)

1 15 (15/60)=0.25 25.0
25% 1 2 25 (25/60)=0.333 41.7
33%
2 3 20 (20/60)=0.417 33.3

3 Total 60 1.00 100


42%
Graphical Presentation –Numerical
Variable
Histogram: Overall pattern can be described by its shape, center,
and spread. The following age distribution is right skewed. The
center lies between 80 to 100. No outliers.

Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518

16 Median 84
14 Mode 84
Number of Subjects

12 Standard Deviation 30.22979318


10
Sample Variance 913.8403955
8
Kurtosis -1.183899591
6
4 Skewness 0.389872725
2 Range 95
0 Minimum 48
40 60 80 100 120 140 More
Maximum 143
Age in Month
Sum 5425
Count 60
Graphical Presentation –Numerical
Variable
Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age


160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
Box Plot
1
 A Pareto chart, named after
Vilfredo Pareto, is a type
of chart that contains both bars and
a line graph, where individual values
are represented in descending order
by bars, and the cumulative total is
represented by the line.
OGIVE
 In statistics, an ogive is a free-hand
graph showing the curve of a
cumulative distribution function. The
points plotted are the upper class
limit and the corresponding
cumulative frequency. (which, for the
normal distribution, resembles one
side of an Arabesque or ogival arch).
REVIEW OF DESCRIPTIVE STAT
 FOR FREQUENCIES & GRAPHS
Types of Statistics/Analyses
Descriptive Statistics
Describing a phenomena
– Frequencies How many? How much?
– Basic measurements BP, HR, BMI, IQ, etc.

Inferential Statistics Inferences about a phenomena


– Hypothesis Testing Proving or disproving theories
– Correlation Associations between phenomena
If sample relates to the larger
– Confidence Intervals population
– Significance Testing E.g., Diet and health
– Prediction
Descriptive Statistics
Descriptive statistics can be used to summarize
and describe a single variable (aka, UNIvariate)
•Frequencies (counts) & Percentages
– Use with categorical (nominal) data
• Levels, types, groupings, yes/no, Drug A vs. Drug B

•Means & Standard Deviations


– Use with continuous (interval/ratio) data
• Height, weight, cholesterol, scores on a test
Frequencies & Percentages
Look at the different ways we can display frequencies and
percentages for this data:

Pie chart

Table
AKA frequency
distributions –
good if more
than 20
observations

Good if more
than 20
observations Bar chart
Distributions
The distribution of scores or values can also be
displayed using Box and Whiskers Plots and Histograms
Continuous  Categorical

It is possible to take
continuous data
(such as hemoglobin
levels) and turn it
into categorical data
by grouping values
together. Then we
can calculate
frequencies and
percentages for each
group.
Continuous  Categorical
Distribution of
Glasgow Coma
Scale Scores

Even though
this is
continuous
data, it is
being treated
as “nominal”
as it is broken
down into
groups or
Tip: It is usually better to collect continuous data and then break it categories
down into categories for data analysis as opposed to collecting data
that fits into preconceived categories.
Ordinal Level Data
Frequencies and percentages can be computed
for ordinal data
– Examples: Likert Scales (Strongly Disagree to Strongly
Agree); High School/Some College/College
Graduate/Graduate School
Interval/Ratio Data
We can compute frequencies and percentages
for interval and ratio level data as well
– Examples: Age, Temperature, Height, Weight,
Many Clinical Serum Levels
Distribution of Injury Severity
Score in a population of patients
Interval/Ratio Distributions
The distribution of interval/ratio data often
forms a “bell shaped” curve.
– Many phenomena in life are normally
distributed (age, height, weight, IQ).
• MAKE A FREQUENCY DISTRIBUTION AND
GRAPH OF THE GIVEN DATA

• 20 scores from a 100-point psychology exam


• 80, 90, 94, 82, 83, 84, 88, 90, 89, 92, 82, 83,
83, 84, 85, 85, 85, 87, 85, 84
MODULE 5

Averages and
measures of position
(Measures of Central
Tendency)
Numerical Presentation
A fundamental concept in summary statistics is that of a
central value for a set of observations and the extent to
which the central value characterizes the whole set of data.

Measures of central value such as the mean or median


must be coupled with

measures of data dispersion (e.g., average distance from


the mean) to indicate how well the central value
characterizes the data as a whole.
Numerical Presentation
A fundamental concept in summary statistics is that of a
central value for a set of observations and the extent to
which the central value characterizes the whole set of data.

Measures of central value such as the mean or median


must be coupled with measures of data dispersion (e.g.,
average distance from the mean) to indicate how well the
central value characterizes the data as a whole.
To understand how well a central value characterizes a
set of observations, let us consider the following two sets
of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of
the observations from the mean in data set A is larger
than in the data set B. Thus, the mean of data set B is a
better representation of the data set than is the case for
set A.
Measures of central tendency
 Mean – arithmetic average of a set of
values and most frequently used
measure of central tendency
 Median- midpoint of values if they
are ordered from high to low
 Mode – value that occurs most
frequently

03/30/25 106
Methods of Central value (or
measurement)
Center measurement is a summary measure of the overall level of
a dataset
Mean: Summing up all the observation and dividing
by number of observations. Mean of 20, 30, 40 is
(20+30+40)/3 = 30.
Notation : Let x1 , x2, ...xn are n observations of a variable
x. Then the mean of this variable,
n

x
x1  x2  ...  xn i 1 i
x 
n n
Compute: Mean
 Given Numbers:
7 26 54 82 32 26 51
 Total up the numbers
 7+26+54+82+32+26+51= 287
 Divide the total by the n (number of
values)
 (287 / 7 = 39.71)

03/30/25 108
Methods of Center Measurement
Median: The middle value in an ordered sequence
of observations.

To find the median we need to order the data set


and then find the middle value. (Data array=
arranged data like High to Low or Low to High)

For example, to find the median of {9, 3, 6, 7, 5}, we first


sort the data giving {3, 5, 6, 7, 9}, then choose the middle
value 6.
Methods of Center Measurement
Median: The middle value in an ordered sequence
of observations.

In case of an even number of observations the


average of the two middle most values is the
median.

If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2},


then the median is the average of the two middle values
from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.
Methods of Center Measurement

Mode: The value that is observed most


frequently.

Given: 3,3,4,5,5,6,6,6,6,6,7,7,7,8,8,9
Mode = 6

The mode is undefined for sequences in which no


observation is repeated.
Given: 3,3,3,5,5,5,6,6,6,7,7,7
Mode: undefined
Median

Median: The middle value in a


ranked distribution. If there are
an even number of values, then
take the average of the middle
two values.
Median

Median: The middle value in a


ranked distribution. If there is an
even number of values, then take
the average of the middle two
values.

Raw Data: 2, 0, 1, 0, 2, 2, 1, 0, 0, 2, 1, 3,
3, 3, 2, 1, 3, 5, 1, 1, 0, 3, 1, 2, 2
Do not attempt the find the middle value of a
raw data set. It must first be sorted.
Median

Median: The middle value in a ranked


distribution. If there is an even
number of values, then take the
average of the middle two values.

ta: 2, 0, 1, 0, 2, 2, 1, 0, 0, 2, 1, 3, 3, 3, 2, 1, 3, 21, 1, 1, 0, 3, 1, 2,
2
Sorted Data: 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 5
Median

Median: The middle value in a ranked


distribution. If there is an even
number of values, then take the
average of the middle two values.

Raw Data: 2 0, 1, 0, 2, 2, 1, 0, 0, 2, 1, 3, 3, 3, 2, 1, 3, 21, 1, 1, 0,


3, 1, 2, 2

Ranked: 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 5

Median is 2
Median

Median: The middle value in a ranked


distribution. If there is an even
number of values, then take the
average of the middle two values.
Ranked: 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, , 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 5

Median: (1 + 2) / 2
Median: 1.5
Effects of outliers on
Mean and Median
The median is less sensitive to outliers (extreme
scores) than the mean and thus a better measure
than the mean for highly skewed distributions, e.g.
family income.
For example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations
out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic
picture of the major part of the data. It is influenced
by extreme value 990.
Measures of Central Tendency – And Outliers

When there is an
Real outlier, your reporting
World options are to report:
Use
(1)Median, or

(2)Median and Mean

If you think the outlier does not belong in the


data set (i.e., was an error)… then consider
also reporting the mean without the outlier.
MODULE 6
MEASURES OF
VARIABILITY OR DISPERSION
Methods of Variability

Variability (or dispersion) measures the amount


of scatter in a dataset.
Commonly used methods: range, variance,
standard deviation, interquartile range, coefficient of
variation etc.

Range: The difference between the largest and the smallest


observations. The range of 10, 5, 2, 100 is (100-2)=98. It’s a crude
measure of variability.
Methods of Variability Measurement
Variance: The variance of a set of observations is the average of the
squares of the deviations of the observations from their mean. In
symbols, the variance of the n observations x1, x2,…xn is

2 ( x1  x ) 2  ....  ( xn  x ) 2
S 
n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5  5) 2  (3  5) 2  (7  5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard
deviation of the above example is 2.
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that cover the total
range of observed values. Cut points for these regions are known as
quartiles.
In notations, quartiles of a data is the ((n+1)/4)q th observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile
(Q2) is between the 25th and 50th percentage points in the data. The
upper bound of Q2 is the median. The third quartile (Q3) is the 25% of
the data lying between the median and the 75% cut point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is


the median of the second half of the ordered observations.
Methods of Variability Measurement
In the following example Q1= ((15+1)/4)1 =4th observation of the data. The
4th observation is 11. So Q1 is of this data is 11.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is
also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile


range of the previous example is 61- 11=50. The middle half of the
ordered data lie between 11 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points
are called Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut
points are called Percentiles. 25th percentile is the Q1, 50th percentile
is the Median (Q2) and the 75th percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of


the data, where p is the desired percentile and n is the number of
observations of data.

Coefficient of Variation: The standard deviation of data divided by it’s


mean. It is usually expressed in percent.

Coefficient of Variation = 100
x
Five Number Summary

Five Number Summary: The five number summary of a distribution


consists of the smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum)
observation written in order from smallest to largest.

Box Plot: A box plot is a graph of the five number summary. The
central box spans the quartiles. A line within the box marks the
median. Lines extending above and below the box mark the
smallest and the largest observations (i.e., the range). Outlying
samples may be additionally plotted outside the range.
Boxplot
Distribution of Age in Month
160
160
140
140
120
120 q1
100 q1
100 min
min
80 median
80 median
60 max
60 max
q3
40 q3
40
20
20
0
0
1
1
Choosing a Summary
The five number summary is usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
extreme outliers.

The mean and standard deviation are reasonable for symmetric


distributions that are free of outliers.

In real life we can’t always expect symmetry of the data.


It’s a common practice to include number of observations (n), mean,
median, standard deviation, and range as common for data
summarization purpose.
We can include other summary statistics like Q1, Q3, Coefficient of
variation if it is considered to be important for describing data.
 DESCRIBING DATA BASED ON THE
SHAPE OF THE HISTOGRAM
Shape of Data
 Shape of data is measured by
 Skewness (assymmetry of the histogram)
 Kurtosis (Height of the histogram)
Skewness
Kurtosis
Skewness
 Measures asymmetry of data
 Positive or right skewed: Longer right tail
 Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,


n
n  ( xi  x ) 3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 
(measures of the deviation from
normality)
 Skewness: the extent to which a distribution of
values deviates from symmetry around the mean.
 A value of zero means the distribution is symmetric,
 positive skewness indicates a greater number of
smaller values,
 negative value indicates a greater number of larger
values.
 Values for acceptability for psychometric purposes
(+/-1 to +/-2) are the same as with kurtosis.
Kurtosis
 Measures peakedness of the distribution of
data. The kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 
(measures of the deviation
from normality)
 Kurtosis: a measure of the "peakedness" or
"flatness" of a distribution.
 A kurtosis value near zero indicates a shape
close to normal.
 A negative value indicates a distribution
which is more peaked than normal
 a positive kurtosis indicates a shape flatter
than normal.
 An extreme positive kurtosis indicates a
distribution where more of the values are
located in the tails of the distribution rather
than around the mean. A kurtosis value of +/-
1 is considered very good for most
psychometric uses, but +/-2 is also usually
acceptable.
Why is kurtosis and skewness
important?
 Parametric tests require normal
distributions of data before they can be
applied

 Ifdata distribution is not normal, then


use non-parametric tests
How to obtain kurtosis and
skew in SPSS
 Choose Statistics, Descriptives
Choose "Options"
Select skew and kurtosis
Interpretation of Skew and Kurtosis Output
Divide Skew by SE Skew and divide Kurtosis
by SE Kurtosis
Values of 2 or more suggest skew or kurtosis

 Viewing Normality of Distribution


Choose Charts, Histogram
Enter variable
Check "Display normal curve"
1500

2000
800

1000

1500
600

1000
400

500

500
200
0

0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

z z z

n=2 n=5 n = 15
mean of sample mean of sample mean of sample
means = 10 means = 10 means = 10
SD of sample means = SD of sample means = SD of sample means =
4.16 2.41 0.87
Summary of the Variable ‘Age’ in
the given data set
Mean 90.41666667 Histogram of Age

Standard Error 3.902649518

10
Median 84
Mode 84

8
Standard Deviation 30.22979318

Number of Subjects

6
Sample Variance 913.8403955
Kurtosis -1.183899591

4
Skewness 0.389872725
Range 95 2

Minimum 48
0

Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60
Summary of the Variable ‘Age’ in
the given data set

Boxplot of Age in Month


140
120
Age(month)

100
80
60
Class Summary (First Part)
So far we have learned-
Statistics and data presentation/data summarization

Graphical Presentation: Bar Chart, Pie Chart, Histogram, and Box


Plot
Numerical Presentation: Measuring Central value of data (mean,
median, mode etc.), measuring dispersion (standard deviation,
variance, co-efficient of variation, range, inter-quartile range etc),
quartiles, percentiles, and five number summary

Any questions ?
 USING STATISTICAL SOFTWARE
Statistical Softwares-Which to use?
There are many softwares to perform statistical analysis and visualization
of data.
Some of them are SAS (System for Statistical Analysis), S-plus, R, Matlab,
Minitab, BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM, HIL,
MS Excel etc. We will discuss MS Excel and SPSS in brief.

Some useful websites for more information of statistical softwares-

http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_Wuerlaender/
statsoft.htm#archiv
http://www.R-project.org
Microsoft Excel
A Spreadsheet Application. It features calculation, graphing tools,
pivot tables and a macro programming language called VBA (Visual
Basic for Applications).

There are many versions of MS-Excel. Excel XP, Excel 2003, Excel 2007
are capable of performing a number of statistical analyses.

Starting MS Excel: Double click on the Microsoft Excel icon on the


desktop or Click on Start --> Programs --> Microsoft Excel.

Worksheet: Consists of a multiple grid of cells with numbered rows down the
page and alphabetically-tilted columns across the page. Each cell is referenced by
its coordinates. For example, A3 is used to refer to the cell in column A and row 3.
B10:B20 is used to refer to the range of cells in column B and rows 10 through 20.
Microsoft Excel
Opening a document: File  Open (From a existing workbook). Change the
directory area or drive to look for file in other locations.
Creating a new workbook: FileNewBlank Document
Saving a File: FileSave

Selecting more than one cell: Click on a cell e.g. A1), then hold the Shift key
and click on another (e.g. D4) to select cells between and A1 and D4 or Click on a
cell and drag the mouse across the desired range.

Creating Formulas: 1. Click the cell that you want to enter the
formula, 2. Type = (an equal sign), 3. Click the Function Button, 4.
Select the formula you want and step through the on-screen
instructions.
fx
Microsoft Excel
Entering Date and Time: Dates are stored as MM/DD/YYYY. No need to
enter in that format. For example, Excel will recognize jan 9 or jan-9 as
1/9/2007 and jan 9, 1999 as 1/9/1999. To enter today’s date, press Ctrl and ;
together. Use a or p to indicate am or pm. For example, 8:30 p is interpreted
as 8:30 pm. To enter current time, press Ctrl and : together.

Copy and Paste all cells in a Sheet: Ctrl+A for selecting, Ctrl +C for copying
and Ctrl+V for Pasting.

Sorting: Data  Sort Sort By …

Descriptive Statistics and other Statistical methods: ToolsData Analysis


Statistical method. If Data Analysis is not available then click on Tools Add-Ins and
then select Analysis ToolPack and Analysis toolPack-Vba
Microsoft Excel
Statistical and Mathematical Function: Start with ‘=‘ sign and then select
function from function wizard f x .

Inserting a Chart: Click on Chart Wizard (or InsertChart), select


chart, give, Input data range, Update the Chart options, and Select
output range/ Worksheet.

Importing Data in Excel: File open FileType Click on File


Choose Option ( Delimited/Fixed Width) Choose Options (Tab/
Semicolon/ Comma/ Space/ Other)  Finish.

Limitations: Excel uses algorithms that are vulnerable to rounding and


truncation errors and may produce inaccurate results in extreme
cases.
Statistics Package
for the Social Science (SPSS)
A general purpose statistical package SPSS is widely used in the social
sciences, particularly in sociology and psychology.
SPSS can import data from almost any type of file to generate tabulated
reports, plots of distributions and trends, descriptive statistics, and
complex statistical analyzes.
Starting SPSS: Double Click on SPSS on desktop or ProgramSPSS.

Opening a SPSS file: FileOpen

MENUS AND TOOLBARS


• Data Editor
Various pull-down menus appear at the top of the Data Editor window. These
pull-down menus are at the heart of using SPSSWIN. The Data Editor menu
items (with some of the uses of the menu) are:
Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS
FILE used to open and save data files

EDIT used to copy and paste data values; used to find data in a
file; insert variables and cases; OPTIONS allows the user to
set general preferences as well as the setup for the
Navigator, Charts, etc.

VIEW user can change toolbars; value labels can be seen in cells
instead of data values

DATA select, sort or weight cases; merge files

TRANSFORM Compute new variables, recode variables, etc.


Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS
ANALYZE perform various statistical procedures

GRAPHS create bar and pie charts, etc

UTILITIES add comments to accompany data file (and other,


advanced features)

ADD-ons these are features not currently installed (advanced


statistical procedures)

WINDOW switch between data, syntax and navigator windows

HELP to access SPSSWIN Help information


Statistics Package
for the Social Science (SPSS)
MENUS AND TOOLBARS
Navigator (Output) Menus
When statistical procedures are run or charts are created, the output will appear
in the Navigator window. The Navigator window contains many of the pull-down
menus found in the Data Editor window. Some of the important menus in the
Navigator window include:

INSERT used to insert page breaks, titles, charts, etc.

FORMAT for changing the alignment of a particular portion of the output


Statistics Package
for the Social Science (SPSS)
• Formatting Toolbar
When a table has been created by a statistical procedure, the user can edit the
table to create a desired look or add/delete information. Beginning with version
14.0, the user has a choice of editing the table in the Output or opening it in a
separate Pivot Table (DEFINE!) window. Various pulldown menus are activated
when the user double clicks on the table. These include:

EDIT undo and redo a pivot, select a table or table body (e.g., to
change the font)

INSERT used to insert titles, captions and footnotes

PIVOT used to perform a pivot of the row and column variables

FORMAT various modifications can be made to tables and cells


Statistics Package
for the Social Science (SPSS)
• Additional menus
CHART EDITOR used to edit a graph

SYNTAX EDITOR used to edit the text in a syntax window


• Show or hide a toolbar

Click on VIEW ⇒ TOOLBARS ⇒ 􀀻to show it/ to hide it


• Move a toolbar

Click on the toolbar (but not on one of the pushbuttons) and then drag the toolbar to
its new location

• Customize a toolbar

Click on VIEW ⇒ TOOLBARS ⇒ CUSTOMIZE


Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
Data from an Excel spreadsheet can be imported into SPSSWIN as follows:
1. In SPSSWIN click on FILE ⇒ OPEN ⇒ DATA. The OPEN DATA FILE Dialog
Box will appear.
2. Locate the file of interest: Use the "Look In" pull-down list to identify the folder
containing the Excel file of interest
3. From the FILE TYPE pull down menu select EXCEL (*.xls).
4. Click on the file name of interest and click on OPEN or simply double-click on
the file name.
5. Keep the box checked that reads "Read variable names from the first row of
data". This presumes that the first row of the Excel data file contains variable
names in the first row. [If the data resided in a different worksheet in the Excel
file, this would need to be entered.]
6. Click on OK. The Excel data file will now appear in the SPSSWIN Data
Editor.
Statistics Package
for the Social Science (SPSS)
Importing data from an EXCEL spreadsheet:
7. The former EXCEL spreadsheet can now be saved as an SPSS file (FILE ⇒
SAVE AS) and is ready to be used in analyses. Typically, you would label variable
and values, and define missing values.
Importing an Access table
SPSSWIN does not offer a direct import for Access tables. Therefore, we must follow
these steps:
1. Open the Access file
2. Open the data table
3. Save the data as an Excel file
4. Follow the steps outlined in the data import from Excel Spreadsheet to SPSSWIN.
Importing Text Files into SPSSWIN
Text data points typically are separated (or “delimited”) by tabs or commas.
Sometimes they can be of fixed format.
Statistics Package
for the Social Science (SPSS)
Importing tab-delimited data
In SPSSWIN click on FILE ⇒ OPEN ⇒ DATA. Look in the appropriate location for
the text file. Then select “Text” from “Files of type”: Click on the file name and then
click on “Open.” You will see the Text Import Wizard – step 1 of 6 dialog box.

You will now have an SPSS data file containing the former tab-delimited data. You
simply need to add variable and value labels and define missing values.

Exporting Data to Excel


click on FILE ⇒ SAVE AS. Click on the File Name for the file to be exported. For
the “Save as Type” select from the pull-down menu Excel (*.xls). You will notice the
checkbox for “write variable names to spreadsheet.” Leave this checked as you will
want the variable names to be in the first row of each column in the Excel
spreadsheet. Finally, click on Save.
Statistics Package
for the Social Science (SPSS)
Running the FREQUENCIES procedure

1. Open the data file (from the menus, click on FILE ⇒ OPEN ⇒ DATA) of
interest.

2. From the menus, click on ANALYZE ⇒ DESCRIPTIVE STATISTICS ⇒


FREQUENCIES
3. The FREQUENCIES Dialog Box will appear. In the left-hand box will be a listing
("source variable list") of all the variables that have been defined in the data file. The
first step is identifying the variable(s) for which you want to run a frequency analysis.
Click on a variable name(s). Then click the [ > ] pushbutton. The variable name(s)
will now appear in the VARIABLE[S]: box ("selected variable list"). Repeat these
steps for each variable of interest.

4. If all that is being requested is a frequency table showing count, percentages


(raw, adjusted and cumulative), then click on OK.
Statistics Package
for the Social Science (SPSS)
Requesting STATISTICS
Descriptive and summary STATISTICS can be requested for numeric variables. To
request Statistics:
1. From the FREQUENCIES Dialog Box, click on the STATISTICS... pushbutton.
2. This will bring up the FREQUENCIES: STATISTICS Dialog Box.
3. The STATISTICS Dialog Box offers the user a variety of choices:

DESCRIPTIVES

The DESCRIPTIVES procedure can be used to generate descriptive statistics


(click on ANALYZE ⇒ DESCRIPTIVE STATISTICS ⇒ DESCRIPTIVES). The
procedure offers many of the same statistics as the FREQUENCIES procedure,
but without generating frequency analysis tables.
Statistics Package
for the Social Science (SPSS)
Requesting CHARTS
One can request a chart (graph) to be created for a variable or variables included in
a FREQUENCIES procedure.

1. In the FREQUENCIES Dialog box click on CHARTS.


2. The FREQUENCIES: CHARTS Dialog box will appear. Choose the intended chart
(e.g. Bar diagram, Pie chart, histogram.

Pasting charts into Word


1. Click on the chart.
2. Click on the pulldown menu EDIT ⇒ COPY OBJECTS
3. Go to the Word document in which the chart is to be embedded. Click on EDIT ⇒

PASTE SPECIAL
4. Select Formatted Text (RTF) and then click on OK
5. Enlarge the graph to a desired size by dragging one or more of the black squares

along the perimeter (if the black squares are not visible, click once on the graph).
Statistics Package
for the Social Science (SPSS)
BASIC STATISTICAL PROCEDURES: CROSSTABS
1. From the ANALYZE pull-down menu, click on DESCRIPTIVE STATISTICS ⇒
CROSSTABS.
2. The CROSSTABS Dialog Box will then open.

3. From the variable selection box on the left click on a variable you wish to
designate as the Row variable. The values (codes) for the Row variable make up
the rows of the crosstabs table. Click on the arrow (>) button for Row(s). Next,
click on a different variable you wish to designate as the Column variable. The
values (codes) for the Column variable make up the columns of the crosstabs
table. Click on the arrow (>) button for Column(s).

4. You can specify more than one variable in the Row(s) and/or Column(s). A
cross table will be generated for each combination of Row and Column variables
Statistics Package
for the Social Science (SPSS)
Limitations: SPSS users have less control over data manipulation and
statistical output than other statistical packages such as SAS, Stata etc.

SPSS is a good first statistical package to perform quantitative research


in social science because it is easy to use and because it can be a good
starting point to learn more advanced statistical packages.
REVIEW: COURSE OUTLINE
 1. Basic Statistic Concepts
 2. Basic Concepts in Research
 3. Data Collection
 4. Organizing & Presenting Data
 5. Averages and measures of position
 6. Measures of Variability
 7. Permutations and combinations
 8. Probability
REVIEW OF COURSE OUTLINE
 9. The normal Distribution
 10. Correlation and simple regression
 11. Tests of hypothesis for means
 12. Analysis of Enumeration data
7. Permutations and combinations
10.3 – Using Permutations and Combinations
Permutation: The number of ways in which a subset of
objects can be selected from a given set of objects, where
order is important.
Given the set of three letters, {A, B, C}, how many possibilities are
there for selecting any two letters where order is important?
(AB, AC, BC, BA, CA, CB)

Combination: The number of ways in which a subset of


objects can be selected from a given set of objects, where
order is not important.
Given the set of three letters, {A, B, C}, how many possibilities are
there for selecting any two letters where order is not important?
(AB, AC, BC).
10.3 – Using Permutations and Combinations
Factorial Formula for Permutations

n!
n Pr  .
(n  r )!
Factorial Formula for Combinations

n Pr n!
n Cr   .
r ! r !(n  r )!
10.3 – Using Permutations and Combinations
Evaluate each problem.
a) 5P3 b) 5C3 c) 6P6 d) 6C6

54

6 1 720 1
3

0 0
10.3 – Using Permutations and Combinations
How many ways can you select two letters followed by
three digits for an ID if repeats are not allowed?
Two parts:
1. Determine the set of two letters. 2. Determine the set of three
P digits. P
26 2 10 3

26 10
25
650 98
720
650
720
468,0
00
10.3 – Using Permutations and Combinations
A common form of poker involves hands (sets) of five cards each,
dealt from a deck consisting of 52 different cards. How many
different 5-card hands are possible?
Hint: Repetitions are not allowed and order is not important.

52 C5

2,598,9 5-card
60 hands
10.3 – Using Permutations and Combinations
Find the number of Find the number of
different subsets of size 3 arrangements of size 3 in the
in the set: {m, a, t, h, r, o, set: {m, a, t, h, r, o, c,
c, k, s}. k, s}.
9C3 9P3

98
50 arrange
7
4 ments
84 Different
subsets
10.3 – Using Permutations and Combinations
Guidelines on Which Method to Use
8. PROBABILITY
Basic
Probability
Concepts

Dr Mona Hassan Ahmed Hassan


Prof of Biostatistics
High Institute of Public Health
Alexandria, Egypt
Introduction

People use the term probability many


times each day. For example, physician
says that a patient has a 50-50 chance
of surviving a certain operation. Another
physician may say that she is 95%
certain that a patient has a particular
disease
Definition

If an event can occur in N mutually exclusive


and equally likely ways, and if m of these
possess a trait, E, the probability of the
occurrence of E is read as
P(E) = m/N
Properties

The probability ranges between 0 and 1


If an outcome cannot occur, its
probability is 0
If an outcome is sure, it has a
probability of 1
The sum of probabilities of mutually
exclusive outcomes is equal to 1
P(M) + P(F) = 1
Definition

Experiment ==> any planned process


of data collection. It consists of a
number of trials (replications) under
the same condition.
Definition
Sample space: collection of unique, non-overlapping
possible outcomes of a random circumstance.

Simple event: one outcome in the sample space; a


possible outcome of a random circumstance.
Event: a collection of one or more simple events in
the sample space; often written as
A, B, C, and so on

Male, Female
Definition
Complement ==> sometimes, we want to know
the probability that an event will not happen; an
event opposite to the event of interest is called
a complementary event.
If A is an event, its complement is The
probability of the complement is AC or A
Example: The complement of male event
is the female

P(A) + P(AC) = 1
Views of Probability:

1-Subjective:

It is an estimate that reflects a person’s opinion,


or best guess about whether an outcome will
occur.

Important in medicine  form the basis of a


physician’s opinion (based on information gained
in the history and physical examination) about
whether a patient has a specific disease. Such
estimate can be changed with the results of
diagnostic procedures.
2- Objective
Classical
• It is well known that the probability of flipping a
fair coin and getting a “tail” is 0.50.
• If a coin is flipped 10 times, is there a
guarantee, that exactly 5 tails will be
observed
• If the coin is flipped 100 times? With 1000
flips?
• As the number of flips becomes larger, the
proportion of coin flips that result in tails
approaches 0.50
Example: Probability of Male versus
Female Births
Long-run relative frequency of males born in
KSA is about 0.512 (512 boys born per 1000
births)
Table provides results of simulation: the proportion is far
from .512 over the first few weeks but in the long run
settles down around .512.
2- Objective
Relative frequency

Assuming that an experiment can be repeated


many times and assuming that there are one or
more outcomes that can result from each
repetition. Then, the probability of a given
outcome is the number of times that outcome
occurs divided by the total number of
repetitions.
Problem 1.

Blood Males Females Total


Group
O 20 20 40
A 17 18 35
B 8 7 15
AB 5 5 10
Total 50 50 100
Problem 2.

An outbreak of food poisoning occurs in a group of


students who attended a party

Ill Not Ill Total

Ate Barbecue 90 30 120


Did Not Eat Barbecue 20 60 80

Total 110 90 200


Marginal probabilities
Named so because they appear on the
“margins” of a probability table. It is
probability of single outcome
Example: In problem 1, P(Male), P(Blood
group A)
P(Male) = number of males/total
number of subjects
= 50/100
= 0.5
Conditional probabilities
It is the probability of an event on
condition that certain criteria is satisfied

Example: If a subject was selected randomly and found


to be female what is the probability that she has a
blood group O
Here the total possible outcomes constitute a subset
(females) of the total number of subjects.
This probability is termed probability of O given F
P(O\F) = 20/50
= 0.40
Joint probability
It is the probability of occurrence of two or
more events together

Example: Probability of being male &


belong to blood group AB
P(M and AB) = P(M∩AB)
= 5/100
= 0.05
∩ = intersection
Properties

The probability ranges between 0 and 1


If an outcome cannot occur, its
probability is 0
If an outcome is sure, it has a
probability of 1
The sum of probabilities of mutually
exclusive outcomes is equal to 1
P(M) + P(F) = 1
Rules of probability

1- Multiplication rule

Independence and multiplication rule

P(A and B) = P(A) P(B)


P(A) P(B\A)

P(B)

A and B are independent


P(B\A) = P(B)
Example:
The joint probability of being male and
having blood type O
To know that two events are independent
compute the marginal and conditional
probabilities of one of them if they are
equal the two events are independent. If
not equal the two events are dependent
P(O) = 40/100 = 0.40
P(O\M) = 20/50 = 0.40
Then the two events are independent
P(O∩M) = P(O)P(M) = (40/100)(50/100)
= 0.20
Rules of probability

1- Multiplication rule

Dependence and
the modified multiplication rule

P(A and B) = P(A) P(B\A)


P(B\A)
P(A)

P(B) P(B\A)

A and B are not independent


P(B\A) ≠ P(B)
Example:
The joint probability of being ill and
eat barbecue

P(Ill) = 110/200 = 0.55


P(Ill\Eat B) = 90/120 = 0.75
Then the two events are dependent
P(Ill∩Eat B) = P(Eat B)P(Ill\Eat B)
= (120/200)(90/120)
= 0.45
Rules of probability

2- Addition rule
A and B are mutually exclusive
The occurrence of one event precludes
the occurrence of the other

Addition

Rule

P(A) P(B)

P(A OR B) = P(A U B) = P(A) + P(B)


Example:
The probability of being either blood
type O or blood type A
P(OUA) = P(O) + P(A)
= (40/100)+(35/100)
= 0.75
A and B are non mutually exclusive
(Can occur together)
Example: Male and smoker

Mo
d ifi
Ad ed
d i ti
on
Ru
le

P(A) P(B)

P(A ∩ B)

P(A OR B) = P(A U B) = P(A) + P(B) - P(A ∩


B)
Example:

Two events are not mutually exclusive (male


gender and blood type O).
P(M OR O) = P(M)+P(O) – P(M∩O)
= 0.50 + 0.40 – 0.20
= 0.70
Excercises

1. If tuberculous meningitis had a case fatality of 20%,


(a) Find the probability that this disease would be fatal in
two randomly selected patients (the two events are
independent)
(b) If two patients are selected randomly what is the
probability that at least one of them will die?

(a) P(first die and second die) = 20%  20% = 0.04


(b) P(first die or second die)
= P(first die) + P(second die) - P(both
die)
= 20% + 20% - 4%
= 36%
2. In a normally distributed population, the probability
that a subject’s blood cholesterol level will be lower
than 1 SD below the mean is 16% and the probability
of being blood cholesterol level higher than 2 SD
above the mean is 2.5%. What is the probability that a
randomly selected subject will have a blood
cholesterol level lower than 1 SD below the mean or
higher than 2 SD above the mean.

P(blood cholesterol level < 1 SD below the mean or 2


SD above the mean) = 16% + 2.5%
= 18.5%
3. In a study of the optimum dose of lignocaine required to
reduce pain on injection of an intravenous agent used
for induction of anesthesia, four dosing groups were
considered (group A received no lignocaine, while
groups B, C, and D received 0.1, 0.2, and 0.4 mg/kg,
respectively). The following table shows the patients
cross-classified by dose and pain score:
Compute the following probabilities for a Pain Group Total
score
randomly selected patient:
A B C D
1.being of group D and experiencing
0 49 73 58 62 242
no pain
1 16 7 7 8 38
2.belonging to group B or having a 2 8 5 6 6 25
3 4 1 0 0 5
pain score of 2
Total 77 86 71 76 310
3.having a pain score of 3 given that
Nightlights and Myopia
Assuming these data are representative of a larger
population, what is the approximate probability that
someone from that population who sleeps with a
nightlight in early childhood
will develop some degree of myopia?

Note: 72 + 7 = 79 of the 232 nightlight users developed some


degree of myopia. So the probability to be 79/232 = 0.34.
Assignment:
• Daniel WW.
• Page 76-81
• Questions:
4, 6, 8, 10, 12, 14, 16, 18, 20, 22

You might also like