Econ 656 - Research Methods V
Econ 656 - Research Methods V
Part V
Data Processing and Analysis
1
Content of the Lecture
1. Introduction
2. Data Preparation and Processing
3. Analysis of Data
A. Univariate analysis
B. Bivariate analysis
C. Multivariate analysis
2
Introduction
Once data are acquired you will need to work with the
data and use them to address your research questions.
Working with data means:
Getting to know your data, becoming familiar with
what you have, ensuring completeness, etc.
Example: check the data for completeness and accuracy.
Even the most well-prepared and designed research
will probably have some problems with completeness
and accuracy.
3
Introduction
4
Introduction
It is the process of working with the data to
describe, discuss, interpret, evaluate, and
explain it in terms of the research questions or
hypothesis.
It is the computation of indices or measures
and searching for patterns and relationships.
It can range from simple summary statistics to
extremely complex multivariate analyses.
5
Introduction
Collected data must be converted into a machine-
readable, numeric format so that they can be
analyzed by computer programs.
Much of the quantitative data analysis is conducted
using software programs- SPSS, STATA, Eviews,
SAS, etc.
Of course, the analysis starts as soon as the project is
conceived.
By the time the project is designed and planned, the
analysis strategy and process should be clear.
6
Data Preparation and Processing
Data processing is the method by which the
collected data is organized for further analysis.
It is an intermediary stage between the
collection of data and their analysis and
interpretation.
The step involves editing, coding, classification,
and tabulation of data in order to make them
amenable to analysis.
7
Data Preparation and Processing
Editing: is the process of examining the collected raw
data to detect errors and omissions.
The first step in analyzing data is to “clean”
clean it of
any obvious data entry errors:
Outliers? (really high or low numbers)
Example: Age = 110 (really 10 or 11?)
Value entered that doesn’t exist for a variable?
Example: 2 entered where 0 = male, 1 = female
8
Data Preparation and Processing
It is a careful scrutiny of the completed
questionnaire to assure that the data are:
Accurate
Consistent with other facts gathered
Uniformly entered
Are:
the responses legible/ readable?
all important questions answered?
the responses complete?
9
Data Preparation and Processing
Editing can be either field editing or office editing.
Field level Editing: after an interview, field workers
should review their reporting forms, complete what
was abbreviated, translate personal short hands, rewrite
illegible entries, and make callback if necessary.
Central editing: When all forms have been completed
and returned to the office.
Data editors correct obvious errors such as entry
in wrong place, recorded in wrong units, etc.
10
Data Preparation and Processing
Checking questionnaires’ identifiers
Each questionnaire needs a unique identifier.
Sometimes this will be assigned prior to the
fieldwork by numbering the questionnaires.
An identifier should be given to each data source.
Example: all questionnaires from people in
Addis Ababa may start with ‘1’ (101, 102,
103, etc.) and those from Haramaya with ‘2’
(201, 202, 203, etc.).
11
Data Preparation and Processing
This will make it easier to sort out where
questionnaires have come from, and
allows analysis to be carried out on the two sets
of separately.
It also enables the researcher to refer back to his
participants.
(if, for example, he wishes to involve them in
further research, etc.)
12
Data Preparation and Processing
13
Data Preparation and Processing
What we do with the data depends on the number of
missing cases and the possible reasons for
incompletion.
Can the data be used, perhaps to address some of
your research questions?
decide whether to reject some incomplete
questionnaires or whether to include the partial
information.
And you must make it clear when writing up and
discussing your findings that this is the case.
14
Data Preparation and Processing
Inconsistent data
Sometimes you will find that the information
given by a respondent within a questionnaire is
inconsistent.
This can be the case with both factual and value
data.
Example: a participant may give his/her date
of birth as 1990 but also record that s(he) has
children born in 2000.
15
Data Preparation and Processing
Coding: Some forms of data are in an unstructured
form.
Examples: answers to open questions.
In order to quantify and analyze such materials, the
researcher has to code them.
Coding refers to the process of assigning numerals to
answers.
It is the process of converting data into numeric
format.
16
Data Preparation and Processing
Common coding systems (code and label) for dichotomous
variables:
0 = No 1 = Yes
0 = Male 1 = Female
Coding process is similar with other categorical variables.
Example: possible coding for education:
0 = High school
1 = Diploma
2 = First Degree
3 = Second Degree, etc.
17
Data Preparation and Processing
A codebook which is a comprehensive document
containing detailed description of each variable would
be created.
The coding must be:
Exhaustive - there must be a class for every data
item.
i.e., the list of categories must be complete and
therefore cover all possibilities.
If it is not, some material will not be capable of
being coded.
18
Data Preparation and Processing
Mutually exclusive: category components should
be mutually exclusive
i.e. specific answers can be placed in one and
only one cell in a given category set.
The categories that are generated must not overlap.
If they do, the numbers that are assigned to
them cannot be applied to distinct categories.
19
Example Example
You can consider
each of the listed
foods as a variable
and code each
variable as 1 if it is
ticked, 2 if it is not
ticked.
20
Data Preparation and Processing
Data entry: Coded data can be entered into a
spreadsheet, database, text file, or directly into a
statistical program like SPSS or STATA.
Observations can be entered as rows in the
spreadsheet and measurement items can be
represented as columns.
The entered data should be checked for accuracy,
via occasional spot checks during and after entry.
21
Data Preparation and Processing
Classification: is the process of arranging data
into groups or classes according to their common
characteristics.
characteristics
data arranged in groups or classes on the basis
of common characteristics
Reduces a large volume of data into homogeneous
groups of manageable size.
22
Data Preparation and Processing
Classification can be according to:
Attributes: Descriptive characteristics such as literacy, sex,
honesty, or the numerical characteristics weight, age,
height, income, expenditure, etc.
class-intervals:
e.g. income: Birr 2001 to Birr 4000 and Birr 4001 to Birr
6000 and so on.
Geographical classification
Chronological Classification
Alphabetical Classification
23
Data Preparation and Processing
Tabulation: is the process of summarizing raw data and
displaying the same in compact form (i.e., in the form of
statistical tables) for further analysis.
Is an orderly arrangement of data in columns and rows.
Objectives of tabulation
To clarify the object of the investigation
To clarify the characteristics of data
To present facts in the minimum space
To facilitate statistical process
24
Analysis of quantitative data
Analysis is the computation of indices or measures
and searching for patterns of relationship that exist
among the data groups.
It involves estimating the values of unknown
parameters of the population and testing of
hypothesis for drawing inferences.
25
Analysis of quantitative data
Statistical analysis may broadly classified as
i. Descriptive analysis
ii. Inferential analysis
Descriptive analysis helps to describe and
summarize data in a meaningful way.
Usually two types of statistic are used to describe
data:
Measures of central tendency
Measures of Dispersion/spread/variation
26
Analysis of quantitative data
Inferential analysis is concerned with making
predictions or inferences about a population from
observations and analyses of a sample.
There are two areas of statistical inferences:
a) statistical estimation and
b) the testing of hypothesis
t-test, analysis of variance/covariance, correlation
analysis, regression analysis, etc.
27
Analysis of quantitative data
Descriptive Statistics Inferential Statistics
03/06/25 28
Analysis of quantitative data
With respect to the number of variables
three types of statistical analysis could be
considered:
Univariate analysis: only one variable
Bivariate analysis: two variables
Multivariate analysis: more than two
variables
29
Quantitative Data Analysis
• Descriptive statistics: the use of statistics to summarize,
describe or explain the essential characteristics of a data
set.
- Frequency Distributions
- Measures of Central Tendency
- Measures of Variability
• Inferential statistics: the use of statistics to make
generalizations or inferences about the characteristics of
a population using data from a sample.
- Estimation
- Hypothesis Testing
30
Frequency Distributions
• A frequency distribution is a summary of data that shows the
number (or frequency) of occurrences of different values
within a dataset. It helps in understanding the distribution of
data and is often represented in tables, histograms, or bar
charts.
• Stata Commands for Frequency Distribution
• In Stata, you can use the following commands to generate a
frequency distribution:
1. For a categorical variable (e.g., gender, education level):
tabulate variable_name
2. For a continuous variable (grouped frequencies):
tabulate variable_name, n
3. For frequency with percentages:
tabulate variable_name, percent
31
Sample Data for Frequency Distribution
• Let's assume a dataset with three variables: ID, Gender,
and Education Level.
ID Gender Education
2 Female Bachelor
3 Male Master
5 Male Bachelor
6 Female Master
32
Cont.
33
Cont.
34
Example Data: 2
35
Histogram for Age Distribution
Stata Command : histogram age, bin(5) normal title("Age
Distribution")
Age Distribution
.15
.1
Density
.05
0
25 30 35 40
Age
36
histogram age, bin(1) normal title("Age Distribution“)
Age Distribution
.08
.06
Density
.04
.02
0
25 30 35 40
Age
37
Bar Chart for Gender Distribution
• Stata Command: graph bar (count), over(gender)
title("Gender Distribution")
Gender Distribution
3
2
frequency
1
0
Female Male
38
Bar Chart for Education Level Distribution
Stata Command : graph bar (count), over(education)
title("Education Level Distribution")
39
Bar Chart for Mean Income by Education
Level
Stata Command: graph bar (mean) income, over(education)
title("Mean Income by Education Level")
Mean Income by Education Level
10,000 20,000 30,000 40,000 50,000
mean of income
0
40
Normal Distribution
Distribution
• Symmetrical
• Unimodal
• Normality Assumption
necessary for conducting
many inferential statistics
Low High
Values Values
41
Non-Normal Distributions
• Distributions that lack symmetry are skewed. Distributions
that have two frequently occurring values are bimodal
Positively Negatively
Skewed Skewed
Distribution Distribution
Bimodal Distribution
42
Measures of Central Tendency
• Measures of central tendency provide Case Annual Salary
1 $20,100
information about the single
2 $22,700
numerical value that is most typical of
3 $25,600
the values of a variable. 4 $26,400
5 $27,900
• Mean (average): the sum of values of 6 $32,600
44
Central Tendency & Normal Distributions
• The mean and the median are affected by skewness, or lack of
symmetry in the data.
Mean
Media
Mode n Mode
Mode
Median
Median
Frequenc
Mean
y
Mean
45
Measures of Variability
• Measures of variability Case # School A School B
47
Cont.
48
Cont
49
Interpretation of Statistical Measures
1. Standard Deviation (SD)
•Measures the spread of data around the mean.
•Higher SD → Data is more dispersed.
•Lower SD → Data is closer to the mean.
•Example: If SD is 10, most data points are within 10 units of the
mean.
2. Standard Error (SE)
•Measures how accurately the sample mean represents the
population mean.
•Smaller SE → More reliable estimate of the mean.
•Larger SE → More variability in the sample mean.
•Example: If SE is 2, the sample mean is likely within 2 units of
the true mean.
50
Cont.
51
Cont.
5. Kurtosis
•Measures the "tailedness" of a distribution.
•Kurtosis = 3 → Normal distribution (Mesokurtic).
•Kurtosis > 3 → Leptokurtic (sharp peak, heavy
tails).
•Kurtosis < 3 → Platykurtic (flat peak, light tails).
•Example: Stock market returns are often leptokurtic
(many extreme values).
52
Cont.
6. Percentiles : divide the data into 100 equal parts, showing the
value below which a given percentage of observations fall. They
help understand the distribution of data.
Percentile Interpretation
10th Percentile (P10) 10% of the data falls below this value.
54
Describing Single Variables: Univariate Analysis
55
BIVARIATE ANALYSIS
The major differentiating point between
univariate and bivariate analysis, in addition
to looking at more than one variable, is that
the purpose goes beyond simply descriptive:
it is the analysis of the relationship
between the two variables.
Relationship simply refers to the extent to
which it becomes easier to know/predict a
value for the Dependent variable if we know
a case's value on the Independent variable.
Univariate Data Bivariate Data
1 Involving a single variable Involving two variables
2 Does not deal with causes or Deals with causes or
relationships relationships
3 The major purpose of univariate The major purpose of bivariate
analysis is to DESCRIBE analysis is to EXPLAIN
4 Central tendency - mean, mode, Analysis of two variables
median, simultaneously
Dispersion - range, variance, max,
min, quartiles, standard deviation. Correlations, comparisons,
frequency distributions
Relationships, causes,
Bar graph, histogram, pie chart, explanations
and line-graph
Independent and dependent
variables
In this PowerPoint we look at sets of data
which contain two variables.
1. SCATTER DIAGRAMS
2. COVARIANCE
3. CORRELATION
4. REGRESSION
We often want to know if there is a
relationship between two numerical variables.
A scatter plot, which gives a visual display
of the relationship between two variables,
provides a good starting point.
Scatter diagrams are of use for variables that
are closely related and have a relatively very
high covariance.
Consider data on ‘hours of study’ vs ‘ test score’
This is called
correlation
This point
is an outlier
Describe these relationships
Perfect, Perfect,
negative, No positive,
linear relationship linear
relationship relationship
Moderate, Weak,
negative positive
linear linear
relationship relationship
Pearson’s product-moment correlation
coefficient, r
Correlation measures the strength of
the linear association between two
quantitative variables.
r = -1 r=0 r=1
Cont.
Positive Negative
Correlation treats x and y symmetrically.
The correlation of x and y is the same as
the correlation of y with x.
• Pearson correlation coefficient
•r
• Linear relationship
r
[( X M X )(Y M Y )]
( SS X )( SSY )
cov ariance( x, y )
r
var x var y
Calculating by hand…
( x x )( y
i 1
i i y)
cov ariance( x, y ) n 1
rˆ
var x var y n n
(x x) ( y
i 1
i
2
i 1
i y) 2
n 1 n 1
Simpler calculation formula…
Numerator of
n
covariance
(x
i 1
i x )( yi y )
rˆ n 1
n n SS xy
(x
i 1
i x) 2
(y
i 1
i y) 2
rˆ
SS x SS y
n 1 n 1
n
(x i x )( yi y )
SS xy
Numerators of
variance
i 1
n n
SS x SS y
(x
i 1
i x) 2
(y
i 1
i y) 2
Variance vs Covariance
Variance: n
S x2 i 1
variability of a single variable. n 1
Covariance:
• Gives information on the degree to
which two variables vary together.
• Note how similar the covariance is
to variance: the equation simply
multiplies x’s error scores by y’s error n
scores as opposed to squaring x’s ( xi x)( yi y )
error scores. cov( x, y ) i 1
n 1
Covariance
n
( x x)( y
i i y)
cov( x, y ) i 1
n 1
When X and Y : cov (x,y) = pos.
When X and Y : cov (x,y) = neg.
When no constant relationship: cov (x,y) = 0
Covariance
Covariance is the joint variation of two variables
about their common mean.
The covariance is sometimes called a measure of
"linear dependence" between the two random
variables.
When the covariance is normalized, one obtains the
correlation coefficient. From it, one can obtain the
Pearson coefficient, which gives us the goodness
of fit for the best possible linear function
describing the relation between the variables.
In this sense, covariance is a linear gauge of
dependence.
Concept of a regression function
A. Historical Definition by Data Analysts
•The person who coined the word in data
analysis was Sir Francis Galton (1822-1911).
•He found that although there was a tendency for
tall parents to have tall children and short
parents to have short children, the average height
of children born of parents of a given height
tended to move or “regress” toward the average
in the population as a whole (Galton’s law of
universal regression).
B. The Modern Definition of Regression
Regression analysis is concerned with describing and
evaluating the economic relationship between the
left-hand side variable (Y) and the right hand side
variable(s) (Xi’s).
Cont.
2
minˆ RSS Yi ˆ ˆ X i
ˆ ,
i 1
This can be shown to yield the following pair of equations
known as the least-squares normal equations.
N N
ˆ N ˆ X i Yi
i 1 i 1
N N N
ˆ X i ˆ X X iYi i
2
i 1 i 1 i 1
X i X Yi Y
ˆ i 1 N
ˆ Y ˆ X
X X
2
i
i 1
Distribution of the OLS estimator
1. Unbiased
E ˆ
Efficiency
2
ui ~ N 0, 2 ˆ ~ N ,
Xi X
2
Interpretation of the research findings
The task of analysis is incomplete without interpretation.
85
Interpretation of the research findings
A research can be better appreciated only through
interpretation why the findings are what they are.
87