0% found this document useful (0 votes)
6 views87 pages

Econ 656 - Research Methods V

The document outlines the methodology for data processing and analysis in research, detailing steps such as data preparation, cleaning, coding, classification, and tabulation. It emphasizes the importance of ensuring data accuracy and completeness, as well as the use of statistical software for analysis. The analysis section covers descriptive and inferential statistics, including univariate, bivariate, and multivariate analyses.

Uploaded by

habtamulegese24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views87 pages

Econ 656 - Research Methods V

The document outlines the methodology for data processing and analysis in research, detailing steps such as data preparation, cleaning, coding, classification, and tabulation. It emphasizes the importance of ensuring data accuracy and completeness, as well as the use of statistical software for analysis. The analysis section covers descriptive and inferential statistics, including univariate, bivariate, and multivariate analyses.

Uploaded by

habtamulegese24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

Research Methodology

Part V
Data Processing and Analysis

1
Content of the Lecture

1. Introduction
2. Data Preparation and Processing
3. Analysis of Data
A. Univariate analysis
B. Bivariate analysis
C. Multivariate analysis
2
Introduction
 Once data are acquired you will need to work with the
data and use them to address your research questions.
 Working with data means:
 Getting to know your data, becoming familiar with
what you have, ensuring completeness, etc.
 Example: check the data for completeness and accuracy.
 Even the most well-prepared and designed research
will probably have some problems with completeness
and accuracy.

3
Introduction

 Analysis is the most rewarding part of your


research project.
 There is a sense of relief, excitement, and
satisfaction that your work is meaningful.
 Analysis of data means making the raw
data meaningful i.e. to draw some results
from the data.

4
Introduction
 It is the process of working with the data to
describe, discuss, interpret, evaluate, and
explain it in terms of the research questions or
hypothesis.
 It is the computation of indices or measures
and searching for patterns and relationships.
 It can range from simple summary statistics to
extremely complex multivariate analyses.

5
Introduction
 Collected data must be converted into a machine-
readable, numeric format so that they can be
analyzed by computer programs.
 Much of the quantitative data analysis is conducted
using software programs- SPSS, STATA, Eviews,
SAS, etc.
 Of course, the analysis starts as soon as the project is
conceived.
 By the time the project is designed and planned, the
analysis strategy and process should be clear.
6
Data Preparation and Processing
 Data processing is the method by which the
collected data is organized for further analysis.
 It is an intermediary stage between the
collection of data and their analysis and
interpretation.
 The step involves editing, coding, classification,
and tabulation of data in order to make them
amenable to analysis.

7
Data Preparation and Processing
Editing: is the process of examining the collected raw
data to detect errors and omissions.
 The first step in analyzing data is to “clean”
clean it of
any obvious data entry errors:
 Outliers? (really high or low numbers)
Example: Age = 110 (really 10 or 11?)
 Value entered that doesn’t exist for a variable?
Example: 2 entered where 0 = male, 1 = female

8
Data Preparation and Processing
 It is a careful scrutiny of the completed
questionnaire to assure that the data are:
 Accurate
 Consistent with other facts gathered
 Uniformly entered
 Are:
 the responses legible/ readable?
 all important questions answered?
 the responses complete?
9
Data Preparation and Processing
 Editing can be either field editing or office editing.
 Field level Editing: after an interview, field workers
should review their reporting forms, complete what
was abbreviated, translate personal short hands, rewrite
illegible entries, and make callback if necessary.
 Central editing: When all forms have been completed
and returned to the office.
 Data editors correct obvious errors such as entry
in wrong place, recorded in wrong units, etc.

10
Data Preparation and Processing
 Checking questionnaires’ identifiers
 Each questionnaire needs a unique identifier.
 Sometimes this will be assigned prior to the
fieldwork by numbering the questionnaires.
 An identifier should be given to each data source.
 Example: all questionnaires from people in
Addis Ababa may start with ‘1’ (101, 102,
103, etc.) and those from Haramaya with ‘2’
(201, 202, 203, etc.).
11
Data Preparation and Processing
 This will make it easier to sort out where
questionnaires have come from, and
 allows analysis to be carried out on the two sets
of separately.
 It also enables the researcher to refer back to his
participants.
 (if, for example, he wishes to involve them in
further research, etc.)

12
Data Preparation and Processing

 What to do with partial responses – missing responses


 If there are missing response, there may be a
number of reasons for this.
 It may be that the length of the questionnaire
deterred your participants from completing it,
 or
 They did not wish to answer a particular question.
 For example, is a sensitive topic

13
Data Preparation and Processing
 What we do with the data depends on the number of
missing cases and the possible reasons for
incompletion.
 Can the data be used, perhaps to address some of
your research questions?
 decide whether to reject some incomplete
questionnaires or whether to include the partial
information.
 And you must make it clear when writing up and
discussing your findings that this is the case.
14
Data Preparation and Processing
 Inconsistent data
 Sometimes you will find that the information
given by a respondent within a questionnaire is
inconsistent.
 This can be the case with both factual and value
data.
 Example: a participant may give his/her date
of birth as 1990 but also record that s(he) has
children born in 2000.
15
Data Preparation and Processing
 Coding: Some forms of data are in an unstructured
form.
 Examples: answers to open questions.
 In order to quantify and analyze such materials, the
researcher has to code them.
 Coding refers to the process of assigning numerals to
answers.
 It is the process of converting data into numeric
format.
16
Data Preparation and Processing
 Common coding systems (code and label) for dichotomous
variables:
 0 = No 1 = Yes
 0 = Male 1 = Female
 Coding process is similar with other categorical variables.
 Example: possible coding for education:
0 = High school
1 = Diploma
2 = First Degree
3 = Second Degree, etc.

17
Data Preparation and Processing
 A codebook which is a comprehensive document
containing detailed description of each variable would
be created.
 The coding must be:
 Exhaustive - there must be a class for every data
item.
 i.e., the list of categories must be complete and
therefore cover all possibilities.
 If it is not, some material will not be capable of
being coded.
18
Data Preparation and Processing
 Mutually exclusive: category components should
be mutually exclusive
 i.e. specific answers can be placed in one and
only one cell in a given category set.
 The categories that are generated must not overlap.
 If they do, the numbers that are assigned to
them cannot be applied to distinct categories.

19
Example Example
You can consider
each of the listed
foods as a variable
and code each
variable as 1 if it is
ticked, 2 if it is not
ticked.

You could then count


how many people
eat, for example,
cereal more than
twice a week.

20
Data Preparation and Processing
 Data entry: Coded data can be entered into a
spreadsheet, database, text file, or directly into a
statistical program like SPSS or STATA.
 Observations can be entered as rows in the
spreadsheet and measurement items can be
represented as columns.
 The entered data should be checked for accuracy,
via occasional spot checks during and after entry.

21
Data Preparation and Processing
 Classification: is the process of arranging data
into groups or classes according to their common
characteristics.
characteristics
 data arranged in groups or classes on the basis
of common characteristics
 Reduces a large volume of data into homogeneous
groups of manageable size.

22
Data Preparation and Processing
 Classification can be according to:
 Attributes: Descriptive characteristics such as literacy, sex,
honesty, or the numerical characteristics weight, age,
height, income, expenditure, etc.
 class-intervals:
 e.g. income: Birr 2001 to Birr 4000 and Birr 4001 to Birr
6000 and so on.
 Geographical classification
 Chronological Classification
 Alphabetical Classification

23
Data Preparation and Processing
 Tabulation: is the process of summarizing raw data and
displaying the same in compact form (i.e., in the form of
statistical tables) for further analysis.
 Is an orderly arrangement of data in columns and rows.
 Objectives of tabulation
 To clarify the object of the investigation
 To clarify the characteristics of data
 To present facts in the minimum space
 To facilitate statistical process

24
Analysis of quantitative data
 Analysis is the computation of indices or measures
and searching for patterns of relationship that exist
among the data groups.
 It involves estimating the values of unknown
parameters of the population and testing of
hypothesis for drawing inferences.

25
Analysis of quantitative data
 Statistical analysis may broadly classified as
i. Descriptive analysis
ii. Inferential analysis
 Descriptive analysis helps to describe and
summarize data in a meaningful way.
 Usually two types of statistic are used to describe
data:
 Measures of central tendency
 Measures of Dispersion/spread/variation
26
Analysis of quantitative data
 Inferential analysis is concerned with making
predictions or inferences about a population from
observations and analyses of a sample.
There are two areas of statistical inferences:
a) statistical estimation and
b) the testing of hypothesis
t-test, analysis of variance/covariance, correlation
analysis, regression analysis, etc.

27
Analysis of quantitative data
Descriptive Statistics Inferential Statistics

• Organize •Generalize from samples


• Summarize to pops
• Simplify •Hypothesis testing
• Presentation of data •Relationships among
variables

Describing data Make predictions

03/06/25 28
Analysis of quantitative data
 With respect to the number of variables
three types of statistical analysis could be
considered:
 Univariate analysis: only one variable
 Bivariate analysis: two variables
 Multivariate analysis: more than two
variables

29
Quantitative Data Analysis
• Descriptive statistics: the use of statistics to summarize,
describe or explain the essential characteristics of a data
set.
- Frequency Distributions
- Measures of Central Tendency
- Measures of Variability
• Inferential statistics: the use of statistics to make
generalizations or inferences about the characteristics of
a population using data from a sample.
- Estimation
- Hypothesis Testing
30
Frequency Distributions
• A frequency distribution is a summary of data that shows the
number (or frequency) of occurrences of different values
within a dataset. It helps in understanding the distribution of
data and is often represented in tables, histograms, or bar
charts.
• Stata Commands for Frequency Distribution
• In Stata, you can use the following commands to generate a
frequency distribution:
1. For a categorical variable (e.g., gender, education level):
tabulate variable_name
2. For a continuous variable (grouped frequencies):
tabulate variable_name, n
3. For frequency with percentages:
tabulate variable_name, percent
31
Sample Data for Frequency Distribution
• Let's assume a dataset with three variables: ID, Gender,
and Education Level.

ID Gender Education

1 Male High School

2 Female Bachelor

3 Male Master

4 Female High School

5 Male Bachelor

6 Female Master

32
Cont.

33
Cont.

34
Example Data: 2

D Gender Age Education Income


1 Male 25 Bachelor 35000
2 Female 30 Master 45000
3 Male 28 High School 28000
4 Female 35 Bachelor 50000
5 Male 40 Master 60000

35
Histogram for Age Distribution
Stata Command : histogram age, bin(5) normal title("Age
Distribution")
Age Distribution
.15
.1
Density
.05
0

25 30 35 40
Age

36
histogram age, bin(1) normal title("Age Distribution“)

Age Distribution
.08
.06
Density
.04
.02
0

25 30 35 40
Age

37
Bar Chart for Gender Distribution
• Stata Command: graph bar (count), over(gender)
title("Gender Distribution")
Gender Distribution
3
2
frequency
1
0

Female Male

38
Bar Chart for Education Level Distribution
Stata Command : graph bar (count), over(education)
title("Education Level Distribution")

Education Level Distribution


2
1.5
frequency
1.5
0

Bachelor High School Master

39
Bar Chart for Mean Income by Education
Level
Stata Command: graph bar (mean) income, over(education)
title("Mean Income by Education Level")
Mean Income by Education Level
10,000 20,000 30,000 40,000 50,000
mean of income
0

Bachelor High School Master

40
Normal Distribution

Normal Distribution Characteristics of a Normal


Frequency

Distribution
• Symmetrical
• Unimodal
• Normality Assumption
necessary for conducting
many inferential statistics

Low High
Values Values

41
Non-Normal Distributions
• Distributions that lack symmetry are skewed. Distributions
that have two frequently occurring values are bimodal
Positively Negatively
Skewed Skewed
Distribution Distribution

Bimodal Distribution

42
Measures of Central Tendency
• Measures of central tendency provide Case Annual Salary
1 $20,100
information about the single
2 $22,700
numerical value that is most typical of
3 $25,600
the values of a variable. 4 $26,400
5 $27,900
• Mean (average): the sum of values of 6 $32,600

all cases divided by the total number 7 $38,400


8 $42,600
of cases
9 $55,700
• Median: the center point in a set of 10 $60,000
values of a variable 11 $550,000
• Mode: the most frequently occurring Total $902,000
value of a variable Mean $82,000
Median $32,600
43
Summary of Key Commands for Measures of
Central Tendency

Measure Stata Command Example


Mean summarize var summarize income
Detailed Mean & summarize var, summarize income,
Median detail detail
Median centile var, centile income,
centile(50) centile(50)
Mode (categorical) tabulate var tabulate education

44
Central Tendency & Normal Distributions
• The mean and the median are affected by skewness, or lack of
symmetry in the data.

Negatively Skewed Normal Positively Skewed


Distribution Distribution Distribution

Mean
Media
Mode n Mode
Mode
Median
Median
Frequenc

Mean
y

Mean

45
Measures of Variability
• Measures of variability Case # School A School B

provide information about how 1 45 60


2 50 65
"spread out" the values of a
3 55 65
variable are.
4 60 70
Ex. Standard variation (SD), 5 65 70
variance 6 70 70
7 70 70
8 75 70
• Range: the difference between 9 80 70
the highest and lowest values 10 85 75
11 90 75
• Standard Deviation (SD): how 12 95 80
far the values tend to vary from Mean 70 70
the mean. Median 70 70
SD 15.14 5
46
Stata Commands for Measures of Variability
• Measures of variability describe how spread out the data is.
The key measures include standard deviation (SD),
variance, range, percentiles, interquartile range (IQR), and
quartiles.
 Standard Deviation (SD) & Variance
 Standard Deviation measures the average deviation from the
mean.
 Variance is the square of the standard deviation.
Stata Command : summarize schoolA schoolB, detail

47
Cont.

48
Cont

49
Interpretation of Statistical Measures
1. Standard Deviation (SD)
•Measures the spread of data around the mean.
•Higher SD → Data is more dispersed.
•Lower SD → Data is closer to the mean.
•Example: If SD is 10, most data points are within 10 units of the
mean.
2. Standard Error (SE)
•Measures how accurately the sample mean represents the
population mean.
•Smaller SE → More reliable estimate of the mean.
•Larger SE → More variability in the sample mean.
•Example: If SE is 2, the sample mean is likely within 2 units of
the true mean.
50
Cont.

51
Cont.
5. Kurtosis
•Measures the "tailedness" of a distribution.
•Kurtosis = 3 → Normal distribution (Mesokurtic).
•Kurtosis > 3 → Leptokurtic (sharp peak, heavy
tails).
•Kurtosis < 3 → Platykurtic (flat peak, light tails).
•Example: Stock market returns are often leptokurtic
(many extreme values).

52
Cont.
6. Percentiles : divide the data into 100 equal parts, showing the
value below which a given percentage of observations fall. They
help understand the distribution of data.

Percentile Interpretation

10th Percentile (P10) 10% of the data falls below this value.

25th Percentile (P25 or Q1) First quartile (Lower 25% of data).

50th Percentile (P50 or Q2) Median (Middle of the data).

75th Percentile (P75 or Q3) Third quartile (Upper 25% of data).

90th Percentile (P90) 90% of data falls below this value.


53
Cont.

• Percentile: a value below which a certain percent of the


ordered observations in a distribution are located.

• Inter-quartile range: the range of values within which the


middle 50 percent of the observations are
- The first quartile: value below which 25 percent of
the cases are found
- The second quartile: value below which 50 percent of
the cases are found
- The third quartile: value below which 75 percent of
the cases are found

54
Describing Single Variables: Univariate Analysis

Variable Measure of Measure of


Type Central Tendency Variability
Nominal
Mode n/a
(e.g. gender)

Ordinal Median Range


(e.g. Ed degree)
Standard
(e.g. Likert scale) Mean
Deviation
Standard
Mean
Interval/Ratio Deviation
(e.g. income, test scores)
Median Range

55
BIVARIATE ANALYSIS
 The major differentiating point between
univariate and bivariate analysis, in addition
to looking at more than one variable, is that
the purpose goes beyond simply descriptive:
it is the analysis of the relationship
between the two variables.
 Relationship simply refers to the extent to
which it becomes easier to know/predict a
value for the Dependent variable if we know
a case's value on the Independent variable.
Univariate Data Bivariate Data
1 Involving a single variable Involving two variables
2 Does not deal with causes or Deals with causes or
relationships relationships
3 The major purpose of univariate The major purpose of bivariate
analysis is to DESCRIBE analysis is to EXPLAIN
4 Central tendency - mean, mode, Analysis of two variables
median, simultaneously
Dispersion - range, variance, max,
min, quartiles, standard deviation. Correlations, comparisons,
frequency distributions
Relationships, causes,
Bar graph, histogram, pie chart, explanations
and line-graph
Independent and dependent
variables
In this PowerPoint we look at sets of data
which contain two variables.
1. SCATTER DIAGRAMS
2. COVARIANCE
3. CORRELATION
4. REGRESSION
We often want to know if there is a
relationship between two numerical variables.
A scatter plot, which gives a visual display
of the relationship between two variables,
provides a good starting point.
Scatter diagrams are of use for variables that
are closely related and have a relatively very
high covariance.
Consider data on ‘hours of study’ vs ‘ test score’

Hours Score Hours Score Hours Score


18 59 14 54 17 59
16 67 17 72 16 76
22 74 14 63 14 59
27 90 19 72 29 89
15 62 20 58 30 93
28 89 10 47 30 96
18 71 28 85 23 82
19 60 25 75 26 35
22 84 18 63 22 78
30 98 19 61
Cont.
We may want to see if we could predict the test score
(response variable) based on the hours of study
(explanatory variable).
y - axis: Test score
x - axis: Hours of study

scatter yvar xvar


twoway (scatter yvar xvar) (lfit yvar xvar)
Certain patterns
tell us about
the relationship

This is called
correlation
This point
is an outlier
Describe these relationships
Perfect, Perfect,
negative, No positive,
linear relationship linear
relationship relationship

Moderate, Weak,
negative positive
linear linear
relationship relationship
Pearson’s product-moment correlation
coefficient, r
Correlation measures the strength of
the linear association between two
quantitative variables.

The correlation coefficient


may take any value
between -1.0 and +1.0
Pearson’s product-moment correlation
coefficient, r
Points fall exactly No linear Points fall
on a straight line relationship exactly on a
(uncorrelated) straight line

r = -1 r=0 r=1
Cont.

Sure you can calculate a correlation coefficient for any


pair of variables but correlation measures the strength
only of the linear association and will be misleading if
the relationship is not linear.
Some facts about the correlation
coefficient
• The sign gives the direction of the association.
• Correlation is always between -1 and 1.
• Correlation treats x and y symmetrically. The
correlation of x and y is the same as the correlation of y
with x.
• Correlation has no units and is generally given as a
decimal.
• Note: variables can have a strong association but still
have a small correlation if the association isn’t linear.
• Correlation is sensitive to outliers. A single outlying
value can make a small correlation large or make a large
one small.
The sign gives the direction of the
association.

Positive Negative
Correlation treats x and y symmetrically.
The correlation of x and y is the same as
the correlation of y with x.
• Pearson correlation coefficient
•r
• Linear relationship

r
 [( X  M X )(Y  M Y )]
( SS X )( SSY )

cov ariance( x, y )
r
var x var y
Calculating by hand…

 ( x  x )( y
i 1
i i  y)
cov ariance( x, y ) n 1
rˆ  
var x var y n n

 (x  x)  ( y
i 1
i
2

i 1
i  y) 2

n 1 n 1
Simpler calculation formula…

Numerator of
n
covariance
 (x
i 1
i  x )( yi  y )

rˆ  n 1 
n n SS xy
 (x
i 1
i  x) 2
(y
i 1
i  y) 2
rˆ 
SS x SS y
n 1 n 1
n

 (x i  x )( yi  y )
SS xy
Numerators of
variance
i 1

n n
SS x SS y
 (x
i 1
i  x) 2
(y
i 1
i  y) 2
Variance vs Covariance
Variance: n

• Gives information on the


 (x i  x) 2

S x2  i 1
variability of a single variable. n 1
Covariance:
• Gives information on the degree to
which two variables vary together.
• Note how similar the covariance is
to variance: the equation simply
multiplies x’s error scores by y’s error n
scores as opposed to squaring x’s  ( xi  x)( yi  y )
error scores. cov( x, y )  i 1
n 1
Covariance
n

 ( x  x)( y
i i  y)
cov( x, y )  i 1
n 1
 When X and Y : cov (x,y) = pos.
 When X and Y : cov (x,y) = neg.
 When no constant relationship: cov (x,y) = 0
Covariance
 Covariance is the joint variation of two variables
about their common mean.
 The covariance is sometimes called a measure of
"linear dependence" between the two random
variables.
 When the covariance is normalized, one obtains the
correlation coefficient. From it, one can obtain the
Pearson coefficient, which gives us the goodness
of fit for the best possible linear function
describing the relation between the variables.
 In this sense, covariance is a linear gauge of
dependence.
Concept of a regression function
A. Historical Definition by Data Analysts
•The person who coined the word in data
analysis was Sir Francis Galton (1822-1911).
•He found that although there was a tendency for
tall parents to have tall children and short
parents to have short children, the average height
of children born of parents of a given height
tended to move or “regress” toward the average
in the population as a whole (Galton’s law of
universal regression).
B. The Modern Definition of Regression
 Regression analysis is concerned with describing and
evaluating the economic relationship between the
left-hand side variable (Y) and the right hand side
variable(s) (Xi’s).
Cont.

Suppose we wish to estimate the parameters of the


following relationship:
Yi    X i  ui
A common method is to choose parameters to minimise
the residual sum of squares:
N

 
2
minˆ RSS  Yi  ˆ  ˆ X i
ˆ , 
i 1
This can be shown to yield the following pair of equations
known as the least-squares normal equations.
N N
ˆ N  ˆ  X i  Yi
i 1 i 1
N N N
ˆ  X i  ˆ  X  X iYi i
2

i 1 i 1 i 1

Solving these equations yields the least-squares estimates:


N

 X i  X Yi  Y 
ˆ  i 1 N
ˆ Y  ˆ X
 X  X
2
i
i 1
Distribution of the OLS estimator

First we will establish the conditions under which the OLS


estimator is:

1. Unbiased  
E ˆ 

2. Efficient The variance is the lowest in the class


of linear unbiased estimators.

These results will depend on the Gauss-Markov assumptions.

The Gauss-Markov assumptions are a set of assumptions


about the nature of the error term in the regression equation.
The Gauss-Markov assumptions

Under a specific set of assumptions the OLS estimates can


be shown to be the best linear unbiased estimates (BLUE).

1. The error has expected value zero E ui  0, i

2. The errors are serially uncorrelated E ui u j  0, i  j

3. The errors have constant variance E ui2   2 , i

4. The X variable is non-stochastic


E  X i ui   X i E ui 
(fixed in repeated samples)
Unbiasedness

The proof of the unbiasedness of the OLS estimator relies


on only two of the Gauss-Markov assumptions.

These are assumptions 1 and 4.

Efficiency

The proof of the efficiency of the OLS estimator relies on


all the Gauss-Markov assumptions
Distribution of the OLS estimator

To derive the distribution of the OLS estimator we need to


make some assumption about the distribution of the errors.

If we assume that the errors follow a normal distribution then


we can show that the OLS estimates are also normally
distributed.

  2 
ui ~ N 0,  2   ˆ ~ N   , 
  Xi  X  
2
 
Interpretation of the research findings
 The task of analysis is incomplete without interpretation.

 Interpretation refers to the task of drawing inferences


from the collected facts.

 It is the process of establish relationship between


variables and explaining why such relationship exists.

85
Interpretation of the research findings
 A research can be better appreciated only through
interpretation why the findings are what they are.

 You should make others to understand the significance


of the research findings.

 It is through interpretation that the researcher can


well understand the abstract principles beneath his
findings.

 It will lead to the establishment of concepts that can


serve as a guide for further research study.
03/06/25 86
Thank You !

87

You might also like