Data analysis notes
Data analysis notes
Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the
goal of discovering useful information, suggesting conclusions and supporting decision-
making. The purpose of Data Analysis is to extract useful information from data and taking the
decision based upon the data analysis. Data analysis has multiple facets and approaches,
encompassing diverse techniques under a variety of names, and is used in different business,
science, and social science domains. In today's business world, data analysis plays a role in
making decisions more scientific and helping businesses operate more effectively.
The analysis of data requires a number of closely related operations such as establishment of
categories, data dimensions, the application of these categories to raw data through coding,
tabulation and then drawing statistical inferences. The unwieldy data should necessarily be
condensed into a few manageable groups and tables for further analysis. Thus, a researcher
should classify the raw data into some purposeful and usable categories.
Data analysis involves ordering and organizing raw data so that useful information can be
extracted from it. This enables one to understanding what the data does and does not contain.
Dat analysis can be approached in many ways, and it is easy to manipulate data during the
analysis phase to push certain conclusions or agendas. For this reason, it is important to pay
attention when data analysis is presented, and to think critically about the data and the
conclusions which were drawn.
Raw data can take a variety of forms, including measurements, survey responses, and
observations. In its raw form, this information can be incredibly useful, but also overwhelming.
Over the course of the data analysis process, the raw data is ordered in a way which will be
useful. For example, survey results may be tallied, so that people can see at a glance how many
people answered the survey, and how people responded to specific questions. Modeling the data
with the use of mathematics and other tools can sometimes exaggerate such points of interest
in the data, making them easier for the researcher to see.
When people encounter summarized data and conclusions, they should view them critically.
Asking where the data is from is important, the sampling method used to collect the data, and
the size of the sample. If the source of the data appears to have a conflict of interest with the
type of data being gathered, the results in question could be doubted. Likewise, data gathered
from a small sample or a sample which is not truly random may be of questionable validity.
Reputable researchers always provide information about the data gathering techniques used, the
source of funding, and the point of the data collection in the beginning of the analysis so that
readers can think about this information while they review the analysis.
Analysis, refers to dividing a whole into its separate components for individual
examination. Data analysis, is a process for obtaining raw data, and subsequently converting it
into information useful for decision-making by users. Data, is collected and analyzed to answer
questions, test hypotheses, or disprove theories.
Data analysis is an important stage of the research process. It provides a summary of the process
and explores specific areas of data analysis that might be applicable to learners studying at
undergraduate and post graduate levels. In any reserach, one should take some time to carefully
1
review all of the data collected from the experiment. Use charts and graphs to help you analyze
the data and patterns. Did you get the results you had expected? What did you find out from
your experiment? Then, really think about what you have discovered and use your data to help
you explain why you think certain things happened.
Research Process
Researchers who are attempting to answer a research question employ the research process.
Though presented in a liner format, in practice the process of research can be less
straightforward. This said, researchers attempt to follow the process and use it to present their
research findings in research reports and journal articles.
b) Credibility improved through long engagement with the respondents or triangulation in data
collection (internal validity).
c) Transferability achieved through a thick description of the research process to allow a reader
to see if the results can be transferred to a different setting (external validity).
d) Dependability examined through the audit trail (reliability) e.g. member checking.
e) Confirmability audit trail categories used e.g. raw data included, data analysis and reduction
processes described, data reconstruction and synthesis including structuring of categories and
themes, process notes included, instrument development information included.
What if you do not have a question to begin with? Exploring data without a defined question,
sometimes referred to as “data mining”, can sometimes reveal interesting patterns in the data
that are worth exploring. Regardless of what leads you to look at data, thinking about your
audience (your staff, supervisor, Board members, etc.) is helpful to shape the story and guide
your thinking about the data.
Whenever you look at data, it is important to be open to unexpected patterns, explanations, and
unusual results. Sometimes the most interesting stories to be told with data are not the ones you
set out to tell.
Data is used to describe things by assigning a value to them. The values are then organized,
processed, and presented within a given context so that it becomes useful. Data can be in
different forms: qualitative and quantitative:
2
Whether the study employs secondary or primary data, the researcher should formulate
questions that can be addressed with data and collect, organize and display relevant data to
answer them.
Three key question that should be at the back of your mind when analysing data include
i) Is this the right data to answer the proposed research questions?
ii) Is the data well organized for the software to know the type of data being analysed such as
cross section data, time series data and panel data?
iii) What is the appropriate technique to analyse the data to provide accurate solutions?
This will guide you to understand the differences among various kinds of studies and which
types of inferences can legitimately be drawn from each; know the characteristics of well-
designed studies, including the role of randomization in surveys and experiments; understand
the meaning of measurement data and categorical data, of univariate and bivariate data, and of
the term variable; understand histograms, parallel box plots, and scatter plots and use them to
display data; compute basic statistics and understand the distinction between a statistic and a
parameter.
Quantitative data methods for outlier detection, can be used to get rid of data that appears to
have a higher likelihood of being input incorrectly. Textual data spell checkers, can be used to
lessen the amount of mis-typed words, however, it is harder to tell if the words themselves are
correct.
Types of Data
Data is mainly categorised into secondary and primary data. This can be subcategorised into
nominal and ordinal, discrete and contineous, binary and categorical, or countable data. The
Table 1 summarises data types and their associated measurement level, plus some examples. It
3
is important to appreciate that appropriate methods for summary and display depend on the type
of data being used. This is also true for ensuring the appropriate statistical test is employed.
a) Quantitative data
Quantitative data is that which can be easily measured and recorded in numerical form. This is
used extensively in education in forms such as exam results, SATs results, absence and truancy
figures etc. Quantitative data is collected by testing to an agreed criteria as in exams or by
measuring as in height, age etc. Often this data is expressed using percentages rather than the
actual numbers themselves.
“Quantitative data” is data that is expressed with numbers. Quantitative data is data which can
be put into categories, measured, or ranked. Length, weight, age, cost, rating scales, are all
examples of quantitative data. Quantitative data can be represented visually in graphs and tables
and be statistically analyzed.
Categorical data
Categorical data is data that has been placed into groups. An item cannot belong to more than
one group at a time. Examples of categorical data include the individual’s current living
situation, smoking status, or whether he/she is employed. As discussed in more detail later, the
type of analysis used with categorical data is the Chi-square test.
Continuous data
“Continuous data” is numerical data measured on a continuous range or scale. In continuous
data, all values are possible with no gaps in between. Examples of continuous data are a
person’s height or weight, and temperature. As discussed in more detail later, many types of
analysis can be used with continuous data, including effect size calculations.
4
Calculations and Summarizing Data
Often, you will need to perform calculations on your raw data in order to get the results from
which you will generate a conclusion. A spreadsheet program such as Microsoft Excel may be
a good way to perform such calculations, and then later the spreadsheet can be used to display
the results. Be sure to label the rows and columns--don't forget to include the units of
measurement (grams, centimeters, liters, etc.).
You should have performed multiple trials of your experiment. Think about the best way to
summarize your data. Do you want to calculate the average for each group of trials, or
summarize the results in some other way such as ratios, percentages, or error and significance
for really advanced students? Or, is it better to display your data as individual data points?
Do any calculations that are necessary for you to analyze and understand the data from your
experiment.
● Use calculations from known formulas that describe the relationships you are testing.
● Pay careful attention because you may need to convert some of your units to do your
calculation correctly. All of the units for a measurement should be of the same scale–
(keep L with L and mL with mL, do not mix L with mL!)
b) Qualitative Data
Qualitative data is data that uses words and descriptions. Qualitative data can be observed but
is subjective and therefore difficult to use for the purposes of making comparisons. Descriptions
of texture, taste, or an experience are all examples of qualitative data. Qualitative data collection
methods include focus groups, interviews, or open-ended items on a survey. Qualitative data
is information that is represented by means other than numbers. This could be data on gender,
place of birth, school attended etc. Data from questionnaires or forms is often of a qualitative
nature and categories are often used to group the data together such as questions on racial origin.
Qualitative data is often presented in numbers or percentages such as in the statement, 23% of
Makere University Students are of Ugandan origin.
Continuous data is that which can be any number on a scale. If you were to measure the heights
of the children in the class there would be a range of measurements from the shortest to tallest
child and these measurements could be anywhere on the chosen scale. For practical purposes
the measurements would usually be rounded of to the nearest whole or half unit but could
actually be at any point. Other examples of continuous data are things like rainfall, length of
feet and weight.
5
plotted with line graphs, but are also plotted as bar graphs where times are clumped together,
e.g., days of the week.
a) Nominal: This is used to describes variables that are categorical in nature. The characteristics
of the data you're collecting fall into distinct categories. If there are a limited number of distinct
categories (usually only two), then you're dealing with a discrete variable. If there are an
unlimited or infinite number of distinct categories, then you're dealing with a continuous
variable. Nominal variables include demographic characteristics like sex, race, and religion.
b) Ordinal:This data measurement describes variables that can be ordered or ranked in some
order of importance. It describes most judgments about things, such as big or little, strong or
weak. Most opinion and attitude scales or indexes in the social sciences are ordinal in nature.
c) Interval:This data measurement describes variables that have more or less equal intervals, or
meaningful distances between their ranks. For example, if you were to ask somebody if they
were first, second, or third generation immigrant, the assumption is that the distance, or number
of years, between each generation is the same. All crime rates in criminal justice are interval
level measures, as is any kind of rate.
d) Ratio:This data measurement describes variables that have equal intervals and a fixed
reference point. It is possible to have zero income, zero education, and no involvement in crime,
but rarely do we see ratio level variables in social science since it's almost impossible to have
zero attitudes on things, although "not at all", "often", and "twice as often" might qualify as
ratio level measurement.
Descriptive statistics, such as, the average or median, can be generated to aid in understanding
the data. Data visualization is also a technique used, in which the analyst is able to examine the
data in a graphical format in order to obtain additional insights, regarding the messages within
the data.
6
Inferential statistics, includes utilizing techniques that measure the relationships between
particular variables. For example, regression analysis may be used to model whether a change
in advertising (independent variable X), provides an explanation for the variation in sales
(dependent variable Y). In mathematical terms, Y (sales) is a function of X (advertising). It may
be described as (Y= aX + b + error). Basically, analysts may also attempt to build models that
are descriptive of the data, in an aim to simplify analysis and communicate results.
Data visualization
Once the data is analyzed, it may be reported in many formats to the users of the analysis to
support their requirements. The users may have feedback, which results in additional analysis.
As such, much of the analytical cycle is iterative.
When determining how to communicate the results, the analyst may consider implementing a
variety of data visualization techniques, to help clearly and efficiently communicate the
message to the audience. Data visualization uses information displays (graphics such as, tables
and charts) to help communicate key messages contained in the data. Tables are a valuable tool
by enabling the ability of a user to query and focus on specific numbers; while charts (e.g., bar
charts or line charts), may help explain the quantitative messages contained in the data.
The data analysts typically obtain descriptive statistics for study variables, such as the mean
(average), median, and standard deviation. They may also analyze the distribution of the key
variables to see how the individual values cluster around the mean.
Hypothesis testing is used when a particular hypothesis about the true state of affairs is made
by the analyst and data is analysed to determine whether that state of affairs is true or false. For
example, the hypothesis might be that "Unemployment has no effect on inflation", which relates
to an economics concept called the Phillips Curve. Hypothesis testing involves considering the
likelihood of Type I and type II errors, which relate to whether the data supports accepting or
rejecting the hypothesis.
Regression analysis may be used when the analyst is trying to determine the extent to which
independent variable X affects dependent variable Y (e.g., "To what extent do changes in the
unemployment rate (X) affect the inflation rate (Y)?"). This is an attempt to model or fit an
equation line or curve to the data, such that Y is a function of X.
Necessary condition analysis (NCA) may be used when the analyst is trying to determine the
extent to which independent variable X affects the dependent variable Y. For example, To what
extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)? Whereas
(multiple) regression analysis uses additive logic where each X-variable can produce the
outcome and the X's can compensate for each other (they are sufficient but not necessary),
necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow
the outcome to exist, but may not produce it (they are necessary but not sufficient). Each single
necessary condition must be present and compensation is not possible.
7
Barriers to effective analysis may exist among the analysts performing the data analysis or
among the audience. Distinguishing fact from opinion, cognitive biases, and innumeracy are all
challenges to sound data analysis.
As another example, the auditor of a public company must arrive at a formal opinion on whether
financial statements of publicly traded corporations are "fairly stated, in all material respects”.
This requires extensive analysis of factual data and evidence to support their opinion. When
making the leap from facts to opinions, there is always the possibility that the opinion
is erroneous.
Cognitive biases
There are a variety of cognitive biases that can adversely affect analysis. For
example, confirmation bias is the tendency to search for or interpret information in a way that
confirms one’s preconceptions. In addition, individuals may discredit information that does not
support their views. Note that, analysts may be trained specifically to be aware of these biases
and how to overcome them.
Innumeracy
Effective analysts are generally adept with a variety of numerical techniques. However,
audiences may not have such literacy with numbers or numeracy; they are said to be innumerate.
Persons communicating the data may also be attempting to mislead or misinform, deliberately
using bad numerical techniques.
For example, whether a number is rising or falling may not be the key factor. More important
may be the number relative to another number, such as the size of government revenue or
spending relative to the size of the economy (GDP) or the amount of cost relative to revenue in
corporate financial statements. This numerical technique is referred to as normalization or
common-sizing. There are many such techniques employed by analysts, whether adjusting for
inflation (i.e., comparing real vs. nominal data) or considering population increases,
demographics, etc. Analysts apply a variety of techniques to address the various quantitative
messages described in the section above.
8
Qualitative methodology recognizes that the subjectivity of the researcher is intimately
involved in scientific research. Subjectivity guides everything from the choice of topic that one
studies, to formulating hypotheses, to selecting methodologies, and interpreting data. In
qualitative methodology, the researcher is encouraged to reflect on the values and objectives he
brings to his research and how these affect the research project. Other researchers are also
encouraged to reflect on the values that any particular investigator utilizes.
A key issue that arises with the recognition of subjectivity is how it affects objectivity. Two
positions have been articulated. Many qualitative researchers counterpoise subjectivity and
objectivity. Objectivity is said to negate subjectivity since it renders the observer a passive
recipient of external information, devoid of agency. And the researcher's subjectivity is said to
negate the possibility of objectively knowing a social psychological world. The investigator's
values are said to define the world that is studied. One never really sees or talks about the world,
per se. One only sees and talks about what one's values dictate. A world may exist beyond
values, but it can never be known as it is, only as values shape our knowledge of it.
If the data seem valid and reliable, you need to make sure that you have an accurate copy of
the data, especially if you obtained it through an electronic medium. This includes verifying
that you:
Why use secondary data? (Advantages and Disadvantages of Secondary Data Analysis)
It is unobtrusive research
It can be less expensive than gathering the data all over again. Low cost
It may allow the researcher to cover a wider geographic or temporal range. That is, using
secondary data ensures the breadth of data available.
It can allow for larger scale studies on a small budget.
It does not exhaust people's good will by re-collecting readily available data.
b) Data may have been intended for consumption by particular groups, which differ from the
present project
The descriptive data analysis techniques include graphics, tabulation , simple summary
statistics, pictoral and textural methods. Summary statistics include measures of central
tendency (averages - mean, median and mode) and measures of variability about the average
(range and standard deviation). These give the reader a 'picture' of the data collected and used
in the research project.
Inferential statistics are the outcomes of statistical tests, helping deductions to be made from
the data collected, to test hypotheses set and relating findings to the sample or population.
Pictures are often better at communicating ideas than other media forms; graphic presentations
are pictures. In working with graphics, pay particular attention to axes and/or labels and legends
which define the data in the graphic. Basic exaple of graphs we used include
Using graph for data visualization is very common in your day to day life; they often appear in
the form of charts and graphs. In other words, data shown graphically so that it will be easier
for the human brain to understand and process it. Data visualization often used to discover
unknown facts and trends. By observing relationships and comparing datasets, you can find a
way to find out meaningful information.
i) Scatter diagram
A scatter plot is used to show how two variables are related with each other. By observation
one is able to tell whether they are positively or negatvely related or nop relationship. A scatter
diagram is useful in data exploration and when a line of best fit is imposed, one is able to know
how the twovaraobale best fit the data. Example of a scatter diagram and lone of best fit can be
shown in figire 1 below:
Figure : Mean weekly hours worked by the youth by region and residence
iii) Histogram: A histogram is the most commonly used graph to show frequency distributions.
It looks very much like a bar chart, but there are important differences. This is good for
comparing different groups if the indepedent variable is not numeric. A histogram can be used
to establish the normal distribution of the variable, i.e. whether it is normally distributed or
positive or negatively skewed. Illustrations:
13
Skewed Distribution
The skewed distribution is asymmetrical because a natural limit prevents outcomes on one side.
The distribution’s peak is off center toward the limit and a tail stretches away from it. For
example, a distribution of analyses of a very pure product would be skewed, because the product
cannot be more than 100 percent pure. Other examples of natural limits are holes that cannot
be smaller than the diameter of the drill bit or call-handling times that cannot be less than zero.
These distributions are called right- or left-skewed according to the direction of the tail.
14
v) Pie Diagram
Pie diagram is another graphical method of the representation of data. It is drawn to depict the
total value of the given attribute using a circle. Dividing the circle into corresponding degrees
of angle then represent the sub– sets of the data. Hence, it is also called as Divided Circle
Diagram. Data are defined by labels and/or the legend associated with the chart.The angle of
each variable is calculated using the following formulae. Analysis of data using pie chat can be
illustrated using the figure beelow:
15
Construction
(a) Mark time series data on X-axis and variable data on Y-axis as per the selected scale.
(b) Plot the data in closed columns.
Data Tabulations
This mainly involves data analysis using tabulation of frequencies, percentages or number of
observations. It may be a one way tabulation, two way tabulation, three way tabulation etc
depending on the information one wants to display and report to the readers. Below we focus
on a one way and two way tabulations. However, in two way, one needs to take precautions, by
noting that he/she is doing row tabulations, column tabulations or cell tabulations because these
provide different interpretations of the study findings.
16
One way tabulation
Table 3 presents youth transition stage in their employment status by gender. The results show
that 38.5% of male youth transited into stable jobs compared to 25.4% of their female
counterparts, while more female youth (39.4%) transited to satisfactory jobs than male youth
(32.3%) and more female youth are in transition (35.2%) than male youth (29.2%).
Sex
Male 38.5 32.3 29.2 100
Female 25.4 39.4 35.2 100
17
Three way tabulation
Table 4: Tabulation of region sex race
Regions 1=white, 2=black, 3=other and 1=male, 2=female
White Black Other
Male Female Male Femal Male Femal
e e
North 962 1017 51 55 5 6
central 1170 1292 133 162 7 10
South 1076 1208 247 301 9 12
West 1104 1236 69 68 82 69
The mean value is what we typically call the "average." You calculate the mean by adding up
all of the measurements in a group and then dividing by the number of measurements.
Quartiles: The lower (Q1) quartile is the value below which the bottom 25% of the sample data
lie, and the upper (Q3) quartile is the value above which the upper 25% lie. NB. The middle
quartile (Q2) corresponds to the median.
The median is a statustsical value that is mid-way an arranged data set in ana ascending or
dissending order and is less sensitive to “outliers” in the data. That is, data values at the extremes
of a group.
Range: This measures the distance between the lowest and highest values in the data set and
generally describes how spread out data are. The range gives only minimal information about
the spread of the data, by defining the two extremes. It says nothing about how the data are
distributed between those two endpoints. Two other related measures of dispersion, the variance
and the standard deviation, provide a numerical summary of how much the data are scattered.
For example, after an exam, an instructor may tell the class that the lowest score was 65 and
the highest was 95. The range would then be 30. Note that a good approximation of the standard
deviation can be obtained by dividing the range by 4.
Variance is expressed as the sum of the squares of the differences between each observation
and the mean, which quantity is then divided by the sample size. A measure of the dispersion
of a set of data points around their mean value. Variance is a mathematical expectation of the
average squared deviations from the mean. For populations, it is designated by the square of
the Greek letter sigma ( ). For samples, it is designated by the square of the letter s ( ).
Since this is a quadratic expression, i.e. a number raised to the second power, variance is the
second moment of statistics.
Standard deviation is expressed as the positive square root of the variance, i.e. for
populations and s for samples. A measure of the dispersion of a set of data from its mean. The
more spread apart the data, the higher the deviation. It is the average difference between
observed values and the mean. The standard deviation is used when expressing dispersion in
the same units as the original measurements. It is used more commonly than the variance in
expressing the degree to which data are spread out.
Coefficient of variation measures relative dispersion by dividing the standard deviation by the
mean and then multiplying by 100 to render a percent. A statistical measure of the dispersion
of data points in a data series around the mean. This number is designated as V for populations
and v for samples and describes the variance of two data sets better than the standard deviation.
For example, one data set has a standard deviation of 10 and a mean of 5. Thus, values vary by
two times the mean. Another data set has the same standard deviation of 10 but a mean of 5,000.
In this case, the variance and, hence, the standard deviation are
insignificant.
Percentiles measure the percentage of data points which lie below a certain value when the
values are ordered. For example, a student scores 1280 on the Scholastic Aptitude Test (SAT).
Her scorecard informs her she is in the 90th percentile of students taking the exam. Thus, 90
percent of the students scored lower than she did.
Quartiles group observations such that 25 percent are arranged together according to their
values. The top 25 percent of values are referred to as the upper quartile. The lowest 25 percent
of the values are referred to as the lower quartile. Often the two quartiles on either side of the
median are reported together as the interquartile range. Examining how data fall within quartile
groups describes how deviant certain observations may be from others.
Measures of skew describe how concentrated data points are at the high or low end of the scale
of measurement. Skew is designated by the symbols Sk for populations and sk for samples.
Skew indicates the degree of symmetry in a data set. The more skewed the distribution, the
higher the variability of the measures, and the higher the variability, the less reliable are the
data.
Skew is calculated by either multiplying the difference between the mean and the median by
three and then dividing by the standard deviation or by summing the cubes of the differences
19
between each observation and the mean and then dividing by the cube of the standard deviation.
Note that the use of cubic quantities helps explain why skew is called the third moment.
More conceptually, skew defines the relative positions of the mean, median, and mode. If a
distribution is skewed to the right (positive skew), the mean lies to the right of both the mode
(most frequent value and hump in the curve) and median (middle value). That is, mode less
than (<) median less than (<) mean. But, if the distribution is skewed left (negative skew), the
mean lies to the left of the median and the mode. That is, mean < median < mode.
In a perfect distribution, mean = median = mode, and skew is 0. The values of the equations
noted above will indicate left skew with a negative number and right skew with a positive
number.
Measures of kurtosis describe how concentrated data are around a single value, usually the
mean. It is statistical measure used to describe the distribution of observed data around the
mean. It is sometimes referred to as the “volatility of volatility”. Thus, kurtosis assesses how
peaked or flat is the data distribution. The more peaked or flat the distribution, the less normally
distributed the data. And the less normal the distribution, the less reliable the data.
Kurtosis is designated by the letter K for populations and k for samples and is calculated by
raising the sum of the squares of the differences between each observation and the mean to the
fourth power and then dividing by the fourth power of the standard deviation. Note that the use
of the fourth power explains why kurtosis is called the fourth moment. Three degrees of kurtosis
are noted:
Mesokurtic distributions are, like the normal bell curve, neither peaked nor flat.
Platykurtic distributions are flatter than the normal bell curve. A description of the kurtosis in
a distribution in which the statistical value is negative.
Leptokurtic distributions are more peaked than the normal bell curve. A description of the
kurtosis in a distribution in which the statistical value is positive.
The ideal value rendered by the equation for kurtosis is 3, the kurtosis of the normal bell curve.
The higher the number above 3, the more leptokurtic (peaked) is the distribution. The lower the
number below 3, the more platykurtic (flat) is the distribution.
Correlation analysis
Correlation analysis is performed to establish how two variables are associated to one another.
The correlation coefficinet ranges from -1 to +1, When it is negative we say that the two
variables are negatively correlated, when it is positive we say that the two variables are
positively correlated, and when it is zero, we say that the twwo variables are independent.
Note that correlation analysis is one of the exploratory data analysis and can be used to
estaablish whether the variables are highly associatedd such that when the two variables are
used as independent variables in a regresion model, the results will be biased, i.e.e inneficient
and not accurate for forecasting. Using teh rule of the Thumb, when the correlation coefficient
20
is equal to or gretaer that 0.8, we may suspect that tere is a problem of multicollinearity in the
data and thus, the data needs to eb transformed. We can transform the data using either ratios,
logarithms of lags or differencing.
Accurate analysis of data using standardized statistical methods in scientific studies is critical
to determining the validity of empirical research. Statistical formulas such as regression,
uncertainty coefficient, t-test, chi square, and various types of ANOVA (analyses of variance)
are fundamental to forming logical, valid conclusions. If empirical data reach significance under
the appropriate statistical formula, the research hypothesis is supported. If not, the null
hypothesis is supported (or, more correctly, not rejected), meaning no effect of the independent
variable(s) was observed on the dependent variable(s).
It is important to understand that the outcome of empirical research using statistical hypothesis
testing is never proof. It can only support a hypothesis, reject it, or do neither. These methods
yield only probabilities.
Among scientific researchers, empirical evidence (as distinct from empirical research) refers to
objective evidence that appears the same regardless of the observer. For example, a
thermometer will not display different temperatures for each individual who observes it.
Temperature, as measured by an accurate, well calibrated thermometer, is empirical evidence.
By contrast, non-empirical evidence is subjective, depending on the observer. Following the
previous example, observer A might truthfully report that a room is warm, while observer B
might truthfully report that the same room is cool, though both observe the same reading on the
thermometer. The use of empirical evidence negates this effect of personal (i.e., subjective)
experience.
Empirical cycle
1. Observation: The collecting and organisation of empirical facts; Forming hypothesis.
2. Induction: Formulating hypothesis.
3. Deduction: Deducting consequences of hypothesis as testable predictions.
4. Testing: Testing the hypothesis with new empirical material.
5. Evaluation: Evaluating the outcome of testing or else
21
We are bombarded with information all of our lives. Does the information make sense? Is it
important? Why should I care? Reading and thinking critically involves asking questions of
everything we read or study. The questions listed below are just some of the questions you need
to ask while reading reports of empirical studies. These questions will help you identify key
information in the report and reflect on what purpose the information serves.
In order to critically evaluate a report of empirical research, it is essential to consider the context
in which the research was conducted. The following questions will help you understand the
context for the research
Who did the research? Where was it published? What are the research questions? , Where did
these research questions come from? Is the research important? Why or why not? Researchers
use different methods to address the research questions. The following questions will help you
evaluate the way in which the research was conducted:
Who or what is involved in the study? Are the subjects appropriate for the study? What is the
research design? Is the research design appropriate for the research question(s)?, What are the
measures? Are the measures appropriate for addressing the research question(s)? What ethical
considerations are important to address? Are they all addressed in the article?
The results of the study are used by the researchers to answer the research questions. Use the
following questions to help you understand the results and determine whether they answer the
research questions:
What are the main results of the study?, Can the results be used to answer the research
question(s)?, Can the results be generalized beyond the context of the study?
The conclusions place the results of the study into the context of the study. The following
questions will help you understand how the researchers make sense of the results and how they
use the results to better understand the discipline:
What conclusions do the researchers draw from the results? , Are the conclusions important?
Why or why not?
Diagnostic Analysis
22
Diagnostic Analysis shows “Why did it happen?” by finding the cause from the insight found
in Statistical Analysis. This Analysis is useful to identify behavior patterns of data. If a new
problem arrives in your business process, then you can look into this Analysis to find similar
patterns of that problem. And it may have chances to use similar prescriptions for the new
problems.
Predictive Analysis
Predictive Analysis shows “what is likely to happen” by using previous data. The simplest
example is like if last year I bought two dresses based on my savings and if this year my salary
is increasing double then I can buy four dresses. But of course it's not easy like this because
you have to think about other circumstances like chances of prices of clothes is increased this
year or maybe instead of dresses you want to buy a new bike, or you need to buy a house!
So here, this Analysis makes predictions about future outcomes based on current or past data.
Forecasting is just an estimate. Its accuracy is based on how much detailed information you
have and how much you dig in it.
Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine which
action to take in a current problem or decision. Most data-driven companies are utilizing
Prescriptive Analysis because predictive and descriptive Analysis are not enough to improve
data performance. Based on current situations and problems, they analyze the data and make
decisions.
This model is used only when the dependent variable is continous by minimising the sum of
squared residues.
y=a+bX+u
where y is height and X is age. The model is estimated to show the relationship between age
and height of an individual.
Source | SS df MS Number of obs = 10,351
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
23
_cons | 173.1531 .272987 634.29 0.000 172.618 173.6882
------------------------------------------------------------------------------
The model has been estimated using 10,351 observations. It has statistically significant F-
statistic at 1% level of significance and thus we see that the model fits well the data. In terms
of the estimated coefficient, from the tabel above, we observe a negative relationship between
age and height and it is statistically significant at 1% level of significance.
b) Binary dependent variable models-These are also known as probability models. This when
the outcome variables is coded as 0 or 1. In this the most appropriate technique of models
estimation is either logit/probit techniques. When a logit model is estimated we obtain
coefficients that are difficult to inpterete and which violates the probability properties as they
may be greater positive 1 or less than negative 1. therefore to get meaningful results, for the
logit model, we can report the odd ratios or the marginal effects.
When odd ratios are reported for a logit model, odds greater than 1 imply a positive effect to
the outcome variable, the effect is given by oddsratio-1. On the other hand, if the odds are less
than, then it implies that the indenpendent varibale has a negative effect on the outcome.
More common in empirical analysis, marginal effects can be computed for both the logit/probit
models. These are interpreted as marginal probabilities that give the likelihood of occurance of
the outcome variables with respect to a given independent variable.
*logit and probit models can be estimated for both cross sectional and panel data
in case of cross section data we use the commands logit/probit, or logistic and in panel data we
used the command xtlogit/xtprobit.
c) Categorical dependent variable-This is when the dependent variable has more than two
outcome choices- the level of education codes 0-noeduc, 1 primary, 2 secondary, 3
postsecondary.
These models are known as multinomial logit/probit models. Ordinarily the coefficient of these
models do not make sense, hence we always report relative risk ratios for multinomiallogit or
marginal probabilities.
Note. In case of multinomial logit/probit the dependent variable does not take any form of
ordering. However, when the dependent assumes a given ordering, then we estimate the ordered
logit or probit models.
Note, In case the data have so many zeros, we instead estimated what is known as zero inltaed
negative binomial models or zero inflated poisson models.
24
Econometric model evaluation
The precision of the estimate depends on the size of the sample. Clearly the larger the sample
the better the estimate will be. Precision is measured by calculating the standard error of the
estimate or a confidence interval (usually the 95% confidence interval).
Confidence interval
A range of values, within which we are fairly sure the true value of the parameter being
investigated lies. A common confidence interval (CI) is 95%. Thus, for example, we can be
95% certain that the true population mean lies approximately within the interval calculated from
the sample mean ± 2 x standard error of the mean. 2 is an approximation, dependent on sample
size.
The first hypothesis is usually referred to as the Null Hypothesis because it is the hypothesis of
no effect or no difference between the populations of interest. It is usually given the symbol H0.
The second hypothesis is usually called the Alternative Hypothesis by statisticians, but since it
is often the hypothesis that the researcher would like to be true, it is sometimes referred to as
the Study Hypothesis or Research Hypothesis. Note, however, in equivalence trials a researcher
would like a new (but perhaps cheaper) treatment to be as effective as the current treatment, it
is the null hypothesis that the researcher would like to see supported by the data. The Alternative
Hypothesis is usually given the symbol H1 or HA. The Alternative Hypothesis states that there
is an effect or that there is a difference between the populations.
However, in some instances the researcher may be interested in a change in one direction only
(eg pulse is lower or pain relief is better). The alternative hypothesis in this case is known as a
directional (one-tailed) alternative hypothesis. In this case, the alternative hypothesis will take
the form, for example:
H1: on average, there is a greater pain relief from taking drug A, than not.
Note: the null hypothesis is the same for both directional and non-directional cases.
25
All statistical tests produce a p-value and this is equal to the probability of obtaining the
observed difference, or one more extreme, if the null hypothesis is true. To put it another way
- if the null hypothesis is true, the p-value is the probability of obtaining a difference at least as
large as that observed due to sampling variation.
Consequently, if the p-value is small the data support the alternative hypothesis. If the p-value
is large the data support the null hypothesis. But how small is 'small' and how large is 'large'?!
The 5% value is called the significance level of the test. Other significance levels that are
commonly used are 1% and 10%.
Statistical Power
The use of a significance level of 5% controls the probability of erroneously rejecting the null
hypothesis when it is, in fact, true. Rejecting the null hypothesis when it is true is called a Type
I error. However, there is another error can be made - that is failing to reject the null hypothesis
when it is, in fact, not true. This is called a Type II error.
The power of a test (or statistical power) refers to the probability that a statistical test will
correctly reject a null hypothesis when it is false. In other words, it’s the ability of the test to
detect an effect or difference when one truly exists. The power of a test is an important concept
in hypothesis testing, as it indicates how sensitive the test is to detecting real differences or
relationships in the data.
Significance level (α): The significance level (often set at 0.05) defines the threshold for
rejecting the null hypothesis. It represents the probability of making a Type I error, which
occurs when the null hypothesis is wrongly rejected. The lower the α value, the stricter the
criteria for rejecting the null hypothesis. However, a lower α may reduce the power of the test,
as fewer differences will be considered significant.
Sample size (n): Larger sample sizes generally increase the power of a test. This is because
larger samples provide more information, making it easier to detect a true effect. With smaller
sample sizes, the test may fail to detect differences, even if they exist, because the data is less
precise.
Effect size: The effect size measures the magnitude of the difference or relationship that the
test is trying to detect. A larger effect size makes it easier to detect a true effect, increasing the
power of the test. For example, if the difference between two groups is very large, the test is
more likely to detect that difference. If the effect size is small, the power of the test will
decrease.
Variability (Standard Deviation) in the Data: The less variability (or noise) there is in the
data, the easier it is to detect a significant effect. High variability (i.e., large standard deviations)
can obscure true differences or relationships, reducing the test's power.
26
Test type and design: The power of a test can also depend on the type of statistical test used
and the design of the experiment. For example, paired tests (such as paired t-tests) tend to have
more power than unpaired tests (like independent t-tests) because paired tests reduce the
variability by comparing the same subjects under different conditions.
Increase the sample size: Larger sample sizes decrease variability and increase the
precision of estimates.
Increase the effect size: Although the effect size is determined by the phenomenon being
studied, a more powerful experimental design can help magnify the effect.
Reduce variability: By controlling for sources of variation or improving measurement
techniques, researchers can reduce the noise in the data and increase power.
Choose a more sensitive test: Some statistical tests are more powerful than others (e.g.,
paired t-tests are more powerful than independent t-tests).
Before conducting a study, researchers can perform a power analysis to determine the required
sample size for a given power level. This helps ensure that the study is adequately powered to
detect meaningful effects while minimizing the risk of Type II errors. Power analysis is also
used to identify the likelihood of detecting an effect when designing experiments, especially
when resources or participants are limited.
2. and being aware of the theoretical positions available on the topic researchers' are 'pre-
figuring the field' i.e. anticipating what they may find.
Rigour
Pre-figuring the field runs the risk of researchers only finding out what they want to find by
only looking for a specific phenomena, or by being blind to other issues that arise. It involves
the checks and balances built into qualitative research to make sure it is believable, trustworthy
and credible.
Reflexivity
Forewarned is forearmed. By being aware of the pitfalls of pre-figuring the field, researchers
can maintain an openess to the situation they are investigating. They can be attentive to issues
that are not expected or do not conform to existing accounts or theories of society. This idea of
being aware of your own values, ideas and pre-judgements as a researcher is known as
reflexivity.
Iteration
Iteration means moving back and forth. In qualitative research it is difficult to cleanly separate
27
out data collection or generation from data analysis because there is movement back and forth
between generation and analysis.
Researchers usually generate data at a point in time and also write analytical notes to themselves
about that data. These notes are then processed into memos or guiding notes to inform the next
bout of data collection. And so leads the merry dance.
Analytical memos
The sorts of things included are –
1. The identification of patterns;
2. Working out the limitations, exceptions and variations present in whatever is being
investigated;
3. Generating tentative explanations for the patterns and seeing if they are present or absent in
other settings or situations;
4. Working explanations into a theoretical model;
5. Confirming or modifying the theoretical model;
The way this is presented here sounds like it is an inevitable process that follows a straight line
and does not deviate. Of course life is not like that, and these stages are an ideal type meant to
help you get a handle on the topic. What makes qualitative data analysis dynamic, exciting and
intellectually challenging is the iteration between generation and analysis and within the
different types of analytical work.
Triangulation of analysis
It is very rare for qualitative data to be collected all in one go, then processed and analysed. If
this happened we might criticise the project for not being true to the context in which it was
generalised, which would make it a weak piece of work.
One way of producing believable, credible and trustworthy work is to use triangulation. This is
a term 'borrowed' from geography - and in qualitative analysis means more than one perspective
on a situation e.g. patients or service users, their families and friends, and service providers.
Fluency
To analyse texts for their meaning, researchers have to be fluent in the language which the
research participants use.
Not just the formal language, but also the colloquialisms used in every day talk. Listen carefully
next time you are in a public place to the richness of everyday language that bears little
resemblance to standard English - check with a friend their interpretation of phrase or word
against your own. An inability to understand what is said will restrict researchers' abilities to
gain an understanding of participants' motives, meanings and behaviours.
Capturing talk
The act of capturing talk may shape what is said and in turn influence how it is analysed. Using
tape recorders to capture talk means that researchers' may attend to the interviewee without
having to focus on writing down their talk verbatim. However, the recording will have to be
clear to allow an accurate transcription so attention to equipment and environment will have a
direct affect on the quality of the analysis.
28
Processing texts and archiving
The most common way of processing texts is to transcribe taped talk into word processed
documents. These may then be read and re-read to identify meaning, patterns and models.
Analytical notes and memos will be made, and all of these need to stored carefully -
1. to protect the integrity of the original document,
2. to allow the various components of the current analysis to be identified,
3. to locate the source of the comments made.
There are software programmes which provide an orderly and rigourous framework for data
archival and administrative tasks. Each programme has built in assumptions about data and how
it should be handled. Researchers need to choose with care a programme that is similar to their
own perspective and to the characteristics of their data.
This is where qualitative data analysis software programmes come into their own because they
allow researchers to earmark segments of text, apply tags or descriptive lables to the segments,
and build up categories and themes of analysis. When it comes to writing the definitive research
document these segments can then be found easily in the archive, and directly inserted into the
text.
29