A Review of Research Process, Data Collection and Analysis: Surya Raj Niraula
A Review of Research Process, Data Collection and Analysis: Surya Raj Niraula
Review Article
Define research
problems
Review of literatures
Formulate
hypothesis
Data
collection
Analysis
HTTPS://WWW.HEIGHPUBS.ORG How to cite this article: Niraula SR. A review of research process, data collection and analysis. Insights
Biol Med. 2019; 3: 001-006. https://doi.org/10.29328/journal.ibm.1001014
001
A review of research process, data collection and analysis
On the other hand, in depth interviews and unstructured observations are associated
with qualitative research. The socially stigmatized and hidden issues are understood
and explored by the qualitative research approach. In fact, the purpose of quantitative
research is to measure concepts or variables that are predetermined objectively and to
examine the relationship between them numerically and statistically. Researches have
to choose methods which are appropriate for answering their questions.
Basically there are two sources of data, primary and secondary. The secondary
data, which are generally obtained from different departments of country like health,
education, population and may be collected from different hospitals, clinics, and
schools’ records, can be utilized for our own research. The secondary sources may be
private and foundation databases, city and county governments, surveillance data from
the government programs, and federal agency statistics - Census, NIH, etc. The use of
secondary data may save our survey cost, time, and may be accurate if the government
agency has collected the information. However, there are several limitations over it.
The secondary data may be out of date for what we want to analyze. It may not have
been collected long enough for detecting trends, e.g. organism pattern registered
in a hospital for 2 months. A major limitation is we should formulate the research
objectives based on availability of variables in the data set. On the other hand there
may be missing information on some observations. Unless such missing information
is caught and corrected for, analysis will be biased. There may be many biases like
sample selection bias, source choice bias, drop out etc.
If we look at primary source, it has more advantages than the secondary source
of data. The data can be collected through surveys, focus groups, questionnaires,
personal interviews, experiments and observational study. If we have time for
designing our collection instrument, selecting our population or sample, pretesting/
piloting the instrument to work out sources of bias, administration of the instrument,
and collection/entry of data, using primary source of data collection, researcher may
minimize the sampling bias, and other confounding bias.
Analysis
The analysis is an important part of research. The analysis of the data depends
upon the types of variables and its’ nature [3]. The irst thing for the data analysis is to
describe the characteristics of the variables. The analysis can be scrutinized as follows:
Summarizing data: Data are the bunch of values of one or more variables. A
variable is a characteristic of samples that has different values for different subjects.
Value can be numeric, counting, and category. The numeric values of continuous
variables are those which have numeric meanings, a unit of measurement, and may be
in fraction like – height, weight, blood pressure, monthly income etc. Another type of
variables is discrete variables which are based on counting process like – number of
student in different classes, number of patients visiting OPD in each day etc [4].
If the variables are numeric, they can be explored by plotting histogram, steam
and leaf plot, Whisker box plot, and normal plots to visualize how well the values it a
normal distribution. When the variables are categorical, they can be visualized by pie
chart or bar diagrams or just the frequencies and percentages.
c) Standard error of the mean: the typical variation in the mean with repeated
sampling divided by the root of (sample size).
Mean and standard deviation are most commonly used measure of central tendency
and dispersion respectively in case of normally distributed data (Tables 1,2). Median
(middle value or 50th percentile) and quartiles (25th and 75th percentiles) are used
for grossly non-normally distributed data.
The table 1 describes how the different tests are applied for different purpose.
Simple statistics for categorical variables are frequency, proportion or odds ratio. The
effect size derived from statistical model (equation) of the form Y (dependent) Vs X
(Predictor) depend on type of Y and X.
a) If the model is numeric vurses to numeric e.g. SBP and cholesterol; linear
regression with correlation coef icient could be used to ind the relationship
between the variables, where effect statistics gives slope and intercept of the
line of equation, called as parameters. The correlation coef icient is explained in
terms of variance explained in the model. This provides measures of goodness of
it. Other statistics typical or standard error of the estimate provides the residual
error and based measure of validity (with criterion variable on the Y axis).
b) But if the model is numerical versus categorical e.g. marks in medical exam
versus sex, the model will be t-test for 2 groups and one way ANOVA for more
than two groups (Table 2). Effects statistics will be difference between means,
express as row difference, percent difference, or fraction of the root mean
square error which is an average standard deviation of the two groups. The
table 2 shows the result of ANOVA for academic performances.
d) If the model is categorical versus categorical e.g. smoking habit versus sex,
the model test will Chi-square or Fisher exact tests where the effect statistics
provides relative frequencies, expressed as a difference in frequencies, ratio
of frequencies (relative risk) or odds ratio. The relative risk is appropriate for
Table 2: Academic performance in different levels of the MBBS students during 1994 to 1996.
Mean ± SD
Batches (n) MBBS
SLCS ISS EES MBBSI MBBSII MBBS III MBBSIV MBBS V
Total
1994 (29) 74.2±6.3 71.9±7.8 71.3±2.5 67.8±5.2 71.0±5.1 73.3±5.9 69.5±4.3 65.8±3.0 69.5±4.2
1995 (29) 75.2±5.1 69.98.7 52.1±4.0 67.3±5.4 68.04.0 65.3±3.5 65.3±3.5 62.317.6 65.6±5.6
1996 (28) 76.4±5.3 71.28.4 54.4±4.2 69.3±5.1 73.24.6 64.3±3.0 65.7±3.2 66.53.2 67.8±3.4
F value 1.1 0.4 241.2 1.1 9.2 42.7 11.1 1.3 5.2
P value NS NS <0.0001 NS <0.0001 <0.0001 <0.0001 NS < 0.01
Source: Niraula et al, 2006[6].
e) If the model is nominal category versus >= 2 numeric e.g. heart disease versus
age, sex and regular exercise, the model test will be categorical modeling where
effect statistics will be relative risk or odds ratio. This can be analyzed using
logistic regression or generalize liner modeling. The complex models will be
most reducible to t tests, regression, or relative frequencies.
− Assume the null hypothesis: that the true value is zero (null).
− If we observed value falls in a region of extreme values that would occur only
5% of the time, we reject the null hypothesis.
− That is, we decide that the true value is unlikely to be zero; we can state that the
result is statistically signi icant at the 5% level.
− If the observed value does not fall in the 5% unlikely region, most people
mistakenly accept the null hypothesis: they conclude that the true value is zero
or null!
− The p value helps us to decide whether our result falls in the unlikely region.
One meaning of the p value: the probability of a more extreme observed value
(positive or negative) when true value is zero. Better meaning of the p value: if we
observe a positive effect, 1 - p/2 is the chance the true value is positive, and p/2 is the
chance the true value is negative. For example: If we observe a 1.5% enhancement
of performance (p=0.08). Therefore there is a 96% chance that the true effect is
any “enhancement” and a 4% chance that the true effect is any “impairment”. This
interpretation does not take into account trivial enhancements and impairments.
Therefore, if we must use p values as possible, show exact values, not p<0.05 or p>0.05.
Meta-analysts also need the exact p value (or con idence limits).
If the true value is zero, there’s a 5% chance of getting statistical signi icance: the
Type I error rate, or rate of false positives or false alarms. There’s also a chance that the
smallest worthwhile true value will produce an observed value that is not statistically
signi icant: the Type II error rate, or rate of false negatives or failed alarms. The type II
error is related to the size of samples in the research. In the old-fashioned approach to
research design, we are supposed to have enough subjects to make a Type II error rate
of 20%: that is, our study is supposed to have a power of 80% to detect the smallest
worthwhile effect. If we look at lots of effects in a study, there’s an increased chance
being wrong about at least one of them. Old-fashioned statisticians like to control this
in lation of the Type I error rate within an ANOVA to make sure the increased chance
is kept to 5%. This approach is misguided.
Summary
In summary, the research process begins with de ining research problems and then
review of literatures, formulation of hypothesis, data collection, analysis, interpretation
and end in report writing. There are chances of occurrence of many biases in data
collection. Importantly, the analysis of research data should be done with very caution.
If a researcher use statistical test for signi icance, he/she should show exact p values.
It is also better still, to show con idence limits instead. The standard error of the mean
should be shown only in case of estimating population parameter. Usually between-
subject standard deviation should be presented to convey the spread between subjects.
In population studies, this standard deviation helps convey magnitude of differences
or changes in the mean. In interventions, show also the within-subject standard
deviation (the typical error) to convey precision of measurement. Standard deviation
helps convey magnitude of differences or changes in mean performance.
Chi-square and isher exact tests are used for categorical variables (category versus
category). Two numerical variables are examined by correlation coef icient. For the
model numeric versus two category, t test will be the suitable in case of normal data,
ANOVA should be applied for the model numeric versus >=2 categorical variables.
Multiple regression model is used to ind out adjusted effects of all possible predictors
(>=2) on a numeric response variable.
References
1. The Advanced Learner’s Dictionary of Current English. Oxford. 1952; 1069. Ref.: https://goo.gl/K7pKvD
2. Farrugia P, Petrisor BA, Farrokhyar F, Bhandari M. Practical tips for surgical research: Research
questions, hypothesis and objectives. J Can Surg. 2010; 53: 278-281. Ref.: https://goo.gl/Rf6DED
3. Niraula SR, Jha N. Review of common statistical tools for critical analysis of medical research.
JNMA. 2003; 42:113-119.
4. Reddy M V. Organisation and collection of data. In Statistics for Mental Health Care Research. 2002;
Edition 1: 13-23.
5. Lindman HR. Analysis of Variance in experimental design. New York: Springer-Verlag, 1992. Ref.:
https://goo.gl/jXeec5
6. Niraula SR, Khanal SS. Critical analysis of performance of medical students. Education for Health.
2006; 19: 5-13. Ref.: https://goo.gl/5dFKUK
7. Indrayan A, Gupta P. Sampling techniques, confidence intervals and sample size. Natl Med Journal
India. 2000; 13: 29-36. Ref.: https://goo.gl/1nbNpQ
8. Simon R. Confidence intervals for reporting results of clinical trails. Ann Int Med. 1986; 105: 429-435.
Ref.: https://goo.gl/acDett