Evaluating Research

Psychological Assessment Copyright 1995 by the American Psychological Association, Inc.
1995, Vol. 7, No. 3, 228-237 1040-3590/95/53.00
Preparing and Evaluating Research Reports

Alan E. Kazdin
Yale University
Preparation of research reports for journal publication or dissemination in some other form is a
central part of the research process. This article discusses preparation of the report in light of how
the information is likely to be evaluated and how the report contributes to the research process. The
focus is on three essential features: description, explanation, and contextualization of the study.
These features are elaborated by reviewing the contents of each section of the manuscript and ques-
tions to guide authors and reviewers for preparing and evaluating the report. Emphasis is placed on
conveying the rationale for decisions made in the design, execution, and analysis of the study. Com-
mon issues related to the interpretation of assessment studies, including test validity data, the rela-
tion of constructs and measures, and sampling, are highlighted as well.
The research process consists of the design and execution of Reports of empirical studies have many characteristics in
the study, analysis of the results, and preparation of the report common, whether or not they focus on assessment. Even so,
(e.g., journal article). The final step seems straightforward and the present focus will emphasize studies that are designed to
relatively easy, given the nature and scope of the other steps. evaluate assessment devices, constructs that the measures are
In fact, one often refers to preparation of the article as merely intended to reflect, and studies of test validation. Issues that
"writing up the results." Yet the implied simplicity of the task commonly emerge in articles of assessment and hence the de-
belies the significance of the product in the research process. sign of assessment studies are highlighted as well.
The article is not the final step in this process. Rather, it is an
important beginning. The article is often a launching platform Guidelines for Preparing Reports for Publication
for the next study for the authors themselves or for others in the
field who are interested in pursuing the findings. Thus, the re- Preparation of the report for publication involves three inter-
port is central to the research process. related tasks, which I shall refer to as description, explanation,
The article itself is not only a description of what was accom- and contextualization. Failure to appreciate or to accomplish
plished, but it also conveys the extent to which the design, exe- these tasks serves as a main source of frustration for authors, as
cution, and analyses were well conceived and appropriate. Rec- their articles traverse the process of manuscript review toward
ognition of this facet of the report is the reason why faculty journal publication. Description is the most straightforward
require students in training to write a proposal of the study in task and includes providing details of the study. Even though
advance of its execution. At the proposal stage, faculty can ex- this is an obvious requirement of the report, basic details often
amine the thought processes, design, planned execution, and are omitted in published articles (e.g., gender and race of the
data analyses and make the necessary changes in advance. Even participants, means, and standard deviation; see Shapiro &
so, writing the full article at the completion of the study raises Shapiro, 1983; Weiss & Weisz, 1990). Explanation is slightly
special issues. At that point, the authors evaluate critical issues, more complex insofar as this task refers to presenting the ratio-
see the shortcomings of the design, and struggle with any clashes nale of several facets of the study. The justification, decision-
or ambiguities of the findings in light of the hypotheses. making process, and the connections between the decisions and
The purpose of this article is to discuss the preparation and the goals of the study move well beyond description. There are
evaluation of research reports (articles) for publication.1 numerous decision points in any given study, most of which can
Guidelines are presented to facilitate preparation of research be questioned. The author is obliged to make the case to explain
articles. The guidelines cover the types of details that are to be why the specific options elected are well suited to the hypotheses
included, but more important, the rationale, logic, and flow of or the goals of the study. Finally, contextualization moves one
the article to facilitate communication and to advance the next step further away from description of the details of the study
stage of the research process. Thus, preparation of a research and addresses how the study fits in the context of other studies
report involves many of the same considerations that underlie and in the knowledge base more generally. This latter facet of
the design and plan of the research.
1
Preparation of manuscripts for publication can be discussed from
the perspective of authors and the perspective of reviewers (i.e., those
Completion of this research was supported by Research Scientist persons who evaluate the manuscript for publication). This article em-
Award MH00353 and Grant MH35408 from the National Institute of phasizes the perspective of authors and the task of preparing an article
Mental Health. for publication. The review process raises its own issues, which this ar-
Correspondence concerning this article should be addressed to Alan ticle does not address. Excellent readings are available to prepare the
E. Kazdin, Department of Psychology, Yale University, P.O. Box author for the journal review process (Kafka, The Trial, The Myth of
208205, New Haven, Connecticut 06520-8205. Sisyphus, and Dante's Inferno).
228
SPECIAL ISSUE: PREPARING RESEARCH REPORTS 229
article preparation reflects such lofty notions as scholarship and context) or noting that no one else has studied this phenomenon
perspective, because the author places the descriptive and ex- often are viewed as feeble attempts to circumvent the contextu-
planatory material into a broader context. alization of the study.
The extent to which description, explanation, and contextu- Limitations of previous work and how those limitations can
alization are accomplished increases the likelihood that the re- be overcome may be important to consider. These statements
port will be viewed as a publishable article and facilitates inte- build the critical transition from an existing literature to the
gration of the report into the knowledge base. Guidelines follow present study and establish the rationale for design improve-
that emphasize these tasks in the preparation and evaluation of ments or additions in relation to those studies. Alternatively or
research reports. The guidelines focus on the logic to the study; in addition, the study may build along new dimensions to ad-
the interrelations of the different sections; the rationale for spe- vance the theory, hypotheses, and constructs to a broader range
cific procedures and analyses; and the strengths, limitations, of domains of performance, samples, settings, and so on. The
and place of the study in the knowledge base. It may be helpful rationale for the specific study must be very clearly established.
to convey how these components can be addressed by focusing If a new measure is being presented, then the need for the mea-
on the main sections of manuscripts that are prepared for jour- sure and how it supplements or improves on existing measures,
nal publication. if any are available, are important to include. If a frequently
used measure is presented, the rationale needs to be firmly es-
Main Sections of the Article tablished what precisely this study will add.
In general, the introduction will move from the very general
Abstract. At first glance, the abstract certainly may not to the specific. The very general refers to the opening of the in-
seem to be an important section or core feature of the article. troduction, which conveys the area of research, general topic,
Yet, two features of the abstract make this section quite critical. and significance of a problem. For example, if an article is on
First, the abstract is likely to be read by many more people than the assessment of alcohol abuse or marital bliss (or their
is the article. The abstract probably will be entered into various interrelation), a brief opening statement noting the current
databases that are available internationally. Consequently, this state of the topic and its implications outside of the context of
is the only information that most readers will have about the measurement is very helpful. Although reviewers are likely to
study. Second, for reviewers of the manuscript and readers of be specialists in the assessment domain, many potential readers
the journal article, the abstract sometimes is the first impression would profit from clarification of the broader context.
of what the author studied and found. Ambiguity, illogic, and The introduction does not usually permit authors to convey
fuzziness here are ominous. Thus, the abstract is sometimes the all of the information they wish to present. In fact, the limit is
only impression or first impression one may have about the usually two to four manuscript pages. A reasonable use of this
study. What is said is critically important. space involves brief paragraphs or implicit sections that de-
Obviously, the purpose of the abstract is to provide a rela- scribe the nature of the problem, the current status of the liter-
tively brief statement of purpose, methods, findings, and con- ature, the extension that this study is designed to provide, and
clusions of the study. Critical methodological descriptors per- how the methods to be used are warranted. To the extent that
tain to the participants and their characteristics, experimental the author conveys a grasp of the issues in the area and can
and control groups or conditions, design, and major findings. identify the lacunae that the study is designed to fill greatly im-
Often space is quite limited; indeed a word limit (e.g., 100- or proves the quality of the report and the chances of acceptance
120-word maximum) may be placed on the abstract by the jour- for journal publication.
nals. It is useful to make substantive statements about the char- Method. This section of the article encompasses several
acteristics of the study and the findings rather than to provide • points related to who was studied, why, how, and so on. The
general and minimally informative comments. Similarly, vacu- section not only describes critical procedures, but also provides
ous statements (e.g., "Implications of the results are discussed" the rationale for methodological decisions. Initially, the re-
or "Future directions for research are suggested") should be research participants (or subjects) are described, including sev-
placed with comments about the findings or one or two specific eral basic descriptors (e.g., age, genders, ethnicity, education,
implications and research directions (e.g., "The findings raise occupation, and income). From a method and design stand-
the prospect that there is a Big One rather than a Big Five set of point, information beyond basic descriptors can be helpful to
personality characteristics"). encompass factors that plausibly could affect generality or rep-
Introduction. The introduction is designed to convey the lication of the results or that might influence comparison of the
overall rationale and objective of the research. The task of the data with information obtained from normative or standard-
author is to convey in a clear and concise fashion why this par- ization samples.
ticular study is needed and the current questions, void, or defi- The rationale for the sample should be provided. Why was
ciency the study is designed to address. The section should not this sample included and how is it appropriate to the substantive
review the literature in a study-by-study fashion, but rather con- area and question of interest? In some cases, the sample is obvi-
vey issues and evaluative comments that set the stage for the ously relevant because participants have the characteristic or
study that is to follow. The task of contextualization is critically disorder of interest (e.g., parents accused of child abuse) or are
important in this section. Placing the study in the context of in a setting of interest (e.g., nursing home residents). In other
what is and is not known and conveying the essential next step cases, samples are included merely because they are available
in research in the field require mastery of the pertinent litera- (college students or a clinic population recruited for some other
tures and reasonable communication skills. Saying that the purpose than the study). Such samples of convenience often
study is important (without systematically establishing the count against the investigator. If characteristics of the sample
230 ALAN E. KAZDIN
are potentially objectionable in relation to the goals of the study, the hypotheses or purposes presented earlier in the article. It is
the rationale may require full elaboration to convey why the often the case that analyses are reported in a rote fashion in
sample was included and how features of the sample may or which, for example, the main effects are presented first, fol-
may not be relevant to the conclusions the author wishes to lowed by the interactions for each measure. The author presents
draw. A sample of convenience is not invariably a problem for the analyses in very much the same way as the computer print-
drawing valid inferences. Yet, invariably, a thoughtful discus- out that provided multiple runs of the data. Similarly, if several
sion will be required regarding its use. More generally, partici- dependent measures are available, a particular set of analyses is
pant selection, recruitment, screening, and other features war- automatically run (e.g., omnibus tests of multivariate analyses
rant comment. The issue for the author and reviewer is whether of variance followed by univariate analyses of variance for indi-
features of the participant selection process could restrict the vidual measures). These are not the ways to present the data.
conclusions in some unique fashion or, worse, in some way rep- In the presentation of the results, it is important to convey
resent a poor test of the hypotheses. why specific tests were selected and how these tests serve the
Assessment studies may be experimental studies in which specific goals of the study. Knowledge of statistics is critical for
groups vary in whether they receive an intervention or experi- selecting the analysis to address the hypotheses of interest and
mental manipulation. More commonly, assessment studies fo- conditions met by the data. The tests ought to relate to the
cus on intact groups without a particular manipulation. The hypotheses, predictions, or expectations outlined at the begin-
studies form groups based on subject selection criteria (e.g., one ning of the article (Wampold, Davis, & Good, 1990). Presum-
type of patient vs. another, men vs. women) for analyses. The ably, the original hypotheses were presented in a special
rationale for selecting the sample is obviously important. If the (nonradom) order, based on importance or level of specificity.
sample is divided into subgroups, it is as critical to convey how It is very useful to retain this order when the statistics are pre-
the groups will provide a test of the hypotheses and to show that sented to test these hypotheses. As a general rule, it is important
characteristics incidental to the hypotheses do not differ or do to emphasize the hypotheses or relations of interest in the re-
not obscure interpretation of the results (see Kazdin, 1992). sults; the statistics are only tools in the service of these
Also, the selection procedure and any risks of misclassification hypotheses.
based on the operational criteria used (e.g., false positives and It is often useful to begin the results by presenting basic de-
negatives) warrant comment. Reliability of the assessment pro- scriptors of the data (e.g., means and standard deviations for
cedures used to select cases, especially when human judgment each group or condition) so the readers have access to the num-
is required, is very important because of the direct implications bers themselves. If there are patterns in the descriptors, it is use-
for interpretation and replication of the findings. A common ful to point them out. Almost-significant results might be noted
example for which this arises in clinical research is in invoking here to err on the side of conservatism regarding group equiva-
psychiatric diagnoses using interview techniques. lence on some domain that might affect interpretation of the
Several measures are usually included in the study. Why the results, particularly if power (or sample size) was weak to detect
constructs were selected for study should be clarified in the in- such differences.
troduction. The specific measures and why they were selected to The main body of the results presents tests of the hypotheses
operationalize the constructs should be presented in the method or predictions. Organization of the results (subheadings) or
section. Information about the psychometric characteristics of brief statements of hypotheses before the specific analyses are
the measures is often summarized. This information relates di- often helpful to prompt the author to clarify how the statistical
rectly to the credibility of the results. Apart from individual test relates to the substantive questions. As a step towards that
assessment devices, the rationale for including or omitting areas goal, the rationale for the statistical tests chosen or the varia-
that might be regarded as crucial (e.g., multiple measures, in- tions within a particular type of test ought to be noted. For ex-
formants, and settings) deserves comment. The principle here ample, within factor analyses or multiple regression, the options
is similar to other sections, namely, the rationale for the author's selected (e.g., method of extracting factors, rotation, and
decisions ought to be explicit. method of entering variables) should be described along with
Occasionally, ambiguous statements may enter into descrip- the rationale of why these particular options are appropriate.
tions of measures. For example, measures may be referred to as The rationales are important as a general rule, but may take on
"reliable" or "valid" in previous research, as part of the ratio- even greater urgency because of the easy use of software pro-
nale for use in the present study. There are, of course, many grams than can run the analyses. Default criteria on many soft-
different types of reliability and validity. It is important to iden- ware programs are not necessarily related to the author's con-
tify those characteristics of the measure found in prior research ceptualization of the data, that is, the hypotheses. (Such infor-
that are relevant to the present research. For example, high in- mation is referred to as "default criteria" because if the results
ternal consistency (reliability) in a prior study may not be a do not come out with thoughtless analyses, it is partially "de
strong argument for use of the measure in a longitudinal design fault of the criteria de investigator used.") Statistical decisions,
in which the author hopes for test-retest reliability. Even previ- whether or not explicit, often bear conceptual implications re-
ous data on test-retest reliability (e.g., over 2 weeks) may not garding the phenomena under investigation and the relations of
provide a sound basis for test-retest reliability over annual in- variables to each other and to other variables.
tervals. The information conveys the suitability of the measure Several additional or ancillary analyses may be presented to
for the study and the rationale of the author for selecting the elaborate the primary hypotheses. For example, one might be
measure in light of available strategies. able to reduce the plausibility that certain biases may have ac-
Results. It is important to convey why specific analyses counted for group differences based on supplementary or ancil-
were selected and how a particular test or comparison addresses lary data analyses. Ancillary analyses may be more exploratory
and diffuse than tests of the primary hypotheses. Manifold vari- sults by continuing the story line that began in the introduction.
ables can be selected for these analyses (e.g., gender, race, and With the present findings, what puzzle piece has been added to
height differences) that are not necessarily conceptually inter- the knowledge base, what new questions or ambiguities were
esting in relation to the goals of the study. The author may wish raised, what other substantive areas might be relevant for this
to present data and data analyses that were unexpected, were line of research, and what new studies are needed? From the
not of initial interest, and were not the focus of the study. The standpoint of contextualization, the new studies referred to here
rationale for these excursions and the limitations of interpreta- are not merely those that overcome methodological limitations
tion are worth noting. From the standpoint of the reviewer and of the present study, but rather those that focus on the substan-
reader, the results should make clear what the main hypotheses tive foci of the next steps for research.
were, how the analyses provide appropriate and pointed tests,
and what conclusions can be reached as a result. In addition, Guiding Questions
thoughtful excursions (i.e., with the rationale guiding the
reader) in the analyses are usually an advantage. The section-by-section discussion of the content of an article
Discussion. The discussion consists of the conclusions and is designed to convey the flow or logic of the study and the in-
interpretations of the study and hence is the final resting place terplay of description, explanation, and contextualization. The
of all issues and concerns. Typically, the discussion includes an study ought to have a thematic line throughout, and all sections
overview of the major findings, integration or relation of these ought to reflect that thematic line in a logical way. The thematic
findings to theory and prior research, limitations and ambigu- line consists of the substantive issues guiding the hypotheses and
ities and their implications for interpretation, and future direc- the decisions of the investigator (e.g., with regard to procedures
tions. The extent that this can be accomplished in a brief space and analyses) that are used to elaborate these hypotheses.
(e.g., two to five manuscript pages) is to the author's advantage. Another way to consider the tasks of preparing a report is to
Description and interpretation of the findings may raise a ten- consider the many questions the article ought to answer. These
sion between what the author wishes to say about the findings are questions for the authors to ask themselves or, on the other
and their meaning versus what can be said in light of how the hand, questions reviewers and consumers of the research are
study was designed and evaluated. Thus, the discussion shows likely to want to ask. Table 1 presents questions that warrant
the reader the interplay of the introduction, method, and results consideration. They are presented according to the different
sections. For example, the author might draw conclusions that sections of a manuscript. The questions emphasize the descrip-
are not quite appropriate given the method and findings. The tive information, as well as the rationale for procedures, deci-
discussion conveys flaws, problems, or questionable method- sions, and practices in the design and execution. Needless to say,
ological decisions within the design that were not previously ev- assessment studies can vary widely in their purpose, design, and
ident. However, they are flaws only in relation to the introduc- methods of evaluation, so the questions are not necessarily ap-
tion and discussion. That is, the reader of the article can now propriate to each study nor are they necessarily exhaustive. The
recognize that if these are the types of statements the author set of questions is useful as a way of checking to see that many
wishes to make, the present study (design, measures, and important facets of the study have not been overlooked.
sample) is not well suited for making them. The slight mis-
match of interpretative statements in the discussion and the General Comments
methodology is a common, albeit tacit basis for not considering Preparation of an article often is viewed as a task of describ-
a study as well conceived and well executed. A slightly different
ing what was done. With this in mind, authors often are frus-
study may be required to support the specific statements the
trated at the reactions of reviewers. In reading the reactions of
author makes in the discussion; alternatively, the discussion
reviewers, the authors usually recognize and acknowledge the
might be more circumscribed in the statements that are made. value of providing more details that are required (e.g., further
It is usually to the author's credit to examine potential information about the participants or procedure). However,
sources of ambiguity given that he or she is in an excellent posi- when the requests pertain to explanation and contextualization,
tion because of familiarity with procedures and expertise to un-
authors are more likely to be baffled or defensive. This reaction
derstand the area. A candid, nondefensive appraisal of the study
may be reasonable because graduate training devotes much less
is very helpful. Here, too, contextualization may be helpful be-
attention to these facets of preparing research reports than to
cause limitations of a study are also related to prior research,
description. Also, reviewers' comments and editorial decision
trade-offs inherent in the exigencies of design and execution,
letters may not be explicit about the need for explanation and
what other studies have and have not accomplished, and
contextualization. For example, some of the more general reac-
whether a finding is robust across different methods of investi-
gation. Although it is to the author's credit to acknowledge lim- tions of reviewers are often reflected in comments such as
itations of the study, there are limits on the extent to which re- "Nothing in the manuscript is new," "I fail to see the impor-
viewers grant a pardon for true confessions. At some point, the tance of the study," or "This study has already been done in a
flaw is sufficient to preclude publication, whether or not is ac- much better way by others."2 In fact, such characterizations
knowledged by the author. At other points, acknowledging po- may be true. Alternatively, the comments could also reflect the
tential limitations conveys critical understanding of the issues
and directs the field to future work. This latter use of acknowl- 2
1 am grateful to my dissertation committee for permitting me to
edgement augments the contribution of the study and the like- quote their comments at my oral exam. In keeping with the spirit em-
lihood of favorable evaluation by readers. bodied in their use of pseudonyms in signing the dissertation, they wish
Finally, it is useful in the discussion to contextualize the re- not to be acknowledged by name here.
232 ALAN E. KAZDIN
Table 1
Major Questions to Guide Journal Article Preparation
Abstract
What were the main purposes of the study?
Who was studied (sample, sample size, special characteristics)?
How were participants selected?
To what conditions, if any, were participants exposed?
What type of design was used?
What were the main findings and conclusions?
Introduction
What is the background and context for the study?
What in current theory, research, or clinical work makes this study useful, important, or of interest?
What is different or special about the study in focus, methods, or design to address a need in the area?
Is the rationale clear regarding the constructs to be assessed?
What specifically were the purposes, predictions, or hypotheses?
Method
Participants
Who were the participants and how many of them were there in this study?
Why was this sample selected in light of the research goals?
How was this sample obtained, recruited, and selected?
What are the participant and demographic characteristics of the sample (e.g., gender, age, ethnicity, race, socioeconomic status)?
What if any inclusion and exclusion criteria were invoked (i.e., selection rules to obtain participants)?
How many of those participants eligible or recruited actually were selected and participated in the study?
Was informed consent solicited? How and from whom, if special populations were used?
Design
What is the design (e.g., longitudinal, cross-sectional) and how does the design relate to the goals of the study?
How were participants assigned to groups or conditions?
How many groups were included in the design?
How were the groups similar and different in how they were treated in the study?
Why were these groups critical to address the questions of interest?
Assessment
What were the constructs of interest and how were they measured?
What are the relevant reliability and validity data from previous research (and from the present study) that support the use of these measures for
the present purposes?
Were multiple measures and methods used to assess the constructs?
Are response sets or styles relevant to the use and interpretation of the measures?
How was the assessment conducted? By whom (as assessors/observers)? In what order were the measures administered?
If judges (raters) were used in any facet of assessment, what is the reliability (inter- or intrajudge consistency) in rendering their
j udgments/rati ngs?
Procedures
Where was the study conducted (setting)?
What materials, equipment, or apparatuses were used in the study?
What was the chronological sequence of events to which participants were exposed?
What intervals elapsed between different aspects of the study (e.g., assessment occasions)?
What procedural checks were completed to avert potential sources of bias in implementation of the manipulation and assessments?
What checks were made to ensure that the conditions were carried out as intended?
What other information does the reader need to know to understand how participants were treated and what conditions were provided?
Results
What were the primary measures and data on which the predictions depend?
What are the scores on the measures of interest for the different groups and sample as a whole (e.g., measures of central tendency and variability)?
How do the scores compare with those of other study, normative, or standardization samples?
Are groups of interest within the study similar on measures and variables that could interfere with interpretation of the hypotheses?
What analyses were used and how specifically did these address the original hypotheses and purposes?
Were the assumptions of the data analyses met?
If multiple tests were used, what means were provided to control error rates?
If more than one group was delineated, were they similar on variables that might otherwise explain the results (e.g., diagnosis, age)?
Were data missing due to incomplete measures (not filled out completely by the participants) or due to loss of participants? If so, how were these
handled in the data analyses?
Are there ancillary analyses that might further inform the primary analyses or exploratory analyses that might stimulate further work?
Discussion
What were the major findings of the study?
How do these findings add to research and how do they support, refute, or inform current theory?
What alternative interpretations can be placed on the data?
What limitations or qualifiers must be placed on the study given methodology and design issues?
What research follows from the study to move the field forward?
Note. Further discussion of questions that guide the preparation of journal articles can be obtained in additional sources (Kazdin, 1992; Maher,
1978). Concrete guidelines on the format for preparing articles are provided by the American Psychological Association (1994).
extent to which the author has failed to contextualize the study Interpreting Correlations Among Test Scores
to obviate these kinds of reactions.
Text validation is a complex and ongoing process involving
The lesson for preparing and evaluating research reports is
many stages and types of demonstrations. As part of that pro-
clear. Describing a study does not eo ipso establish its contribu-
cess, evidence often focuses on the extent to which a measure of
tion to the field, no matter how strongly the author feels that the
interest (e.g., a newly developed measure) is correlated with
study is a first. Also, the methodological options for studying a
other measures. Interpreting seemingly simple correlations be-
particular question are enormous in terms of possible samples,
tween measures requires attention to multiple considerations.
constructs and measures, and data-analytic methods. The rea-
Convergent validation. Convergent validity refers to the ex-
sons for electing the particular set of options the author has cho-
tent to which a measure is correlated with other measures that
sen deserve elaboration.
are designed to assess the same or related constructs (Campbell
In some cases, the author selects options because they were & Fiske, 1959). There are different ways in which convergent
used in prior research. This criterion alone may be weak, be- validity can be shown, such as demonstrating that a given mea-
cause objections levied at the present study may also be appro- sure correlates with related measures at a given point in time
priate to some of the prior work as well. The author will feel (e.g., concurrent validity) and that groups selected on some re-
unjustly criticized for a more general flaw in the literature. Yet, lated criterion (e.g., history of being abused vs. no such history)
arguing for a key methodological decision solely because "oth- differ on the measure, as expected (e.g., criterion or known-
ers have done this in the past" provides a very weak rationale, groups validity).3 In convergent validity, the investigator may
unless the purpose of the study is to address the value of the be interested in showing that a new measure of a construct cor-
option as a goal of the study. Also, it may be that new evidence relates with other measures of that same construct or that the
has emerged that makes the past practice more questionable in new measure correlates with measures of related constructs.
the present. For example, investigators may rely on retrospec- With convergent validity, some level of agreement between mea-
tive assessment to obtain lifetime data regarding symptoms or sures is sought.
early characteristics of family life, a seemingly reasonable as- In one scenario, the investigator may wish to correlate a mea-
sessment approach. Evidence suggests, however, that such ret- sure (e.g., depression) with measures of related constructs (e.g.,
rospective information is very weak, inaccurate, and barely negative cognitions and anxiety). In this case, the investigator
above chance when compared with the same information ob- may search for correlations that are in the moderate range (e.g.,
tained prospectively (e.g., Henry, Momtt, Caspi, Langley, & r = .40-.60) to be able to say that measure of interest was cor-
Silva, 1994; Robins et al., 1985). As evidence accumulates over related in the positive direction, as predicted, with the other
time to make this point clear and as the domain of false memo- (criterion) measures. Very high correlations raise the prospect
ries becomes more well studied, the use of retrospective assess- that the measure is assessing the "same" construct or adds no
ment methods is likely to be less acceptable among reviewers. In new information. In cases in which the investigator has devel-
short, over time, the standards and permissible methods may oped a new measure, the correlations of that measure will be
change. with other measures of the same construct. In this case, high
In general, it is beneficial to the author and to the field to correlations may be sought to show that the new measure in fact
convey the thought processes underlying methodological and does assess the construct of interest.
design decision. This information will greatly influence the ex- Interpretation of convergent validation data requires caution.
tent to which the research effort is appreciated and viewed as To begin with, the positive, moderate-to-high correlation be-
enhancing knowledge. Yet, it is useful to convey that decisions tween two measures could well be due to shared trait variance
were thoughtful and that they represent reasonable choices in the construct domains, as predicted between the two mea-
among the alternatives for answering the questions that guide sures. For example, two characteristics (e.g., emotionality and
the study. The contextual issues are no less important. As au- anxiety) might overlap because of their common psychological,
thors, we often expect the latent Nobel Prize caliber of the study biological, or developmental underpinnings. This is usually
to be self-evident. It is better to be very clear about how and what the investigator has in mind by searching for convergent
where the study fits in the literature, what it adds, and what validity. However, other interpretations are often as parsimoni-
questions and research the study prompts. ous or even more so. For example, shared method variance may
be a viable alternative interpretation for the positive correlation.
Shared method variance refers to similarity or identity in the
Common Interpretive Issues in Evaluating Assessment procedure or format of assessment (e.g., both measures are self-
Studies report or both are paper-and-pencil measures). For example,
if two measures are completed by the same informant, their
In conducting studies and preparing reports of assessment common method variance might contribute to the magnitude
studies, a number of issues can be identified to which authors of the correlation. The correlations reflect the shared method
and readers are often sensitive. These issues have to do with the variance, rather than, or in addition to, the shared construct
goals, interpretation, and generality of the results of studies. I variance.
highlight three issues here: test validation, the relations of con-
structs to measures, and sampling. Each of these is a weighty 3
There are of course many different types of validity, and often indi-
topic in its own right and will be considered in other articles vidual types are referred to inconsistently. For a discussion of different
in this issue. In this article, they are addressed in relation to types of validity and their different uses, the reader is referred to other
interpretation and reporting of research findings. sources (Kline, 1986; Wainer & Braun, 1988).
234 ALAN E. KAZDIN
The correlation between two measures that is taken to be ev- criminant validity also raises interpretive issues. Two measures
idence for validity also could be due to shared items in the mea- may have no conceptual connection or relation but still show
sures. For example, studies occasionally evaluate the interre- significant and moderate-to-high correlation because of com-
lations (correlations) among measures of depression, self-es- mon method variance. If method variance plays a significant
teem, hopelessness, and negative cognitive processes. Measures role, as is often the case when different informants are used,
of these constructs often overlap slightly, so that items in one then all the measures completed by the same informant may
particular scale have items that very closely resemble items in show a similar level of correlation. In such a case, discriminant
another scale (e.g., how one views or feels about oneself). Item validity may be difficult to demonstrate.
overlap is not an inherent problem because conceptualizations Discriminant validity raises another issue for test validation.
of the two domains may entail common features (i.e., shared There is an amazing array of measures and constructs in the
trait variance). However, in an effort of scale validation, it may field of psychology, with new measures being developed regu-
provide little comfort to note that the two domains (e.g., hope- larly. The question in relation to discriminant validity is
lessness and negative cognitive processes) are moderately to whether the measures are all different and whether they reflect
highly correlated "as predicted." When there is item overlap, different or sufficiently different constructs. The problem has
the correlation combines reliability (alternative form or test- been recognized for some time. For example, in validating a
retest) with validity (concurrent and predictive). new test, Campbell (1960) recommended that the measure be
Low correlations between two measures that are predicted to correlated with measures of social desirability, intelligence, and
correlate moderately to highly warrant comment. In this case, acquiescence and other response sets. A minimal criterion for
the magnitude of the correlation is much lower than the investi- discriminant validation, Campbell proposed, is to show that the
gator expected and is considered not to support the validity of new measure cannot be accounted for by these other constructs.
the measure that is being evaluated. Three considerations war- These other constructs, and no doubt additional ones, have been
rant mention here and perhaps analysis in the investigation. shown to have a pervasive influence across several domains, and
First, the absolute magnitude of the correlation between two their own construct validity is relatively well developed. It is
measures is limited by the reliability of the individual measures. likely that they contribute to and occasionally account for other
The low correlation may then underestimate the extent to which new measures.
the reliable portion of variance within each measure is corre- Few studies have adhered to Campbell's (1960) advice, albeit
lated. Second, it is possible that the sample and its scores on the recommendations remain quite sound. For example, a re-
one or both of the measures represent a restricted range. The cent study validating the Sense of Coherence Scale showed that
correlation between two measures, even if high in the popula- performance on the scale has a low and nonsignificant correla-
tion across the full range of scores, may be low in light of the tion with intelligence (r = .11) but a small-to-moderate corre-
restricted range. Third, it is quite possible that key moderators lation (r = .39) with social desirability (Frenz, Carey, & Jorgen-
within the sample account for the low correlation. For example, sen, 1993). Of course, convergent and discriminant validity de-
it is possible that the correlation is high (and positive) for one pend on multiple sources of influence rather than two
subsample (men) and low (and negative) for another subsam- correlations. Even so, as the authors noted, the correlation with
ple. When these samples are treated as a single group, the cor- social desirability requires some explanation and conceptual
relation may be low or zero, and nonsignificant. A difficulty is elaboration.
scavenging for these moderators in a post hoc fashion. However, General comments. Convergent and discriminant validity
in an attempt to understand the relations between measures, raise fundamental issues about validation efforts because they
it is useful to compute within-subsample correlations on key require specification of the nature of the construct and then
moderators such as gender, ethnicity, and patient status (patient tests to identify the connections and boundary conditions of the
vs. community) where relations between the measures are very measure. Also, the two types of validity draw attention to pat-
likely to differ. Of course, the study is vastly superior when an terns of correlations among measures in a given study and the
influence moderating the relations between measures is theoret- basis of the correlation. The importance of separating or exam-
ically derived and predicted. ining the influence of shared method factors that contribute to
Discriminant validity. Disciminant validity refers to the ex- this correlation pattern motivated the recommendation to use
tent to which measures not expected to correlate or not to cor- multitrait and multimethod matrices in test validation
relate very highly in fact show this expected pattern.4 By itself, (Campbell & Fiske, 1959). In general, demonstration of con-
discriminant validity may resemble support for the null hy- vergent and discriminant validity and evaluation of the impact
pothesis; namely, no relation exists between two measures. Yet, of common method variance are critical to test validation. In
the meaning of discriminant validity derives from the context the design and reporting of assessment studies, interpretation of
in which it is demonstrated. That context is a set of measures, the results very much depends on what can and cannot be said
some of which are predicted to relate to the measure of interest about the measure. The interpretation is greatly facilitated by
(convergent validity) and others predicted to relate less well or
not at all (discriminant validity). Convergent and discriminant 4
Discriminant validity is used here in the sense originally proposed
validity operate together insofar as they contribute to construct by Campbell and Fiske (1959). Occasionally, discriminant validity is
validity (i.e., identifying what the construct is and is not like). used to refer to cases in which a measure can differentiate groups (e.g.,
A difficulty in many validational studies is attention only to Trull, 1991). The different meanings of the term and the derivation of
convergent validity. related terms such as discriminate, discriminative, and divergent valid-
With discriminant validity, one looks for little or no relation ity reflect a well-known paradox of the field, namely, that there is little
between two or more measures. As with convergent validity, dis- reliability in discussing validity.
providing evidence for both convergent and discriminant tified through confirmatory factor analyses (e.g., emotional dis-
validity. tress, self-derogation, purpose in life, hostility, anxiety, and
others). Of special interest is that the study permitted evalua-
Constructs and Measures tion of several scales to each other as well as to the latent vari-
able and the relation of latent variables (as second-order
Assessment studies often vary in the extent to which they re- factors) to each other. This level of analysis provides important
flect interests in constructs or underlying characteristics of the information about individual measures and contributes to the
measures and in specific assessment devices themselves. These understanding of different but related domains of functioning
emphases are a matter of degree, but worth distinguishing to and their interrelations to each other. At this higher level of ab-
convey the point and its implications for preparing and inter- straction, one can move from assessment to understanding the
preting research reports. Usually researchers develop measures underpinnings of the constructs or domains of functioning (e.g.,
because they are interested in constructs (e.g., temperament, in development), their course, and the many ways in which they
depression, or neuroticism). Even in cases in which measures may be manifested.
are guided by immediately practical goals (e.g., screening and Although all assessment studies might be said to reflect inter-
selection), there is an interest in the bases for the scale (i.e., the est in constructs, clearly many focus more concretely at a lower
underlying constructs). level of abstraction. This is evident in studies that focus on the
The focus on constructs is important to underscore. The em- development of a particular scale, as reflected in evaluation of
phasis on constructs draws attention to the need for multiple psychometric properties on which the scale depends. Efforts to
measures. Obviously, a self-report measure is important, but it elaborate basic features of the scale are critically important.
is an incomplete sample of the construct. Perhaps less obvious Later in the development of the scale, one looks to a measure to
is the fact that direct samples of behavior also are limited, be- serve new purposes or to sort individuals in ways that elaborate
cause they are only a sample of the conditions as specified at a one's understanding of the construct. It is still risky to rely on a
given time under the circumstances of the observations. Some- single measure of a construct no matter how well that valida-
times investigators do not wish to go beyond the measure or at tional research has been. Thus, studies using an IQ test or an
least too much beyond the measure in relation to the inferences objective personality inventory still raise issues if only one test
they draw. Self-report data on surveys (e.g., what people say is used, as highlighted later. For a given purpose (e.g.,
about a social issue or political candidate or what therapists say prediction), a particular measure may do very well. Ultimately,
they do in therapy with their clients) and direct observations of the goal is understanding in addition to prediction, and that re-
behavior (e.g., how parents interact with their children at quires greater concern with the construct and multiple mea-
home) may be the assessment focus. Even in these instances, the sures that capture different facets of the construct.
measure is used to represent broader domains (e.g., what people In designing studies that emphasize particular measures, it
feel, think, or do) beyond the confines of the operational mea- is important to draw on theory and analyses of the underlying
sure. In other words, the measure may still be a way of talking constructs as much as possible. From the standpoint of psychol-
about a broader set of referents that is of interest besides test ogy, interest usually extends to the theory, construct, and clini-
performance. Anytime an investigator wishes to say more than cal phenomena that the measure was designed to elaborate.
the specific items or contents of the measure, constructs are of Also, research that is based on a single assessment device occa-
interest. sionally is met with ambivalence. The ambivalence often results
Any one measure, however well established, samples only a from the view that a study of one measure is technical in nature,
part or facet of the construct of interest. This is the inherent crassly empirical, and theoretically bereft. The focus on a single
nature of operational definitions. In preparing reports of assess- measure without addressing the broader construct in different
ment studies, the investigator ought to convey what constructs ways is a basis for these concerns. And, at the level of interpreta-
are underlying the study and present different assessment de- tion of the results, the reliance on one measure, however well
vices in relation to the sampling from the construct domain. A standardized, may be viewed as a limitation.
weakness of many studies is using a single measure to assess At the same time, there is a widespread recognition that the
a central construct of interest. A single measure can sample a field needs valid, standardized, and well-understood measures.
construct, but a demonstration is much better when multiple Programs of research that do the necessary groundwork are of-
measures represent that construct. ten relied on when selecting a measure or when justifying its
The focus on constructs also draws attention to the interrela- use in a study or grant proposal. When preparing articles on
tion among different constructs. Although a researcher may assessment devices, it is important to be sensitive to the implica-
wish to validate a given measure and evaluate his or her opera- tions that the study has for understanding human functioning
tional definition, he or she also wants to progress up the ladder in general, in addition to understanding how this particular
of abstraction to understand how the construct behaves and how measure operates. Relating the results of assessment studies to
the construct relates to other constructs. These are not separate conceptual issues, rather than merely characterizing a single
lines of work, because an excellent strategy for validating a mea- measure, can greatly enhance a manuscript and the reactions of
sure is to examine the measure in the context of other measures consumers regarding the contribution.
of that construct and measures of other constructs. For exam-
ple, a recent study examined the construct psychological stress
Sample Characteristics and Assessment Results
by administering 27 self-report measures and identifying a
model to account for the measures using latent-variable analy- Sampling can refer to many issues related to the participants,
ses (Scheier & Newcomb, 199 3). Nine latent factors were iden- conditions of the investigation, and other domains to which one
236 ALAN E. KAZDIN
wishes to generalize (Brunswik, 1955). In assessment studies, a Conclusions should not be limited to a single operation
special feature of sampling warrants comment because of its (measure or type of measure). There may be irrelevancies asso-
relevance for evaluating research reports. The issue pertains to ciated with any single measure that influences the obtained re-
the structure and meaning of a measure with respect to different lation between the constructs of interest. A study is strength-
population characteristics. Occasionally, the ways in which ened to the extent that it samples across different assessment
studies are framed suggest that the characteristics of a scale in- methods and different sources of information.
here in the measure in some fixed way, free from the sample to The familiar finding of using multiple measures of a given
which the scale was applied. construct is that the measures often reflect different conclu-
It is quite possible that the measure and indeed the constructs sions. For example, two measures of family functioning may
that the measure assesses behave differently across samples, as show that they are not very highly related to each other. One
a function of gender, age, race, and ethnicity (e.g., McDermott, measure may show great differences between families selected
1995). Such differences have important implications for test because of a criterion variable, whereas the other measure may
standardization and interpretation beyond the scope of the pres- not. These results are often viewed as mixed or as partial sup-
ent discussion. Sensitivity to such potential differences and eval- port for an original hypothesis. The investigator usually has to
uation of such differences in the design of research can be very prepare a good reason why different measures of seemingly sim-
helpful. Ideally, an assessment study will permit analyses of the ilar constructs show different results. However, the study is
influence of one or more sample characteristics that plausibly stronger for the demonstration when compared with a study
could influence conclusions about the measure. For example, that did not operationalize family functioning in these different
in a recent evaluation of scales to study motives for drinking ways. An issue for the field is to make much further conceptual
alcohol, analyses showed that the factor model that fit the mea- progress in handling different findings that follow from different
sure was invariant across male and female, Black and White, methods of assessment.
and older and younger adolescents (Cooper, 1994). The inclu-
sion of multiple samples and a sufficient sample size to permit Conclusion
these subsample analyses (N > 2,000) enabled the research to
make a significant contribution to assessment and scale struc- Preparing reports for publication involves describing, ex-
ture. From the study, it was learned that the structure of the plaining, and contextualizing the study. The descriptive feature
measure is robust across samples. Apart from scale character- of the study is essential for the usual goals such as facilitating
istics, the generality of the model may have important implica- interpretation and permitting replication of the procedures, at
tions for adolescent functioning in general. least in principle. However, the tasks of explaining the study
A more common research approach is to sift through sepa- by providing a well thought-out statement of the decisions and
rate studies, each representing an attempt to replicate the factor contextualizing the study by placing the demonstration into the
structure with a slightly different population (e.g., Derogatis & field more generally are the challenges. The value of a study is
Cleary, 1977; Schwarzwald, Weisenberg, & Solomon, 1991;Ta- derived from the author's ability to make the case that the study
keuchi, Kuo, Kim, & Leaf, 1989). Such research often shows contributes to the literature, addresses an important issue, and
that the central features of the measure differ with different generates important answers and questions.
samples. One difficulty lies in bringing order to these sample In this article, I discussed some of the ways in which authors
differences, in large part because they are not tied to theoretical can make such a case when preparing a research article.5 Gen-
hypotheses about characteristics of the samples that might ex- erally, the task is to convey the theme or story line, bringing
plain the differences (Betancourt & Lopez, 1993). Also, from all of the sections of the study in line with that, and keeping
the standpoint of subsequent research, guidelines for using the irrelevancies to a minimum. In the context of assessment stud-
measure are difficult to cull from the available studies. ies, three issues were highlighted because they affect many stud-
Evaluating assessment devices among samples with different ies and their interpretation. These include interpretation of cor-
characteristics is important. However, one critically important relations between measures, the relation of constructs and mea-
step before evaluating these assessment devices is the replication sures, and sampling. Each issue was discussed from the
of the scale results with separate samples from the same popu- standpoint of ways of strengthening research. Test validation,
lation. Some studies include large standardization samples and development of assessment methods from constructs, and sam-
hence provide within-sample replication opportunities. More pling raise multiple substantive and methodological issues that
common among assessment studies is the evaluation of the mea- affect both the planning and reporting of research. Many of the
sure with smaller samples. It is important to replicate findings articles that follow elaborate on these issues.
on the structure of the scale or the model used to account for
the factors within the scale. Even when separate samples are 5
drawn from the same population, the findings regarding scale In closing, it is important to convey that recommendations in this
characteristics may not be replicated (e.g., Parker, Endler, & article regarding manuscript preparation and journal publication derive
from my experiences as an editor rather than as an author. As an author,
Bagby, 1993). Evaluation of multiple samples is very important
the picture has not always been as pretty. For example, over the course
in guiding use of the measure in subsequent research. of my career, such as it is, two journals went out of business within a few
Sampling extends beyond issues related to participants. Sam- months after a manuscript of mine was accepted for publication and
pling refers to drawing from the range of characteristics or do- forwarded to production. Although this could be a coincidence in the
mains to which one wishes to generalize (Brunswik, 1955). In career of one author, in this case the result was significant (p < .05),
relation to assessment studies, the use of multiple measures to using a chi round test and correcting for continuity, sphericity, and
assess a construct is based in part on sampling considerations. leptokurtosis.
References explanations for children's ability and adjustment: A national ap-

praisal. Journal of School Psychology, 33, 75-91.
American Psychological Association. (1994). Publication manual of the Parker, J. D. A., Endler, N. S., & Bagby, R. M. (1993). If it changes, it
American Psychological Association (4th ed.). Washington, DC: might be unstable: Examining the factor structure of the Ways of
Author. Coping Questionnaire. Psychological Assessment, 5, 361-368.
Betancourt, H., & Lopez, S. R. (1993). The study of culture, ethnicity, Robins, L. N., Schoenberg, S. P., Holmes, S. J., Ratcliff, K. S., Benham,
and race in American psychology. American Psychologist, 48, 629- A., & Works, J. (1985). Early home environment and retrospective
637. recall. American Journal of Orthopsychiatry, 55, 27-41.
Brunswik, E. (1955). Representative design and probabilistic theory in Scheier, L. M., & Newcomb, M. D. (1993), Multiple dimensions of
a functional psychology. Psychological Review, 62, 193-217. affective and cognitive disturbance: Latent-variable models in a com-
Campbell, D. T. (1960). Recommendations for APA test standards re- munity sample. Psychological Assessment, 5, 230-234.
garding construct, trait, and discriminant validity. American Psychol- Schwarzwald, J., Weisenberg, M., & Solomon, Z. (1991). Factor invari-
ogist. 15, 546-553. ance of SCL-90-R: The case of combat stress reaction. Psychological
Campbell, D. T, & Fiske, D. (1959). Convergent and discriminant val- Assessment, 3, 385-390.
idation by the multitrait-multimethod matrix. Psychological Bulle- Shapiro, D. A., & Shapiro, D. (1983). Comparative therapy outcome
tin, 56, 81-105. research: Methodological implications of meta-analysis. Journal of
Cooper, M. L. (1994). Motivations for alcohol use among adolescents: Consulting and Clinical Psychology, 51, 42-53.
Development and validation of a four-factor model. Psychological As- Takeuchi, D. T, Kuo, H., Kim, K., & Leaf, P. J. (1989). Psychiatric
sessment, 6, 117-128. symptom dimensions among Asian Americans and native Hawaiians:
Derogatis, L. R., & Cleary, P. A. (1977). Factorial invariance across An analysis of the symptom checklist. Journal of Community Psy-
gender for the primary symptom dimensions of the SCL-90. British chology, 77,319-329.
Journal of Social and Clinical Psychology, 16, 347-356. Trull, T. J. (1991). Discriminant validity of the MMPI-Borderline Per-
Frenz, A. W., Carey, M. P., & Jorgensen, R. S. (1993). Psychometric sonality Disorder scale. Psychological Assessment, 3, 232-238.
evaluation of Antonovsky's Sense of Coherence Scale. Psychological Wainer, H., & Braun, H. I. (Eds.). (1988). Test validity. Hilldale, NJ:
Assessment, 5, 145-153. Erlbaum.
Henry, B., Moffitt, T. E., Caspi, A., Langley, J., & Silva, P. A. (1994). Wampold, B. E., Davis, B., & Good, R. H., Ill (1990). Hypothesis va-
On the "remembrance of things past": A longitudinal evaluation of lidity of clinical research. Journal of Consulting and Clinical Psychol-
the retrospective method. Psychological Assessment, 6, 92-101. ogy, 58, 360-367.
Kazdin, A. E. (1992). Research design in clinical psychology (2nd ed). Weiss, B., & Weisz, J. R. (1990). The impact of methodological factors
Needham Heights, MA: Allyn & Bacon. on child psychotherapy outcome research: A meta-analysis for re-
Kline, P. (1986). A handbook of test construction: Introduction to psy- searchers. Journal of Abnormal Child Psychology, 18, 639-670.
chometric design. London: Methuen.
Maher, B. A. (1978). A reader's, writer's, and reviewer's guide to assess-
ing research reports in clinical psychology. Journal of Consulting and Received February 27, 1995
Clinical Psychology, 46, 835-838. Revision received March 27, 1995
McDermott, P. A. (1995). Sex, race, class, and other demographics as Accepted March 27, 1995 •

Evaluating Research

Uploaded by

Copyright:

Available Formats

Evaluating Research

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating Research

Uploaded by

Copyright:

Available Formats

Psychological Assessment Copyright 1995 by the American Psychological Association, Inc.

1995, Vol. 7, No. 3, 228-237 1040-3590/95/53.00

Preparing and Evaluating Research Reports

References explanations for children's ability and adjustment: A national ap-

You might also like