20.collecting Existing Data
20.collecting Existing Data
frequently make a survey the preferred data collection method. One or more of the six survey
designs reviewed in this chapter (including mixed-mode) can be applied to almost any research
question. It is no wonder that surveys have become the most popular research method in social
work and that they frequently inform discussion and planning about important social and political
questions. As use of the Internet increases, survey research should become even more efficient and
popular.
The relative ease of conducting at least some types of survey research leads many people to
imagine that no particular training or systematic procedures are required. Nothing could be further
from the truth. But as a result of this widespread misconception, you will encounter a great many
nearly worthless survey results.
You must be prepared to examine carefully the procedures used in any survey before accepting its
findings as credible. If you decide to conduct a survey, you must be prepared to invest the time and
effort required by proper procedures.
[Blank Page]
20Content Analysis
22Historical Research
[Blank Page]
Allen Rubin
As you know by now, data can be collected from research participants or respondents when we use
structured observations (Chapter 16), interviewing (Chapter 17), and surveys (Chapter 18). In
some research situations, particularly if the costs or time requirements of these data collection
methods are a consideration, a sample is difficult to locate, or a low response rate is likely, we
have the alternative of using data that have already been collected.
The data collection method discussed in this chapter and the one described in the following
chapter (content analysis) rely on available data; that is, the data have already been collected.
Secondary analysis is more accurately referred to as a data utilization method rather than a data
collection method. This process does not involve the collection of original data; instead, available
data relevant to a problem or question are located in records or a computerized database. A
research study is then designed to analyze these data with the goal of answering questions, arriving
at hypotheses that can be tested, or building theory.
UTILIZATION OF DATA
In secondary analysis, the data must be at hand before the design of the study in which they will be
analyzed can be formulated. Moreover, the researcher usually has little control over the original
format of the data and must take them as they are presented. This does not relieve the researcher of
the responsibility for verifying the validity and reliability of the data, however. As with any
research design, studies that rely on secondary analysis must use (or, in this case, reuse) data that
accurately measure what they are supposed to measure.
Another consideration in using this method is that the secondary analysis cannot be used for the
same purpose as the data were when they were first collected. In situations where researchers
participated in or designed the original research study with the intention of doing a secondary
analysis later, that analysis could only be described as secondary if it were conducted as a separate,
independent investigation. It should not merely inquire further into the implications of the data as
an expansion of the preceding study.
For example, suppose a social service agency uses data from a research study of clients and
services provided as a basis for designing a system for collecting and storing comprehensive data
on this topic. The system is put into operation, and then the researchers who helped develop it are
approached with additional practice questions involving the database. To answer these questions,
they could formulate a new analysis of the data. This would be a secondary analysis even though
the researchers had helped to gather the original data.
Example: Big Brothers/Big Sisters Programs
To demonstrate the secondary analysis method of data utilization, an example will be given that
involves a Big Brothers/Big Sisters program. In these programs, youths at risk are matched with
volunteers of the same gender. A motherless girl, for example, will be matched with a Big Sister
who will act as a role model and try to compensate for the absence of the mother; a fatherless boy
will be matched with a Big Brother. There are two major problems that affect both the
maintenance of the programs and the quality of the services they provide:
1. premature turnover among volunteers (the Big Brothers and Big Sisters)
2. uncertainty about how the volunteers might affect the psychosocial functioning of the service
recipients (the Little Brothers and Little Sisters)
Turnover is a clinical problem in terms of the potentially harmful impact that aborted relationships
might have on the boys and girls. It is an administrative problem for the same reason, and it
represents a drain on scarce agency resources and can ultimately lower the public's image of the
program and its effectiveness.
To reduce turnover, social workers and administrators need to know about the variables that
distinguish volunteers who leave prematurely from those who do not. They need data they can use
as a basis for decisions about modifications in volunteer screening, the orientation of volunteers,
or changes in the ways volunteers and youths are matched.
program as a selling point with funding sources; indeed, funders are demanding such data. Social
workers could use the data to identify the types of volunteers and youths associated with the best
outcomes for the service recipients.
They could see what dyad matches do best; that is, what type of youth benefits most from
association with what type of volunteer. In order to obtain data about the programs effectiveness,
however, it is necessary to obtain and analyze data about the psychosocial functioning of the
youths the program affects.
The question is, who will collect the data to address these types of questions? Who has the time
and money to generate data about volunteers and youths throughout the course of their
involvement in the program? Obtaining turnover data could mean waiting several years after the
initial data-gathering effort to see when volunteers terminate. Obtaining the program impact data
could mean gathering longitudinal data on each child's psychosocial functioning several times over
the same period.
In Big Brothers/Big Sisters programs, social workers' efforts are often stretched thin with heavy
caseloads, and administrators can be swamped with the resource procurement and public relations
activities necessary to keep the program afloat. Collecting original data themselves, in person or
by mail or phone, may be out of the question, and limited finances may make it impossible for
them to hire outside researchers to do the data collection.
Moreover, since the needed data are usually collected with a longitudinal study design, both
administrators and social workers may well be working elsewhere by the time a study is
completed. Even if they are still with the same program, the agency boards or funding bodies
wanting these data may not wait that long. The problems could become irrelevant by the time the
studies are completed.
Secondary analyses can be used to study the questions implicit in both the turnover and the
program-effectiveness questions. The turnover question can be studied retrospectively by
analyzing existing data in the agency's records about volunteers who had participated in the
program in the past. The attributes of these volunteers and their Little Brothers or Little Sisters
could be examined in relation to how long the volunteer-youth relationship had lasted.
This would require considerable time to examine the agency's records and process and analyze
data from them, particularly if the files have not been computerized. Nevertheless, using available
data could make an otherwise impractical research project feasible. It also could enable agency
staff to complete the study before it becomes irrelevant to those who want to utilize the findings.
The same applies to the program-effectiveness question. Assessing effectiveness with a sufficient
degree of internal validity would require higher-level research designs such as those discussed in
Chapter 15. Suppose one effectiveness measure of the program to be assessed is the school
performance of the Little Brothers and Little Sisters.
Existing school records on such indicators as grades, tardiness, and number of detentions per
report period would be used to conduct a series of studies with single-system or multiple-baseline
designs (see Chapter 14). These studies would be based on time-series analyses of each child's
school performance before and after the child was matched with a volunteer. Because available
data would be used, the study could focus on closed cases, and the analyses could include trends in
school performance (e.g., grades, tardiness, detentions) after the intervention had been completed.
Sources of Data
For many secondary analyses conducted by social workers, the principal source of existing data is
social service agency records. But available data may be obtained from a variety of other sources.
In the United States, many federal agencies, such as the Department of Health and Human
Services, the Bureau of Labor Statistics, the Federal Bureau of Investigation, and the National
Center for Health Statistics, regularly report current data.
The largest secondary database is provided by the U.S. Bureau of the Census, which conducts a
national census of the population every 10 years that attempts to account for every person in the
country and to
provide data on a variety of variables such as housing, gender, age, income, and family
relationships.
The bureau also conducts periodic surveys and publishes annual editions of the Statistical Abstract
of the United States, in which data from numerous federal agencies on a wide variety of topics of
interest to social workers are presented in tabular form.
Other good sources of secondary data are national voluntary agencies such as the Council on
Social Work Education (CSWE), which annually produces a report of statistics on students and
faculty in all accredited undergraduate and graduate schools of social work in the United States.
(These data provide a source for a variety of secondary analyses that will be used as examples in
this chapter.) On the state and local levels, public agencies such as local housing departments and
private agencies such as community councils also make data available.
The use of agency records for secondary analysis, including the case records on file in almost any
social service agency, has been stimulated by the adoption in recent years of management
information systems (MISs) in which data from agency records are processed, stored, retrieved,
and analyzed by computer technology.
These systems provide agencies with a computerized database of information that is collected
routinely by social workers on such variables as client characteristics, workers' activities, costs and
outcomes of agency services, and other aspects of service delivery. They have made secondary
analysis a common method of data collection for social workers.
According to the principles of these two forms of logic, deductive logic consists of forming a
theory, making deductions from the theory, and testing those deductions, or hypotheses, against
reality. This is the basis for the hypothetico-deductive approach to hypothesis testing introduced in
Chapter 1, which represents the positivistic approach to research.
As pointed out in Chapter 3, inductive logic begins with specific observations of actual events,
things, or processes, as in naturalistic research, and builds on these observations to make
inferences or more general statements. In research, therefore, inductive reasoning is applied to data
collection and research results in order to make general statements and see if they fit theory;
deductive reasoning is applied to theory in Order to arrive at hypotheses that can be empirically
tested.
Briefly stated, inductive secondary analyses emphasize the process of accumulating observations
to build theories, whereas deductive secondary analyses test the hypotheses derived from theories.
As indicated in Chapter 3, the greater the number of hypotheses that are derived from a theory and
proved true, the more reason there is to believe in the truth of the theory. Conversely, the greater
the number of hypotheses that are derived, tested, and proved false, the greater is the skepticism
generated about the theory from which the hypotheses were drawn.
Secondary analyses are typically associated with inductive reasoning, but they also can be
employed in a deductive fashion. For example, various research studies have found that when
schizophrenics live with their families, relapse rates are higher among patients whose families
express high levels of emotionality than among patients whose families are more restrained.
Other studies have observed that psychiatric programs that seek to increase cognitive arousal and
rely on psychotherapy are less successful than those that seek to develop occupational
rehabilitation and de-emphasize psychotherapy.
In inductive fashion, a variety of such studies have been taken together to develop a theory on the
vulnerability of schizophrenics to overly stimulating environments. From this theory, hypotheses
have been deduced regarding intervention with families and agencies to reduce the level of
overstimulation and expressed emotion, which can affect patients. The results of studies to verify
the efficacy of the treatment interventions have supported not only the hypotheses derived from
the theory but also the plausibility of the theory.
In the Big Brothers/Big Sisters example used earlier in this chapter, problems of volunteer
turnover and program effectiveness for service recipients were addressed with two different types
of secondary analysis. The first study described dealt with attributes of youths and volunteers that
were associated with the length of the youth-volunteer relationship. This study could be termed
inductive because data were examined in an effort to discover some plausible pattern that would
account for volunteer turnover.
The second study described examined the program's impact on the school performance of the
Little Brothers and Little Sisters. This study could be termed deductive because it began with a
theory concerning the impact of same-gender role models on the psychosocial functioning of the
youths. It then utilized an exploratory research design to verify a hypothesis derived from the
theory concerning changes in school performance. In other words, the first study inductively
accumulated observations to generate theory, whereas the second study addressed the plausibility
of a theory by verifying a hypothesis derived from it.
For example, if there is no strong basis for explaining a phenomenon by testing hypotheses with a
limited set of variables, the procedures of secondary analysis can be utilized to search through a
much wider range of independent variables to determine which ones might be associated with a
particular dependent variable. An association discovered through secondary analysis might merit
further theoretical development and study at the explanatory level.
An example using the CSWE data will help illustrate this point. Social work educators have been
concerned about the potential consequences of a significant decline in MSW applications to
graduate schools of social work in recent years. To find ways to offset this decline and maintain
enrollments and the quality of students who do apply, and to ensure the survival of these
educational programs, bold and innovative strategies and curriculum revisions have been tried.
One of the first attempts to seek quantitative evidence supporting such proposals was an
exploratory secondary analysis of the CSWE database undertaken by a graduate-level social work
research class. This class project was essentially a "fishing expedition" to consider every variable
in the CSWE data that seemed to have any possibility of explaining why the decline in
applications was much greater in some MSW programs than in others.
These variables included tuition; provisions for part-time study; provisions for advanced standing,
enabling students to graduate sooner; concentrations offered; a BSW or doctoral social work
program; length and types of field practicums; program auspices (public or private); availability of
financial grants to students; student-faculty ratios; and size of the program.
The study's findings indicated that the steepest declines were occurring in schools with doctoral
programs, schools with lower (e.g., "better") student-faculty ratios, and schools offering more
financial aid. Based on these findings, which at first seemed to defy logic, the class postulated that
the schools experiencing the smallest declines in applications were those that had lower academic
standards and fewer resources, compared with better-endowed schools with more academic
prestige. They suggested that the former schools had more applicants because they took earlier,
more aggressive steps to recruit them (Rubin, Conway, Paterson, & Spence, 1983).
This study, being exploratory, utilized secondary analysis to generate hypotheses, not test them.
The reasoning was in the form of a postulated explanation of variation in applications decline; it
was not an attempt to imply that the true explanation had been found or verified. The ease and
speed of analyzing arrays of variables with computers to determine relationships simplifies the
task, but it is a method that can be misused.
The more pairs of dependent and independent variables that are examined in order to find pairs
that are related, or covary, the greater is the likelihood of finding some pairs that covary not for
any theoretical reason but merely due to statistical chance. When relationships are found in such
fishing expeditions, it makes a great deal of difference whether the findings are interpreted as
plausible hypotheses warranting further investigation or as verification of hypotheses. The former
would be an acceptable conclusion in an exploratory context; the latter would be misleading and
wrong.
To understand the importance of this distinction requires some knowledge of the concept of
statistical significance. The level of statistical significance or probability found in a statistical test
(usually set at .05) represents the probability of finding relationships by chance. With a .05
significance level, if researchers were to look for covariation among 100 pairs of variables, they
could expect to find it for five pairs merely due to statistical chance. Therefore, the relationships
found for those five pairs of variables need not have any theoretical or practical meaning.
When statistical significance is found with secondary analysis, therefore, the researcher must take
into account whether it was found in order to verify a hypothesis or was found for some pairs of
variables out of a much larger array that was explored with electronic data processing. Failure to
identify the tentative nature of findings that are only postulations for future study represents an
abuse of computer technology--massaging the data until something "significant" appears and then
misrepresenting that finding as a verified hypothesis.
Another example involves the CSWE statistics, which provide descriptive data useful in
educational planning without getting into explanatory or theoretical issues. The council often
receives requests from schools of social work for secondary analyses with additional descriptive
data that are not included in its annual reports.
A dean of a midwestern school of social work in a public university might need to know how the
school's student-faculty ratio compares with that of other similar schools in the Midwest, for
example. The dean needs these data not to construct or test an explanation about the causes or
consequences of variation in student-faculty ratios but merely to help make a case with faculty or
administration regarding the adequacy or inadequacy of the school's resources compared with the
resources of similar schools.
Descriptive secondary analyses also are used frequently in social service agencies with
computerized information systems to make such descriptive data as caseload trends and program
costs readily available to administrators and planners.
Explanatory Secondary Analyses
Explanatory designs are used in secondary analyses, as in other types of studies, to verify
hypotheses and examine relationships between variables. A cross-sectional secondary analysis
could be used in an explanatory study to analyze the existing data to see if a specific hypothesized
relationship exists at one point in time.
It might also assess whether that relationship holds up when rival hypotheses in the database are
controlled for. An example would be a secondary analysis of the CSWE database to assess a
postulated disparity between male and female faculty salaries when academic degrees, ranks, and
administrative responsibilities are controlled for (held constant).
ISSUES IN METHODOLOGY
There are a number of methodological issues in the analysis of data that have been previously
collected, such as the misuse of computer technology, already discussed. These problems concern
the adequacy of the database and the reliability, validity, and availability of data relevant to a
proposed research question.
An easy mistake researchers can make when using secondary analysis is to assume that since they
are using an available database and have not been responsible for gathering the data, they can
assume that the validity and reliability of the data they wish to reanalyze have been established.
However, it does not follow that because others have collected the data, no matter how prestigious
or "official" their auspices maybe, researchers doing secondary analyses can accept the data
without question.
Measurement Reliability
As we know from Chapter 11, reliability refers to accuracy, precision, or consistency in
measurement, and some available databases are riddled with measurement inconsistencies. If
social workers who are instructed to collect data for a new computerized management information
system in their agency resent the extra work and the time it takes away from direct service to their
clients, for example, they may not be careful to complete the measuring instruments accurately,
precisely, and consistently.
Inconsistencies in measurement can plague even the most esteemed databases. In the CSWE
annual data, for example, one variable is ethnicity. Asking schools to report the ethnicity of their
faculty and students seems a relatively straightforward and easy task. A couple of decades ago, the
CSWE administration decided that, in order to be more sensitive to Native American constituents,
the category of ethnicity formerly labeled American Indian should more correctly read Native
American.
Theoretically, this was a proper decision. However, in the year the change was made, the
proportion of Native American students and faculty reported on the forms shot up to a level
several times higher than it had been previously. In fact, the data reported by some schools in
industrial areas of the Northeast and Midwest indicated that all their faculty and students classified
as white the previous year had been replaced by Native Americans!
Fortunately, it was not too difficult to spot and correct this error. Some of the clerical staff who
completed the CSWE forms apparently thought that Native American meant white or born in the
United States and overlooked the category of white ethnicity, which appeared lower in the list.
Measurement Validity
As noted in Chapter 11, inconsistent or unreliable measuring instruments cannot produce valid
data because they do not accurately measure what they purport to measure. But even if the data are
reliable, they still may not be valid. Consistency is a necessary condition for a measuring
instrument to be valid, but it is not a sufficient condition.
For example, suppose a database in a probation department contains data on the frequency of
criminal acts by probationers, based on their reports to their
probation officers. The probationers would undoubtedly underreport such acts. These data would
be reliable as a measure of their self-reports, but they would not be valid because they would
contain a consistent bias that has more to do with avoiding being returned to prison than with an
accurate count of criminal acts.
It can be assumed that a state hospital would be consistent in measuring the number of former
patients from various communities who are readmitted to it and the length of each hospitalization.
But would this measurement really mean what it is intended to? The administration and staff of
community-based psychiatric rehabilitation centers often are dedicated to their belief that such
patients are almost always better off in the community than in the hospital.
They are unlikely to recommend rehospitalization for patients who experience a relapse; indeed,
they may forcefully advocate against it. Mental health professionals in a community with a
different type of program might view the state hospital in positive terms, as an acceptable way to
provide a temporary, protected environment for patients who need it. They therefore would send
comparatively more patients back to the hospital.
A good case could be made that what would really be measured by reliable data that indicate this
disparity in rehospitalization rates is largely a difference not in ideology but in the clinical benefits
of the different community treatment programs. At least there should be considerable doubt as to
how much of the disparity is due to ideological differences and how much really reflects
differences in the well-being of the former patients.
Before using such available data, therefore, a brief preliminary investigation should be conducted
to examine the attitudes, beliefs, and practices of those who influenced or determined what was
recorded in the database. A few brief interviews might reveal disparate staff ideologies concerning
the value of rehospitalization, which would suggest the dubious validity of these data as a measure
of clinical outcome.
An even more likely case, in view of cutbacks in funding for community mental health programs,
is that a community with a program for former patients of state hospitals would be found to have
significantly higher rehospitalization rates than another community that lacked any sort of
program for former patients. Would the higher rates mean the program was harmful?
Or would they mean that the community with the program was doing a better job in following up
on discharged patients and making sure that some form of service (in this case, rehospitalization)
was provided in case of relapse? Conceivably, the community with no program simply neglected
these patients, allowing them to go their own way and get along in the community as best they
could with the minimal public health and welfare benefits available.
Sources of Error
Problems in the validity and reliability of available data can stem from a variety of sources. Some
have been mentioned previously, such as haphazard data recording. There are inconsistencies in
how respondents interpret the meaning of questions or response categories on measuring
instruments, and their personal response sets may lead them to supply data that present a socially
desirable impression or that they think will please the interviewer or observer. There also are
inconsistencies in the definition of available data that may be reliably recorded in a certain context
but mean something different in another context.
Another source of error is that different agencies or programs contributing to a database may have
different operational definitions of the same variables. Because they define the terms differently,
their respective counts might not be comparable, and the sum of all the counts might overestimate
or underestimate the true amount.
For example, at one time the CSWE database included the full-time enrollment of each
undergraduate social work program in the United States. In some schools, however, freshman and
sophomore social work majors were included in these counts; other schools could not do so
because university policy required students to wait until the junior year to declare their majors.
Consequently, the latter schools appeared to be devoting excessive resources to faculty; they had
lower student-faculty ratios than the schools that counted freshmen and sophomores.
The CSWE adjusted its reporting procedures to improve the comparability among undergraduate
programs by asking them all to include only juniors and seniors in their enrollment counts. This,
however, introduced a different problem; the assessment of trends over time became more
difficult.
In comparing the number of full-time students before and after the CSWE adjusted its reporting
procedures, the implications of including freshmen and sophomores before the adjustment but not
after it must be taken into account. If this is not done, it would be easy to conclude that full-time
enrollments had dropped more than they really had in some schools, simply because freshman and
sophomore students no longer were being counted. Before trends can be assessed over time using
secondary analyses, therefore, careful inquiries must be made about changes in reporting
procedures.
For the same reasons, the possibility of differences in response rates from year to year must also
be considered. Each year, a small proportion of social work education programs fail to respond to
the annual canvass for the CSWE database. Lets take a simple hypothetical example to illustrate
this point. Let's say that in 2010, 19% of CSWE-accredited baccalaureate programs failed to
respond; the previous year (2009), 13% failed to do so.
The larger proportion of nonrespondents in 2010 than in 2009 must be taken into account in
developing trend data from one year to the next. Moreover, changes in the programs that fail to
respond can cause distortions regarding some variables.
For example, a substantial drop in the proportion of Puerto Rican faculty members and students in
a given year may have more to do with the failure of one large program in Puerto Rico to respond
to the canvass that year than with any real trend in the ethnicity of faculty and students.
Missing Data
A related problem is missing data. Just because an agency intends to include a particular statistic in
its records or database is no guarantee that the statistic will be consistently available. It may be
missing or "unknown" for many cases. In the CSWE database, for example, certain schools
regularly refuse to supply data on faculty salaries. In social service agencies, some social workers
may not conform to expectations that they routinely record data on service goals or client
outcomes.
There may even be a bias influencing which data are omitted. For example, social workers might
be more disposed to record behaviors and events that reflect client progress than those that reflect
poor treatment outcomes. They might be more likely to be aware of (and thus record) acting-out
behavior than withdrawal behavior.
Consequently, before a great deal of time is invested in a research study, some preliminary spot-
checking of the data sources to be used is advisable to get an idea of how consistently the data are
available. This provides a basis for deciding whether the extent of missing data will create too
great a margin of error in the results of the proposed research study.
SUMMARY
Secondary analysis is a method of data utilization in which existing data collected by others are
used to answer questions, test hypotheses, or build theory, so there is no need to collect original
data. In some research situations, the time, cost, and other resources saved by avoiding the
collection of original data are a
distinct advantage of secondary analysis. It also may be the preferable method when it is difficult
to find an appropriate sample or a low response rate is anticipated.
Data for secondary analyses maybe obtained from a variety of agency sources--public or voluntary
and at the national, state, or local levels. Especially when computer technology is used to process,
store, retrieve, and analyze data on an ongoing basis, secondary analyses can be done routinely
and easily by social work researchers.
Designs for studies using secondary analyses can be classified in several ways. In inductive
studies, theories are built from accumulated observations, whereas in deductive studies, the
hypotheses derived from these theories are tested. Secondary analyses are typically associated with
inductive reasoning, but they can be employed in a deductive fashion as well.
They can also be used at all three levels of the research knowledge continuum. In exploratory
studies, the utility of secondary analyses is related to the ease with which many different possible
relationships between variables can be identified in an existing database. In descriptive studies,
social work researchers can use existing data to obtain more precise definitions of variables,
without examining or testing relationships between them. In explanatory designs, they can analyze
existing data to try to verify research hypotheses about these relationships.
A number of methodological issues are involved in the analysis of data that others have collected.
The data researchers propose to analyze must be valid and reliable. Problems in the validity and
reliability of available data can stem from haphazard recording of original data, inconsistencies in
how respondents interpret requests for information, or the inclination of respondents to report
socially desirable information.
Data that have been reliably recorded can mean different things in different contexts, and some
data may be missing in a database. Researchers must also take care that they do not abuse
computer technology by merely massaging data in inappropriate efforts to statistically verify
potential hypotheses.
CraigW. LeCroy
Gary Solomon