Bigdata 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

AJPH PERSPECTIVES

Big Data, Large-Scale Text Analysis, total entries in a database. In


addition to this flexibility, most
nonrelational databases (e.g., the
and Public Health Research popular MongoDB), are avail-
able free of charge with robust
“Big data” is often an amor- documents. Fortunately, the individual cells, often queried online support communities.
phous catchall. And lately, it is past decade has seen the rise of using the Structured Query
one that has attracted more crit- computing consortiums that al- Language (SQL). For decades,
icism than praise. Some have low us to harness any number of relational databases have proven
pointed out that a torrent of new computers—from a single server useful for holding information RAPID PROCESSING
data hardly guarantees good in- to tens of thousands—for a single about individuals and their vari- Tasks that are too over-
sights. Others point to its threats task. Commonly referred to as ous traits, such as those typically whelming for a personal com-
to privacy, often in service of “high-performance computing,” recorded in a longitudinal data puter can now be distributed over
highly targeted and invasive this pooled computing power is set, and for data in static form. high-performance computing
marketing. Then there are those now accessible to researchers of For more complex and “un- grids like those described.
who simply question the hype. all sorts, including those without structured” data types, however, Open-source utilities such as
These criticisms are valid but enormous monetary resources. SQL databases exhibit severe Docker, HTCondor, and
can overshadow the potential of One prime example is the Open Hadoop can facilitate the simul-
limits. Single cells in a grid are
big data. I examine avenues that Science Grid, a collection of taneous operation of an appli-
inefficient ways to hold, for ex-
big data opens for public health research universities that share cation across any designated
ample, the contents of a 200-page
researchers, especially those in their computing resources.2 The number of servers. One recent
legal text. Further, many rela-
health promotion. Boiled down Open Science Grid’s staff and high-performance computing
tional databases are constrictive
to its essence, the term encom- campus liaisons provide assistance application is ToxicDocs, a
because of limits on the number
passes two revolutions that are to researchers with all levels of publicly available data set of
of observations and their frequent
undeniably real: more compu- ability. In practical terms, this once-secret corporate docu-
inability to dynamically add ad- ments on industrial poisons that
tational horsepower and a greater means a researcher can open up
ditional fields if needed.
volume of data.1 Here, I focus on his or her laptop, distribute a task have emerged from the vaults of
Nonrelational databases, often major multinational corpora-
text, the volume of which has from a home institution across a
called “NoSQL,” offer a solution tions.4 ToxicDocs’s creators
boomed in the digital era, and the grid of servers, and complete it in
analysis of it, often called “natural a fraction of the time it might to these problems. Most of them confronted the task of rendering
language processing.” Although I have taken a couple of decades contain data conforming to millions of pages of documents
provide a necessarily survey-level ago. Concurrently, private firms JavaScript Object Notation full-text searchable. This typi-
scan, I offer enough depth so that now sell access to as much (JSON), a nontabular standard- cally requires using optical char-
readers can familiarize themselves computing as a researcher needs, ized format that holds data in acter recognition software that
with emerging technological along with technical assistance, all labeled fields, such as “title:,” converts images of letters and
trends and methods. at a relatively low cost. Amazon “publication date:,” and “pub- numbers into actual recognizable
Web Services has been a leading lisher:.”3 Dynamic updating of characters. Optical character
player for years, with Google the field list is easy and can be recognition, however, is an
Cloud, Microsoft Azure, and done in real time, and there is no inherently slow process. But
others offering similar services. limit to the number of fields or by deploying the task on a
HIGH-PERFORMANCE Paralleling the rise in com-
COMPUTING AND puting power has been the advent ABOUT THE AUTHOR
NOVEL DATABASES of novel database architectures Merlin Chowkwanyun is with the Center for the History and Ethics of Public Health,
New computing infrastruc- that can be adapted to in- Department of Sociomedical Sciences, Mailman School of Public Health, Columbia
University, New York, NY.
ture is required to handle ever- creasingly diverse forms of data. Correspondence should be sent to Merlin Chowkwanyun, Center for the History and Ethics of
growing amounts of text, including Until recently, most researchers Public Health, Department of Sociomedical Sciences, Mailman School of Public Health, Co-
social media chatter, online ram- have used some version of lumbia University, 722 West 168th Street #R931, New York, NY 10034 (e-mail: mc2028@
columbia.edu). Reprints can be ordered at http://www.ajph.org by clicking the “Reprints” link.
blings, digitized periodicals, In- a relational database: tabular This editorial was accepted January 8, 2019.
ternet ads, and scanned paper spreadsheet-like databases with doi: 10.2105/AJPH.2019.304965

S126 Editorial Chowkwanyun AJPH Supplement 2, 2019, Vol 109, No. S2


AJPH PERSPECTIVES

high-performance computing example, by complaint type what phrases—and identify the 3. Crockford D. Introducing JSON. 2019.
Available at: https://www.json.org.
grid, the project shortened what (noise, vermin). relative sway of different pur- Accessed January 24, 2019.
would have taken months on a veyors of information.
4. Chowkwanyun M, Markowitz G,
single desktop computer to a Rosner D. ToxicDocs: Version 1.0. [data-
handful of days.5 base]. New York, NY: Columbia Uni-
versity and City University of New York;
NAMED ENTITY 2018.
RECOGNITION FUTURE CHALLENGES 5. Rosner D, Markowitz G, Chowkwa-
Numerous robust tools for For all the potential they carry, nyun M. Toxicdocs (www.toxicdocs.
these techniques—and big data org): from history buried in stacks of paper
AUTOMATIC natural language processing are
more broadly—raise critical
to open, searchable archives online.
available to researchers, such as
DOCUMENT questions about current in-
J Public Health Policy. 2018;39(1):4–11.
the NLTK (Natural Language
CLASSIFICATION Toolkit) and spaCy libraries stitutional structures. One is the
6. NLTK Project. Natural language
toolkit (NLTK). 2019. Available at:
Large amounts of data often question of standards. How will https://www.nltk.org. Accessed January
for the Python computer lan- 24, 2019.
need to be sorted into discrete the researchers adjudicate among
guage.6,7 They enable rapid 7. spaCy. Industrial-strength natural
categories. In the previously de- several ways of doing something?
analysis of text, ranging from language processing in Python. 2019.
scribed ToxicDocs example, it What kind of protocol trans- Available at: https://spacy.io. Accessed
identifying recurring words;
would be useful not just to possess parency ought to be mandatory, if January 24, 2019.
detecting their parts of speech;
an enormous cache of documents any? Then there is the need to
and characterizing nouns as pla-
but also to know whether a update training in both public
ces, organizations, or people. In
certain artifact is a newspaper health and computer science and
massive databases, this allows the
article, e-mail, scientific article, other related fields. This will fa-
rapid isolation of commonly used
or internal memorandum. Au- cilitate collaboration among future
phrases or names of people who
tomatic classification combines generations, who can use new
might otherwise be an anony-
older and newer techniques as computational techniques while
mous blur. For those interested
follows. One typically starts with respecting the longstanding norms
in beliefs about certain health
a team of humans manually an- of the public health profession.
practices, named entity recogni-
alyzing a sample set of docu- Considering that big data is in its
tion could isolate commonly
ments, known as “training data,” infancy, these issues will occupy
invoked authors on bulletin
while affixing categories to them much attention in the coming
boards where users regularly
in a fashion not unlike traditional years. But there is no question that
swap health information of vary-
coding. With more classifica- big data is here to stay and, with it,
ing quality, among dozens of other
tions, a probability model then an enormous opportunity for all
applications.
identifies distinctive characteris- in public health.
tics in each type of document
category (e.g., special characters Merlin Chowkwanyun, PhD,
and formatting in e-mails) and, MPH
over time, increases the accuracy SOCIAL MEDIA AND
of a category guess. Many tech- METADATA ACKNOWLEDGMENTS
niques exist for identifying re- Finally, social media contain a I wish to thank Alex Farrill for clarifying
current characteristics across trove of data, and what is beneath discussions of nonrelational databases with
me. Yoka Tomita assisted with preparation
documents. An increasingly the surface is often as interesting as of the manuscript, and conversations with
common one is support vector an utterance itself. Behind a single Nora Landis-Shack shaped much of my
machine classification, in which tweet, for example, is information thinking on these issues.
words and phrases are converted in a JSON format that contains
CONFLICTS OF INTEREST
into mathematical representa- not only a username and the text The author has no conflicts of interest
tions in a vector space, which are of a tweet but also its date and to declare.
then used to calculate similarity time, the geographical location, REFERENCES
and dissimilarity to other words the number of times a tweet was 1. Barrett MA, Humblet O, Hiatt RA,
and phrases. In a health pro- liked or retweeted, and its hash- Adler NE. Big data and disease preven-
tion: from quantified self to quantified
motion context, a hypothetical tags, among other information. communities. Big Data. 2013;1(3):168–
researcher might analyze thou- Using the techniques I have de- 175.
sands of health department scribed, one can use these data to 2. Juve G, Rynge M, Deelman E, Vockler
complaints from constituents reconstruct influence networks— J, Berriman GB. Comparing FutureGrid,
Amazon EC2, and Open Science Grid for
and automatically slot them for instance, who spreads anti- scientific workflows. Comput Sci Eng.
into discrete categories, for vaccination messaging and with 2013;15(4):20–29.

Supplement 2, 2019, Vol 109, No. S2 AJPH Chowkwanyun Editorial S127


Copyright of American Journal of Public Health is the property of American Public Health
Association and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.

You might also like