Bigdata 2
Bigdata 2
Bigdata 2
high-performance computing example, by complaint type what phrases—and identify the 3. Crockford D. Introducing JSON. 2019.
Available at: https://www.json.org.
grid, the project shortened what (noise, vermin). relative sway of different pur- Accessed January 24, 2019.
would have taken months on a veyors of information.
4. Chowkwanyun M, Markowitz G,
single desktop computer to a Rosner D. ToxicDocs: Version 1.0. [data-
handful of days.5 base]. New York, NY: Columbia Uni-
versity and City University of New York;
NAMED ENTITY 2018.
RECOGNITION FUTURE CHALLENGES 5. Rosner D, Markowitz G, Chowkwa-
Numerous robust tools for For all the potential they carry, nyun M. Toxicdocs (www.toxicdocs.
these techniques—and big data org): from history buried in stacks of paper
AUTOMATIC natural language processing are
more broadly—raise critical
to open, searchable archives online.
available to researchers, such as
DOCUMENT questions about current in-
J Public Health Policy. 2018;39(1):4–11.
the NLTK (Natural Language
CLASSIFICATION Toolkit) and spaCy libraries stitutional structures. One is the
6. NLTK Project. Natural language
toolkit (NLTK). 2019. Available at:
Large amounts of data often question of standards. How will https://www.nltk.org. Accessed January
for the Python computer lan- 24, 2019.
need to be sorted into discrete the researchers adjudicate among
guage.6,7 They enable rapid 7. spaCy. Industrial-strength natural
categories. In the previously de- several ways of doing something?
analysis of text, ranging from language processing in Python. 2019.
scribed ToxicDocs example, it What kind of protocol trans- Available at: https://spacy.io. Accessed
identifying recurring words;
would be useful not just to possess parency ought to be mandatory, if January 24, 2019.
detecting their parts of speech;
an enormous cache of documents any? Then there is the need to
and characterizing nouns as pla-
but also to know whether a update training in both public
ces, organizations, or people. In
certain artifact is a newspaper health and computer science and
massive databases, this allows the
article, e-mail, scientific article, other related fields. This will fa-
rapid isolation of commonly used
or internal memorandum. Au- cilitate collaboration among future
phrases or names of people who
tomatic classification combines generations, who can use new
might otherwise be an anony-
older and newer techniques as computational techniques while
mous blur. For those interested
follows. One typically starts with respecting the longstanding norms
in beliefs about certain health
a team of humans manually an- of the public health profession.
practices, named entity recogni-
alyzing a sample set of docu- Considering that big data is in its
tion could isolate commonly
ments, known as “training data,” infancy, these issues will occupy
invoked authors on bulletin
while affixing categories to them much attention in the coming
boards where users regularly
in a fashion not unlike traditional years. But there is no question that
swap health information of vary-
coding. With more classifica- big data is here to stay and, with it,
ing quality, among dozens of other
tions, a probability model then an enormous opportunity for all
applications.
identifies distinctive characteris- in public health.
tics in each type of document
category (e.g., special characters Merlin Chowkwanyun, PhD,
and formatting in e-mails) and, MPH
over time, increases the accuracy SOCIAL MEDIA AND
of a category guess. Many tech- METADATA ACKNOWLEDGMENTS
niques exist for identifying re- Finally, social media contain a I wish to thank Alex Farrill for clarifying
current characteristics across trove of data, and what is beneath discussions of nonrelational databases with
me. Yoka Tomita assisted with preparation
documents. An increasingly the surface is often as interesting as of the manuscript, and conversations with
common one is support vector an utterance itself. Behind a single Nora Landis-Shack shaped much of my
machine classification, in which tweet, for example, is information thinking on these issues.
words and phrases are converted in a JSON format that contains
CONFLICTS OF INTEREST
into mathematical representa- not only a username and the text The author has no conflicts of interest
tions in a vector space, which are of a tweet but also its date and to declare.
then used to calculate similarity time, the geographical location, REFERENCES
and dissimilarity to other words the number of times a tweet was 1. Barrett MA, Humblet O, Hiatt RA,
and phrases. In a health pro- liked or retweeted, and its hash- Adler NE. Big data and disease preven-
tion: from quantified self to quantified
motion context, a hypothetical tags, among other information. communities. Big Data. 2013;1(3):168–
researcher might analyze thou- Using the techniques I have de- 175.
sands of health department scribed, one can use these data to 2. Juve G, Rynge M, Deelman E, Vockler
complaints from constituents reconstruct influence networks— J, Berriman GB. Comparing FutureGrid,
Amazon EC2, and Open Science Grid for
and automatically slot them for instance, who spreads anti- scientific workflows. Comput Sci Eng.
into discrete categories, for vaccination messaging and with 2013;15(4):20–29.