Data Visualization A Guide To Visual Storytelling For Libraries
Data Visualization A Guide To Visual Storytelling For Libraries
Visualization
LIBRARY INFORMATION TECHNOLOGY ASSOCIATION (LITA)
GUIDES
Marta Mestrovic Deyrup, PhD
Acquisitions Editor, Library Information and Technology Association, a division of
the American Library Association
The Library Information Technology Association (LITA) Guides provide
information and guidance on topics related to cutting edge technology for library
and IT specialists.
Written by top professionals in the field of technology, the guides are sought after
by librarians wishing to learn a new skill or to become current in today’s best
practices.
Each book in the series has been overseen editorially since conception by LITA and
reviewed by LITA members with special expertise in the specialty area of the book.
Established in 1966, the Library and Information Technology Association (LITA) is
the division of the American Library Association (ALA) that provides its members
and the library and information science community as a whole with a forum for
discussion, an environment for learning, and a program for actions on the design,
development, and implementation of automated and technological systems in the
library and information science field.
Approximately 25 LITA Guides were published by Neal-Schuman and ALA
between 2007 and 2015. Rowman & Littlefield took over publication of the series
beginning in late 2015. Books in the series published by Rowman & Littlefield are:
Digitizing Flat Media: Principles and Practices
The Librarian’s Introduction to Programming Languages
Library Service Design: A LITA Guide to Holistic Assessment, Insight, and
Improvement
Data Visualization: A Guide to Visual Storytelling for Libraries
Data Visualization
A Guide to Visual Storytelling for
Libraries
The paper used in this publication meets the minimum requirements of American National Standard for
Information Sciences—Permanence of Paper for Printed Library Materials, ANSI/NISO Z39.48-1992.
Printed in the United States of America
Contents
Contents
Preface
1 Sculpting Data for a Successful Visualization
2 Designing Public Visualizations of Library Data
3 Tools and Technologies
4 Using Google Tools to Create Public Analytics Visualizations
5 Minding the Gap
6 A Picture Is Worth a Thousand Books
7 Visualizing the Topical Coverage of an Institutional Repository with VOSviewer
8 Visualizing Archival Context and Content for Digital Collections
9 Using R and ggvis to Create Interactive Graphics for Exploratory Data Analysis
10 Integrating Data and Spatial Literacy into Library Instruction
11 Using Infographics to Teach Data Literacy
Appendix
About the Editor
About the Contributors
Preface
Libraries are embracing the expectation that they demonstrate their effectiveness
and be accountable to communities and the institutions they serve. Libraries collect
mountains of raw data about how their collections and services are used, but
communicating the impact of that data to a variety of audiences can be challenging.
Raw data are not enough—data must be carefully prepared and presented in a way
that is understandable, transparent, and compelling. Emerging data visualization
technologies provide the ability to create engaging, interactive visualizations that
effectively tell the story of a library’s impact on its users.
Data visualization is interdisciplinary, combining elements of data science,
statistics, visual communication, and visual design. Although emerging technologies
are making it easier to create complex visualizations using large data sets, data
visualization can refer to the design and creation of visuals with or without digital
technology. The ultimate goal of data visualization is to help viewers understand
data better—by providing context, illustrating trends, showcasing patterns, and
enabling interactive exploration of data.
The purpose of this book is to provide guidance and practical-use cases to illustrate
how libraries can use data visualization technologies to both understand and
communicate data. Case studies in this book discuss a variety of technologies and
frameworks that can be used in data visualization, including D3.js (chapter 3),
Google Charts (chapters 3 and 4), and visualization libraries for the R programming
language (chapters 5 and 9). To create compelling visualizations, you must first
acquire a deep understanding of the underlying data. A common theme throughout
this book is that a significant amount of work is required to clean, prepare, and
transform data to make truly meaningful visualizations, so real-world examples of
how this data preparation work can be accomplished are provided.
This book features case studies and examples that can be useful to the total
visualization beginner as well as those with some experience using data visualization
technologies. Whether you have just started creating visualizations using charting
tools like Excel or Google Charts or you have used sophisticated front-end data
visualization code libraries, you will find useful case studies and inspirational
applications of visualization technologies in this book.
The scope of this book covers core principles of data preparation and visualization
and describes how libraries apply data visualization technologies and strategies to
improve library services, enhance accountability, and understand collections and
users. Also included are practical strategies for incorporating data visualizations, data
literacy, and visual literacy into information literacy and instructional services.
The book begins with Eric Phetteplace’s chapter, “Sculpting Data for a Successful
Visualization,” which provides a practical overview of methods for cleaning and
preparing data for visualizations. In “Designing Public Visualizations of Library
Data,” Angela Zoss discusses essential principles of visualization design and provides
a comprehensive overview of common data visualization types and techniques.
The following chapters discuss case studies and examples of creating visualizations
using data about library services, from discovery and web analytics to interlibrary
loan data. In “Tools and Technologies: Visualizing Research Activity in the
Discovery Layer in Real Time,” Godmar Back and Annette Bailey describe how
they created a series of innovative visualizations to better understand and
communicate search behavior in their Summon discovery system. In “Using Google
Tools to Create Public Analytics Visualizations,” I provide a step-by-step tutorial
for setting up a service that enables querying Google Analytics data for a website to
create publicly accessible, constantly updating visualizations. Roger Taylor and
Emily Mitchell, in “Minding the Gap: Utilizing Data Visualizations for Library
Collection Development,” provide a comprehensive overview of how collection
circulation data can be mined and visualized to provide guidance in making
collection decisions. In “A Picture Is Worth a Thousand Books: OBILLSK, Data
Visualization, and Interlibrary Loan,” Ryan Litsey, Kenny Ketner, and Scott Luker
describe the creation of a tool that gathers and renders interlibrary loan data from
across a library consortium to provide greater accountability for interlibrary loan
services.
Several chapters in the book focus on visualizing archives metadata, institutional
repository collections, and scholarly research activities. David E. Polley’s chapter,
“Visualizing the Topical Coverage of an Institutional Repository Using
VOSviewer,” presents a detailed exploration of how an institution’s faculty research
can be visualized in an innovative way using term co-occurrence maps. In
“Visualizing Archival Context and Content for Digital Collections,” Stephen Kutay
discusses how visualizations can enable users to better understand digital archival
collections and details strategies that can be used to effectively identify, curate, and
present visualizations using a digital archives platform. Tim Dennis, in “Using R
and ggvis to Create Interactive Graphics for Exploratory Data Analysis,” explores
how data visualization packages for the R programming language can generate a
range of visualizations using data sets such as gate counts, circulation statistics, and
article-level metrics.
As libraries become more advanced creators of data visualizations, they can also
play a role in educating their users to become data literate consumers and producers
of visualizations. Data literacy and visual literacy are increasingly important to
information literacy instruction, as explored in chapters by Charissa Jefferson and
Caitlin A. Bagley. In “Integrating Data and Spatial Literacy into Information
Literacy Instruction,” Charissa Jefferson provides a comprehensive overview of
strategies that libraries can use to incorporate open data and geospatial data into
library instruction and educate library users on finding, understanding, and using
data to create effective visualizations. Caitlin A. Bagley, in her chapter “Using
Infographics to Teach Data Literacy,” describes how using the infographic service
Piktochart enhanced information literacy instruction and empowered students to
create their own unique visualizations from research data.
Creating effective data visualizations requires a wide variety of skills, including a
thorough understanding of math and statistics, knowledge of data storage and
mining methods and front-end design, and development knowledge. This book
provides an overview of how these skills can be applied in a library context to create
better libraries, better services, and better instruction and to showcase the impact
libraries have on their users and their communities. It is my hope that after reading
this book, you will have the knowledge and tools needed to begin transforming your
library’s data into a compelling, meaningful story.
1
Sculpting Data for a Successful Visualization
Eric Phetteplace
There are many prerequisites for successful information visualization, not the least
of which is meaningful data. Ben Fry’s Visualizing Data breaks down the process of
creating a visual into seven steps: acquire, parse, filter, mine, represent, refine, and
interact (Fry 2008). Importantly, we do not even begin to work with something
visual until step five (represent); the first four steps all relate to manipulating data.
We must first acquire and hone meaningful information that can be used to answer
a pressing question about library services, to demonstrate value, or to simply serve as
a launch pad for exploration. While much has been written about how to evaluate
library value or what data libraries should be collecting, this chapter focuses on the
procedures of preparing data for visualization.
What is at stake when refining our data? Poorly processed information can make
creating an insightful visualization impossible or, even worse, mislead its audience.
Consider figures 1.1 and 1.2. In figure 1.1, the data’s trend is not evident at all
because an outlier distorts the chart. The chart’s vertical range has to be extended so
much to accommodate one aberration that the rest of the data seem, by
comparison, near identical in their Y values. The same data source is used in figure
1.2 except that the outlier has been removed; suddenly, a general downward trend is
evident. The two figures demonstrate how a solitary point can completely obscure
the distinctions present within the rest of our data.
Figure 1.1. A single data point greatly influences the chart’s boundaries.
Figure 1.2. Removing the single point allows the chart to intensify its focus.
Does that mean we are justified in removing the obnoxious outlier? Not in the
least. That a datum is inconvenient is not cause to remove it. We must ask why the
outlier exists, which in turn involves knowing what our data describe and how they
were collected. Consider that our data represent returns at a slot machine, with the
x-axis representing money and the y-axis time. As the gambler plays, he slowly loses
money, occasionally slightly rebounding. But suddenly he hits the jackpot, surging
his winnings for the session. It would be massively misleading to simply discard the
jackpot spin due to its extremity. In this example, the outlier must be retained if we
are to accurately represent the phenomena being studied. If we are unhappy with
our visualization, perhaps we should ask whether the data set itself is sufficient. Is a
single gambling session a meaningful sample? Would it not be more sensible to
gather the results of many sessions or else scrap the investigation if that’s not
obtainable?
On the other hand, not all outliers belong. Some represent errors in the data
collection process and should be removed. Identifying errors within our data set and
handling them appropriately requires familiarity with the data and good judgment.
It is an ethical issue as well as a technical one. Data processing is arguably a less
interesting topic than data visualization. It is dry, and tedious, and difficult. But it is
also essential. This chapter explores the multifarious facets of manipulating the raw
clay of a troublesome data set into a gorgeous sculpture.
TERMS
Before we discuss the various pieces of data manipulation and some specific case
studies, it is important that we have a shared understanding of vital terms. Some of
these terms will no doubt be familiar, acronyms you have at least seen or heard
before. Please feel free to skip terms you are already acquainted with; this chapter
does not provide any sterling insight into their essence but rather a basic outline of
their primary properties.
The one caveat to the CSV format is that it truly is a row-based, delimited data set
in which the delimiter can vary; tabs are also common. Most visualization
applications have a means of reading in and exporting to delimited data formats
such as CSV.
JSON is important because a number of web service APIs return JSON data.
Furthermore, the ease with which programming languages can manipulate it makes
it easier to process than many other formats. Finally, if we want to build a
visualization that lives on the web, it’s likely that JSON will be our desired format
due to the many JavaScript visualization tools, such as the D3.js library,1 that
consume it.
Regular Expressions
Regular expressions are magical incantations used to extract patterns from text.
They are bewildering jumbles of punctuation understood by few but used by many,
since their power is great. With luck, we will not need to know much about regular
expressions to perform the text processing we so require. Indeed, the mystic
expressions needed to match such common strings as an email address or URL can
often be found by searching online, though we should verify them against our data
sets before trusting them. To learn more about regular expressions, you are best
served by searching the Internet.2
Spreadsheet
By spreadsheet data I mean data that are present in a spreadsheet application, such
as Apple’s Numbers, Google Drive’s Sheets, Microsoft Excel, or Libre/OpenOffice
Calc. It is perhaps misleading to separate these from other formats, as they can be
created from CSV files and are often implemented as XML documents of a
particular structure (as is the case with Excel and Calc). However, spreadsheet data
are distinct in that they are primarily most useful in their native application and
suffer elsewhere. One can export between formats or to a more neutral form such as
CSV, but some features may be lost in the transition. For this reason, most
sophisticated visualizations are done in simpler and more open formats that are
easier to interchange between applications. If we intend to create a visualization
using a tool like Processing or D3.js, we will not be able to use spreadsheet data but
must export them first.
Unix
Unix is an amorphous concept. As I use it in this chapter, it stands less for a family
of computer operating systems than a systems design philosophy. Unix programs
are small and perform one task well but use text as a universal interface to exchange
data. The output of one program can always be used as the input of another. Thus
Unix is less about a specific era in the history of computers than the idea of small,
interlocking parts that form a greater whole. In terms of the importance of data
processing, well-established Unix tools allow us to chain together a suite of simple
operations to accomplish tasks of arbitrary complexity.
FOUR PLATEAUS
Fry’s aforementioned stages of information visualization are an important guiding
point, but they should not be taken as gospel. Instead, we will discuss four
intermingled plateaus of data preparation: the acquisition and collection, the
filtering and refinement, the processing and formatting, and the publishing of data.3
These plateaus are interleaved; they do not occur in cleanly discrete stages. For
instance, we might start immediately with the publishing stage by declaring our
research interests and data collection methodology publicly. As we collect more
data, we continually publish, perhaps on a website that allows prior versions to be
viewed and compared to present work, such as the software version control service
GitHub.4 Thus the publishing plateau is not a final pinnacle to be reached after we
obtain and process data but is present simultaneously alongside those other terrains.
Acquire/Collect
Acquiring data is increasingly becoming the easiest step in the data visualization
process, as libraries monitor more and more of their services and sources of open
data continue to grow. However, it is by no means straightforward to obtain the
data we are interested in. Depending on our position within our institution, some
data sets may be restricted. Departments may be possessive of their data or
legitimately worry about misrepresentation. It takes social tact and negotiation skills
to secure access to data outside our position’s immediate responsibilities. In these
situations, I find that assuring data owners that they will have a chance to review
any work done with their information—and that they will likely benefit from my
work—to be a strong stance.
On the other hand, open data are widely available through sites such as data.gov,
wikidata.org, and datahub.io. Library-specific data sets can be found through
OCLC or Open Library, to name but two major services. However, while open data
make acquisition easier, they can complicate the other plateaus we discuss in this
section. Data may be incomplete, inconsistent, in an odd format, or
undocumented. Thus the work of filling in missing data, normalizing values,
converting to a usable format, and discerning the data’s internal structure is
increased.
Thus far we have discussed obtaining fully formed data sets, but what about
collecting our own data from scratch? Defining our own data collection procedures
is incredibly powerful because we know the strengths and weaknesses of the
collection methodology and can adjust subsequent work accordingly (for instance,
by filtering out known inaccuracies). Collecting one’s own data lets data processing
inform collection. Were we unable to answer a research question with our available
information? Was it a struggle to process the raw data into a meaningful summary?
Then we can improve how data are gathered to make future efforts more successful.
Finally, data stewarding is worth discussing alongside issues of collection. Libraries
often collect data with personally identifiable information (PII) included; consider
circulation records attached to a patron’s identity or a recording of a patron’s voice
during a web usability study. Even supposedly anonymized data can potentially be
linked back to an individual via their patterns; the release of anonymized AOL
search logs in 2006 stands as an infamous example of how inadequate naïve
anonymization is (Hafner 2006). For that reason, we should avoid working with
data at a level that can be linked to individuals, instead seeking aggregated forms. If
we must work with PII, a timescale for data retention should be defined. We should
keep strict standards on how long we will maintain access to the data and how
securely they are stored.
Filter/Refine
Altering the data post acquisition is perhaps the hardest area to discuss in the
abstract, because its nature will vary so greatly with each project. Data sets may need
to be supplemented, pruned, or aggregated. We must always keep in mind that we
refine our data sets to enhance their objectivity, not discard it. Discarding data
points simply because they are problematic is unethical and distorts the meaning of
the set as a whole. Rather, we should discard a datum if we can prove it to be
erroneous or outside the scope of our particular investigation. Are we researching a
library’s virtual reference services that occur online? Then we filter out interactions
with an unsuitable mode of communication, such as telephone or in person. Are we
researching print circulation? Then ILS records describing DVDs are ignored.
While these are two examples of filtering data outside our scope, refinement refers
less to narrowing our data than improving their quality. For instance, we can utilize
our knowledge of the library’s hours of operation to delete inexplicable data; an in-
person reference interaction at 2:00 a.m. when all library branches are closed is
clearly an artifact of poor data entry. Similarly, two separate and near-identical
bibliographic records for the same title need to be de-duplicated for our circulation
study to be accurate.
Whatever our methods and reasoning when filtering data, documentation is
utterly vital. We should maintain a running narrative of the alterations we make
and their rationale. The value of this document is manifold: formally justifying our
decisions strengthens them and makes us less likely to rely on indefensible personal
preferences. With a detailed guide we, or anyone else, can re-create our study and
generate longitudinally comparable data in the future, and we can ask others to
critique our decision making to gain perspective.
Format/Process
While perhaps similar to Filter/Refine, here we are more concerned with the
technical procedures of converting data from their original format to the one that
our visualization software utilizes. While we may be handed an Excel file full of
equations, macros, and cell references, that is rarely the most suitable form for our
final product. We may need to convert equations into their raw values, export
multiple worksheets into separate tabular data files, and normalize character
encodings from a common Windows encoding (e.g., Windows-1252) to the more
ubiquitous UTF-8.
Any nontrivial format conversion should be verified to some extent. It’s easy to
grab a random sample of data points to verify that all values were translated
sensibly. We can perhaps inspect a random subset of ≈10 percent of our data to
confirm that everything converted accordingly. Often, it’s possible to verify via
aggregation methods that data were transformed appropriately. For instance, if we
convert from CSV to JSON, we can compute the average of a numeric field before
and after conversion to virtually ensure that nothing was distorted in that particular
column.
As we go along converting data from one format to another, it behooves us to keep
intermediary copies of the data as they transition. If we discover down the line that
something has become corrupted, we had better be able to pinpoint where in the
formatting procedures this occurred and determine a fix. As mentioned at the start
of this section, an ideal tool for saving snapshots of our data after each
transformation is version control software such as Git. If we download an Excel file,
export to CSV, and convert from CSV to JSON, we can record each alteration in a
“commit” that can easily be reverted to later. Version control software is also
another argument for simple, text-based file formats. Programs such as Git have a
much easier time tracking the changes in a line-oriented text file than in more
complicated formats meant to be read by a particular application, such as Excel
spreadsheets.
Publish
Many volumes could be written on publishing data alone. Rather than devote a
substantial segment of this chapter to the subject, I will simply emphasize that
publishing data is important. If we want our research study to be verifiable, we must
publish the data in some form alongside our final visualization. Beyond verifiability,
publishing our data means that others can correct errors in the event that our
filtering/refining was imperfect, remix our data by applying it to another
visualization technique, or reuse it as part of a larger study that takes our data as but
one of several sources.
There are a few major considerations when publishing data. For one, we should be
wary of publishing any PII or data with strong patterns that could be linked to
individuals, as discussed in the Acquire/Collect section. One value that sets libraries
apart in the digital age is our commitment to patron privacy, and that absolutely
extends to how we treat potentially sensitive data. If in doubt, we should not share
or publish data that could reveal information about an individual without his or her
explicit permission. Returning to our dual examples of reference interactions and
circulation totals, it is easy to see how disaggregated data could be abused. Knowing
that the library answered 73 email reference questions in the month of July is likely
an unrevealing level of aggregation; knowing that one of the emails was answered at
7:05 p.m. on July 3 on the subject of illegal fireworks is potentially incriminating
and could clearly be used by local law enforcement. Circulation records are an even
more obvious liability, with the classic example of exposing a patron’s checkout
history that includes The Anarchist’s Cookbook.
If we do publish data, it is our duty to document them thoroughly and provide
them in a useful format. A raw data set with no description of its schema, data
types, or collection procedures is often worse than no data at all because of the ease
with which it is misinterpreted. Rather, we now find a use for all the documentation
created during our Filter/Refine plateau. Furthermore, presenting a formal data
schema is of immense value to potential consumers. Knowing whether a particular
field is composed of integers, unique ID numbers, or floating point values has
implications for how others perform their own processing of our information.
While documenting a data schema is a somewhat nascent field, there is a
datapackage.json standard that lets us describe our data with a lightweight metadata
file we place alongside our published product (Pollock and Keegan 2015).
Standardized efforts to describe data structure have the added advantage of making
our data set more accessible to machines; while we do want individual researchers to
be able to ascertain the meaning of our data, machine-readable documentation
again enables larger meta-studies that quickly process our data via code (Phetteplace
2015).
A SERIES OF STUDIES
No single case study can easily encompass the range of work involved in preparing
data for visualization. Rather, we will work through a number of small issues in a
variety of projects to capture an expansive view of the topic.
Character Encoding
In our smallest case study, we can convert from uncommon or proprietary character
encodings to more usable, web-friendly formats with command line tools or other
software services. If you are unfamiliar with character encodings, they are essentially
reductions of character sets (e.g., the Western alphabet, Arabic, Wingdings symbols)
to digital representations like binary or octets. When a piece of software opens a
document in an unanticipated character encoding, errors or misinterpreted
characters can occur, often indicated by replacement characters. Recently, I was
provided with a set of metadata files in a reusable CSV format but encoded in Mac
OS Roman; inspecting the data in my terminal showed replacement characters, and
the bulk metadata tool I intended to use expected UTF-8 data. I was able to convert
the metadata with the “iconv” command line program that comes preinstalled on
Mac OS X: (iconv(1) Mac OS X Manual Page 2015):
iconv -f MACROMAN -t UTF-8 metadata.csv > utf-8-metadata.csv
The short “-f” and “-t” flags stand for “from” and “to” respectively, while the
greater than sign indicates that I am piping the converted output of the iconv
command into a new file named “utf-8-metadata.csv.” The iconv program knows
well over a hundred different character encodings, which are listed with the “iconv -
-list” command.
Unix Text Processing
The Unix command line offers an incredible set of tools for processing text. What’s
more, the Unix philosophy of small programs that perform one task well and
operate on text input has proven to be a solid historical base on which many
modern tools choose to operate. For this reason, learning basic command line
operations can greatly enhance our ability to meld data from a troublesome format
into a useful one. For instance, say we are given a text file of author names, only
there are duplicates in the file, which has no ordering logic, and we need a list of all
unique names to seed our network diagram. A short chain of Unix commands
produces the desired data:
sort authors.txt | uniq > unique-authors.txt
Here we first use the “sort” program to sort our text file alphabetically. Next we
“pipe” the output of the sort program to the “uniq” program with the vertical bar
character. The uniq program then de-duplicates identical, adjacent lines in our text
file. This sorted text output is redirected into a new “unique-authors.txt” file, where
it is ready to be used, using the same greater than symbol we saw in the character
encoding section. This example, while trivial, demonstrates the elegant power of
Unix text processing; though each program does only a simple task, the ability to
chain together the output of one as the input of the next allows us to create complex
text manipulation procedures. For instance, what if our author names followed the
common format in authority files and had a birth and death date following them?
We could use the “stream editor” sed to search for and remove date sequences from
our list of names:
sed -e 's| [[:digit:]]*-.*||' unique-authors.txt > unique-authors-
without-dates.txt
The above sed command is more mysterious than those we have seen previously,
but only because of the presence of regular expressions. The sed program, like sort
and uniq before it, operates on a text file and has its output redirected to a new text
file. The “-e ‘s| [[:digit:]]*-.*||’” string tells sed to perform an edit, specifically a
substitution (“s”), which substitutes an empty string (the final pair of adjacent
vertical bars “||”; if we wanted to replace with something instead of nothing, the
replacement is set between these two characters) for character patterns like “1983-,”
“500-600,” or “999-1020.”5 As a short aside, our process of saving the result of each
procedure as a new file is good practice for an initial processing run-through. Until
we are certain that our procedures are flawless, saving the inchoate results helps us
rewind to a safe point if we make a tragic mistake later. The most surefire, formal
way of saving our progress is by fully versioning our data using a program such as
Git, Mercurial, or Subversion, but relying on meaningful files conventions is a
cheap method of accomplishing the same.
But what if we have our data in the fine CSV format but want to perform slightly
more complicated spreadsheet-like operations? Procedures that are conceptually
simple, such as “extract all the values from the second column of our data,” can
become extraordinarily complex if we limit ourselves to standard command line
tools such as sed, sort, and uniq. However, the csvkit suite of tools written in
Python can solve these problems with ease: the in2csv tool converts from Excel or
JSON to CSV; csvcut allows us to slice particular columns out of our data; csvgrep
mimics the powerful Unix “grep” command except it searches over the contents of
cells; csvstat reports summary statistics on our columns; csvlook renders our data as
a readable table right in our terminal (Csvkit 0.9.1 — Csvkit 0.9.1 Documentation
2015). These are but a few examples of useful csvkit tools.
Returning to our earlier example, what if we had a CSV of circulation transactions
and wanted to extract a list of unique authors? We know how to perform this sort
of extraction with a simplistic text file, but now we can do the same with our CSV:
csvcut -c "Author" circulation-events.csv | tail -n +2 | sort | uniq >
unique-authors.txt
Assuming the column we want is labeled in the header row as “Authors,” this string
of procedures first extracts the authors column, removes the first line with “tail” to
delete the “Author” header, sorts it alphabetically, and then de-duplicates as before.
The csvkit tools are powerful and comprehensive, extending all the usual line-
oriented text operations to everyone’s favorite delimited data format.
While many applications can perform the operations described in this section, the
strength of utilizing common Unix tools lies in the ability to write long scripts that
flawlessly perform the exact same operation in the future. If we are publishing our
data, it is easier to include a four-line shell script that converts our list of authors
than to describe a series of Excel sorts and equations that achieve the same outcome
but are less universal. We—or other parties—can update the backing data of a
visualization in mere seconds by downloading a fresh batch of author names and
running our script, confident that the output will be consistent with prior versions.
Custom Scripting
EZproxy is a ubiquitous piece of library technology; virtually every library needs a
proxy server to provide authenticated access to its subscription databases outside of
the physical library building, and a large majority choose to employ OCLC’s
software to do so. But what if we want to investigate our proxy server’s usage
patterns to inform purchasing decisions, argue for the library’s value, or modify our
services? EZproxy logs are packed with informative data but they are not
immediately useful. Here’s an example of one line from a log file:
12.123.23.32 - wp0Zort3D7B5a2w [20/Jan/2014:07:56:40 -0500] "GET
http://global.ebsco-
content.com:80/embimages/c964b0e32e6167277108703cad5e9f2d/52dd1d01/imh/upi/thumbnail
HTTP/1.1" 200 2244
If you can read that and garner useful clues from it, great! You can skip this section.
But if you are utterly baffled at what half of the pieces mean, then it might require
more work to make the raw data usable. EZproxy data are most similar to CSV
format; each piece is separated by a single space. But then there are strange oddities;
the timestamp is wrapped in brackets while the HTTP request details are in quotes.
Even a simple conversion to the more universal CSV format is nontrivial in that
respect. Furthermore, EZproxy can be set to log every HTTP request that clogs up
the data with tons of lines representing rather minor events, such as the loading of a
thumbnail (which is what’s happening above). If we have a huge text file of
thousands of HTTP requests, how can you distill it down to a metric our
administrators care about, such as an individual person’s session?
One method of converting EZproxy data is to write a custom script that recognizes
the meaning of each element in a single line of a log file. Robin Camille Davis,
emerging technologies and distance services librarian at John Jay College of
Criminal Justice (CUNY), wrote a piece of code in the Python programming
language to do just such a thing (2015). The script takes a directory of EZproxy log
files and produces a series of aggregate statistics, such as the total number of
connections, number of on-campus connections, percentage of connections that
were on campus, number of connections from within the library building, number
of student sessions from off campus, and number of faculty or staff sessions from off
campus. It should be immediately evident how much more useful those figures are
in comparison with the raw data that a particular IP address requested a certain
thumbnail image at 7:00 a.m. on January 20. Camille Davis’s script not only
provides a high level of insight into our EZproxy usage but also enables rapid bulk
processing of large sets of logs that would be cumbersome and time consuming to
perform otherwise.
The Python language is ideally suited to many kinds of data processing. For one, it
is meant to be a readable and easy-to-learn language that mimics the English
language, more so than other languages that sometimes rely on arcane combinations
of punctuation as shorthand for certain operations. Furthermore, Python has an “all
batteries included” philosophy, which means that the language comes with a large
number of built-in features for handling common tasks. Inspecting the EZproxy
parsing script, the utility of these features is obvious: regular expressions are used to
search for patterns in each line of the logs, a system module is used to accept user-
entered arguments on the command line such as which directory to process the files
within, and the “glob” feature makes iterating over said files straightforward. The
full script is powerful yet only 124 lines long, many of which are explanatory
comments and not computer code. Python’s prowess is not limited to only text-
based processing like this, as it has tools for manipulating numeric and scientific
data as well. While Python is a strong choice for custom data processing, it is not
the only one. Many scripting languages exist that aim to offer the programmer
extensive control of his or her data in just a few lines of expressive code. Other
languages, such as Ruby, JavaScript, and Perl, could perform the same logs analysis
with ease.
Coding a custom script is certainly not the only means for processing EZproxy
logs. In a blog post, Lauren Magnuson and Robin Camille Davis outline several
approaches to squeezing delicious lemonade out of sour lemon server logs (2014).
Besides scripts they provide several potential options for the aspiring analyst and
outline the pros and cons of each, noting that AWstats, Piwik, EZpaarse, and
Splunk are all useful for these purposes. I have used Sawmill to quickly summarize
EZproxy data myself, extracting useful information such as what is my library’s
busiest month of the year, busiest day of the week, or busiest time of day and which
databases are utilized disproportionately from off campus given their overall usage
statistics.
While it’s by no means necessary to code a custom processing solution, the ability
to do so provides full control over our data. Particularly if we have obtuse data that
are specific to our institution, coding may be the absolute best method we have of
ensuring they are processed properly. EZproxy is actually not a great example due to
its ubiquity; EZproxy logs are not too dissimilar from other formats of web server
logs, and thus a few pieces of software come with an understanding of the log
format. However, if we are faced with locally maintained vocabularies, heavily
modified metadata schemas, or simply the desire to approach a common data set in
a novel manner, coding knowledge is invaluable.
Normalization with OpenRefine
Names are difficult data. Whether of people, places, or institutions, names often
come in a variety of forms. There is a reason that authority control is such an
important concept in library science. We keep carefully maintained lists of
authoritative forms of proper nouns specifically because our data would lose a
substantial amount of value without them. Knowing that the strings “Samuel
Clemens,” “Mark Twain,” and “Twain, Mark, 1835-1910” all refer to the same
person allows us to provide a single access point to a set of metadata records that
share a relation to that person. Consider a data set with the following entries:
"King, Augusta Ada, 1815-1852"
"Morrison, Toni"
"Morrison, Toni, 1931-"
"King, Augusta Ada"
"Morrison, Toni 1931-"
"Lovelace, Ada"
Figure 1.3. OpenRefine’s algorithms recognize that several cells of data can be combined.
ASSESSING IMPACT
To discuss the impact of data cleanup, I present one final case study. At a prior
institution, a small community college where many students drove from their
homes to class, I was a member of a sustainability committee. We were tasked with
estimating the total carbon emissions of the college as part of our participation in
the American College and University Presidents’ Climate Commitment (Home |
Presidents’ Climate Commitment 2015). Many pieces of the puzzle could be put
together with information from our facilities or budgetary offices, but calculating
student commute mileage was a challenge. We knew that it was likely a large
portion of our total emissions, but there was no easy way to determine it given the
data at hand, which was course schedules for all our students.
First of all, student course schedules are obviously personal information. It would
be highly troubling were someone to gain access to a schedule without
authorization. So I created a data retention plan to delete all but the aggregated data
totals and equations used to derive them three months after the project was
completed. I also introduced another layer of anonymization into the data,
removing student ID numbers that could be traced to individuals and replacing
them with different unique identifiers. My eventual goal was to derive an
approximation of how many miles each student drove during the academic year
from a course schedule and home zip code. To do so, I utilized many of the
approaches discussed in this chapter. I filtered out extraneous data that didn’t relate
to automobile emissions: online classes, classes taught to local high schools, and
inexplicable data such as classes that apparently never met. I had to adjust for factors
such as students dropping out midsemester; hybrid classes that had only half of their
sessions in person; finals; classes that met twice on the same day, thus implying only
a single trip to campus (e.g., science laboratory courses); and students who took
what limited public transit was available in the area.
With an estimation of the number of trips each student took to campus, I still
needed to derive the distance from their homes to each of our campuses. So I
employed a Google Maps API that accepted a “from” and “to” zip code and
returned a distance in miles. But eyeing over the mileage data, I spotted some
strange results. We had students commuting from Texas and the middle of the
German countryside! It turned out that some of our students listed zip codes that
do not exist in our system, which Google Maps interpreted as foreign addresses, and
others listed a home address where they clearly did not live during the academic
year. I had to further filter our data set by removing rows that met an arbitrary
standard of feasibility; no one was commuting from hundreds of miles away. The
final piece of the equation—taking mileage and converting it into emissions—was
actually done for me by a spreadsheet application designed exclusively for the
purpose of calculating an institution’s emissions. It uses an average gas mileage of
American automobiles to produce a number of metric tons of carbon dioxide.
There were many assumptions made throughout the process of estimating carbon
emissions from student commuting. I had to repeatedly adjust my ideas of how
classes were scheduled and where students were coming from by performing a data
transformation and inspecting its effects. The final product, however, was quite
worth it; we produced a few bold graphics to demonstrate the primary sources of
the college’s emissions and displayed them prominently at a major campus event.
These figures further informed policy decisions at the college and how best we
could begin approaching carbon neutrality. Knowing student commuting totals,
even as the rough estimate we had, was a key ingredient in our planning. Besides
leaving behind a series of spreadsheets with painstakingly developed equations, I
also wrote a long narrative document describing each operation, the rationale
behind the equations, and the assumptions made. This document makes it easier to
re-create my steps and inspect my assumptions for flaws.
Often, the difficulty with data cleanup is knowing when to stop. If our goals are
strictly defined at the start—we just need the data in a slightly different format, we
want to normalize certain values, etc.—then completing them is no challenge. In
this case, the value of our work is often obvious since without it the creation of a
corresponding visual would be impossible. But if our objective is more nebulous, we
can spend much time endlessly tweaking our data set in an attempt to perfect it.
Much as writers struggle with endless revisions, we need to stop at some point and
publish. I could have spent years fine-tuning our commuting data, but deadlines
demanded I do the best I could within a restricted timeframe. Luckily, the open-
ended nature of data processing works well with iterative design; we clean up our
data, experiment with a visualization, consider its efficacy, and repeat. Only once
our visual is compelling can we claim that our data cleanup is complete.
CONCLUSION
Data cleanup is a difficult and perilous task. It requires not only variegated skills,
from statistical analysis to coding to knowledge of sophisticated software, but also
subjective decision making. The line between removing irrelevant noise from a data
set and introducing bias is often razor thin. While the visual steps that follow early
stage data sculpting are more glamorous and yield more interesting results, they are
not necessarily more important. Edward Tufte (1990) famously said, “If the
statistics are boring, then you’ve got the wrong numbers.” Investing time and effort
in the earliest stages of data visualization to ensure that your collection and filtering
procedures are top notch is the best way to obtain the right numbers.
NOTES
1. D3.js refers to the Data-Driven Documents JavaScript library, which is a widely used code library for
creating data visualizations. Examples of D3.js will be discussed in chapter 3 of this volume. For more
information, see https://d3js.org.
2. I would be remiss not to mention two of my favorite sources for learning regular expressions: Kim, Bohyun.
2013. “Fear No Longer Regular Expressions.” ACRL TechConnect Blog. July 31.
http://acrl.ala.org/techconnect/?p=3549; Verou, Lea. 2012. “Reg(exp){2}lained: Demystifying Regular
Expressions.” Presented at the O’Reilly Fluent, San Francisco, CA, May 29. https://www.youtube.com/watch?
v=EkluES9Rvak.
3. I use the word plateau deliberately, following philosophers Gilles Deleuze and Felix Guattari in conceiving
it as a medium to be passed through and returned to, not a final destination. See: Deleuze, Gilles, and Félix
Guattari. 1987. A Thousand Plateaus: Capitalism and Schizophrenia. Minneapolis: University of Minnesota
Press.
4. GitHub is a repository hosting service that is used for storing and sharing code and version control
information through the Git version control system. Accounts can be created for free at https://github.com.
5. I do not claim that this pattern sufficiently captures the possible forms of dates in authorized name formats;
for instance it does not accommodate AD/AH/BC/BCE calendar abbreviations, which might precede a date.
The best way to identify an appropriate pattern is by inspecting one’s own data and seeing what works
sufficiently, as the possible formats for any string of text (whether it be a name, date, etc.) are incredibly diverse.
REFERENCES
Camille Davis, Robin. 2015. “Robincamille/ezproxy-Analysis.” GitHub. Accessed August 2, 2015.
https://github.com/robincamille/ezproxy-analysis.
“Csvkit 0.9.1 — Csvkit 0.9.1 Documentation.” 2015. Csvkit 0.9.1. Accessed July 27, 2015.
https://csvkit.readthedocs.org/en/0.9.1/.
Deleuze, Gilles, and Félix Guattari. 1987. A Thousand Plateaus: Capitalism and Schizophrenia. Minneapolis:
University of Minnesota Press.
ECMA International. 2013. The JSON Data Interchange Format. 1st ed. Geneva. http://www.ecma-
international.org/publications/files/ECMA-ST/ECMA-404.pdf.
Fry, Ben. 2008. Visualizing Data. Beijing; Cambridge: O’Reilly Media, Inc.
Hafner, Katie. 2006. “Researchers Yearn to Use AOL Logs, but They Hesitate.” The New York Times, August
23, sec. Technology. http://www.nytimes.com/2006/08/23/technology/23search.html.
“Home | Presidents’ Climate Commitment.” 2015. Accessed August 2, 2015.
http://www.presidentsclimatecommitment.org/.
“iconv(1) Mac OS X Manual Page.” 2015. OS X Man Pages. Accessed July 20, 2015.
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/iconv.1.html.
“JSON.” 2015. Accessed July 11, 2015. http://json.org/.
Kim, Bohyun. 2013. “Fear No Longer Regular Expressions.” ACRL TechConnect Blog.
http://acrl.ala.org/techconnect/?p=3549.
Magnuson, Lauren, and Robin Camille Davis. 2014. “Analyzing EZProxy Logs.” Blog. ACRL TechConnect
Blog. http://acrl.ala.org/techconnect/?p=4684.
Phetteplace, Eric. 2015. “A Forray into Publishing Open Data on GitHub.” ACRL TechConnect Blog.
http://acrl.ala.org/techconnect/?p=5084.
Pollock, Rufus, and Martin Keegan. 2015. “Data Packages.” Data Protocols Lightweight Standards and Patterns
for Data. http://dataprotocols.org/datapackages/.
Tufte, Edward R. 1990. Envisioning Information. Cheshire, CT: Graphics Press.
Verou, Lea. 2012. “Reg(exp){2}lained: Demystifying Regular Expressions.” Presented at the O’Reilly Fluent, San
Francisco, CA. https://www.youtube.com/watch?v=EkluES9Rvak.
2
Designing Public Visualizations of Library
Data
Angela M. Zoss
As in many other organizations and fields of inquiry, the data generated by libraries
becomes ever more complex, and the need to communicate trends both internally
and externally has also been increasing. As visualizations become increasingly
embedded in library assessment and outreach, it is crucial to take into consideration
the audience of the visualizations and to design visualizations that are easy to
interpret. This chapter will walk readers through the process of selecting a
visualization based on a particular data representation need, designing that
visualization to be optimized to its specific purpose, and combining visualizations
into larger narratives to engage a public audience.
The process of choosing, designing, and combining visualizations in a way that is
engaging to a broader public requires an understanding of how visualizations
convert data into shapes or spatial arrangements that can be analyzed and
understood. Various types of visualizations have been developed over the centuries
to take advantage of human skills and acuities. To understand why some work
better than others in certain situations, it helps to first examine the components of
these graphs and how well they match up with what humans are good at.
Several studies have been done to explore how well individuals read different
components of visualizations. Cleveland and McGill (1985) conducted a seminal
study in which users were asked to evaluate two data elements in terms of the
proportion one represented of the other. This task was performed using multiple
visual representations: position, length, angle, slope, and area. The results suggested
that individuals are much better at assessing numerical data using position and
length encodings than they are using slope or area of a shape. This study was
replicated and extended by Heer and Bostock (2010), who confirmed that position
and length afford more accuracy than slope or area, including both circular and
rectangular areas.
A quote from Moritz Stefaner (2012), a professional visualization designer and
researcher, summarizes this and related academic and professional work around
designing data visualizations: “position is everything, color is difficult.” The
following section will explore how these studies and inherent properties of different
types of visualizations can be used to select an appropriate visualization for a data
need.
VISUALIZATION SELECTION
This section will review common types of charts and maps and discuss their
strengths and weaknesses in relation to different types of data and audiences. For
example, some visualizations are matched well with the visual acuities of humans,
some are good at filling space on a page, and some can be especially engaging for a
public audience. Additional visualization-specific design suggestions will accompany
the discussions of each visualization type.
Basic charts and graphs1 (as opposed to custom/novel visualizations or highly
scientific plots) are the easiest visualizations to produce, as well as the easiest
visualizations for a broad audience to understand. Their ubiquity does not,
however, mean that all chart types can be equally well understood by all people.
First, different charts use different visual encodings, and those encodings match up
differently with human perceptual abilities. Second, people do not have equal
exposure to all chart types. Depending on someone’s educational background, she
or he may have more or less practice interpreting a particular chart type. Third, the
data being visualized may have properties that do not match well with a particular
chart type, making the resulting chart difficult or impossible to read effectively.
Choosing an appropriate chart type requires familiarity with the chart options
available, as well as with the properties of the data and the purpose of the chart.
What trends in the data are most important? What chart will best show those
trends? Who is going to be reading the chart? The following section will review six
basic chart types: bar charts, pie charts, line charts, scatterplots, bubble charts, and
heat maps. Each chart will be presented with a discussion of the best uses of the
chart, the most common concerns with the chart, and any additional suggestions
about the use of the chart.
Bar Chart
Bar charts are a staple of data visualization—especially those that are intended for a
broad audience. They are some of the most general types of charts, with few limits
on the kind of data that can be used. Here are just a few basic design tips to watch
out for to make sure you’re taking full advantage of how powerful they can be.
Data. The most basic form of a bar chart2 involves one categorical variable and
one numerical variable. The categories from the categorical variable are each given a
bar, and the quantities from the numerical variable are used to determine the length
of the bar. An optional, additional categorical variable can be added to make either
a grouped bar chart (where several bars appear next to each other for each category
in the original categorical variable) or a stacked bar chart (where the original bar is
split up into several colored segments, stacked on top of each other).
Strengths. Bar charts are especially good for times when it’s important to be able
to read numbers accurately off the chart. The reason for this lies in the human
visual system. Humans have excellent perceptual acuity for differences in alignment
and line length (Cleveland and McGill 1985, Ware 2013). Bar charts are also
extremely common. They are some of the first charts we learn to read and produce.
Comparing the individual bars to each other or exploring the overall trend across
categories are both good uses for a bar chart.
Weaknesses. Bar charts have a few common concerns, however. First, labeling the
bars with long phrases can cause design problems. If the bars are vertical, it can be
difficult to make the bar labels horizontal; the words will be much wider than the
bars, leaving large gaps or awkward line breaks in the labels. If, on the other hand,
the bars are horizontal, it is very easy to have long labels for each bar. The text
height and the bar height can be very similar, and the wide format for the chart fits
much better with the aspect ratio of screens nowadays.
Sometimes one or two bars will be much longer than the others; consider creating
a second chart in which you exclude the long bar and zoom in on the shorter bars
(Few 2012) or investigate whether several categories can be combined into an
“other” category. Also consider what comparisons or evaluations are most important
for your bar chart; if alphabetical order is not relevant for your data, for example, it
might make sense to sort the bars instead by length (Cleveland 1994).
Most important, however, is to pay attention to the numerical axis of the bar
chart. Because humans are so good at perceiving differences in length between bars,
having the axis start at anything other than zero will distort how the length
corresponds to the data. This renders the bars, which are a powerful visual cue,
inaccurate and hard to ignore. The numerical axes for bar charts should always start
at zero (Few 2012, Yau 2013).
Pie Chart
Along with bar charts, pie charts are also very common in infographics, reports, etc.
They are the subject of much discussion and often criticized because of the many
problems associated with them, but they are still a valuable chart in certain contexts.
Data. Like a bar chart, a pie chart is built from one categorical variable and one
numerical variable. In a bar chart, each category becomes a separate bar. In a pie
chart, each category becomes a separate wedge of the pie. The proportion of the
total data associated with each category is represented by the area of the wedge.
Typically, programs that create pie charts will start the first wedge at the “12:00”
position of the circle and proceed clockwise. Programs also typically assign a
different color to each wedge; this color is then used to match a wedge with the
legend entry for the wedge. If wedges are instead labeled directly, they can each have
the same color, provided a contrasting border color is used to show the wedge
boundaries (Wong 2010).
Strengths. Pie charts are a favorite when people want to show a part-to-whole
relationship. That is, when you have data on the total amount of some variable, like
a budget, it is natural to want to represent the subdivisions of that total in a way
that shows that all of the parts add up to the total. Pie charts also are quite common
and easy to produce, so audience familiarity should be relatively high.
It has been said that the best data set for a pie chart is a single number. Focusing
on one category and how much of the total data it represents can be very
compelling. This technique is used more frequently in an infographic context—one
where data are reduced to bite-sized pieces—rather than a reporting context, where
completeness may be important.
Weaknesses. Humans are notoriously bad at reading data values from pie charts.
Our perceptual system is not very precise at evaluating angles, orientations, or two-
dimensional areas (Cleveland and McGill 1985, Heer and Bostock 2010). With
each wedge rotated in space, it is incredibly difficult either to read precise values or
to compare wedges, especially those that do not appear next to each other.
This problem is compounded if the differences between data values are very small.
For example, if you have five categories and each is only slightly above or slightly
below 20 percent of the total, the fine distinctions will be lost to chart readers. If the
goal of your chart is to help readers understand small differences, the pie chart may
not be the better choice; consider switching to a bar chart. If, on the other hand,
there are noticeable differences in the sizes of most of your wedges, you can make
sure to help your readers as much as possible by putting the wedges in decreasing
order by size (i.e., left-hand edge of largest wedge at “12:00,” followed on the right
by next largest wedge, etc.).3 Sorting by size can aid readers in making relative
estimates of data values.
Also beware of data that do not have a true part-to-whole relationship. One
example might be data that change over time—for example, where each “category”
is a single year. In this case, there is a mismatch between the visual encoding or the
visual metaphor (one that joins parts of the chart into a solid shape) and the nature
of the data (one where a single variable is changing over time). Combining multiple
years does not often result in a meaningful total, so encoding that total as a visual
element (a completed circle) is seldom necessary or natural. Time-based variables
are often a good match for line charts, which will be discussed next.
Pie charts have another notorious concern—the use of three-dimensional special
effects. Some programs allow the chart’s creator to add a special effect that renders
charts as though they are 3D objects. This may simply involve drawing a 2D pie
chart to look like it is popping forward, but it may also involve actually tilting the
chart “away” from the reader. This tilting effect causes an extreme distortion of the
visual area of the chart (Skau 2011). The wedge at the bottom of the chart will end
up taking up much more of the chart space than it would in a 2D version, and the
wedges at the top will take up much less chart space than they should. Thus, in a
tilted pie chart, neither the angle nor the area of the wedges accurately corresponds
to the data values in question. An already difficult chart to read has become highly
distorted. Consider settling for the more boring, but much more faithful, 2D
version (Few 2012).
The final problem with pie charts occurs when there are a large number of
categories (see figure 1 in the photospread), which can result in very small slices,
including ones that may even be invisible depending on the size of the chart. A large
number of categories can also lead to difficulties with color if the program you are
using assigns a new color to each wedge. If the program has to repeat the same color
multiple times, it becomes hard to match a wedge to the correct entry in the legend.
If wedges are labeled directly to solve the color problem, the many labels can add a
lot of visual complexity to the chart. If the data have a natural grouping, you could
combine wedges into larger categories or simply create an “other” category for
wedges under a certain size.
Line Chart
Line charts are more limited than bar charts and don’t necessarily match well with
proportional data, but they are very well suited to visualizing continuity and
temporal data. The main concerns involve data complexity or consistency.
Data. Line charts typically show change over time. As such, they often use one
date variable (for the horizontal axis) and one numerical variable (for the vertical
axis). If your software program does not require line charts to use dates, the
program might treat that axis as a series of ordered categories (like days of the week)
or just as another numerical variable (like treating years as integers). An extra
categorical variable can be used to split the line into multiple lines, one for each
category.
Strengths. The continuity of the line in a line chart is a great match for data that
change continually over time. Pie charts, as previously discussed, should be used for
only separate categories that add up to a meaningful whole. A bar chart is a good
general chart and can be used either for parts of a whole or for quantities that
change over time, but the separations between bars do not reinforce the idea that
data are changing in a more fluid manner.
The line chart also offers practical features beyond a bar chart. First, the line chart
can be more elegant than a bar chart when you add a category variable. Imagine
looking at a data set over time with a grouped bar chart. For each time point, you
will have several bars, but they have to be placed next to each other. This will add
width to the chart and make it difficult for the eye to connect the bars that
represent a single category. The other option, a stacked bar chart, makes it very
difficult to read the exact value of each category because all but the bottom-most
category start at a value other than zero.
A line chart can fix both of these problems. Each data point is positioned from the
bottom of the y-axis;4 data points within a single category are clearly connected by
lines; and the data points from all categories line up vertically at each time point,
keeping the width of the chart manageable.
A final benefit of using a line chart is that the date measurements do not have to
be evenly spaced. Bars default to even spacing, which would distort how quickly
data are changing if, say, some dates are a single year apart while others are two or
three years apart. With a line chart, the horizontal position of the data points is (or
should be) precise, corresponding exactly to the date and accurately representing the
distance between measurements. (See figure 2 in the photospread.)
Weaknesses. The strengths of the line chart become its weaknesses when it comes
to data overlap. Unlike bar and pie charts, line charts have nothing that prevents the
lines from covering each other up. Typically line charts use color to differentiate the
lines, which can help when lines cross a lot, but this also makes it more difficult for
users to read the chart. Users end up needing to look back and forth between the
chart and the legend to identify the lines. Where possible, placing labels on the
chart, next to the lines, can improve readability and potentially even eliminate the
need to use color for identifying categories.
Depending on the software being used, you may still need to worry about the
problem of uneven spacing between date measurements. If your dates look more
like categories—for example, if you are using the names of months, you may have
to create the complete list of all possible months and then leave “null” values or
zeroes for the months that have no data. (For a good discussion about how to deal
with missing data in a line chart, see Kandel et al. [2011]).
Finally, if you use a simple style (i.e., a plain line) for your line chart, you may find
that it is hard to tell where the data points are. The measured data will be visible
only if the line bends one way or the other at that particular point. If the slope of
the line is relatively consistent, it is hard to tell how evenly spaced the data are.
Consider adding small dots on top of the line to clarify where the measurements
are, especially if they are unevenly distributed (Few 2012). It is typically not
necessary to use multiple shapes for these dots; a single shape (e.g., a circle) for all
lines will help maintain a consistent style and ensure that each data point is equally
visible.
Scatterplot
Scatterplots are some of the most precise and data-dense charts available for
visualization. Special patterns in scatterplots can reveal strong trends in a data set or
even suggest errors or outliers in the data (Yau 2013). Scatterplots are a great match
for our perceptual system, but they can fall short in terms of audience familiarity.
Data. Scatterplots are typically used to show the relationship between two
numerical variables. Each variable is assigned to an axis, and a point is placed in
space so that it lines up with the correct values on each axis. An additional
categorical or numerical variable can be added to change the color or shade of the
dots in the scatterplot. A slight variation on the scatterplot, often called a bubble
chart, will change the size (specifically, the area) of the dots according to yet another
numerical variable.
Strengths. Scatterplots offer a visual representation of the correlation of two
numerical variables. If there is a strong (linear)5 relationship between these variables,
the dots will form something like a line, angled either from top left to bottom right
(negative correlation) or from bottom left to top right (positive correlation).
Scatterplots are also useful for showing clusters of data points (i.e., data points that
are much closer to each other than they are to other data points) or outliers (i.e.,
individual data points that are very different from the rest of the data points) (Yau
2013).
A scatterplot also can be used as a general form to make other types of charts.
Take, for example, a data set where you want to see the distribution—or range of
values—for a particular variable. Maybe you have the number of pages each user
read of a particular ebook. You don’t necessarily need to know the name of each
user; you just want to see how the different users varied from each other, what the
lowest and highest values were, etc. Instead of making a bar chart, which could take
up a lot of space, you could use a scatterplot in which one axis is the number of
pages read and the other axis is a single value—basically, a fake integer representing
the name of the book. This is called a one-dimensional scatterplot (Cleveland
1994), or strip plot (Few 2012), and all of the dots representing the users will be
places in a single vertical or horizontal line. This will help you see clearly whether
there is a pattern across these ebook readers or if they vary wildly, and the chart
itself will be very compact. The distribution of a variable can also be shown with a
histogram (see previous note), which may be a better option if you have a lot of data
points or if many of the data points lie on top of each other.
Note, however, that this simplified scatterplot technique would also work for
multiple books. You could assign a different fake number to each book, and the
data points for each book would then show up in parallel lines on the chart. More
generally, this is how you can use a categorical variable, like book title or genre, with
a scatterplot. This can even be used as a replacement for a bar chart. Each category
is assigned to a different integer, and each category has a single numerical value, or a
single dot. Because the dot is not visually powerful in the same way as a bar, this
type of chart—often called a dot plot—does not require a full numerical axis that
goes all the way to zero (Cleveland 1994, Few 2012). See figure 3 of the
photospread for examples of using scatterplots with different combinations of
variables.
Weaknesses. One of the primary weaknesses of scatterplots is that they are less
common than bar charts, pie charts, and line charts. It does require some training to
be able to detect and make sense of trends in dot patterns, and the sheer number of
data points may also be overwhelming.
Likewise, the more data you have, the greater the chance that dots will cover each
other up. Unlike the line chart, where lines coming into and out of data points can
help users infer an obscured data point, scatterplot points can simply disappear
when drawn on top of each other. Some solutions involve using transparency to
show when multiple dots are stacked, using logarithmic axes to space dots out more
toward the lower ends of the data values, or even using aggregation to total up the
number of points that are in exactly the same spot (Cleveland 1994). This gives you
an additional variable that you can encode as either color or size to make a bubble
chart.
Bubble charts, though, have an additional weakness. Like a pie chart, a bubble
chart requires users to compare the areas of different shapes to understand trends in
the data. Comparing the areas of circles has been found to be extremely difficult
(Cleveland 1994), and bubble size is thus not a good match for data that need to be
interpreted precisely. If you have three numerical variables and you would like to
show them all on the same chart, it is best to use the axes for the two that are most
important or that require the most precision.
Heat Map
Heat maps are much less common than the other types of charts, but they can be
extremely useful for representing a large amount of data. They are more appropriate
for general trends than for precise data lookup.
Data. Heat maps show data in rows and columns, much like a standard table. The
difference between a table and a heat map is that every number in a table cell is
converted to a color. The grid of a heat map can be created with categorical or
numerical variables. For categorical variables, each category becomes a separate row
(or column). For numerical variables that are easily separated, such as integers, the
values can be treated just like categories. For continuous numerical variables,
though, it is typical to create number ranges (see previous note on histograms).
Within the grid, a third numerical variable (often the number of data points
associated with that particular row and column) is encoded as the color of the cell.
Strengths. While not a common data visualization, heat maps fill a gap in the
ability of visualizations to highlight the relationships between two categorical
variables. Grouped or stacked bar charts can also visualize two categorical variables
along with a numerical variable, but both make it difficult to see interesting
interactions between the categories, and both prioritize one variable over the other
(see figure 4 of the photospread). Laying two categorical variables out on a grid
gives both of those variables equal focus. Finally, heat maps are also very space
efficient; they can compress an amazing amount of data into a small area (Munzner
2014).
Weaknesses. The primary weakness of a heat map is that the numerical variable
cannot be read precisely. The color encoding may give a sense of what categories are
consistently high or low or where the interaction between two categories results in a
high point or a low point. Like area encoding, however, (numerical) color encoding
does not allow data to be matched precisely to a data value. Humans are simply not
proficient at detecting small changes in the brightness or saturation of a color and
matching the changes to a data range.
Another weakness of heat maps is that they are relatively uncommon. While they
can be produced even in common spreadsheet programs, they do not appear as an
official chart type. This makes them harder to produce than other chart types, and
the lack of popularity also means that users will be less familiar with them.
Finally, heat maps are highly dependent on the order of the rows and columns to
reveal similarities between different categories within the same variable. If the
categories do not have a natural ordering—say, if they are institution names instead
of months of the year—you may wish to inspect the data to see whether some of the
categories should be placed next to each other (Yau 2013).
MAPS
Data often include a spatial component. Maps can be a very powerful way to engage
an audience and to highlight patterns in data that relate strongly to location.
Sometimes, however, location data are a red herring. Sometimes the patterns in
your data are not inherently spatial (Wong 2010). Even if location is important,
sometimes a map is too constraining to show the data well.
For example, you may have data on where website traffic is coming from (see
figure 2.6). In all likelihood, most of the traffic is coming from the country in
which your university is located. Maybe a few other countries are represented, but
there may be no spatial pattern to where those other requests are coming from.
There may be a small number of those external requests, and they might be
scattered across the rest of the globe, hard to see among the many countries without
data and the large oceans. That would be a lot of space devoted to showing a very
small amount of data, and unless location turns out to be important somehow (e.g.,
requests are coming only from places in a particular time zone), there may be a
better visualization type than a map. Remember—position is everything in data
visualization. When you use a map, you lose the ability to encode a variable (or
two!) by position.
Figure 2.6. Two symbol maps focusing on locations.
If spatial patterns are important, however, maps can often immediately pull users
in by helping them “find themselves” in the data. Different types of maps allow for
different types of data analysis. This chapter will focus on the two most common
types of maps for data presentation: choropleth maps and (proportional) symbol
maps.
Choropleth Map
Choropleth maps are a basic and common map type. They fill the borders of
geographic regions (e.g., countries and states) with colors that represent data. They
are easy to create with a variety of tools and easy to read. While there are some
problems to be aware of, knowing how to use these maps well can lead to maps that
are very easy to read.
Data. Choropleth maps typically take two variables—the name of the region of
interest and a numerical or categorical variable that gets converted into a color.
When using a numerical variable with a choropleth map, cartographers often use
number ranges (see the note on histograms) instead of a continuous color scale.
This is a way to adjust for unevenness in the numerical variable. For example, if
your data values are very skewed and you have a lot of very low values, you might
want to limit how many of those low values show up as a very light color. You can
set up number ranges that start out very narrow and gradually get larger, grouping
together enough data points in the upper ranges that the map shows a good amount
of differentiation.
Strengths. Choropleths are especially good for making sure that data points do
not overlap each other. Map regions can get very complicated, but in most cases it is
easy to draw boundaries and keep data separate from one region to another.
Choropleths can also take advantage of audience familiarity with particular region
shapes; the shape of a county, state, or country can sometimes stand on its own,
without the need for a label, which could add visual complexity.
Weaknesses. The major weakness of a choropleth map is that sometimes large
regions take up a huge amount of visual space but don’t necessarily warrant that
much attention. On the iconic “red state/blue state” maps that emerge around
election times in the United States, large but sparsely populated states such as
Montana end up making the country look like a lot of the population leans “red.”
In truth, the densely populated areas such as large cities take up very little visual
space but can have a much larger effect on the results of an election.
The other major weakness of a choropleth map is that the only variables are a
location variable and a single extra variable. Being wedded to both the accurate
position and the accurate size of regions means that both of these encodings are
unavailable. Special maps called “cartograms” play with this notion by changing the
size of the regions while trying to keep them looking approximately the same as the
true regions. For example, Montana might shrink down in proportion to the
number of Electoral College votes it has, while New York and California would
swell quite a bit. The distortion can make these difficult to read, depending on how
different the region ends up looking, and it is also hard to find software that will
create cartograms.
Symbol Map
Symbol maps use a standard map in the background, but the data are encoded in
circles that are placed on top of the correct locations. These maps are becoming
more common, but they fall prey to some of the same problems experienced by
scatterplots and bubble charts.
Active Titling
One safe rule of thumb is to make sure that every axis (or visual encoding) is
labeled. Sometimes chart designers will use the title of the chart to describe the
variables in the chart and leave the axes unlabeled. If you label the axes directly,
however, it actually becomes unnecessary to duplicate this information in the chart
title. Instead, the chart title becomes an opportunity to help users understand the
chart better.
Consider using a practice called “active titling.” With active titling, the standard
boring chart title transforms into an active statement of the importance of the chart.
Instead of “Change in ebook transactions over time,” you can make a concrete
statement about the change was over time—for example, “ebook transactions have
been doubling each year.”
Annotations
Annotations, or special informative text or reference points that have been added to
visualizations, take the principle of active titling one step further. Annotations can
offer users additional contextual information, focus attention on a single data point
or specific trend, or add an additional reference against which to evaluate the data.
Examples include paragraphs of text placed on or near the chart, special labels for
points, additional lines or boxes added to areas of graphs, or even a “callout” box
that zooms into or out of the data in the chart. While it’s important not to
overwhelm the reader or create too much visual complexity, it’s also important to
make sure the chart is understandable, especially if it is being designed for a broad
audience (Yau 2013).
Different publication venues may limit the use of annotations, however. For
example, a paragraph of explanatory text may not be appropriate for a PowerPoint
presentation. Instead, consider creating several versions of the chart, each focusing
on a small aspect or adding one small annotation at a time. On the other hand,
web-based visualizations may offer wonderful opportunities for “on-demand”
annotations. Perhaps annotations can appear when the mouse hovers over a data
point and go away again when they are no longer needed, thus reducing visual
complexity.
Most programs that create visualizations will allow you to customize the text on
the charts or maps. If not, consider removing all of the text from the chart and
adding it back in later with a graphic design program or even a simple drawing or
presentation program. The same is true for most annotations; even if you are adding
a reference line, it may be easier to figure out how to add it in with a drawing
program than to figure out how to get a visualization program to do it. Just make
sure any manual edits look professional and are in the same style as that of the
original chart.
STORYTELLING WITH DATA
These days, it isn’t quite enough to produce a well-designed and accurate
visualization. The focus is increasingly on storytelling with data, or on creating
some kind of narrative arc that ties multiple visualizations together into a clear and
compelling progression. This can happen sequentially over time, as in a digital
slideshow, or by arranging the visualizations in space, as with an infographic, online
dashboard, or poster.
Narrative Design
Designing a narrative with separate visualizations is much like studying how a user
explores data with a single visualization. For many years, the field of data
visualization has subscribed to Ben Shneiderman’s “visual information seeking
mantra” (1996), which states that the following information-seeking tasks should be
prioritized in the design of interactive visualization systems: “overview first, zoom
and filter, then details-on-demand.” Several additional tasks, including a “relate”
task defined as “viewing relationships among items” (p. 2), are listed in the paper
but not included in the mantra.
Applying these tasks to a narrative, it makes sense to start with an overview of the
data you want the user to understand. Where in an interactive visualization you
could literally zoom into parts of the visualization, in a narrative you would create a
separate visualization that is already zoomed or filtered—that is, limited in
numerical ranges or categories or time periods. Perhaps to show this different level
of detail, you will also want to change the chart type. For details on demand, you
may be able to generate a series of small, highly focused charts that allow users to
focus on the details that are most interesting to them. Finally, to summarize what
the data suggest, using the additional “relate” task to build back to an interpretation
of the data will help users understand why you are drawing certain conclusions from
the data.
There are other ways to structure narratives across multiple visualizations. Few
(2012) recommends designing visual communication by following this
organizational process: “1. Group (i.e., segment information into meaningful
sections), 2. Prioritize (i.e., rank information by importance), 3. Sequence (i.e.,
provide direction for the order in which information should be read)” (p. 144).
Going through this process is especially helpful when determining the spatial layout
of your visualizations, which will be addressed in the next section.
Finally, a recent study by Hullman et al. (2013) presents a survey of transition
types found within a sample of narrative visualizations, as well as visualization blogs
and repositories, from the field of journalism. The study found six transition types
(presented in decreasing order of frequency): temporal, granularity, comparison,
causal, spatial, and dialogue. Some kind of temporal transition, or moving from one
time period to another, appears in almost 90 percent of the examined narratives.
Transitions in granularity (i.e., from general to specific or vice versa) closely mirror
the “overview first, zoom and filter” components of the visual information-seeking
mantra (Shneiderman 1996). The final transitions—those of comparison, causality,
spatial proximity, and dialogue—each extends the normal approach to designing a
user experience with a visualization by exploring the different types of connections
that can occur between different visualizations. When considering how to organize
visualizations into a larger narrative, it may be necessary to move back and forth
between general questions of organization (“What are my priorities?”) and specific
questions of arrangement (“What transition makes the most sense here?”).
Arrangement of Visuals
After the content of the visualizations has been selected and the visualizations have
been generated, it’s important to lay the visualizations out in an appealing and
understandable arrangement. Especially for posters, infographics, or dashboards, the
size and arrangement of visualizations have a huge effect on how users read the
display.
The arrangement of panels in a graphic novel or comic book offers a great example
(McCloud 1993). The size and arrangement of the panels must guide the reader’s
eye so that he or she follows the narrative in the correct order. Cultural differences
can also play a part here: Is your audience likely to start in the upper left corner or
the upper right corner? Is the primary reading directly horizontal or vertical? When
you have a diverse audience, you will want to use visual elements to help guide the
eye and encourage users to follow the narrative in the way or ways that will make
the most sense. Typically, in Western cultures, the most important elements are
placed at the top left, and additional elements follow left to right first and then top
to bottom. Consider using size as well as position to emphasize which elements are
most important.
Gestalt principles explain some of the native tendencies of humans when they are
interpreting visual elements (Ware 2008). For example, the Gestalt principle of
enclosure says that visual elements that are all contained within an enclosing shape
will be seen as more similar to each other than to elements outside of that shape. To
apply this principle, any elements contained within a box or circle will be
interpreted together, separately from others. Another Gestalt principle, the principle
of proximity, states that objects that are close together will seem more similar than
those that are far apart. This offers another practical solution: You can actually
simply use whitespace in your arrangement to guide the eye between elements.
Make the spacing tight between elements that should be interpreted as a single
group and then increase the space between that group and any other groups (Few
2012).
Finally, the arrangement of elements in your layout should take into consideration
graphic design principles such as the rule of thirds and the golden ratio (see figure
2.9). When arranging elements in visual space, having perfectly centered and
symmetrical arrangements is not always considered the most aesthetically pleasing.
The rule of thirds suggests that the focal point of an arrangement should be
determined by dividing a space into thirds, both horizontally and vertically, and
then placing visual elements along or at the intersections of those divisions. The
elements of interest in the display thus get placed slightly off center but not all the
way at the edge of the visual field.
The golden ratio is used to create rectangles where the long end is about 1.618
times the length of the short end, or almost two thirds longer. The exact ratio is
perhaps less important than the idea that nestling golden ratio rectangles within
other golden ratio rectangles creates another type of arrangement that keeps
elements slightly off center and encourages changes in size, which in turn
encourages users to focus their attention differently on different elements.
Combining the golden ratio and the rule of thirds with more basic grids can
generate very visually appealing narrative visualizations.
Design Consistency
In any situation where you are combining multiple visual elements, however, it is
important to use graphic design to ensure consistency across all elements. This will
present a more professional output, but it will also aid comprehension by allowing
users to learn design conventions once and apply them successfully to multiple
visualizations (Wainer 2008).
Design consistency extends from using the same or compatible fonts across all text
and visual elements to the use of the same color scheme (or same specific colors)
from one chart to the next. A special case of design consistency arises when you
create multiple visualizations from the same variables. Small multiples, also known
as trellis plots or lattice plots, are created when a complex chart such as a line chart
with too many lines is split up into a series of nearly identical charts, each showing
one category. In this situation, it is essential that not only any data colors be kept
consistent but also the range on the axes, grid lines, and other reference marks
(Cleveland 1994). To make accurate judgments about how the data differ from one
category to another, the reference systems for each chart have to be identical.
CONCLUSION
The considerations presented in this chapter will hopefully provide a foundation for
developing visualizations that can appeal to a broad public. The process, however, is
likely to be much less linear than the information here suggests. You may find while
building a narrative that a new chart is necessary for a particular transition. You may
realize that you have several different charts all using color in different ways and
decide to try to re-encode at least one of those variables in some other type of
graphic element. Specialized visualization software such as Tableau
(http://www.tableau.com) can be especially helpful during planning phases because
of the ease with which additional charts can be created. The best advice for creating
great visualizations is often to create as many draft visualizations as possible and to
seek out feedback on your ideas. With practice, you will develop your own style and
intuitions, and your visualizations will become increasingly clear and engaging.
NOTES
1. Some authors make distinctions between charts and graphs. For example, Börner (2015) uses “chart” for
visualizations that lack a formal reference system (e.g., word clouds). Both Börner (2015) and Few (2012)
reserve “graph” for visualizations that do use a reference system—typically a Cartesian or polar reference system.
In many software applications, however, these terms are used interchangeably, and this chapter will follow that
convention.
2. Sometimes bar charts are incorrectly referred to as “histograms.” A true histogram visualizes only a single
numerical variable. The “bars” in a histogram are actually numerical ranges (e.g., 0 to 4.99, 5 to 9.99), dividing
the values of the variable into a series of chunks or “bins.” The number of data points that fall within the range
becomes the length of the bar for that bin. You can usually tell the difference between a histogram and a bar
chart by looking for gaps between the bars. In a histogram, the lower value of one bin starts right where the
upper value of the previous bin leaves off so that the bars in a histogram usually do not have any gap between
them.
3. Wong (2010), however, recommends placing the largest wedge to the right of the 12:00 position and then
starting the second largest wedge to the left of 12:00, continuing counterclockwise after that. This places the
two most important wedges at the top of the chart, instead of placing the most and the least important wedges
at the top. If wedge sizes are similar, however, it may not be obvious to readers that this nontraditional ordering
is being used.
4. There is some debate about whether the y-axis of a line chart should start at zero. Because there is no bar
representing the number as a continuous visual area, it may be appropriate to limit the y-axis to the range of
interest, effectively “zooming in” on the line chart (Cleveland 1994). This can be especially useful for data series
where small variations are important or significant.
5. Two numerical variables could be related to each other nonlinearly, as well. Scatterplots can also show this,
but broad audiences are less likely to understand nonlinear relationships.
6. This actually may differ between maps and charts, where the data are less densely positioned. In a
choropleth map, for example, the fact that the background behind the map is white may not matter if you are
filling in every state with a color. Each state will be surrounded by other states filled with some kind of color. If
a state filled with a light color is surrounded by states filled with dark colors, that state may actually draw more
attention instead of less attention. You may sometimes want to choose a darker center color—say, a medium
gray—so it doesn’t pop so much in context.
REFERENCES
Borland, David, and Russell M. Taylor. 2007. “Rainbow Color Map (Still) Considered Harmful.” IEEE
Computer Graphics and Applications 27(2): 14–17.
Börner, Katy. 2015. Atlas of Knowledge: Anyone Can Map. Cambridge, MA: MIT Press.
Byrne, Michael D. 2002. “Reading Vertical Text: Rotated vs. Marquee.” Proceedings of the Human Factors
and Ergonomics Society 46th Annual Meeting, Santa Monica, CA.
Cleveland, William S. 1994. The Elements of Graphing Data. Murray Hill, NJ: AT&T Bell Laboratories.
Cleveland, William S., and Robert McGill. 1985. “Graphical Perception and Graphical Methods for Analyzing
Scientific Data.” Science 299(4716): 828–833.
Few, Stephen. 2012. Show Me the Numbers: Designing Tables and Graphs to Enlighten. 2nd ed. Burlingame, CA:
Analytics Press.
Gilmartin, Patricia, and Elisabeth Shelton. 1990. “Choropleth Maps on High Resolution CRTs: The Effects of
Number of Classes and Hue on Communication.” Cartographica 26(2): 40–52.
Healey, Christopher G. 1996. “Choosing Effective Colors for Data Visualization.” Proceedings of the 7th
Conference on Visualization ’96 (VIS ’96), San Francisco, CA.
Healey, Christopher G., and James T. Enns. 2012. “Attention and Visual Memory in Visualization and
Computer Graphics.” IEEE Transactions on Visualization and Computer Graphics 18(7): 1170–1188.
Heer, Jeffrey, and Michael Bostock. 2010. Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess
Visualization Design. New York: ACM.
Hullman, Jessica, Steven Drucker, Nathalie Henry Riche, Bongshin Lee, Danyel Fisher, and Eytan Adar. 2013.
“A Deeper Understanding of Sequence in Narrative Visualization.” IEEE Transactions on Visualization and
Computer Graphics 19(12): 2406–2415.
Kandel, Sean, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris
Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. “Research Directions in Data
Wrangling: Visualizations and Transformations for Usable and Credible Data.” Information Visualization
10(4): 271–288.
Lin, Sharon, Julie Fortuna, Chinmay Kulkarni, Maureen Stone, and Jeffrey Heer. 2013. “Selecting
Semantically-Resonant Colors for Data Visualization.” Proceeedings of Eurographics Conference on
Visualization (EuroVis) 2013, Leipzig, Germany.
McCloud, Scott. 1993. Understanding Comics: The Invisible Art. New York: HarperPerennial.
McClure, Leslie, Mohammad Al-Hamdan, William Crosson, Sigrid Economou, Maurice Estes Jr., Sue Estes,
Mark Puckett, and Dale Quattrochi Jr. 2013. North America Land Data Assimilation System (NLDAS) Daily
Air Temperatures and Heat Index (1979-2011). CDC WONDER Online Database.
Munzner, Tamara, ed. 2014. Visualization Analysis & Design. Boca Raton, FL: CRC Press.
Shneiderman, Ben. 1996. “The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations.”
Proceedings of IEEE Symposium on Visual Languages, Boulder, CO.
Skau, Drew. 2011. “2D’s Company, 3D’s a Crowd.” Visual.ly blog. http://blog.visual.ly/2ds-company-3ds-a-
crowd/.
Stefaner, Moritz. 2012. Data Stories Episode #5: How to Learn Data Visualization (with Andy Kirk). In Data
Stories, edited by Enrico Bertini and Moritz Stefaner.
Wainer, Howard. 2008. “Improving Graphic Displays by Controlling Creativity.” Chance 21(2): 46–53.
Ware, Colin. 2008. Visual Thinking for Design. Burlington, MA: Morgan Kaufmann Publishers.
Ware, Colin. 2013. Information Visualization: Perception for Design. 3rd ed. Waltham, MA: Morgan Kaufmann
Publishers.
Wong, Bang. 2011. “Points of View: Salience to Relevance.” Nature Methods 8(11): 889–889.
Wong, Dona M. 2010. The Wall Street Journal Guide to Information Graphics: The Dos & Don’ts of Presenting
Data, Facts, and Figures. New York: W. W. Norton.
Yau, Nathan. 2013. Data Points: Visualization That Means Something. Indianapolis: Wiley.
3
Tools and Technologies
Visualizing Research Activity in the Discovery
Layer in Real Time
Godmar Back and Annette Bailey
During the past decade, the role of academic and public libraries alike has been
steadily shifting. The vast majority of circulation today is electronic, and thus many
libraries have reduced the amount of physical space dedicated to print resources.
Physical library space is increasingly used to provide collaborative space for students
and faculty. Libraries are repurposing themselves to create information commons
(Bailey and Tierney 2008), which are physical and online spaces dedicated to
information and knowledge accumulation and discovery. Small-and large-scale wall
displays for instruction or exhibition purposes are becoming ubiquitous. Digital
signage has emerged as a managed aspect of public relations in many libraries,
serving to share upcoming events and to offer another venue for library displays.
Libraries have been replacing their traditional, outdated online public access
catalogs (OPACs) with modern discovery systems, which are full-text, index-based
search engines that match the convenience and ease of use of general-purpose search
engines, yet provide users with access to scholarly materials selected by their library.
The resources users click on in their search results in a discovery interface represent
a transaction similar to that of the traditional checking out of books and other
resources in the OPAC. Examples of library discovery systems currently available in
the library marketplace include ProQuest’s Summon, Ex Libris’s Primo, and
EBSCO’s EBSCO Discovery Service (EDS).
These changes are exciting for the new opportunities they offer, but they also pose
a number of risks to the future relevance of libraries. For instance, if interaction
with the library becomes entirely private, the communal character of the
transaction, once so common at the circulation desk, is forfeited. Consequently,
library resources are viewed as commodities for which the library is no longer seen
as a selector and provider. This effect leads some to doubt the library’s relevance,
particularly in the face of declining print circulation, and makes it difficult for
libraries to justify their collection budgets.
Notwithstanding the emergent use of such standards as COUNTER (counting
online usage of networked electronic resources) and the desire of most libraries to
carefully compile, process, and publish usage statistics, such statistics are often
delayed, incomplete, or used only for internal purposes. No major discovery system
provides real-time statistics on its use, and those systems that do collect search
statistics do so in a way that places this information in a silo that makes it difficult
for libraries to analyze easily; cross-institutional analyses are nearly impossible.
This chapter describes the LibFX technology developed by the authors, which
collects and processes real-time usage data from a library’s discovery system
(powered by ProQuest’s Summon discovery system) and creates compelling real-
time visualizations. We gather the interactions of users (while protecting their
privacy) with their library’s discovery system, collect this information in real time
on a web-based server, perform real-time processing, and broadcast the processed
data to data visualization applications. These applications can drive small or large
public displays with which users can interact, and they can also be integrated into
existing or new web offerings. We chose the name LibFX for our system to
emphasize its ability to visualize the data in an engaging manner.
RELATED WORK
Hilary Davis wrote an influential blog article in 2009 for In the Library with the
Lead Pipe about why visualization of library data matters to libraries (Davis 2009).
According to Davis, libraries have many sources of data, and the challenge in
communicating that data to outside stakeholders has become increasingly complex.
Visualizations can provide a compelling way to communicate with stakeholders.
Several projects have used visualizations for library data, some of them to gain a
deeper understanding to aid in collection management and others to collaborate
with artists to display library data to the public.
The Harvard Library Explorer project uses a variety of visualizations in a single
window to demonstrate aggregate data about the circulation of the library’s
collections over a fixed period of time (Harvard 2012). Using the visualizations, the
user can drill down to more specific data about how many times books under a
Library of Congress Subject Heading circulated during each year. Color gradients
are used to show “popularity” so that one can see at a glance the level of interest in
particular subjects. Another graph demonstrates the number of books published in a
given year on a subject. This display effectively visualizes a cross-section of data that
inform both librarians and users about past publication and circulation activities.
The Seattle Central Library installed large-scale displays in its circulation and other
public areas that display aggregate data about how many transactions had transpired
over a period of time, along with other information (Legrady n.d.). Aptly named
“Making the Invisible Visible,” the library’s data visualizations are known for being
a key part of its community members’ experience in the physical space of their
library. It allows the user community that creates the data to see these data, which
are typically looked at by only librarians.
Indianapolis’s Museum of Art (IMA) uses a dashboard to communicate with its
community members about their interactions with it and its art collection (IMA
2010). Using a tiled display, IMA tells visitors at a glance how many Facebook
friends the museum currently has and other real-time data, such as the current
attendance at its physical museum space. This real-time data enrich the
museumgoers’ experience by sharing with them the community’s current activities.
The Filament Mind project (Lee and Brush 2013) in the Teton County Library is
a striking art installation that taps in real time into every library catalog search in
the state of Wyoming, using 1,000 light filaments. These 1,000 light filaments
correspond to 1,000 Dewey Decimal classifications. The filaments are lit based on
the Dewey Decimal classification of the search results users see. The choice of
physical strands of fiber creates an experience that powerfully unites the “real” with
the “virtual” world.
To the best of our knowledge, our project is the only project that focuses on the
real-time visualization of discovery system activity.
ARCHITECTURE
Extracting real-time data and creating visualizations using LibFX involves multiple
steps, which are depicted in figure 3.1. Users interact with a modified front end
when they use Summon for discovery. A number of server components record a
user’s interaction with Summon results and then log the results. A log watcher
component processes the metadata associated with each user event and prepares
them for broadcasting by the data server to the web-based visualizations. We discuss
each of these components in turn.
Front-End Modifications
Although Summon provides usage statistics via the back-end configuration interface
(known as the Client Center), they are aggregate only and not retrievable in real
time. Thus, to record which Summon results our users clicked on, we needed to
modify the Summon search page. Such modifications are possible because Summon
allows its clients to specify a URL pointing to a JavaScript file that is loaded into the
page when search results are displayed. This feature is both powerful and
rudimentary: powerful because it gives clients full control over the web page
including the ability to add, remove, or change any content displayed and to
intercept and record any user interactions with the page, but also rudimentary
because there is no vendor support for any of these actions—the vendor code is
undocumented, provides no public APIs at this level, and is subject to change
without notice!
Since we possessed a significant amount of JavaScript experience from past
projects, we reverse engineered the user interface and identified the elements we
needed to monitor. When Summon upgraded from 1.0 to 2.0, the entire front-end
implementation changed, and we had to start from scratch. Reverse engineering
Summon 2.0 was significantly more difficult than modifying Summon 1.0 because
it uses higher-level libraries (e.g., AngularJS [Hevery 2009]) with significantly
higher levels of abstraction. We ended up publishing the technical details of how we
modified Summon 2.0 in a separate paper (Back and Bailey 2014).
Briefly put, our modifications identify which search results are displayed on the
results page. For each result, we retrieve the unique Summon ID of the associated
displayed item, which is embedded in information to which our JavaScript code has
access. We install an on-click event handler for each hyperlink a user may click on
when accessing the displayed link to the item, and there are multiple hyperlinks for
each item. When users click, we send the IDs of the items on which they clicked to
our server, which then logs them. Similar to what Google Analytics does, we exploit
a request to a 1 × 1 GIF image for this purpose and send along the ID in the
request’s query parameters. We use a standard Apache server installation to serve
these images, but we configured it such that requests for those images are logged in
a separate log file. Dedicating a separate log file to these requests allows the
subsequent tools in our chain to focus on only this log file, eliminating the need to
extract requests from a shared log file and avoiding polluting the same.
Metadata Collection
The click logging results in a timestamped log of (time, Summon ID) pairs, each
referencing an event in which a user accessed a Summon result. For the purposes of
analysis and visualization, we need the metadata associated with each Summon ID.
The Summon API (Serials Solutions 2009) allows us to retrieve this metadata using
the ID. An edited excerpt of a typical record is shown below using JSON (Bray
2014) notation:
{
"Abstract": [ "Data-Driven Documents (D3) is a novel representation-
transparent..." ],
"Author": [ "Heer, J", "Ogievetsky, V", "Bostock, M" ],
"AuthorAffiliation": [ "Comput. Sci. Dept., Stanford Univ., Stanford,
CA, USA" ],
"ContentType": [ "Journal Article" ],
"Copyright": [ "1995-2012 IEEE", "2010 IEEE" ],
"DOI": [ "10.1109/TVCG.2011.185" ], "DatabaseTitle": [ "PubMed",
"CrossRef" ],
"Discipline": [ "Engineering" ],
"EISSN": [ "1941-0506" ],
"EndPage": [ "2309" ],
"Genre": [ "orig-research", "Journal Article" ],
"ID": [ "FETCH-LOGICAL-c900-
3917ebf7400b0992ef397b577022eff784f41b08302ff0c446ef0440fe8ac9a33" ],
. . .
}
As of this writing, the Summon index contains well beyond one billion entries,
many of which were created by automated metadata processing algorithms. A
general lack of authority control resulted in records that are very heterogeneous,
both in terms of the types of fields they contain and their content. This
inconsistency constitutes a challenge for our data processing and visualization tools.
In implementing the processing of these metadata records we encountered the
problem that Summon IDs are short-lived. As new records are being added to the
Summon index and merged with records already in the index, new IDs are created
for the new, merged records while the old IDs are removed and can no longer be
retrieved. To circumvent this problem, and also to avoid frequent accesses to the
Summon API, we retrieve and keep a copy of each record in a local SQLite database
(Hipp 2015). The database contains a single table that maps an ID to a JSON
encoding of the record’s data retrieved via the API. For our institution, this method
resulted in the retrieval of 1.3M unique records stemming from about 1.7 million
clicks during the 30.5-month period from mid-January 2013 to August 2015,
during which time the SQLite database grew to 9.7GB in size. Recently, ProQuest
introduced a PermaLink feature, which may alleviate the need to keep a copy of the
metadata.
From a software engineering perspective, we decoupled the click logging and
metadata collection and analysis by performing all log-related activities in a separate,
continuously running Python program, which we refer to as log watcher. The log
watcher is suspended while there is no activity. When a user clicks on a result, an
entry is added to the log file, which wakes up the log watcher via the Linux inotify
interface (McCutchan 2006), which will then read the added line from the log file.
After extracting the ID, the log watcher will check whether the metadata associated
with that ID are already contained within our local database. If they are not, they
are fetched from the Summon API and added to the database.
The log watcher program also needs to be able to interact with standard log
rotation, which is controlled by a system program and occurs at the end of each
month. To prevent log files from growing too large, the existing log file is renamed
and a new log file is started. We use inotify to detect this situation and continue our
data collection seamlessly with the new log file.
Metadata Aggregation
To be meaningful, most of our visualizations require processing and aggregation of
the metadata contained in multiple records. In our architecture, this processing
takes place within the log watcher process, which maintains a number of tabulators,
which tabulate frequencies related to different metadata fields in the timestamped
stream of click events.
To provide greater flexibility to our visualizations, we simultaneously maintain
several tabulators. Each tabulator is responsible for one or more metadata fields,
whose contents are aggregated over some past time frame. Tabulators can count the
frequencies of either entire fields or a specified combination of fields. In addition,
they may first split a field’s content into words, whose frequency is then counted.
We tabulate the discipline, source type, content type, publication year, keyword,
and subject terms fields. We split and then tabulate the title and abstract fields. In
addition, we tabulate the combination of keyword and subject terms and the
combination of abstract and title.
Since our visualization is expected to run unattended during times of both heavy
and light use of the Summon discovery system, the choice of time frame over which
to aggregate proved challenging. At first, we considered using time periods only
(e.g., the last minute, five minutes, hour, day, and week) because these periods
provide observers with a time frame to which they can easily relate. However, this
approach does not work well during periods of light use (e.g., when there are only a
few clicks per minute) because the resulting aggregate frequencies can be zero or
near zero. For this reason, we also added tabulators for event periods, whose length
is determined by a given number of events, regardless of when they occurred. We
keep track of aggregations over the last 50, 100, 200, 500, 1,000, and 2,000 events.
Tabulators process new events and update their aggregations incrementally; each
update involves two steps. First, the metadata record associated with the new event
is retrieved, and the field contents are extracted and added with their respective
frequencies. Then the tabulator examines whether any events need to be removed
from the aggregation because they have expired. Event period tabulators must
always expire a single old event for each new event (e.g., an event period tabulator
with a period of 200 events must retrieve and expire the 200th last event). Time
period–based tabulators must remove all events with a time stamp that is older than
the time period they are aggregating.
To decouple the metadata aggregations from their later use in the visualizations, a
snapshot of the aggregated metadata during each time or event period is saved after
each click event. We use a ring buffer with 100 entries to keep the last 100 updates.
To keep the implementation simple, this ring buffer is implemented using
subdirectories numbered 0 to 99, each of which contains JSON files that represent a
snapshot of the aggregation after the last event, the next to last, the -2nd, and so on.
A symbolic link named now points to the most recently processed subdirectory.
We maintain 10 tabulators over 5 time periods and 6 event periods, for a total of
110 aggregations. These tabulators require roughly 450KB of disk space; the entire
ring buffer thus takes about 45MB of space. As a concrete example, a snapshot of
the frequencies of the words occurring in the Title fields of the last 50 clicks might
look like this:
{
"timestamp" : "2015-08-03T12:00:27-04:00",
"Title" : [
[ "legionella", 5 ],
[ "robotics", 4 ],
[ "anatomy", 4 ],
[ "boltzmann", 4 ],
[ "lattice", 4 ],
[ "handbook", 4 ],
[ "sciences", 3 ],
[ "mathematical", 3 ],
[ "groundwater", 3 ],
[ "relationships", 3 ],
[ "shear", 3 ],
[ "writing", 3 ],
[ "dynamics", 2 ],
[ "tissue", 2 ],
[ "brain", 2 ],
[ "experimental", 2 ],
[ "education", 2 ],
[ "invariance", 2 ],
[ "vascular", 2 ],
[ "range", 2 ],
. . .
Data Server
The data server component broadcasts real-time updates to the visualization clients.
We chose to build the data server on top of the node.js platform (Dahl 2009). The
node.js platform is a server-side version of Google’s V8 JavaScript virtual machine,
which also powers the Google Chrome browser. It provides an environment that
supports an event-based programming style, which allows for efficient network
communication and efficient access to files on disk but requires that the
programmer arrange his or her program in the form of many event handlers, which
are executed in response to I/O events. Node.js network applications hold the
promise to support large numbers of simultaneous clients while using limited
resources.
We decided to use node.js for three reasons: first, for its aforementioned promise
of efficiency and the resulting scalability; second, because it is a JavaScript
environment and thus provides natural support for operating on JSON data,
including reading, writing, manipulating, sending, and receiving; and, third,
because it supports the Socket.io library in its ecosystem of packages (Rauch 2015),
which provide real-time, bidirectional, and event-based communication with the
visualization clients in which the actual visualization is implemented.
Clients connect to our data server after bootstrapping a JavaScript file, which
provides the code to create a Socket.io1 connection. On most modern browsers, this
connection is based on the HTML5 Web Sockets transport protocol,2 although
Socket.io has the ability to fall back on older technologies such as AJAX-based long
polling. Clients first send a message to subscribe to one or more channels. Channels
correspond to one of the 110 metadata aggregations computed by the log watcher,
plus one additional channel that simply broadcasts the most recent metadata record.
The data server obtains this data from the ring buffer, into which the log watcher
places records and aggregated data. Changes in the ring buffer are recognized via
inotify when the now symbolic link is updated, which signals an advance to the next
subdirectory. Thus, the log watcher and data server programs communicate solely
via the file system; no other form of interprocess communication is necessary. The
data server can also be started and stopped independently of the log watcher.
After a client subscribes to one or more channels, the data server sends the client
updates for each subscribed channel whenever a new event is processed by the log
watcher and added to the ring buffer. From the server’s perspective, the data
associated with each event is read once and then broadcast to all connected clients
while respecting their individual subscriptions.
A number of our visualizations use some form of animation to advance from one
visualization state to a new visualization state in response to a new event. To test
these animations, our data server supports a feature that allows a client to request
the data resulting from past events for each channel to which the client subscribes.
These data are sent after a connection has been established and can be processed by
the client in the same way as if the events had occurred in real time. Besides testing,
this feature is also useful for demonstrating our visualizations and animations.
VISUALIZATIONS
This section describes the implementation of the different visualizations we have
built and discusses the underlying technologies. All the visualizations execute inside
modern, standards-compliant browsers, which require the visualizations to be
programmed using HTML5, CSS, and JavaScript. We reuse widely available open
source and proprietary libraries and modules where appropriate. An example is the
powerful visualization library D3.js (Bostock, Ogievetsky, and Heer 2011). In
keeping with adapted terminology and to emphasize the aspect that our code can be
integrated into an existing web page, we refer to some of the visualization
implementations as widgets.
Speed Gauge
The speed gauge is a simple visualization that visualizes one number, the current
speed of discovery (see figure 3.2). To implement the speed gauge display, we
reused the “Just Gage” library by Bojan Đuričić (2012), which in turn relies on the
Raphaël JavaScript library (Baranovskiy 2013).
The data for the discipline ticker are computed by comparing the aggregate
frequency of each discipline at the most recent event with the aggregate frequency
count 10 events ago. To do that, the widget subscribes to the Discipline.last50
channel. For each discipline with a non-zero frequency, the difference is computed
and noted as a delta using an appropriate color.
The ticker was implemented from scratch via CSS3 transitions (Jackson, Hyatt,
Marrin, and Baron 2013). CSS3 transitions provide a way to have the browser
change CSS attributes over time, resulting in an animated transition. For instance,
via CSS3 transitions, a browser can be instructed to change an element’s position
from (0, 0) to (100, 0) over the time frame of 2 seconds, making the element move
on the screen. Before the widespread support and use of CSS3 transitions, such
animations were accomplished using JavaScript code, which needed to be invoked
repeatedly in small time steps and which then set those attributes to their
interpolated values (for instance, moving the x-value from 0 to 100 over 2 seconds
might result in 100 calls, one every 20 ms, each incrementing the x-value by one).
CSS3 transitions simplify this process by off-loading it onto the browser without
the need for any JavaScript code. This provides the additional advantage that the
browser can use the built-in capabilities of modern graphic processing units
(GPUs), which are now found in all desktop computers. Without CSS3 transitions,
the CPU can quickly become a bottleneck, limiting the number of elements that
can be simultaneously animated.
We implemented the ticker effect by positioning two rectangular div elements
relative to their container, which in CSS is accomplished by setting the position to
absolute and choosing appropriate values for the top, left, bottom, and right
attributes. The container is chosen so that extraneous content is clipped (see figure
3.4).
After determining the width of the ticker content we wish to display, we use a
CSS3 transition on the left property to move the ticker band to the left, behind the
visible area. To achieve the effect of an infinite ticker, we associate an event handler
with the end of the transition. At this point, we append new content (if any was
received from the server) to the right end and restart the transition. If no new
content was received, we repeat the last content. In addition, we check whether any
content has completely disappeared from view; if it has, we remove this content
from the band and adjust its left property accordingly. These manipulations require
knowledge of how long the ticker band is, which is the sum of the widths of each
piece of content. We compute these widths by laying out each content piece in a
hidden div element and retrieving its CSS width attribute.
Implementing the ticker revealed an interesting design and programming
challenge that recurs in the other visualizations we implemented: the animation of
the ticker occurs concurrently with the receipt of new events, yet we need to
coordinate the two. We accomplish this by queuing events on receipt and then
checking the queue of received but not yet processed events whenever a CSS3
transition ends (i.e., when new content has to be added to the band on the right).
The Summon Cube
The Summon Cube visualization projects the metadata records of the most recently
accessed items onto a 3-dimensional cube (see figure 3.5). As a new item is being
accessed, the cube rotates around one or more of its axes and stops with a new face
pointing forward (figure 3.5 displays a snapshot in the middle of such a rotation).
The face is populated using the metadata of the most recent record. We include
book images (if the record contains them), along with metadata such as title,
author, publication date, and publisher. If the record contains an abstract, we
display it. For books, we also include snippets provided by Bowker.
Our implementation of the rotating cube was inspired by a CSS 3D tutorial that
implemented a continuously rotating cube (Crombie 2015). This rotating cube uses
CSS3 animations. CSS3 animations are, like CSS3 transitions, geometric
transformations that are computed and rendered by the browser’s layout engine
with the help of a GPU, without requiring continuous JavaScript code to run
during the animation. Animations are specified using key frames that represent
states that correspond to a set of CSS properties. For instance, the following
keyframe declaration specifies an animation with two key frames (labeled from and
to):
@-webkit-keyframes spincube4 {
from { -webkit-transform: rotateY(-180deg) rotateZ(90deg); }
to { -webkit-transform: rotateY(90deg) rotateX(90deg); }
}
If this animation is applied to an element, the element rotates around three axes
simultaneously: the x-axis from 0 to 90 degrees, the y-axis from –180 to 90 degrees,
and the z-axis from 90 degrees to 0 degrees. To make the rotating cube work, we
had to decompose the key frames describing a continuously animated cube into six
separate states (and animations that move from one to the next state in a circular
fashion).
Unlike the ticker, which is continuously running, the cube moves only in response
to new records arriving. A question that arises is how long should each record on a
face be displayed before displaying the next face. During periods of low activity, it is
reasonable to simply move to the next face, as a viewer will have ample time to read
the information displayed. During bursts of high activity, however, advancing to
the next face should not occur in lockstep, or else a viewer might not have enough
time to read the displayed information, similar to how a bad presenter might
advance slides too fast for his or her audience to read. On the other hand, if the
burst of high activity persists for too long and the per-face pause is set too high, the
animation might eventually be too far out of sync.
We used a variable delay queue to mitigate this problem. As new records arrive,
they are placed in a queue that releases items only after a delay. The delay depends
on the number of elements in the queue before it. The more elements, the smaller
the delay, down to a set minimum. For instance, a [6,4,2] queue might have a delay
of six seconds if there is only item in the queue when a new one is added or four
seconds if there are two items, and the delay may be reduced to two seconds if more
than two items are already in it. In this way, if two records are accessed at nearly at
the same time, the first one to arrive will not cause a “flashing” of the animation but
rather will be visible for six seconds before the next one is displayed. In the event
that several records are accessed near simultaneously, they will be displayed at a pace
of one every two seconds, which then will increase to four and then to six as soon as
the burst ebbs off.
Word Cloud
D3.js is a JavaScript library for providing dynamic, interactive data visualizations in
web browsers. It was developed by Mike Bostock and Jeffrey Heer (Bostock, Heer,
and Ogievetsky 2011; Bostock, Ogievetsky, and Heer 2011). It is based on
manipulating web documents that consist of HTML, CSS, and, in particular,
scalable vector graphics (SVG) in concordance with data coming from a source.
Such data binding allows data-driven transformations: for example, a given data
element might be bound to an HTML div element’s height CSS attribute, which
represents a bar in a bar chart. As the data are updated, the bar changes its height
accordingly.
D3.js provides a number of higher-level abstractions to facilitate this programming
style and make it suitable for arbitrary, dynamic sets of data. In addition, it allows
the creation of smooth transitions as data change and simplify the creation of
interactive applications that respond to user input. The true power of D3.js,
however, lies in the components and plug-ins built by its avid developer community
that support many possible different styles of visualizations (Bostock n.d.).
We implemented a dynamic word cloud visualization based on a D3.js plug-in
written by Jason Davies (2013). An example is shown in figure 3.6, based on terms
found in the titles of retrieved items over the last 50 events. A word cloud is a
popular way to visualize the frequency of words in a text. More frequently occurring
words are displayed using a larger font. The words are arranged to fill a given area in
a nonoverlapping way. To achieve maximum cover, words are laid out in a spiral
starting from the center until a gap is found that is large enough so that a word will
fit. This layout logic is implemented in the plug-in we are using.
Figure 3.6. Word Cloud Visualization.
The plug-in is written using D3.js’s model of separating data (also known as
model) and view. The data comprise the words and their frequency. The view is a
set of svg:text elements (W3C 2001) along with a set of attributes that is computed
by the layout algorithm and include the x, y positions, where the text element is
placed inside the SVG canvas; an angle of rotation; and the font size at which the
text element should be displayed. At its basic use (for a static word cloud), the set of
words must be provided to the layout algorithm, which will create the
corresponding text elements and initiate the computation of their layout.
Computing the layout is time intensive (may take several seconds for large word
clouds) and for that reason is performed using a programming pattern that spreads
the computation over multiple chunks to allow the interleaving of other work
necessary to ensure that a web page does not lose interactive responsiveness. When
the last chunk is finished, a callback function is invoked. In that callback function,
the computed attributes are applied to the text elements that correspond to the
model.
The code becomes more complicated if we wish to implement a word cloud that
changes as new aggregated data arrive. In this scenario, some words are in both the
old and the new set (although their frequency may change), some words are in only
the new set, and some old words are no longer in the set. Visually, the words that
stay transition from their old positions and sizes to their new positions and sizes.
The words that are removed will fade away, and new words will fade in at their new
positions. These three transitions take place simultaneously over a given time span.
We expressed this logic using the idioms required by D3.js’s programming model.
As we do for the Summon cube visualization, we use a variable delay queue to
ensure that the word cloud does not update immediately during bursts of click
events.
Building dynamic word clouds, as opposed to designing and tweaking a word
cloud for a static set of words obtained from a text, poses another challenge, namely,
how to choose the function that maps frequencies to font size. If the font size is too
large, only a few words will fit on the page (the algorithm will discard any word it
cannot place without overlap). If the font size is too small, the smallest words will
be hard to read and/or there may not be enough words in the set to fully cover the
given area, which makes the word cloud look empty. We developed a heuristic to
address this problem. We determine the maximum and minimum frequencies
occurring in the set, then provide a font size function that logarithmically maps this
[min, max] range to a range of font in the size [h/40, h/5], where h is the height of
the provided area into which the word cloud is drawn. This heuristic is based on the
idea that if the words of the cloud were laid out in simple horizontal lines, the entire
area would be filled with five lines of the most frequent words or 40 lines of the
least frequent words.
Listen to Summon
Another visualization is an adaptation of the well-known Listen to Wikipedia
visualization, which was created by Mahmoud Hashemi and Stephen LaPorte
(LaPorte and Hashemi 2013). A snapshot of the Listen to Summon visualization
can be seen in figure 6 in the photospread.
The Listen to Wikipedia visualization provides a visual and aural depiction of real-
time updates to Wikipedia. After a Wikipedia article has been edited, a circle
appears at a random location on a canvas whose size is proportional to the size of
the edit. In addition, a semitransparent ring grows around the circle, creating a
ripple-like effect. The title of the Wikipedia article is displayed over the circle and
hyperlinked to point to the actual article. Colors are used to differentiate between
edits by unregistered contributors, Wikipedia editors, and bots. Sound is used to
denote the type of edit: bells signal additions and string plucks subtractions. The
sounds come from an array of sounds with varying pitch, and the pitch chosen is
based on the size of the edit. To avoid cluttering the animation, older entries fade
away after a set time period.
We adopted the Listen to Summon code and modified it to suit our purposes. The
visualization code subscribes to the Record channel and displays new circles when
new records are broadcast. A variable delay queue is used to avoid overwhelming the
viewer during bursts of events. We map the number of pages of a book or article (if
provided in the record) to the size of the circle. We map the first discipline listed in
the record so that it determines the color. We use the two types of instruments to
distinguish records that came from our original catalog from others. We use the
publication year if it is given to determine the pitch of the sound. Unlike Listen to
Wikipedia, which removes entries after a set time period, we fade old entries out
only if needed to keep the number of displayed circles constant.
Google Chart API
We also used the Google Chart API3 to visualize some data. Google Charts is a
proprietary library created by Google, subject to the Google API Terms of Service.
It supports a closed set of chart types, the code for which is owned and hosted by
Google. These chart types include column charts, bar charts, pie charts, and several
other types. These charts are highly customizable.
An example is shown in figure 7 in the photospread, which displays a pie chart of
the ContentType.last100 aggregation channel. Integrating Google Charts into our
visualization required us to transform our data into the table-based model required
by the Chart API, which was largely straightforward. However, the Chart API also
supports animated transitions when a new data set arrives. For these animated
transitions to work well, we needed to retain the row assignment for each discipline
category between updates. This required some programming to remember the
assigned row index for each category. To handle bursts, we again used a variable
delay queue.
A second use of the Google Chart API is for the creation of QR Codes. Our QR
Code widget subscribes to the Record channel; for each record, the widget retrieves
the same URL to which the user who clicked on the Summon result was led and
encodes it as a QR Code, which can then be embedded in the visualization. This
feature allows passersby to retrieve any items that may pique their interest.
PUTTING IT ALL TOGETHER
The visualizations we have presented so far exist as code that requires a combination
of HTML, CSS, and JavaScript (our code as well as that of the libraries we are
using). Integrating these visualizations into existing web pages requires moderate
effort—a user must copy and paste the corresponding <script> and <link> tags and
ensure that those files are accessible.
To deploy some of our visualizations in a public space we have prototyped a
Chrome app (Google Inc. 2014). Chrome applications are prepackaged bundles
that combine HTML, CSS, and JavaScript and can be invoked by simply installing
and launching them in the Chrome web browser. This packaging results in a
streamlined process for the person deploying them on the computer connected to a
public display and should enable integration into digital signage systems, something
we are in the process of testing.
Our Chrome application is an arrangement of visualizations designed for a newly
installed large display in the main entrance of the Newman Library at Virginia
Tech. This large display consists of four 60-inch screens arranged in a 2 × 2 pattern.
Due to technical limitations, the display supports only HD resolution (1920 ×
1080); the bezels separating the four screens are also large. For this reason, we split
the application into four quadrants; each quadrant displays a frameless top-level
window. Each window contains a single <webview> element that displays a different
page (see figure 8 in the photospread).
A webview element can load and render a web page loaded from an external
source. This external source is our server, which provides the actual code so that
code updates do not require access to the machine that physically drives the display.
As a noteworthy technical detail, we needed to assign each webview element a
different “partition” property when we open it. By doing so, we instruct the
Chrome browser that the code running in the four windows will not attempt to
communicate with each another (despite being loaded from the same domain),
which in turn enables Chrome to place the four windows into four separate
operating system processes. These processes can in turn be placed onto four
different CPU cores, which allows their code to execute in parallel. This separation
allows us to exploit multiprocessing for CPU-intensive JavaScript portions. As a
result, the layout computation in the window displaying the word cloud does not
affect the other windows; we are able to drive smooth animations in all four
quadrants smoothly, even on a commodity PC.
SHORTCOMINGS AND IDEAS FOR FUTURE WORK
Our technology is basically at the prototype stage and can be improved in multiple
ways. Several engineering challenges still lie ahead, such as improving the ease of
packaging of the log watcher and data server components. A second goal is to
improve the ease with which visualizations can be embedded into our web pages.
There are also numerous ways to improve the visualizations themselves. For
instance, the word extracting algorithm that determines the frequency of words in
titles or abstracts currently does not take collocations into account; we are working
on integrating a multiword extractor (MWE) from a text processing library. Such
an extractor could rely on a dictionary but ideally would be trained by the records it
sees as it processes the event logs to learn which groups of words form multiword
expressions.
A second improvement would be to perform deeper longitudinal processing of the
data. For instance, it may be interesting to visualize research activity during longer
time frames than just the recent past. This will require improvements to the data
processing stage to compute and store aggregations over longer periods.
We are also considering accessing external sources with which to enrich the data. A
possible example is the use of altmetrics (Priem, Taraborelli, Groth, and Neylon
2010), which can provide enriching information to items being accessed. Our
visualization system could itself become a source of altmetrics, particularly if it were
deployed at multiple institutions.
CONCLUSION
We have developed LibFX, which is a web-based technology that taps into our
library’s discovery system, collects user interaction in real time, and creates several
visualizations that can be deployed on wall displays or integrated in web pages.
We believe that deploying LibFX has the potential to amplify our library’s effect in
its community. Particularly on academic campuses, libraries remain highly
frequented physical centers of their institutions. Engaging and lively visualizations
of their discovery system’s activities could rekindle the public’s awareness of the
library’s mission as a hub for the discovery and provision of information. Such
displays could spark conversations about the use of books, ebooks, and journals and
unveil hot spots of research interest. Visitors would recognize libraries as active
places that connect the on-and offline activities of their users, which would in turn
reinforce the physical and virtual ties that bind an academic community.
NOTES
1. See http://socket.io/.
2. See http://tools.ietf.org/html/rfc6455.
3. See https://developers.google.com/chart.
REFERENCES
Back, Godmarand, and Annette Bailey. 2014. “Hacking Summon 2.0 the Elegant Way.” The Code4Lib Journal
(26).
Bailey, D. R., and B. Tierney. 2008. Transforming Library Service through Information Commons: Case
Studies for the Digital Age. Chicago, IL: American Library Association.
Baranovskiy, Dimitri. 2013. “Raphael—JavaScript Library.” Accessed August 3, 2015. http://raphaeljs.com/.
Bostock, Mike. n.d. “D3 Gallery.” Accessed August 3, 2015. https://github.com/mbostock/d3/wiki/Gallery.
Bostock, Mike, Jeffrey Heer, and Vadim Ogievetsky. 2011. “D3.js.” Accessed August 3, 2015. https://d3js.org.
Bostock, Mike, Vadim Ogievetsky, and Jeffrey Heer. 2011. “D3 Data-Driven Documents.” IEEE Transactions
on Visualization and Computer Graphics 17(12): 2301–2309.
Bray, Tim. 2014. “The JavaScript Object Notation (JSON) Data Interchange Format.” http://www.rfc-
editor.org/rfc/rfc7159.txt.
Crombie, Duncan. 2015. “CSS 3D Transforms and Animations.” Accessed August 3, 2015. http://www.the-
art-of-web.com/css/3d-transforms/.
Dahl, Ryan. 2009. “Node.js.” Accessed August 3, 2015. https://nodejs.org/.
Davies, Jason. 2013. “How the Word Cloud Generator Works.” Accessed August 3, 2015.
http://www.jasondavies.com/wordcloud/about/.
Davis, Hilary M. 2009. “Not Just Another Pretty Picture.”
http://www.inthelibrarywiththeleadpipe.org/2009/not-just-another-pretty-picture/.
Đuričić, Bojan. 2012. “JustGage.” Accessed August 3, 2015. http://justgage.com/.
Google Inc. 2014. “What are Chrome Apps?” https://developer.chrome.com/apps/about_apps.
Harvard University. 2012. “Harvard Library Explorer.” http://librarylab.law.harvard.edu/toolkit/.
Hevery, Miško. 2009. “Building Web Apps with Angular.” Paper presented at the OOPSLA. Retrieved from
http://www.oopsla.org/oopsla2009/program/demonstrations/254-building-web-apps-with-angular-2-of-2.
Hipp, D. Richard. 2015. “SQLite.” https://www.sqlite.org/.
Indianapolis Museum of Art (IMA). 2010. “Dashboard Indianapolis Museum of Art.” Accessed August 3,
2015. http://dashboard.imamuseum.org/.
Jackson, Dean, David Hyatt, Chris Marrin, and L. David Baron. 2013. “CSS Transitions.” Accessed August 3,
2015. http://www.w3.org/TR/css3-transitions/.
LaPorte, Stephen, and Mahmoud Hashemi. 2013. “Listen to Wikipedia.” Accessed August 3, 2015.
http://listen.hatnote.com/.
Lee, Yong Ju, and Brian Brush. 2013. “Filament Mind.” Accessed August 3, 2015.
http://www.eboarch.com/Filament-Mind.
Legrady, George. n.d. “Making Visible the Invisible, 2005–2014.” Accessed August 3, 2015.
http://www.mat.ucsb.edu/~g.legrady/glWeb/Projects/spl/spl.html.
McCutchan, John. 2006. “inotify(7)—Linux Man Page.” http://linux.die.net/man/7/inotify.
Priem, Jason, Dario Taraborelli, Paul Groth, and Cameron Neylon. 2010. “altmetrics: a manifesto.” Accessed
February 2, 2014. http://altmetrics.org/manifesto/.
Rauch, Guillermo. 2015. “Socket.IO.” http://socket.io/.
Serials Solutions. 2009. “API Documentation Center.” http://api.summon.serialssolutions.com/.
W3C. 2001. “Scalable Vector Graphics (SVG).” Accessed August 3, 2015. http://www.w3.org/Graphics/SVG/.
4
Using Google Tools to Create Public
Analytics Visualizations
Lauren Magnuson
Libraries offer access to a huge range of digital resources through their websites, and
gathering meaningful data about how those resources are used can be a significant
challenge. At the same time, both internal and external library stakeholders
increasingly expect to be able to easily access up-to-the-minute information about
how library web resources are used. While many libraries have adopted Google
Analytics as a free, convenient way of tracking user behavior on their websites, the
ability to easily share up-to-date data and trends from Google Analytics with library
staff and stakeholders is limited. Users can share Google Analytics data through
Google Accounts, but this requires that they log in to view the data and that the
Google Analytics account administrator grants individual access. Understanding
analytics data through Google Analytics interfaces also requires users to have some
working knowledge of how Google Analytics works and how to navigate its
administrative menus.
This chapter will describe how the Google Analytics Application Programming
Interface (API) can be used to create and share data visualizations that can be
published on public-facing web pages or intranets. Beginning with creating a
Google Analytics account and setting up a web property, this chapter will provide a
step-by-step guide for using the Google Analytics API, a Google superProxy
application, and the Google Charts API to create a public-facing dashboard to share
Google Analytics data with your library’s stakeholders.
Once your project has been created, you will have access to settings for the project.
Navigate to the APIs and auth section and perform a search for Analytics to locate
the Analytics API. Click on the Analytics API and click Enable. Then visit the
Credentials menu to set up an OAuth 2.0 Client ID. You will be prompted to
configure the Consent Screen menu and choose an email address (such as your
Google account email address), fill in the product name field with your Application
Identifier (e.g., mylibrarycharts), and save your settings. If you do not include these
settings, you may experience errors when accessing your superProxy application
admin menu.
Next, set up authentication credentials. Your application type is likely to be a Web
application. Set the Authorized JavaScript Origins value to your appspot domain
(e.g., http://mylibrarycharts.appspot.com). Use the same value for the Authorized
Redirect URI, but add adminauth to the end (e.g.,
http://mylibrarycharts.appspot.comadminauth). Note the OAuth Client ID, OAuth
Client Secret, and OAuth Redirect URI that are stored here, as you will need to
reference them later before you deploy your superProxy application to the Google
App Engine.
When experimenting with the Google Analytics Query Explorer, make note of all
the elements you use in your query. For example, to create a query that retrieves the
number of users who visited your site between July 4 and July 18, 2015, you will
need to select your Google Account, Property and View from the drop-down menus
and then build a query with the following parameters:
ids = this is a number (usually eight digits) that will be automatically populated for
you when you choose your Google Analytics Account, Property and View. The ids
value is your property ID, and you will need this value later when building your
superProxy query.
dimensions = ga:browser
metrics = ga:users
start-date = 07-04-2015
end-date = 07-18-2015
You can set the max-results value to limit the number of results returned. For
queries that could potentially have thousands of results (such as individual search
terms entered by users), limiting to the top ten or fifty results will retrieve data more
quickly. Clicking on any of the fields will generate a menu from which you can
select available options. Click Get Data to retrieve Google Analytics data and verify
that your query works (figure 4.4).
Underneath the results of your query, you will see a field labeled API Query URI.
This URI is an encoded query that you will be able to use in your superProxy
application to schedule automatically refreshed data from the Google Analytics API.
The encoded URI will look something like this:
https://www.googleapis.com/analytics/v3/data/ga?ids=ga%39999999&start-
date=30daysAgo&end-
date=yesterday&metrics=ga%3Ausers&dimensions=ga%3Abrowser&max-results=10
Note that this URI, if entered into a browser, does not result in any data being
returned. Instead, entering this URI will result in an error because a log-in is
required. This is because the URI can return data only if the request is wrapped in
an authenticated API request via the superProxy. Return to your superProxy
application’s admin page (e.g., http://[yourapplicationid].appspot.com/admin and
select Create Query). Name your query something to make it easy to identify later
(e.g., Users by Browser). The Refresh Interval refers to how often you want the
superProxy to retrieve fresh data from Google Analytics. For most queries, a daily
refresh of the data will be sufficient, so if you are unsure, set the refresh interval to
86400. This will refresh your data every 86,400 seconds, or once per day (figure
4.5).
You can use the Google Analytics API Explorer to generate encoded URIs, or you
can write them yourself to retrieve nearly any data from your Google Analytics
account. Building an encoded URI for the query may seem daunting, but
understanding the various pieces of the URI can simplify creating custom queries
(figure 4.6). Here is an example of an encoded URI that queries the number of
users (organized by browser) who have visited a web property in the past 30 days:
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991&start-
date=30daysAgo&end-
date=yesterday&metrics=ga:users&dimensions=ga:browser&sort=-
ga:users&max-results=5
All encoded URIs will start with https://www.googleapis.com/analytics/v3/data/ga?.
The encoded URI also contains the following elements:
ids. The ids value is equal to the eight-digit Property ID of your Google Analytics
property. This can be retrieved through the Google Analytics API Query Explorer.
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991&start-
date=30daysAgo&end-
date=yesterday&metrics=ga:users&dimensions=ga:browser&sort=-
ga:users&max-results=5
metrics. The metrics property of the encoded URI determines the quantitative
measure you wish to have returned from Google Analytics. For example, metrics
might include the number of users, the number of sessions, or the number of new
users.
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991&start-
date=30daysAgo&end-
date=yesterday&metrics=ga:users&dimensions=ga:browser&sort=-
ga:users&max-results=5
sort. This element enables the data returned to be sorted by either the data’s
metrics or dimensions. In this example, dimensions are sorted by number of users in
a descending pattern, which provides us the top browsers by number of users:
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991&start-
date=30daysAgo&end-
date=yesterday&metrics=ga:users&dimensions=ga:browser&sort=-
ga:users&max-results=5
Before saving, be sure to run Test Query to see a preview of the kind of data that are
returned by your query. A successful query will return a JSON (JavaScript Object
Notation) string, which is a raw form of your Google Analytics data expressed as an
array. A snippet of this array is shown below:
{u'kind': u'analytics#gaData', u'rows': [[u'Chrome', u'57512'],
[u'Internet Explorer', u'14215'], [u'Safari', u'14096'], [u'Firefox',
u'6074'], [u'Edge', u'850']], u'containsSampledData': False,
u'totalsForAllResults': {u'ga:users': u'93442'}, u'id':
u'https://www.googleapis.com/analytics/v3/data/ga?
ids=ga:12906498&dimensions=ga:browser&metrics=ga:users&sort=-
ga:users&start-date=30daysAgo&end-date=yesterday&max-results=5',
u'itemsPerPage': 5, u'nextLink':
u'https://www.googleapis.com/analytics/v3/data/ga?
ids=ga:12906498&dimensions=ga:browser&metrics=ga:users&sort=-
ga:users&start-date=30daysAgo&end-date=yesterday&start-index=6&max-
results=5', . . .
The JSON array contains data showing the number of users who accessed your site
organized by browser. Once you’ve tested a successful query, save it, which will
allow the JSON string to become accessible to an application that can help to
visualize these data. After saving, you will be directed to the management screen for
your API, where you will need to click Activate Endpoint to begin publishing the
results of the query in a way that is retrievable on the web from any web page. Then
click Start Scheduling so that the query data are refreshed on the schedule you
determined when you built the query (e.g., once a day). Finally, click Refresh Data
to return data for the first time so that you can start interacting with the data
returned from your query. Return to your superProxy application’s Admin page,
where you will be able to manage your query and locate the public end point
needed to create a chart visualization.
});
Three values need to be modified to create a pie chart visualization:
dataSourceUrl. This value is the public end point of the superProxy query you
have created. To get this value, navigate to your superProxy admin page and click
Manage Query on the Users by Browser query you have created. On this page, right
click the DataTable (JSON Response) link and copy the URL (figure 4.6). Paste the
copied URL into superproxy-demo.html, replacing the text REPLACE WITH
Google Analytics superProxy PUBLIC URL, DATA TABLE FORMAT. Leave quotes
around the pasted URL.
Figure 4.6. Using the DataTable (JSON Response) link to copy a URL to your clipboard.
refreshInterval. This can be the same refresh interval of your superProxy query (in
seconds, e.g., 86,400).
chartType. Change this value to BarChart to display the data as a bar chart
instead of the default pie chart. You can change the chartType value to any type
supported by the Google Charts API that makes sense with your data.5
title. This is the title that will appear above your chart.
Save the modified superproxy-demo.html file to your server or localhost
environment, and load the saved page in a browser. You should see a bar chart that
shows the types of browsers used by your website’s visitors, similar to figure 4.1,
shown at the beginning of this chapter.
CONCLUSION
Configuring the superProxy to return Google Analytics data requires an initial up-
front investment in time and effort. However, once your superProxy application has
been set up you can create multiple queries and reference the data in multiple
visualizations that load on any web page. The ability to visualize and share usage
trends can support decision making throughout a library’s organization, and the
superProxy is flexible enough to adapt to the changing data needs of your library’s
environment. As libraries are increasingly data-driven organizations required to
demonstrate accountability and effectiveness to a variety of stakeholders, using
Google Analytics data to illustrate the usage of your library’s web resources can be
one method of meeting that goal.
NOTES
1. Python 2.7 is freely available from https://www.python.org/download/releases/2.7/.
2. Available from https://developers.google.com/appengine/downloads -
Google_App_Engine_SDK_for_Python.
3. https://github.com/googleanalytics/googleanalytics-superproxy.
4. A localhost development environment can be installed on Windows using free XAMPP software (available
from https://www.apachefriends.org/index.html) or on Mac OS X with MAMP (available from
http://www.mamp.info/en/downloads/).
5. A gallery of available chart types, as well as sample code, is available in the Google Charts Chart Gallery
(https://developers.google.com/chart/interactive/docs/gallery?hl=en).
REFERENCES
Google. 2015a. Google Analytics superProxy. Accessed October 31, 2015.
https://github.com/googleanalytics/googleanalytics-superproxy.
Google. 2015b. Google Visualization API Reference. Accessed October 31, 2015.
https://developers.google.com/chart/interactive/docs/reference.
Google. 2015c. Set up Analytics tracking. Accessed October 31, 2015.
https://support.google.com/analytics/answer/1008080?hl=en.
Google. 2015d. User permissions. Accessed October 31, 2015.
https://support.google.com/analytics/answer/2884495?hl=en.
Google. 2015e. What Is The Core Reporting API - Overview. Accessed October 31, 2015.
https://developers.google.com/analytics/devguides/reporting/core/v3/.
5
Minding the Gap
Utilizing Data Visualizations for Library
Collection Development
Roger S. Taylor and Emily Mitchell
A large proportion of library data centers on our collections and their use. Each title
that comes into the collection receives metadata through cataloging and then has
further data associated with it as the library tracks its use. Even small libraries often
contain several thousand titles, and the sheer numbers make it difficult for data
novices to pull actionable information out of the massive spreadsheets that ILSs
(integrated library systems) provide as usage reports.
This overwhelming flood of data, plus most librarians’ lack of training on how to
deal with it, may well explain why circulation data are rarely leveraged to the full
extent they could be. Common advice to librarians taking over a new collection
development area is to familiarize themselves with their collection by “browsing the
book stacks, journals, and the reference section. . . . This type of assessment will
provide information about the format, age, quality, scope . . . and physical
condition of the collection” (Tucker and Torrence 2004).
Much less common is advice to spend time with circulation data. While browsing
the stacks is indeed a useful means of getting a feel for a collection, it cannot hope
to convey the same wealth of information that could be gleaned from well-
visualized usage reports. As Tukey wrote in 1977, “The greatest value of a picture is
when it forces us to notice what we never expected to see.” Of course, this
presupposes that the tools and skill set to create meaningful graphics are available.
In many libraries, it is extremely useful to have someone on staff with even a
rudimentary idea of how to visualize all the data about the collection and make
them more palatable to human beings. This point is well illustrated by a recent
article in Computers in Libraries that focuses on using circulation data to improve
collection development (Meyer 2015). The author used Microsoft Excel to create
line graphs showing increases and decreases in circulation of particular library
collections over time.
While this is a worthwhile first step toward using data visualizations to improve
collection development, experts generally consider Excel to be a poor tool for
processing data and creating visualizations (e.g., Panko 2008). There is a great deal
more that can be done with circulation data than just creating line graphs in Excel!
Better ways of visualizing the same data can help librarians learn more and different
things about their collections. Using scripting rather than Excel reduces errors,
increases the possibilities for how to visualize the data, and allows for better
comparisons across time periods.
Collection development, reduced to its simplest form, is merely a matter of adding
to the collection those items that can and should be added to the collection, and
removing from the collection those items that can and should be removed from the
collection. In practice, however, collection development requires a great deal more
skill and knowledge than such a simplistic definition would imply. How can one
know which items to select? How can one know which items to weed from the
collection? A library’s budget, goals, and policies all affect the answers to those
questions—but so should data about the library’s collections. This chapter will
focus on the latter. Specifically, we will be looking at the following questions, which
any collection development librarian would want answered:
Depending on the degree of the “messiness” of the data they might need to be
processed before they can be loaded into analysis and visualization software. In our
case, the data obtained from the ILS were relatively tidy (Wickham 2014). Namely,
each variable was saved in its own column and each observation was saved in its
own row. In addition, the column headers contained the variable names.
In our example the raw data required a moderate amount of cleaning before it
could be processed using the R programming language. To solve this problem we
used spreadsheet software for this first step of data cleaning. There are a variety of
spreadsheet programs that can be used to open and process CSV files (e.g.,
Microsoft Excel, Apple Numbers). However, continuing with our goal of using
accessible tools we used the free and open source LibreOffice (v. 4.4) Calc
Spreadsheet (downloadable here https://www.libreoffice.org).
We opened (imported) the CSV file we named “circulation_data.txt” (see figure
5.1). Note the checkmark for the “Comma” separator options. The bottom of the
window provides a preview of how the data will be structured when opened in the
spreadsheet. Many times the initial “raw” data set will include irrelevant variables,
which one may want to remove at this point. In our case, we removed the first
column (Bibliography Number), second column (OCLC Number), and fifth
column (Imprint data).
The next step was to replace the variable names on the first row to make them
easier to read for machines and humans. We eliminated all nonnumber and
nonletter characters, with the exception of the underscore (“”), which was used to
replace blank spaces. In our specific case the variable “Call Num” was changed to
“CallNumber” and the variable “# of Transactions” was changed to “Transactions.”
We were left with five variables, shown below in table 5.1 and figure 5.2.
Table 5.1. Variable names (R code variable names and the data to which they refer)
Variable Name Data
Title Title of the book
Author Author of the book
Year Year of publication
Call_Number Call number code
Transactions Number of transactions
Figure 5.2. Cleaned and formatted spreadsheet view of CSV data file in LibreOffice.
After the variable names (column headers) had been fixed, we next examined the
specific values (individual cells) for each variable. To do this we used the
spreadsheet filter function to determine whether the variable values were
appropriate (e.g., only numerical values in the year variable). For instance, there was
an instance in which text (“New York . . .”) was incorrectly entered in the year
variable (see figure 5.3).
Figure 5.3. Filter function allowing a scan for
nonviable values (e.g., nonnumerical years).
Another common problem is missing values—instances in which a variable value is
empty (i.e., no data in a spreadsheet cell). We solved this problem by first using the
spreadsheet’s filter tool to display instances (rows) of a variable that were empty
(i.e., missing). For instance, figure 5.4 shows how we selected the instances in which
the author information was missing.
Figure 5.4. Filter function allowing selection of all instances (rows) of a variable.
The next step was to fill in the missing values. However, there were many
instances in which this was not possible. For those cases, we deleted the instances
(rows) that could not be repaired. After deleting these cases, one must remember to
turn off the filter to be able to view the main data set, which will no longer include
the rows that had empty variables. After this initial cleaning, the data were tidy
enough to be loaded into our data manipulation and visualization software, which
we describe in the next section.
R SOFTWARE: DEDUCER
R is an open source programming language used for data manipulation, statistical
analyses, and visualizations (R Core Team 2014). R is extremely powerful, but it
has a steep learning curve, which limits its use for people without a programming
background. To help overcome this problem we choose to use Deducer—a user-
friendly graphical user interface (GUI) for R. More detailed information is available
in Fellows’s excellent 2012 Journal of Statistical Software Deducer article or at the
software website: http://www.deducer.org. One especially valuable Deducer feature
is that the software converts the GUI selections into R code, which appears in the
Console window (see figure 5.5), which can be copied and reused or modified. The
Data Viewer window has two tabs that can be selected: Data View and Variable
View. The Data View tab, shown below in figure 5.6, shows the information as it
would appear in a spreadsheet.
Figure 5.5. Deducer software Console window.
Figure 5.6. Deducer software Data Viewer window, Data View tab.
The Variable View tab (see figure 5.7) shows the data’s variable names and type
(categories). There are many different data types, but here we’ll focus on just those
directly relevant to our case study. The most straightforward is Integer, which
means the data comprise only whole numbers. In our case there can’t be fractional
transactions, so these data are stored as integers. The second type is Character, in
which R categorizes the data as a mere collection of letters, numbers, and symbols.
In our case this would include the book titles, authors, and call numbers. The third
data type is Factor, which is somewhat more complicated than the other two. Factor
variables are used to distinguish groups that will be compared. In our case, we’ll
compare different call number subcategories (i.e., Architecture books versus
Sculpture books). Factors represent what would be called levels of a variable in
statistics. The ordering of the factors can influence the format of some data
visualizations. Deducer automatically puts the factors in alphabetical order. The
easiest way to change the ordering is from within the Data Viewer window Variable
View tab. Simply click on the variable Factor Levels cell and a pop-up window will
appear (see figure 5.8). Then use the arrow keys to change the ordering of the factor
levels.
Figure 5.7. Deducer software Data Viewer window, Variable View tab.
Figure 5.8. Factor Editor window.
R PACKAGES
As mentioned earlier, R is extremely powerful and allows several ways to accomplish
the same result. Another advantage of R is the constant development of
supplementary modules called packages that can be downloaded from the Internet
and installed to provide the user with additional functionality. In this chapter we’ll
demonstrate how to install and use several such packages. For data manipulation we
used “stringi,” “dplyr,” and “stringr,” which we’ll describe in detail below.
The stringi package was designed to provide users with a powerful way to
manipulate their text data (Gagolewski and Tartanus 2015). The code box below
shows how to install the stringi package and load it into R. The next three code
blocks used the call number variable data to create three new variables: Class,
Subclass, and Subclass number. These three new variables will be used in the
subsequent data visualizations. Lines of code that begin with # are ignored by R and
serve only to make the code more readable to humans.
install.packages("stringi")
library(stringi)
# Class. Matchs (extracts) first Letter.
circulation_data<- transform(circulation_data, Class =
as.character(stri_match_first(circulation_data$Call_Number, regex = "
[:alpha:]")))
# Subclass. Matches (extracts) first set of letters.
circulation_data<- transform(circulation_data, Subclass =
as.character(stri_extract_first(circulation_data$Call_Number, regex = "
[:alpha:]{1,}")))
# Subclass_Number. Matches (extracts) first set of digits.
circulation_data<- transform(circulation_data, Subclass_Number =
as.integer(stri_extract_first(circulation_data$Call_Number, regex =
"\\d{1,}")))
Figure 5.10. Transformed data with call number extracted for new variables.
The dplyr package was designed to provide users with a powerful and easy way to
manipulate their data (Wickham and Francois 2015).
In our example we decided to look at just the art books (Class N). A single line of
code, shown below, shows how to create this subset. The new data frame will not
automatically appear in the Data Viewer window. To view these new data click on
the pull-down menu and select the data frame called “art_data”:
install.packages("dplyr")
library(dplyr)
# Create subset of Art books (call numbers that begin with N)
art_data <- filter(circulation_data, Class == "N")
It’s easy to forget the specific library subclass codes (i.e., NB for sculpture, etc.), so
we decided to add a new variable, “Subclass_Names,” to help with this problem.
From the Deducer Console Window click on “Data” in the menu bar and select
“Recode Variables” (make sure that “art_data” is selected; see figure 5.11).
Figure 5.11. Recode Variables window.
After selecting “Subclass” and clicking the triangle to move it to the window on
the right, select “Subclass -> Subclass” in the middle window and then click on the
“Target” button on the right. This will open a new input window; type in
“Subclass_Names” and press the “okay” button. Now click on the “Define Recode”
button. In the “Set Variable Codings” window specify how each variable gets
recoded (e.g., N into Visual Arts, NA into Architecture, etc.; see figure 5.12).
art_data[c("Subclass_Names")] <-
recode.variables(art_data[c("Subclass")] , "'N' -> 'Visual Arts';NA ->
'Architeture';'NB' -> 'Sculpture';'NC' ->
'Drawing/Design/Illustration';'ND' -> 'Painting';'NE' -> 'Print
Media';'NK' -> 'Decorative Arts';'NX' -> 'General Arts';")
Figure 5.12. Set Variable Codings window.
Since the data have already been cleaned, they are now ready to be used to create
the data visualizations. However, before proceeding we need to set the Working
Directory, using the GUI command located under “File” in the Console Window
pull-down menu. Then we save the data using the GUI command located under
“File” in the Console Window pull-down menu. The example syntax for both is
shown below.
setwd("Usersrtaylor2/Documents/R_Folder/LDV_R")
save(circulation_data,file='Usersrtaylor2/Documents/R_Folder/LDV_R/circulation_data.
Perhaps the main reason for the popularity of pie charts is that under very specific
circumstances, they can provide an intuitive understanding of part-whole
relationships. This requires that the individual slices be mutually exclusive and add
up to a meaningful whole. The main difficulty with pie charts is that people are not
good at estimating the angles of the slices. A better alternative is the waffle chart,
also called a “square pie chart” (Kosara and Ziemkiewicz 2010). The waffle chart
eliminates the problem of angle estimation and allows the reader to easily compare
the individual parts of the whole. Furthermore, if desired, one could count the
number of squares to get more precise measurements. For instance, in the waffle
chart (see figure 10 in the photospread) of library transactions per Fine Arts
subclass, we can quickly count and see that General Art (shown in orange) had
twenty transactions.
install.packages(c("waffle"))
library(waffle)
art_transactions <- c('Visual Arts (N)'=342,
'Architecture (NA)'=82,
'Sculpture (NB)'= 30,
'Drawing.Design.Illustration (NC)'=116,
'Painting (ND)'=187,
'Print Media (NE)'=38,
'Decorative Arts (NK)'=87,
'General Arts (NX)'=36)
waffle(art_transactions, rows=20, size=.5,
colors=c("#E16D5E", "#D3709A", "#9289C0", "#2A9CB8", "#0BA287",
"#699D4D", "#A88D32", "#D3784E"),
title="Transactions per Fine Arts Subclass",
xlab="1 square == 1 Transaction during past year")
The waffle chart is good with a relatively small number of categories but is limited
to a single dimension of data that can be expressed. In our case we would like to be
able to see the book class/subclass, year of publication, and number of transactions.
Scatterplots can provide us with the capability.
REFERENCES
Bertin, Jacques. 1983. Semiology of Graphics. Madison: University of Wisconsin Press.
Cleveland, W. S., and McGill, R. 1985. “Graphical Perception and Graphical Methods for Analyzing Scientific
Data.” Science 229(4716): 828–833.
Fellows, Ian. 2012. “Deducer: A Data Analysis GUI for R.” Journal of Statistical Software 49: 1–15.
Gagolewski, Marek, and Bartek Tartanus. 2015. R Package Stringi: Character String Processing Facilities.
http://stringi.rexamine.com/. doi:10.5281/zenodo.19071.
Heer, Jeffrey, Michael Bostock, and Vadim Ogievetsky. 2010. “A Tour Through the Visualization
Zoo.” Communications of the ACM 53(6): 59–67.
Kosara, Robert, and Caroline Ziemkiewicz. 2010. “Do Mechanical Turks Dream of Square Pie Charts?”
Proceedings of the 3rd BELIV’10 Workshop. 63–70.
LeVeque, Randall J., Ian M. Mitchell, and Victoria Stodden. 2012. “Reproducible Research for Scientific
Computing: Tools and Strategies for Changing the Culture.” Computing in Science & Engineering. 14(4): 13–
17.
Meyer, J. 2015. “Monitoring the Pulse: Data-driven Collection Management.” Computers in Libraries 6: 16.
Panko, Raymond R. 2008. What We Know About Spreadsheet Errors. Accessed November 10, 2015.
http://panko.shidler.hawaii.edu/SSR/Mypapers/whatknow.htm.
Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334(6060): 1226–1227.
R Core Team. 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria. Accessed November 10, 2015. http://www.R-project.org/.
Rudis, Bob. 2015. Waffle: Create Waffle Chart Visualizations in R. R package version 0.3.1. Accessed November
10, 2015. http://CRAN.R-project.org/package=waffle.
Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, Eivind Hovig, and Philip E. Bourne. 2013. “Ten Simple
Rules for Reproducible Computational Research.” PLoS Computational Biology 9(10).
http://doi.org/10.1371/journal.pcbi.1003285.
Spence, Ian. 2005. “No Humble Pie: The Origins and Usage of a Statistical Chart.” Journal of Educational and
Behavioral Statistics 30(4): 353–368. doi:10.3102/10769986030004353.
Stevens, S. S. 1946. “On the Theory of Scales of Measurement.” Science 103(2684): 677–680.
Tucker, James Cory, and Matt Torrence. 2004. “Collection Development for New Librarians: Advice from the
Trenches.” Library Collections, Acquisitions and Technical Services 28(4): 397–409.
Tufte, Edward R. 1983. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press.
Tukey, John W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. New York: Springer.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59(10): 1–23.
Wickham, Hadley, and Romain Francois. 2015. dplyr: A Grammar of Data Manipulation. R package version
0.4.2. Accessed November 10, 2015. http://CRAN.R-project.org/package=dplyr.
6
A Picture Is Worth a Thousand Books
OBILLSK, Data Visualization, and Interlibrary
Loan
Ryan Litsey, Kenny Ketner, and Scott Luker
“A picture is worth a thousand words.” This commonly spoken phrase captures the
power an image can have, especially when we think about how we can represent
large amounts of data in ways that can be better understood by decision makers and
managers. The representation of large-scale data lies at the heart of data
visualization. One of the biggest large-scale data sets in libraries exists in the
transactions between institutions via Interlibrary Loan (ILL). The Online Based
InterLibrary Loan Statistical Kit, or OBILLSK for short, can help both to report
and direct decisions of librarians within a consortium. OBILLSK can also better
represent visually the large amount of data that is contained within an ILL
transaction. Being able to accurately and clearly represent a large data set of a
workflow or process can serve as a tremendous guideline for better understanding
the behavior of a system. Nowhere else is this more a necessary component than in
today’s modern academic libraries.
Academic libraries are very good at collecting data. We collect data on the
circulation of books, eDatabase usage, patrons in the building, ILL statistics, and so
on. All this data tracking, however, is a futile exercise unless we can accurately
represent that data in models that can help inform decision making. One of the best
areas in which we can utilize data visualization in decision making is in ILL. ILL, or
resource sharing, as it has become more recently known, is the sharing of items
owned by a library with another library that may not have access to that item. This
is a process that at its core is a fundamental part of any library. However, in the
academic library, it is of critical importance. Since not every library can be expected
to own everything, resource sharing becomes an important part of the functioning
of the academic library. This means that any large-scale academic library will have
an ILL department that processes tens of thousands of these types of transactions
every year. For example, last year the Texas Tech University Libraries handled over
90,000 transactions of either Texas Tech sending an item to another university,
known in ILL as lending, or another university sending Texas Tech something it
owned, which is known as borrowing. These processes generate a mountain of data,
depending on what the citation information for the item is, where it is going, the
shipping address, when it is due back, and the overall current status of the request.
At Texas Tech, this type of data is usually recorded in an SQL database that is
utilized by the dominant ILL software ILLiad.
Atlas Systems’s ILLiad is the software many academic libraries use to manage the
exchanges between institutions in the ILL process. The software is very good at
tracking the different statuses of the each request as well as the myriad potential
outcomes of the request. There is one area though that the system is unable to
process: what happens in between libraries. In an era of shrinking budgets, many
libraries have entered into partnerships with other libraries to provide resources at
little or no cost. In the ILL world, these partnerships are called consortiums. A
library often will be a member of many consortiums to meet the resource needs of
its patrons. While ILLiad does a good job of presenting in-house statistics, it does
not provide much insight into the statistical data of the consortium as a whole.
Essentially, how long does it take for a request to be sent from one library, received
by another, and sent back? And what are the reasons for why each step takes the
amount of time that it does?
Data about these kinds of transaction details are lacking in the world of ILL. The
development of OBILLSK and the data it visualizes can rise to the challenge of
filling that gap. Before we can illustrate what type of data is being visualized, it is
important to understand the background behind the design, the architecture of how
OBILLSK was built, the software involved, and problem it is solving of presenting
the ILL metrics for all ILL transactions between the libraries of the Greater Western
Library Alliance (GWLA). GWLA is a large academic consortium made up of the
large research institutions of the American Southwest and Pacific Northwest. To see
how we got to where we are today with OBILLSK, it is important first to examine
how we as human beings come to need visualized data and how those data can help
inform decision making in libraries, especially in the logistically complex arena of
ILL.
BACKGROUND
To begin to understand the importance of data visualization, we must first engage
in a bit of visualization ourselves. Imagine a river or stream flowing into a mountain
lake. Now instead of water, think of data as the water flowing through a river and
emptying into a big “data” lake. Traditional methods of “big data” analysis would
argue that to get a picture of what is happening in the environment, we should
sample the data in the lake. To sample the lake, we would dip our toes in and pull
out a specific fact, maybe even casting in a line with a certain question like bait on a
hook to elicit a response from the lake. We may even get lucky and catch a fish. The
problem with this type of analysis is that it does not tell you anything about the
composition of the data lake or how the data came to coalesce in that area. So to
understand the formation of the data lake, we must examine the data as they are
flowing in the river.
Of course, rivers do not stand still; they move in real time. To examine the data in
real time, we must use visualization to accurately represent the different elements so
that the researcher or analyst can get an accurate snapshot quickly, on demand. This
simple description of data is emblematic of a larger point about the cognitive nature
of human thinking. As Larkin and Simon argue,
When they are solving problems, human beings use both internal representations, stored in their brains, and
external representations, recorded on a paper, on a blackboard, or on some other medium. (Larkin and
Simon 1987, 66)
Larkin and Simon touch on an ongoing debate about how we as people come to
formulate ideas within our minds. Their paper illustrates how a diagram can serve as
a better model for understanding than merely words, because data described by
words in written sentences can relate to each other only laterally. In other words, a
single word in a sentence can relate only to the word before it and the word after it.
In contrast, data elements in a diagram can relate to a number of adjacent variables.
This leads to the conclusion (Larkin and Simon 1987, 98) that a diagram is a
superior method of communicating information for the following reasons:
Diagrams can group together all information that is used together, thus
avoiding large amounts of search for the elements needed to make a
problem-solving inference.
Diagrams typically use location to group information about a single
element, avoiding the need to match symbolic labels.
Diagrams automatically support a large number of perceptual inferences,
which are extremely easy for humans to understand.
Eighty percent of all loans shipped from one GWLA school to another must
be received within five days of receipt of request.
Eighty percent of all articles must be received within seventy-two hours of
receipt of the request from the borrowing institution.
This set of benchmarks has distinguished the GWLA consortium from other
consortiums in that the member institutions are accountable for a standard of
service to each other. These benchmarks are also known in ILL terminology as the
turnaround time for processing items. They are a good indicator of the efficiency
and effectiveness of the local institution’s ILL operation.
The regular requirement to examine the GWLA metrics means that a member
institution would need to volunteer to do the data analysis. Historically, it has been
the responsibility of a single person to perform the data analysis; therefore the
GWLA member institutions were limited to handmade reports based on
semiannual sample sets of data. From a workflow perspective, this method required
that each member of the consortium download a set of data from our ILL systems
for the months of February through April and September through December of
each year and then send them to the University of Kansas ILL department for
analysis. This process is cumbersome and time consuming. There is also a high
chance for error since the process relies on thirty-three people (the ILL librarians of
the GWLA member institutions) to access their ILL data and send them to the
analyst. So together the programmers at Texas Tech University Libraries
brainstormed a way that we could make this process better.
The initial screen of the program is a minimal text input form requesting the
location and log-in credentials for the institution’s ILLiad SQL database (figure
6.1). On successful connection to the database, the next screen displays two buttons
(figure 6.2). The first button executes a query, populates a CSV file, and opens a
save file dialogue window allowing the user to save the file to a location of choice.
The second button facilitates the file upload process by linking directly to an upload
page contained in the web application. In the final step, the user selects the CSV file
generated by the program and uploads the file to the OBILLSK server. The data we
are collecting are key to understanding how we negotiate the turnaround times.
The data collected consist of information from two ILLiad database tables:
transactions and tracking. The data from the transactions table contain the ILL
number, which serves as the system-level unique identifier, and citation information
utilized in detailed lists within the web application. The data from the tracking table
include timestamps and transactional status changes. These status changes are
critical in analyzing the data and calculating borrowing and lending turnaround
times. All data collected by the executable program are imported into the OBILLSK
database. To achieve a high level of data integrity, this cumulative data set is
referenced by only SQL queries and stored procedures, which create secondary data
tables while leaving the source data pristine and intact. Once the data are collected
the steps necessary to display them in a visually appealing way are very important.
Using ILL SQL data, we are able to gather around 100,000 records per institution
at a time and upload them to the TTU servers. Such a large amount of data is not
useful unless we can design and interface and present the data in a visually appealing
and useful way. That is where the web application coding for OBILLSK takes
center stage.
The OBILLSK system is a Microsoft ASP.NET web application written in C#.
Visual Studio 2013 is the primary development environment. To remain consistent
with the Microsoft technological stack selected for the user interface, Microsoft
SQL Server 2012 powers the database and stored procedures behind the
application. These Microsoft technologies provide simplification of complex
processes and are institutionally supported at Texas Tech University.
On top of this .NET web application foundation, other contemporary website
user interface technologies are implemented to make the data as visually appealing
as possible. The OBILLSK site is designed with a responsive design using Bootstrap,
a popular cascading style sheets (CSS) and JavaScript library for rendering websites
mobile first, appropriately formatted for any size device. For example, larger menus
automatically collapse into a single drop-down menu represented by a three-line
icon commonly known as a hamburger button. Other buttons that open various
reports on the screen will line up vertically for cell phone users and in a wide
rectangle for tablet and PC users. Reports and charts resize automatically for
changing screen resolutions, such as turning a mobile phone or tablet from a vertical
to a horizontal orientation. Responsive design is a key element to the web pages
because as decision makers access the site, they may not always be near a desktop
computer. Because of this we needed to incorporate a design that allowed for ease of
use on any device.
Charts, graphs, and reports are rendered through a third-party commercial tool
called Shield UI, which was selected because the charts can be implemented easily
and are visually appealing. For example, a few lines of code are all that is required to
generate a line chart of historical data such as total loans by day over a five-day
period. The Shield UI controls appear on the page as simple HTML div tags.
Additionally, the jVectorMap library is used to display a map of the United States
with dots at each of the participating institutions. Users can hover over the dots to
see a quick snapshot of total articles and loans sent to each institution. The map can
be zoomed with the mouse scroll wheel as well, and the map view is clear at all
levels of magnification due to its vector nature. The jVectorMap library is simple,
lightweight, and highly customizable; these are the reasons for its use over more
complex options such as Google Maps, Apple Maps, or Bing Maps. Each of those
other options provides unnecessary functionality, which slows down the web
application, and they are all harder to customize. Selecting the proper tool for
simple geographic visualization will help users understand where their ILL materials
are going. By incorporating a map in the center frame we are able to draw the users’
attention away from the side buttons and to the center, which is where we placed
our more pertinent and easy-to-access information. Since it is a dynamic map, it is
not intended merely for graphic reference; it is able to reproduce where and how
many items have gone to the different universities represented. This accomplishes
two goals: (1) it gives perspective for how large the consortia is, and (2) it can tell
the users where a majority of their items are going, which can influence shipping
decisions.
Data analysis, calculations, and anomaly detection are primarily performed by a
series of stored procedures. Stored procedures are subroutines contained within a
database that consist of extensive SQL statements. This modular approach was
chosen to isolate the complex logic required for statistical calculations from the web
application to increase overall performance and efficiency. Another set of stored
procedures monitors the database for any anomalies that may have occurred during
the collection, analysis, or calculation processes. Scheduled tasks, a common server
technology, are used to execute the stored procedures automatically requiring no
human intervention.
What this complex set of stored procedures, web application code, and data
provided by ILL librarians at each institution provides is a very user-friendly and
visually appealing data presentation. As you can see in figure 6.3, there is a wealth of
data being presented. Since our main focus was to graphically represent the ILL
turnaround times for all of the GWLA institutions, we made that the central focus
of the page when the users first log in. What they see is a main menu with a variety
of easy-to-access informational options. In the center frame we have the quick-
access buttons and the map.
Figure 6.3. The main menu screen for the OBILLSK web page.
As you can see in figure 6.3, we have divided the turnaround times into six unique
categories. We followed a similar categorization already in existence with ILL
processes. Those categories are the distinction between articles and loans and the
division between borrowing, lending, and in transit. All of this terminology was
chosen to correspond with the existing terminology of ILL. This allows users to be
able to adopt the system faster because there is no barrier in understanding the
wording of the different options. As Marlin Brown (1988) writes in his book on
user interface design, “reduce the amount of mental manipulation of data required
of the user. Present data, messages, and prompts in clear and directly useable form”
(7). With this in mind we made sure to present the pertinent data the system was
designed for front and center with terms that were comfortable to users. In addition
to the dashboard we wanted a customizable log-in. To accomplish that we needed
to create a user authentication system so we could control what the specific users see
when they log in.
After authentication, the OBILLSK system begins with a dashboard screen that
features average turnaround time summaries for six categories as well as the
jVectorMap of the United States with dots for each institution. The average
turnaround time summary reports presented on the dashboard include Borrowing
Articles, In Transit Articles, Lending Articles, Borrowing Loans, In Transit Loans,
and Lending Loans. These average times are presented in a format of days, hours,
and minutes. The ability to categorize each of these times, especially the in-transit
times, is new in the ILL world and is possible only because data from each
institution are collected. Daily processing of that data guarantees up-to-date
information presented to the users.
We have built within the center frame another subset of controls that can provide
users with more detailed information should they require it. Each of the turnaround
boxes is selectable and will display a graphical representation of the numbers of
requests of the current calendar year. The decision was made here to provide a brief
graph as a way of helping users draw a better idea of the trends within the current
year of their ILL departments. From the central frame, where we choose to
prioritize the most important data, we move to the left navigational menu.
The web page follows an F-pattern design. F-pattern design is a widely researched
paradigm that illustrates that web users typically read a page in an F-shaped pattern,
starting from the left and moving to the end of a line and then returning to the next
line. With that in mind, we have a left navigational menu section that can help
users navigate to the main overview, which displays the data above. However, we
also give users the option to drill down to a more microlevel of detail in
understanding the ILL transactions between institutions. In the left navigational
menu, users can access the details of all the completed ILLs by looking at one of the
options listed.
Borrowing and lending requests are further divided into lists of articles and lists of
loans, for a total of four levels of analysis. When the user clicks on one of the
options (for example the borrowing requests) the center frame will change and
display an ordered list of all of the transactions for that function with additional
details. Each row in each list displays basic identifying information about the
request, including unique request number, borrowing institution, lending
institution, and the turnaround and in-transit times for that request. Again, here we
are trying to design a system that can present a lot of information in a visually easy-
to-understand system. With that in mind, we kept all of the actions focused on the
center frame. So when users first select an option on the left, the center frame
changes as described above. If this information is not sufficient users can still take
the analysis to a further level by clicking on the button shown in figure 6.4. Each
row also has a button that can be clicked to display a modal pop-up window that
presents the full citation information for that item. These historical lists of requests
are also searchable.
Figure 6.4. An example of the transaction-level detail that OBILLSK can provide.
One of the main benefits of viewing these request details in OBILLSK instead of
in the native ILL interface (ILLiad) is the easy juxtaposition of the citation detail
along with turnaround and in-transit times. It is in this microlevel of detail that the
power of the system really begins to show through. If we travel with users from first
log-in, we can see how as they move through the system they get better and better
detail. This is very helpful in meeting the needs of a large audience. For example,
upper administrators may be more interested in the global view of things. They may
want to see what they are sending, where they are sending it, and how long it is
taking to get there. This type of global overview is what we present first. Daily
practitioners may be more concerned with what they are sending, which is where
the tools on the left come into play. The left navigational menu also includes an
option for institutional comparisons.
Institutional comparisons within the consortium are provided by OBILLSK. Two
versions of this report are available: one for articles and one for loans. For articles,
the report counts each institution’s total articles lent—articles lent in 72 hours or
less, articles lent in under 96 hours, articles lent in under 120 hours, and articles
lent in 120 hours or more—and its compliance percentage within the rules of the
consortium. For loans, the report counts each institution’s total loans—items lent
in under 5 days, items lent in under 6 days, items lent in under 7 days, and items
lent in 7 days or longer—and its compliance percentage within the rules of the
consortium. Institutional comparison gives users another avenue to understand how
their libraries may compare to the other libraries in the consortium. The data here
are presented in a grid pattern. This makes it easy to compare the fill rates of one
library to those of another library. Users are able to see their institutions and
compare them with other institutions. We have also incorporated an existing
consortium-level chart generated outside of OBILLSK in this web page to centralize
all of the services we offer into a single spot.
The final chart we added is an existing account of ILL statistics that we have been
tracking. The OBILLSK system also hosts the Relais D2D load-leveling statistics
within the GWLA as a courtesy to GWLA members. The Relais system is an
additional service offered by the consortium that allows for real-time availability and
building of lending strings. Although Relais provides interesting information on its
own, the consortium wanted to get more complete detail about which universities
were receiving requests and how many. This information was helpful to understand
whether the load-leveling algorithms offered by the Relais service are functioning
correctly. To accomplish this detective work, Texas Tech Libraries wrote a series of
queries that interrogate the Relais D2D system and return a numerical value for
requests received by that institution. We then divided the result into days and
archived them by month so users could return to the chart and access the historical
load-leveling statistics going all the way back to the implementation of the system.
The final options on the left navigational menu were designed to give users the
ability to manage their accounts and to upload the data that are generated by the
Windows executable file. This is the place where the data files are loaded into the
server. Users are also able to contact us if they have any questions about the system
or to download the executable file should they lose the copy, which can occur when
their computers are replaced or re-imaged by their IT staff. The entire web page
visualization scheme functions in the center frame. By isolating the actions of the
different functions on the center frame, users are presented with a visually consistent
set of actions and reactions. This overall design presents a consistent experience so
that users can be comfortable with using the system and asking questions.
Taking into account the whole system, it is easy to see the amount of data
collected. To collect transaction-level detail of all ILL transactions at a given
institution is daunting enough, but to then make the comparisons to other
institutions within a certain group makes it easy to see how the presentation of that
type of data needed to be visual in nature. By presenting the data as graphs and
charts users can easily get an “at a glance” understanding of what is happening in
the ILL transactions within the consortium. By providing visual feedback of the
options, selected in the form of the buttons changing color, we are able to guide
users through the different displays on information presented. All of this design was
deliberate to correspond to existing web presentation styles and user experiences. A
consistent user experience is something we strive for in our design.
What Miller is describing is the idea that the human mind can store in immediate
memory only seven objects at any given moment. Further, if you would like to store
more, then the object needs to be broken up into more than seven categories or
chunks. Two designers at Microsoft echo this idea, writing this:
In essence, G. Miller’s findings that people are only able to make quick, accurate decisions with a small
handful of objects at a time has had wide support across studies, and may provide useful guidance in the
design of web hyperlinks across pages. (Larson and Czerwinski 1998, 26)
All three of these authors are illustrating one of the key components to the visual
design for OBILLSK. Examining figures 6.1 and 6.2, you will see that much of
what we have done follows this principle. The left navigational menu has only ten
total options, which is close to the seven plus or minus two idea. What we have
done, however, is make sure that the navigational menu is divided into distinct
parts. That allows the immediate memory to grapple with the number of choices in
a categorized way, thus allowing for more information to be fed to users. We also
ensure that the center frame that dynamically changes had only seven options at
log-in. Those options are the six turnaround time metrics and the map. This total
number of seven keeps users in a comfortable space for making decisions with the
information. As users continue to drill down into the microlevel detail of the
requests, they continue to be presented with seven options. Each of the microlevel
analyses of the borrowing requests and lending requests, while having a long list of
transactions, has between five and six headings for each column. This allows users
to process the list in a more comfortable way by categorizing the information into
discrete packets that can be assessed one at a time. You can also see that the
dynamic between loans and articles is at work throughout the web page
functionality. This leads to another important discussion point: the consistent user
experience.
A consistent user experience is also a key component to the discussion of the
visualization of the data. Even though we can categorize data into discrete packets
and ensure that we organize them in a certain way to help users make assessments
from what they are seeing, if we do not have a similar experience across categories,
then we will lose users rather quickly. With that in mind, we incorporated a few
elements to make the user experience seem consistent across the web page and
software.
The first user experience element employed in the system, the color palette, is
consistent with the branding of the software and occurs across both the executable
file and the web pages. This gives users a sense of identification and familiarity that
this is in fact an OBILLSK system. We also stuck with a more appealing palette of
whites and blues.
The second element that helps with a consistent user experience is the feedback the
page gives when an option is selected. We used green to highlight when a page is
active. This is illustrated when you click on one of the turnaround metrics and the
button highlights green to show that it is active. We also carried that over into the
microlevel analysis. When you choose article or loan on the tabs above, the tab color
changes to green, which lets users know that the tab is active. Using color to provide
feedback across the platform allows users to have a consistent understanding of what
is happening when things are active. Essentially, users can come to understand that
the color green, which is already associated with action in our daily lives, also means
action on the web page.
CONCLUSION
The field of ILL is essentially a logistics management discipline. That means that
librarians in this discipline, rather than viewing the books as resources of
information, need to come to understand them as UPS or FedEx would understand
a package. This leads to a few fundamental questions, the chief among them being
“How do we get item A from here to there in the most efficient and effective way
possible?” Viewing ILL in this manner can help increase the access to information
that all users of the library need and deserve. However, to construct an efficient
system, libraries must come to rely on visualized data that can tell a story not only
of what is happening at their institutions but also what is happening across a more
global system. Librarians in essence need to be able to see the entire supply chain to
make accurate assessments. This goal, to give ILL librarians easy access to the ILL
data across an entire consortium, is the core motivating idea behind the
development of OBILLSK. Decision makers and practitioners of ILL should be able
to make accurate assessments of what is happening on a broad scale and in near real
time. To accomplish the display of so much varied detail, it is necessary to develop a
cohesive, accessible system that follows consistent design principles so that the end
users of the product can transition from the Windows executable file to the web
page with little to no change in user experience. The OBILLSK system does a very
good job of presenting this large-scale information with varied complexity in a
simple and easy-to-use interface. As we go forward, it will be interesting to see how
the influence of large-scale logistical management systems like OBILLSK can help
create a more efficient ILL system and thus continue to help library patrons get the
resources they want when they want them.
REFERENCES
Brown, C. Marlin. 1988. Human-Computer Interface Design Guidelines. Norwood, NJ: Ablex Publishing.
Larkin, Jill H., and Herbert A. Simon. 1987. “Why a Diagram Is (Sometimes) Worth Ten Thousand Words.”
Cognitive Science 11(1): 65–100.
Larson, Kevin, and Mary Czerwinski. 1998. “Web Page Design: Implications of Memory, Structure and Scent
for Information Retrieval.” CHI ’98 Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, 25–32.
Miller, George. 1994. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for
Processing Information.” Psychological Review 101(2): 343–352.
1. Two visualizations of the same data.
4. Using a grouped bar chart yields a chart with redundant labels and wasted space.
5. Two choropleth maps of the counties of the contiguous United States.
6. Listen to Summon.
7. A pie chart displaying the content types of the last 100 items.
17. Information literacy term appears at the boundary between education-related research and
library-related research.
18. Science-related terms with clustering resolution of 2.0.
20. Scatterplot timeline of a correspondence collection displaying peaks and types of activity.
21. Visual map of a correspondence collection.
25. A radial chart that displays relationships between records, creators, and functions.
26. A visual tool for analyzing dates, extent, document types, and subjects across multiple
collections.
27. A ggvis-generated scatterplot created from the diamonds data set showing carat weight by
price. The color of each data point represents the quality of the cut of each diamond.
28. Using the same data as figure 9.4, color has been added using the fill property to enable
differentiation of branch data.
29. Circulation by gate count with a fill color indicating library branch.
30. A scatterplot using the same data as figure 9.7, with the addition of a smooth model curve to
highlight the data trend, which shows a positive relationship between gate count and circulation.
BACKGROUND
Term co-occurrence maps (sometimes referred to as co-word maps or term maps)
have a rich history in bibliometrics, a subfield of library and information science
that uses various methods to quantitatively analyze scholarly literature. Most often
this analysis focuses on a specific domain to understand both its current state and
evolution over time. Term co-occurrence maps attempt to show the dominant
themes in a set of documents by connecting terms that occur together in a single
document. A document can be a paragraph, an abstract, a title, or the full text of an
article. Term co-occurrences in a body of text are organized into a matrix, which is
interpreted as a network where terms are nodes connected by links based on their
co-occurrence in a document. These maps are typically displayed in two dimensions
using a variety of techniques. Term maps date back to the early 1980s with Callon
et al.’s (1983) landmark study involving a co-word analysis of keywords from 172
scientific articles on dietary fibers. When mapped, terms are placed in a vertical
fashion with more frequently occurring terms appearing at the top and co-
occurrence represented by links connecting terms. Not all co-occurrences are
represented in the map. To simplify the maps and reduce term density, a term must
appear at least three times in association with one other term in the data to meet the
threshold for inclusion in the map (Callon et al. 1983).
Subsequent term maps emphasized the strength of co-occurrence by using
weighted links to connect the terms. The more frequently two terms co-occur, the
thicker the link connecting the terms appears in the map. In their article, Rip and
Courtial (1984) show the connections between keywords from articles published
over a ten-year period in Biotechnology and Bioengineering, a core journal in
biotechnology. Both circular and vertical maps are used to visualize the data.
Similarity between terms is measured using the Jaccard Index and shown through
weighted links (Rip and Courtial 1984). The circular maps used facilitate
interpretation by placing the most highly occurring terns at the center of the map.
One of the major drawbacks of early term co-occurrence maps is the lack of
objectivity regarding term placement on the map. Terms are situated in two-
dimensional space in an ad hoc manner simply to facilitate ease of reading (Rip and
Courtial 1984). The arguably intuitive assumption that distance between terms in
the map corresponds to the terms’ similarity does not hold true. To address this
shortcoming, multidimensional scaling (MDS), a method from spatial-data analysis,
was introduced as a method for creating term maps. Using this approach, maps are
generated where terms are automatically placed using computer software so that the
distance between terms reflects the rate of co-occurrence, resulting in highly co-
occurring terms being placed in close proximity, forming clusters of similar terms
(Tijssen and Van Raan 1989). Ultimately this approach yields maps that are more
intuitive than previous term co-occurrence maps. However, map readability,
especially for larger term maps, still proves challenging due to overlapping term
labels and link density.
More recently, computer programs such as VOSviewer enable the analysis of much
larger bodies of text and increase map readability simultaneously through
improvements in term placement. At the heart of the tool is a mapping technique
referred to as visualization of similarities (VOS), which differs from prior methods
for term placement. The VOS method improves on multidimensional scaling by
locating terms closer to their ideal coordinates on the map and by giving weight to
indirect similarities (van Eck and Waltman 2007). Additionally, previous tools for
visualizing term co-occurrence maps, such as SPSS or Pajek, suffer from problems
of labels overlapping and a lack of ways to explore small portions of the map in any
detail (van Eck and Waltman 2010). The VOSviewer program is highly flexible.
The tool can read data directly from Web of Science or Scopus, allowing users to
generate term maps from article abstracts, or from text files, allowing for the
creation of term maps from any text. Users can employ the VOS mapping method
to create maps from a data set in the tool itself or view maps created using
multidimensional scaling in other programs such as SPSS (van Eck and Waltman
2010). Once maps are created, either natively or in another tool, VOSviewer
provides two ways to visualize the data: the network visualization view or the
density visualization view. In the network visualization view terms are presented by
labels on top of circles. The size of the label and circle corresponds to the overall
frequency in the data set. The color of the circle corresponds to the cluster to which
the term has been assigned. In the density view, terms are represented by labels,
which again correspond to frequency in the data set. The color in the density view
ranges from blue (lowest density) to red (highest density). These color values are
determined by the number of nearest terms in the area around a point and the
weight, or relative frequency, in the case of term co-occurrence maps, in the data set
(van Eck and Waltman 2015). Each view offers users a unique way to uncover
patterns in the data. Additionally, users can view small portions of the map by using
a zoom and scroll functionality. Finally, the tool also offers the ability to take
screenshots of maps and to save both image and map files in a variety of formats.
While VOSviewer was initially designed to create bibliometric maps such as
journal citation maps, it performs well as a text-mining tool for creating term co-
occurrence maps, easily ingesting large amounts of text. Creating a term co-
occurrence map in VOSviewer involves four steps. In the first step, the tool
identifies noun phrases, which are word sequences consisting of only nouns and
adjectives, via part-of-speech tagging using the Apache OpenNLP tool kit (van Eck
and Waltman 2011). In the second step, VOSviewer identifies relevant terms, a
process that ultimately reduces clutter in the resulting map. To determine a term’s
relevance, the tool filters out more general noun phrases by comparing certain noun
phrases that co-occur with only a limited set of other noun phrases versus those
noun phrases that co-occur with many different noun phrases (Waltman, van Raan,
and Smart 2014). The third step involves mapping and clustering the terms using
the VOS mapping technique combined with a modified modularity based
clustering approach (Waltman, van Eck, and Noyons 2010). Finally, the map is
displayed in both the network visualization view and the density visualization view.
VOSviewer has recently gained popularity for its ease of use, the intuitive maps it
generates, and its scalability. The tool has been used to study the evolution of
scholarship in academic domains as diverse as land use and urban planning (Gobster
2014) to computer and information ethics (Heersmink et al. 2011). The tool is also
adept at illuminating connections between research areas in highly interdisciplinary
fields, such as the interface between engineering and physical sciences with health
and life sciences (Waltman, van Raan, and Smart 2014). Due to VOSviewer’s easy-
to-use interface, ability to ingest large volumes of text, and utility in showing
connections in highly interdisciplinary areas, it is a good tool for analyzing the
topical coverage of an institutional repository.
AN EXAMPLE OF A PROJECT
Background
This project began in early 2015 as a way to understand the current state of
IUPUI’s institutional repository, ScholarWorks. The first item was deposited in the
repository, which at the time was named IDeA (IUPUI Digital Archive), in August
2003 (Odell 2014). The first instance of IUPUI’s repository ran on the first version
of DSpace, which was released the year before. Early adopters on campus included
the School of Medicine, University Library, and Herron School of Art and Design
(Staum and Halverson 2004). Over the years the repository has grown and been
organized into different communities, with some of the original communities
subsumed as collections into larger communities. At the time of this study,
ScholarWorks archives over 4,000 unique items and hosts twenty-five communities,
spanning the sciences, social sciences, and humanities (see table 7.1).
1,255
School of Medicine
1,136
Faculty Articles
858
University Library
772
School of Liberal Arts
467
Office of the Vice Chancellor for Research
286
School of Informatics and Computing
241
Robert H. McKinney School of Law
214
Lilly Family School of Philanthropy
175
School of Education
142
School of Public and Environmental Affairs
78
School of Science
70
Herron School of Art and Design
64
School of Engineering and Technology
55
School of Dentistry
49
Richard M. Fairbanks School of Public Health
41
Moi University/IUPUI Partnership
38
School of Nursing
37
Kelly School of Business–Indianapolis
26
Indiana University-Purdue University Columbus
23
Center for Service Learning
17
School of Rehabilitation Sciences
12
School of Physical Education & Tourism Management
11
School of Social Work
8
Alumni Works
5
Initially the project was undertaken as a proof of concept, but it was also done
with an eye toward the future. One of the goals of this project is to serve as a
baseline against which to assess the evolution and growth of ScholarWorks as a
repository. This study proves timely due to the recent passing of a campus-level
open access policy. In October 2014, the IUPUI Faculty Council passed an open
access policy, encouraging faculty and researchers to make their scholarship as
openly available as possible (“Open Access Policy” 2015). While self-archiving is
not mandated by the policy (researchers are able to opt out on an article-by-article
basis), a significant component of the work involved in implementing the policy
centers on an aggressive outreach program aimed at helping faculty and researchers
self-archive their journal articles in ScholarWorks. Due to an increase in this work,
the number of submissions to the repository is expected to expand its coverage
significantly in the coming years. Thus, studying the dominant research themes of
items archived in the repository at this point is an important first step in assessing
future expansion of repository coverage.
1. Launch the program and select create from the action panel menu on the
left of the tool. A pop-up will appear; select Create a map based on a text
corpus.
2. Choose the text file with the abstracts and titles. Load that file as a
VOSviewer corpus file. It is not necessary to use a VOSviewer scores file.
3. Set counting method to binary. This is preferred over full counting,
especially for larger bodies of text. Full counting uses every instance of a
term in a document to assess its similarity to others, while binary counting
uses only the presence of the term. This prevents the maps from being
skewed by a single term’s appearing frequently within one document.
4. Ignore the thesaurus file. This file will eliminate certain noun phrases from
the final map. Terms can always be deselected at a later stage, but supplying
a thesaurus at this step can be helpful in eliminating potentially
nonmeaningful terms, such as results or methodology, from the resulting
map.
5. Set the minimum occurrence threshold. By default, VOSviewer uses a
threshold of ten, which works well for fairly large data sets. The total
number of terms in the ScholarWorks data set is 75,134 terms. Using a
minimum occurrence threshold of ten, the data set is pared down to 1,801
terms.
6. VOSviewer assigns relevance scores to each term. The distribution of
second-order co-occurrences of a single noun phrase over all noun phrases is
compared with the overall distribution of noun phrases over all noun
phrases; the greater the difference between these two distributions, the more
relevant the term is considered to be (van Eck and Waltman 2011). This
significantly reduces the number of terms to 60 percent of the terms above
the selected threshold. For the ScholarWorks data, reducing the terms to
the most relevant 60 percent results in 1,081 terms.
7. Verify selected terms and deselect any nonmeaningful terms outside the
scope of analysis. Clicking on the column heading for Occurrences or
Relevance allows for the sorting of these terms in either ascending or
descending order. Sorting by the most frequently occurring terms facilitates
the removal of nonmeaningful terms from the map. For example,
frequently occurring terms such as article could be removed from the
analysis. This ultimately makes the map easier to read and highlights
meaningful relationships between the terms. Generally, term deselection is
done in an ad hoc fashion and will vary depending on the data and goals of
the project. For the initial exploratory analysis of the ScholarWorks data, no
terms were deselected.
8. Click finish and VOSviewer performs mapping and clustering. Term co-
occurrence maps created from text files are available to view in either
Network Visualization or Density View. To change between views, click on
the tabs at the top of the main panel in the center of the tool.
9. Changing the clustering resolution increases or decreases the number of
clusters in the map, which can help uncover patterns in the data. To change
this parameter, click on the Map tab in the action panel on the left of the
tool. By default, the clustering resolution is set to 1.0. Increasing this
number produces more clusters in the map, and decreasing reduces the
number of clusters.
Results
The initial map shows six clusters of terms in the Network Visualization view (see
figure 13 in the photospread). The red cluster to the left of the map includes terms
associated with social sciences and humanities disciplines, the green and blue
clusters to the right include science-related terms, and the yellow cluster that
connects the two areas has many public health–related terms (see table 7.2). These
four clusters will be examined in detail later. However, it is worth analyzing the
remaining two clusters. The purple-colored cluster in the upper left of the map
contains terms that could not easily be assigned to one of the other clusters. This
occurs for two reasons. First, general terms, such as period, appear in many titles and
abstracts but do not co-occur frequently enough with any other specific terms to be
assigned to either of the other clusters. Second, terms in this cluster such as attorney
general, and opinion are highly specific to a set of items within the repository. In the
case of attorney general, opinion, and official opinion, these terms refer to a historical
set of digitized opinions from the Indiana attorney general. Other terms such as
digital aerial photography, county, accuracy, and report are all associated with a set of
county horizontal accuracy reports, which provide aerial photographs of Indiana
counties. Due to the uniformity of the titles and lack of additional text that might
associate them with their respective disciplines, law and geography, these items are
clustered together.
Table 7.2. Top five most frequently occurring terms from each cluster
Term Occurrences Cluster Color
student Social Sciences & Humanities Red
568
cell Molecular Biology & Genetics Green
441
function Molecular Biology & Genetics Green
412
experience Social Sciences & Humanities Red
376
program Social Sciences & Humanities Red
371
library Social Sciences & Humanities Red
353
community Social Sciences & Humanities Red
334
mechanism Molecular Biology & Genetics Green
329
protein Molecular Biology & Genetics Green
327
expression Molecular Biology & Genetics Green
292
property Other Sciences & Dentistry Blue
198
concentration Other Sciences & Dentistry Blue
169
teeth Other Sciences & Dentistry Blue
166
score Public Health Yellow
165
agent Other Sciences & Dentistry Blue
144
surface Other Sciences & Dentistry Blue
143
diabetes Public Health Yellow
99
predictor Public Health Yellow
92
reliability Public Health Yellow
89
item Public Health Yellow
86
The light blue cluster consisting of two terms, una and cultura, represents a small
number of Spanish language items in the repository, all of which are found in the
Theses, Dissertations, and Doctoral Papers community. VOSviewer is designed for
data in English and cannot perform part-of-speech tagging on other languages,
which is why the article una made it through to the map and was not excluded
during stopword removal. However, the presence and clustering of these terms
suggest some possibility for a basic language-based map for multilingual
repositories. Due to the limited number of foreign-language materials in
ScholarWorks, this type of analysis is beyond the scope of this study.
The largest cluster is the humanities and social sciences cluster at the left of the
map, including 478 terms (see figure 13 in the photospread). Upon initial review,
the terms that stand out the most include student, program, experience, and library. It
is not really surprising that library-related terms figure so prominently in this
cluster. The University Library community is the fourth largest in ScholarWorks,
which is likely due to the fact that librarians are more aware of this service and are
often advocates for open access. However, it is interesting that despite its relatively
small size, especially when compared to the School of Medicine and Theses,
Dissertations, and Doctoral Papers communities (see table 7.1), that terms from this
community dominate the map. This suggests the presence of a large amount of
library-related research in the repository or that these items use similar language to
describe the research.
Switching to the density visualization view provides more information on the
overall structure of the map (see figure 14 in the photospread). It is immediately
apparent that the highest term density occurs at the center of the social sciences and
humanities cluster. The highest density area centers on the term student, which
makes sense given that it is the most frequently occurring term in the data set. The
next two highest areas of term density occur in the science clusters, centered on the
terms cell and function. The area connecting the science clusters with the social
sciences and humanities clusters, containing public health terms, has a relatively low
term density compared to the rest of the map.
To examine the social sciences and humanities cluster more closely, the clustering
resolution is increased in VOSviewer to provide a more granular view. The default
clustering resolution of 1.0 does not provide much detail (see figure 15 in the
photospread). However, changing this parameter to 2.0 yields a map with sufficient
granularity to see different research areas (see figure 16 in the photospread).
There are now four prominent subclusters present. The largest of these subclusters
is the arts and humanities (green) and is spread across the upper portion of the map.
Within this subcluster, the most frequently occurring terms are experience, history,
place, world, and idea. It is important to note that while the terms experience and
history appear in this subcluster, they are centrally located on the map, suggesting
their use as terms in a variety of items across the social sciences and humanities and
providing an example of how VOSviewer handles indirect similarities. The next
largest subcluster includes terms that are related to the scholarship of education
(yellow) in the lower left of the social sciences and humanities cluster. The most
frequently occurring terms in this cluster include student, program, education,
opportunity, and university. It is interesting to note the overlap between this
subcluster and the adjacent library research subcluster (gold), above the scholarship
of education subcluster. In fact, the term information literacy, which is too small to
appear in figure 16 in the photospread but can be seen in figure 17 in the
photospread, spans the boundary between these two subclusters. The library
research cluster is dominated by terms that include article, resource, and service. The
last subcluster within the social sciences and humanities cluster is government, public
policy, and law, which can be seen in purple at the top of the social sciences and
humanities cluster. The most frequently occurring terms in this cluster include
United States, law, opinion, government, and right.
The right side of the term map (figure 13 in the photospread) is dominated by the
two science clusters, which include the biophysics and dentistry cluster (blue) and
the molecular biology and genetics cluster (green). Examining the structure of the
two clusters yields nothing unexpected. For example, the term mechanical property
appears toward the bottom of the biophysics and dentistry cluster, far away from
terms such as protein protein interaction, which occurs at the top of the molecular
biology and genetics cluster due to a high level of dissimilarity (see figure 12 in the
photospread). Conversely, highly similar terms such as disease and resistance occur at
the boundary between these two clusters. To identify further patterns, the clustering
resolution is changed. Increasing the clustering resolution parameter to just 1.5
results in a clearer distinction between the dentistry-related terms (purple) and
biophysics terms (light blue) to their right, which include mostly bone-related
research (see figure 12 in the photospread). To confirm the relative large amount of
bone-related research, a quick keyword search is done in ScholarWorks for the term
bone, returning 761 results.
Even at this level of clustering, all the molecular biology and genetics terms appear
clustered together, represented by the green-colored terms (see figure 12 in the
photospread). Increasing the clustering resolution to 2.0 produces higher
granularity, but without validation by a subject-matter expert it is difficult to
identify any meaningful subclusters or patterns in the data (see figure 18 in the
photospread). However, even with expert input, this research area could still lack
any easily identifiable clusters of terms because of either the relatively small amount
of data or the diversity of research in this area.
Perhaps the most interesting feature of the map is the cluster that connects the
three clusters of social sciences and humanities, biophysics and dentistry, and
molecular biology and genetics. The yellow cluster that bridges the sciences with the
social sciences and humanities contains many public health–related research terms.
This cluster is the most widely dispersed in the map, with terms scattered among
the social sciences and humanities cluster, and the two sciences clusters. In total, the
public health cluster contains 145 terms, which include frequently occurring terms
such as diabetes, predictor, mortality, depression, and race. There are also a number of
terms that indicate the heavy use of surveys as a data collection method, such as
score, item, and questionnaire.
Probably the most interesting feature of the public health cluster is where it
intersects with the other clusters on the map. As an interdisciplinary field, there is
much overlap between public health and other areas. At the intersection of the
public health cluster with the social sciences and humanities cluster, terms that
indicate health economics research such as consumer, patient care, and health care
system are found. Additionally, terms such as race, income, and disparity are found at
the edge of the public health cluster and the social sciences cluster, indicating the
presence of sociological and public policy health-related research. On the opposite
side of the public health cluster, terms that are more often associated with health-
related research in the sciences are found. Terms such as smoking, cardiovascular
disease, and infection intermingle with the terms in the two science clusters.
Discussion
The distribution of term densities across the map is interesting and somewhat
unexpected. The relative high density of terms in the social sciences and humanities
cluster was surprising, given that the majority of research at IUPUI is happening in
medicine and health sciences. When the two science clusters are combined, they
total 442 terms, which is roughly similar in size to the social sciences and
humanities cluster, with 478 terms. However, the density of terms appears far
greater in the social sciences and humanities cluster. This raises interesting questions
about the research that is archived in these areas. Perhaps research in the social
sciences and humanities has a more limited set of terms with which to describe the
research being done. Or perhaps the research archived in ScholarWorks in the social
sciences and humanities is more on similar topics such as student engagement.
Whatever the case, it appears that the research in the sciences that is archived in
ScholarWorks is more diverse than the research in the social sciences and
humanities, at least based on the terms used to describe this research. This
difference represents an area where ScholarWorks may not accurately reflect the
research landscape of the institution and is something to which librarians should
give consideration. Those librarians serving faculty in the social sciences and
humanities should take steps if possible to ensure that the full range of research
happening in their departments is accurately reflected.
The overall structure of the map provides further insight into the connections
between major research areas. As mentioned earlier, IUPUI is a campus with a
strong emphasis on the health sciences, and as such it is not surprising to see so
many health-related terms scattered throughout the map. In this way, the term map
serves as an apt metaphor for campus, with researchers focusing on health-related
issues physically spread across campus in various departments. Furthermore, it is
interesting to see how distinctly the public health cluster bridges the gap between
the social sciences and humanities cluster with the two science clusters, providing
evidence for the highly interdisciplinary nature of public health research. However,
one of the major challenges in this project reveals itself in the structure of the map.
The small collections of specific items, usually with uniform titles such as the
Opinions of the Attorney General of Indiana collection in the Robert H. McKinney
School of Law community, create separate clusters not connected to the rest of the
map that make interpreting the map difficult. If viewers are unaware of these
collections and their uniform titles that increase the frequency of certain words, they
might lend too much weight to the importance of these clusters. While these
clusters do provide important insight into the contents of the repository, they
distract from the more interesting relationships between the research areas that are
depicted in the rest of the map. Therefore, librarians engaged in creating these types
of term maps should have some basic level of familiarity with the contents of their
repositories and, as should always be the case, approach the resulting maps with a
critical eye. Another challenge related to the structure of the map and cluster
formation pertains to the way bodies of text containing many different research
areas do not always form coherent clusters. While VOSviewer can show the
connections between interdisciplinary areas of research, it relies on sufficient high-
quality data. The ScholarWorks data set needs to be larger to more accurately
delineate the relationship the research areas present.
Despite the relatively small amount of data, there are many groups of terms in the
clusters that point to easily identifiable research areas. Some of the more prominent
terms provide clues about institutional values, or at least the values of those actively
engaged in supporting the repository. For example, terms related to student
engagement and educational research figure prominently in the social sciences and
humanities cluster. Much of this research is archived in the Center for Service
Learning community. However, it is interesting to compare the prevalence of these
terms with the relatively small size of the community, suggesting that these terms
are used throughout the social sciences and humanities cluster. This pattern meshes
well with many of IUPUI’s institutional values, which prize student engagement
and student learning as key values. Similarly, the health-related research across the
disciplines and not just in the health sciences is strongly indicative of IUPUI’s
culture. Programs such as Medical Humanities & Health Studies2 and new degrees
such as the PhD in Health Communication3 mean that health research terms show
up in unexpected places, as evidenced by the many health-related terms at the
bottom left of the social sciences and humanities cluster. However, these terms do
not form into any easily identifiable clusters, due in equal parts to the small number
of items in these research areas and the difficulty in clustering interdisciplinary
research. One of the limitations of using term co-occurrence maps to draw
conclusions about the nature of research archived in an institutional repository is
how susceptible they are to individual researchers with many items on the same
topic. For example, much of the bone-related research in the biophysics subcluster
(see figure 18 in the photospread) is attributable to one researcher at the university.
The “repeat customer” phenomenon can make it seem as though a lot of research is
being done institutionally in a particular area when in reality there are ten articles
from one researcher on a single topic. Again, accurate interpretation of these maps
relies heavily on a knowledge of the repository’s contents.
There are a number of areas noticeably absent from the ScholarWorks term map.
Given the strong presence of an engineering program on campus, it is surprising to
see the lack of an engineering cluster or at least a significant number of engineering-
related terms. Another gap in the map is in the area of physics. These gaps are
confirmed by consulting the repository. Only one item is archived in the Physics
collection within the School of Science community, and the School of Engineering
and Technology community has only fifty-five items. Further gaps include math,
chemistry, and chemical biology. The lack of chemistry-related research is not
surprising due to issues around research-related patents and trepidation toward
open access. Despite the lack of some areas in the map, there are small clusters of
terms that suggest emerging areas in the repository. Identifying a potential emerging
area requires a general knowledge of the institution and its research. One potential
emerging area at IUPUI is in Philanthropy, with the recent founding of the
Philanthropic Studies program in the Lilly School of Philanthropy. Terms related to
this emerging area appear in the social sciences and humanities cluster, just above
the library-related terms, and include philanthropy, giving, grant, fund, and nonprofit
organization.
CONCLUSION
This chapter demonstrates how librarians can visually represent the research
archived in library-run institutional repositories using term co-occurrence maps.
Specifically, these maps demonstrate different research clusters around themes in
the sciences, social sciences, and humanities. Somewhat unexpectedly, the highest
density of terms appears in the social sciences and humanities, followed by the
sciences. These two sections of the map are connected by public health. This map
serves as a valuable resource to subject librarians in two primary ways. First, the map
charts the research landscape of the institution, showing connections that while
obvious to some, are new to others. For example, some librarians may be unaware of
just how pervasive health-related research is on IUPUI’s campus, showing up in
social sciences and humanities research as well as in the sciences. Second, the map
identifies gaps in the repository’s coverage. One prominent example is the relatively
small amount of scientific research outside of the health sciences. Many of these
gaps are evident when looking directly at the numbers of items in the collections
that make up the ScholarWorks communities, but visualizing the entire repository
as one term map brings these gaps into context.
The two biggest limitations of these term maps are the relatively small data set and
the necessary reliance on subject-matter expert input for interpretation. These maps
are made with the titles and abstracts from 4,346 items, which is a relatively small
amount of data for this type of large-scale textual analysis. Furthermore, the
relatively small amount of data makes these term maps susceptible to being skewed
by small special collections with uniform titles, such as the Opinions of the Indiana
Attorney General, and single researchers who have a number of articles on the same
topic. However, as the repository expands in size it will be less vulnerable to being
skewed and will more accurately reflect the institution’s research landscape.
Additionally, input from subject-matter experts will result in a more comprehensive
analysis. Many librarians lack the specialized knowledge to connect clusters of terms
with the research areas these terms potentially represent. For the ScholarWorks term
map, this is especially true in the sciences, where a lack of expert knowledge allows
for only the general classification of clusters as dentistry, biophysics, and molecular
biology and genetics.
Future iterations of this project will need to include an interpretation and
validation phase that involves input from faculty or other subject-matter experts on
cluster identification. This input will facilitate librarians’ understanding of the map
and improve everyone’s understanding of the research landscape at IUPUI.
Furthermore, a much larger high-quality data set will improve the resulting map. As
more time passes since the implementation of the campus-level open access policy
and librarians work to mediate submissions of faculty research, the amount of text
in the repository for analyzing will only continue to grow. Replicating these term
maps in a year or two years will yield a much fuller picture of the research landscape
and potentially provide insight into new and emerging research areas on campus.
Despite the drawbacks of the ScholarWorks term maps, they are still useful for
librarians planning outreach around the open access policy. With these term maps
in mind, librarians should focus on increasing the diversity of social sciences
research beyond library and education research and increase the repository’s
holdings in scientific research beyond the health sciences. Lastly, these maps have
the potential for helping librarians, particularly those new to campus, to begin to
chart the research and intellectual landscape at their institutions.
NOTES
1. Visualizing the topical coverage of an institutional repository using VOSviewer.
http://hdl.handle.net/11243/9.
2. Medical Humanities & Health Studies. http://liberalarts.iupui.edu/mhhs/.
3. Communication Studies. http://liberalarts.iupui.edu/comm/.
REFERENCES
Callon, Michel, Jean-Pierre Courtial, William A. Turner, and Serge Bauin. 1983. “From Translations to
Problematic Networks: An Introduction to Co-Word Analysis.” Social Science Information 22(2): 191–235.
doi:10.1177/053901883022002003.
Gobster, Paul H. 2014. “(Text) Mining the LANDscape: Themes and Trends over 40 Years of Landscape and
Urban Planning.” Landscape and Urban Planning 126 (June): 21–30. doi:10.1016/j.landurbplan.2014.02.025.
Heersmink, Richard, Jeroen van den Hoven, Nees Jan van Eck, and Jan van den Berg. 2011. “Bibliometric
Mapping of Computer and Information Ethics.” Ethics and Information Technology 13(3): 241–249.
doi:http://dx.doi.org/10.1007/s10676-011-9273-7.
Odell, Jere. 2014. “Building, Growing and Maintaining Institutional Repositories.” Presented at the Michiana
Scholarly Communication Librarianship Conference, IUSB, South Bend, IN, October 20.
“Open Access Policy, IUPUI Faculty Council (October 7, 2014) | Open Access @ IUPUI.” 2015. Accessed
May 20. https://openaccess.iupui.edu/policy.
Peters, H. P. F., and A. F. J. van Raan. 1993. “Co-Word-Based Science Maps of Chemical Engineering. Part I:
Representations by Direct Multidimensional Scaling.” Research Policy 22(1): 23–45. doi:10.1016/0048-
7333(93)90031-C.
Rip, Arie, and J. Courtial. 1984. “Co-Word Maps of Biotechnology: An Example of Cognitive Scientometrics.”
Scientometrics 6(6): 381–400.
Staum, Sonja, and Randall Halverson. 2004. “IDEA: Sharing Scholarly Digital Resources.” IUPUI,
Indianapolis, IN, February 27.
Tijssen, R., and A. Van Raan. 1989. “Mapping Co-Word Structures: A Comparison of Multidimensional
Scaling and LEXIMAPPE.” Scientometrics 15(3-4): 283–295.
van Eck, Nees Jan, and Ludo Waltman. 2007. “VOS: A New Method for Visualizing Similarities Between
Objects.” In Advances in Data Analysis, edited by Reinhold Decker and Hans-J. Lenz, 299–306. Studies in
Classification, Data Analysis, and Knowledge Organization. Springer Berlin Heidelberg.
http://dx.doi.org/10.1007/978-3-540-70981-7_34.
———. 2010. “Software Survey: VOSviewer, a Computer Program for Bibliometric Mapping.” Scientometrics
84(2): 523–538. doi:10.1007/s11192-009-0146-3.
———. 2011. “Text Mining and Visualization Using VOSviewer.” arXiv:1109.2058 [cs], September.
http://arxiv.org/abs/1109.2058.
———. 2015. “VOSviewer Manual (Version 1.6.0).”
Waltman, Ludo, Nees Jan van Eck, and Ed C. M. Noyons. 2010. “A Unified Approach to Mapping and
Clustering of Bibliometric Networks.” Journal of Informetrics 4(4): 629–635. doi:10.1016/j.joi.2010.07.002.
Waltman, Ludo, Anthony F. J. van Raan, and Sue Smart. 2014. “Exploring the Relationship between the
Engineering and Physical Sciences and the Health and Life Sciences by Advanced Bibliometric Methods.”
PLoS ONE 9(10): e111530. doi:10.1371/journal.pone.0111530.
8
Visualizing Archival Context and Content
for Digital Collections
Stephen Kutay
A NOTE ON TERMINOLOGY
To provide clarity, definitions for the following terms are provided as they pertain
to the subject of this chapter. Digital collections are not restricted to primary
sources; however, digital collections are referenced here as collections of digital
surrogates (or digital objects) derived from documents that make up physical and/or
digital archival collections. Source collection refers to the archival collections from
which a digital collection is, at least in part, derived. Record refers to the original (or
archival) source document from which a digital surrogate (or object) is derived.
Agent refers to the creator of a record. Digital heritage is referenced here to broadly
describe the publication and promotion of surrogates of archival materials online.
ARCHIVAL CONTEXT
To clarify the meaning of the term as it is applied here, A Glossary of Archival and
Records Terminology defines “context” as:
1. The organizational, functional, and operational circumstances surrounding materials’ creation, receipt,
storage, or use, and its relationship to other materials. 2. The circumstances that a user may bring to a
document that influences that user’s understanding of the document. (Pearce-Moses 2005, 90)
When visualizations are interactive, they provide additional access points that help
to restore arrangement, for example, results in original order or a list of all
surrogates in a specific record series in addition to ordered lists of objects according
to place, date of creation, or document genre.
Locating Context
Context for archival collections (and by extension digital collections) resides in
multiple locations. Precisely what constitutes context may vary depending on the
perspective of the user/researcher. Descriptive metadata that may logically pertain to
content could under some conditions provide context depending on the nature of
one’s research inquiry. For example, a box list inventory in a finding aid used to
order and describe the content of an archival collection could also help demonstrate
the frequency with which an agent was engaged in some kind of activity; thereby
the number of records supporting (or not supporting) such activity provides the
context for a researcher’s argument or assertion. Therefore, context is relational,
elusive, and target dependent (Lee 2011, 96). Contexts shift and emerge according
to new needs, and the line between context and content is not always clear. Add to
this the emergence of digital humanities that facilitate computational analyses of
digitized works and corpora, and it becomes unreasonable to predict how and why a
researcher is interested in a collection or a record or its digital surrogate. However,
locating original context can be mediated for the benefit of students and researchers
alike. Some of the most useful sources for locating context for digital collections
include finding-aid metadata, individual records, digital collection–level metadata,
annotations, publications, related collections, and encoded archival context (EAC).
Finding-Aid Metadata
Finding aids place archival materials in their original context (Pearce-Moses 2005,
168) by supplying a biographical or historical narrative regarding the person, family,
or organization responsible for creating the collection. A section devoted to the
scope and contents of the collection reveals the extent and media that make up each
record series that together illustrate the arrangement of the records in a collection
according to the activities of the agent(s). Finding aids are critical sources for
locating original context that help inform the description of a digital collection. The
rules for creating finding-aid metadata are provided by Describing Archives: A
Content Standard (DACS), which designates the elements and formatting of values
for archival description in finding aids. Elements that are especially rich with
context are noted below and paraphrased from DACS.
Individual Records
Some individual records within archival collections are especially rich with context.
Articles of incorporation, mission statements, mandates, diaries, correspondence, or
other documents help supply evidence to support the administrative and
biographical narratives that are commonly found in finding aids and digital
collections. Graphic hierarchies and networks communicate relationships,
structures, and functions, thereby placing the records in context to the overall
activities of the group. To leverage these relationships interactively, however, a
corresponding metadata element must be assigned to describe each functional entity
responsible for the creation of each document in a collection. For example, a
department or other designated role can be assigned to each digital object by using
creator, contributor, or description elements. In doing this, metadata serve to mediate
access to the corpus of documents of each functional entity via the URL produced
by searching the appropriate field in which the entity is described.
ARCHIVAL CONTENT
In contrast to context, content speaks to the “informational value” (Schellenberg
1984, 58) of a record beyond its significance as evidence and provides the
“intellectual substance” (Pearce-Moses 2005, 89) often useful to researchers. It
reflects what a record is and what it is about (as opposed to what it historically
represents or demonstrates). With respect to aggregations of records, a collection’s
scope of documentary genres and subjects gives users of digital collections a
comprehensive idea of a collection’s content. As such, visualizing content enables
users to understand the parameters of a digital collection according to its breadth
and scope, while providing exceptional access points based on authorized terms used
to describe the activities and types of documents created by them. Archival content
is best reflected by archival and digital collection metadata elements such as
document type or genre, extent, subject (as it pertains to what something is “about”),
and archival arrangement (as it pertains to document types or genres produced).
Note that subject and arrangement can represent both context and content
depending on what is described or suggested in the descriptions. A user’s exposure
to the overall content of a digital collection is critical to understanding what a
collection has to offer that user and subsequently provides a way for her or him to
make informed choices regarding search queries.
Table 8.1. Example of metadata from a digital collection of correspondence, stripped of all fields
except those required to populate a data visualization with context, content, and a means to
access the resource (URL)
Letter Date (context) Author Country Language Topic (context, URL
(content) (context) (context) (content) content) (access)
Suydam, United States eng Musicianship http://digital-
Lambert
1 1912-12-
06
Bone, Philip J. Great Britain eng Musicianship http://digital-
1912-12-
2 17
Laurie, United States eng American Guitar http://digital-
Alexander Society
1913-04-
3 02
Specifications
Type: Timeline
Mode: Interactive
Host technology: Viewshare (Library of Congress)
Purpose: Communicate context and content through temporal distribution
Recommended metadata: Author, Date
Optional metadata: Broad Topic, Place of origin, URLs (to access digital objects)
Instructions
1. Create a new data set from a selected set of fields such as author, date,
country of origin, and digital object URL.
2. Upload the data set into Viewshare.
3. Configure the Timeline view to publish and embed, if desired.
Specifications
Type: Scatterplot/timeline
Mode: Static
Host technology: Microsoft Excel or other spreadsheet application (alternative:
Google Charts)7
Purpose: Communicate context and content through temporal distribution
Recommended metadata: Date (year), date frequency (calculated)
Optional metadata: Broad Topic, Subject, or other desired element
Instructions
1. Custom format the date column of the collection metadata to display only
year (yyyy).
2. Alphabetically sort the Broad Topic field (or subject).
3. Copy and analyze the frequency of years per Broad Topic. This can be
achieved by (1) searching the database for both author and topic and then
sorting by year or (2) using a word-frequency calculator.
4. Create a new data set and paste the year frequency data into columns each
labeled by a Broad Topic.
5. Assign a scatterplot chart from the spreadsheet application (if charts are
supported).
Specifications
Type: Geolocation map (or geographic distribution)
Mode: Interactive
Host technology: Library of Congress’s Viewshare (alternative: Google Fusion
Table)
Purpose: Communicate context and content through geographic distribution
Required metadata: Author (or Title), Country (or coordinates)
Optional metadata: Date, Language, Broad Topic, URLs (to access objects)
Instructions
1. Create a new data set from at least the following fields: author, date,
country of origin, and digital object URL.
2. Upload the data set into Viewshare.
3. Configure the Timeline view to publish and embed, if desired.
Specifications
Type: Representational interface
Mode: Interactive (optional static display)
Host technology: HTML (alternative to HTML coding: Adobe Muse application)
Purpose: Communicate context and content through the transformative use of an
archival document
Recommended metadata elements: Creator, Genre, Subject, Extent
Instructions
The interface is dependent on the type of document and its structure; therefore
instructions are limited to aggregating the metadata that are desired and relevant to
display over (or from) the chosen document. Knowledge of HTML coding is
recommended, if not required.
Visualizing Narratives
Many digital collections are accompanied by introductory text that provides archival
context in a narrative format. As previously mentioned, this effective method of
grounding a collection to its archival counterpart becomes complicated or
overwhelming when a digital collection incorporates materials from multiple
archival collections. In thematic collections, the argument for communicating
context is even greater. One solution is to treat the individual narratives as access
points into the collection, thereby promoting use of the materials according to
provenance.
In this example, six independent collections are used to populate a digital
collection based on a history of water development in Los Angeles.8 Rather than
place the narratives together, forming a lengthy page of text, this visual element
utilizes an HTML gallery menu to show/hide narrative text of each collection via
rolling over the collection name. Users can generate all the digital objects within
each archival collection by clicking the collection name (see figure 23 in the
photospread).
Specifications
Type: Navigational gallery menu
Mode: Interactive
Host technology: HTML
Purpose: Independently communicate contexts across multiple collections
Recommended metadata: Source collection and Administrative/Biographical history
and/or Scope and Contents (from finding aid)
Instructions
Specifications
Type: Tree chart
Mode: Interactive
Host technology: Scalar (alternatives: D3, JavaScript InfoVis Toolkit)
Purpose: Communicate context and content through a display of a collection’s
physical (series) arrangement. This is especially useful for digital collections of
multiple archival collections.
Recommended metadata: Source Collection and Arrangement from “Scope and
Contents” in the finding aid
Optional data: Title, URL (to access objects)
Instructions
1. In Scalar, set up one page for every collection, series, and subseries (if
applicable).
2. Call up each page to edit and tag associated pages (and media, if applicable)
and then save.
3. Click “Explore/Visualization” on the left toolbar menu.
4. Click “Paths” to display.
Visualizing Relationships
Records are associated with their creators and the functions they serve. Radial charts
effectively communicate these relationships because of the differences in position
each point occupies around the circle, which enables connections between all
points. In this example, contexts are communicated through the visible association
of records with its creator and function(s). Hovering over any record, function, or
person (entity) highlights these relationships through the connections that appear in
the chart. Like the tree chart, this example was created using the Scalar web
development platform; however, this visualization type is available as part of
JavaScript libraries (see figure 25 in the photospread).
Specifications
Type: Radial chart
Mode: Interactive
Host technology: Scalar (alternative: D3)
Purpose: Communicate context and content through a display of relationships or
associations between records, their creators, and the functions they serve
Recommended metadata: Creator or Contributor, Title
Optional metadata: Function or Functional Entity (preferred from a repeated
instance of Description or other dedicated element), URLs (to access objects)
Instructions
1. In Scalar, set up one page for every creator, record, and function (if
applicable).
2. Call up each page to edit and tag associated pages (and media, if applicable)
and then save.
3. Click “Explore/Visualization” on the left toolbar menu.
4. Click “Radial” to display.
Specifications
Type: Combo chart
Mode: Interactive
Host technology: HTML (alternative to HTML coding: Adobe Muse application)
Purpose: Communicate provenance, contexts, and contents for digital collections
comprising multiple archival collections.
Recommended metadata (per archival collection): Source Collection, Date, Extent,
Document Genre (or type), Subject
Optional data: Source Collection URLs (from a database search in the Source
Collection field), Date URLs (from a database search of the date range by decade)
Instructions
MOVING FORWARD
One important indicator of the future of digital collections is the coordination of
many academic and heritage institutions, both large and small, that are invested in
making archives discoverable from even the most remote locations. Massive digital
libraries such as the DPLA have helped remove barriers to archives that otherwise
lack the resources to place their materials online. Archives with digital collections
programs have begun to share their metadata with the DPLA, OCLC, and other
regional services to increase the discoverability and utility of their materials by
reaching audiences far beyond the localities that their collections represent. Another
indicator of digital collections growth is that granting agencies are providing more
focus on the digitization, preservation, and availability of archival materials online,13
not just the physical processing required to make them initially accessible to
researchers.
With digital collections becoming more discoverable, there is little doubt that
online archives will continue to gain momentum. Yet important questions must be
considered. What kinds of purposes will future researchers have for “big digital
heritage”? To what extent will heritage data become further removed from the
contexts that embed them in their local communities and give them meaning? Can
archival surrogates withstand their digital mobility and malleability as part of a
menagerie of materials among online social and market spaces? Regardless of
whether such concerns ever come to fruition, we are well served by investigating
methods that help ground digital surrogates in original context should any future
counterweight be needed to balance emerging ways of interacting with these sources
that prompt new archival theorizing. It could be that metadata visualizations and
interfaces for digital collections offer at least one helpful step toward that end as
tools for mediating archival context and content.
NOTES
1. Metadata sharing from data providers that operate digital libraries is facilitated by the Open Archives
Initiative (2015) Protocol for Metadata Harvesting (OAI-PMH), which maps all elements used to describe a
resource to simple Dublin Core, where the metadata is ingested, indexed, and delivered to users by Internet
service providers.
2. The DPLA and its European counterpart provide portals for searching digital heritage materials throughout
the United States and Europe, respectively.
3. “International Standard Authority Record for Corporate Bodies, Persons and Families,” International
Council on Archives, April 1, 2004, http://www.icacds.org.uk/eng/ISAAR(CPF)2ed.pdf.
4. “D3 Data-Driven Documents,” accessed August 14, 2015, http://d3js.org/. “Dygraphs,” accessed August
14, 2015, http://dygraphs.com/. “InfoVis Toolkit,” accessed August 14, 2015, https://philogb.github.io/jit/.
These libraries consist of data visualization templates that utilize JavaScript programming language to populate
sophisticated graphic renderings of data.
5. “Viewshare,” Library of Congress, accessed August 14, 2015, http://viewshare.org/. Viewshare is a free
service for creating and embedding or linking useful data visualizations for libraries. Visualizations include
maps, timelines, scatterplots, bar and pie charts, tables, and galleries. Prospective users must request an account.
6. “Vahdah Olcott-Bickford Correspondence,” Delmar T. Oviatt Library, California State University
Northridge, accessed August 14, 2015, http://digital-library.csun.edu/cdm/landingpage/collection/VOBCorr.
7. “Google Charts,” Google Developers, accessed August 14, 2015, https://developers.google.com/chart/?
hl=en. Google charts provide a large set of JavaScript visualizations from which to choose.
8. “Water Works: Documenting Water History in Los Angeles,” Delmar T. Oviatt Library, California State
University Northridge, accessed August 14, 2015, http://digital-library.csun.edu/WaterWorks/.
9. “Scalar,” The Alliance for Networking Visual Culture, accessed August 17, 2015, http://scalar.usc.edu/.
Data visualizations in Scalar are created through tagging pages, digital media, and other elements used in
presentations (or online exhibits) of media with text.
10. “Water Works.”
11. As with Koshman’s “Testing User Interaction,” Bergström and Atkinson’s study, “Augmenting Digital
Libraries with Web-Based Visualizations,” compared the effectiveness of visualizations used to reduce the
mental activities and barriers required to locate relevant sources of academic information.
12. Some studies offering data or guidelines that address effective visualizations for specific disciplines are
Gehlenborg and Wong, 2012, “Into the Third Dimension,” 851; Gehlenborg and Wong, 2012, “Power of the
Plane,” 935; and Kelleher and Wagener, 2011, “Ten Guidelines for Effective Data Visualization in Scientific
Publications,” 822–827.
13. Funded by the Andrew W. Mellon Foundation, “Digitizing Hidden Special Collections and Archives” is a
new program of the Council on Library and Information Resources (2014), which effectively replaces the
program “Cataloging Hidden Special Collections and Archives.”
REFERENCES
Appadurai, Arjun. 1988. “Introduction: Commodities and the Politics of Value.” In The Social Life of Things:
Commodities in Cultural Perspective, edited by Arjun Appadurai, 3–63. Cambridge, UK: Cambridge University
Press.
Bergström, Peter, and Darren C. Atkinson. 2010. “Augmenting Digital Libraries with Web-Based
Visualizations.” Journal of Digital Information Management 8(6): 377–386. http://go.galegroup.com/ps/i.do?
id=GALE|A250885768&v=2.1&u=csunorthridge&it=r&p=CDB&sw=w&asid=f9461a7badcc72732806acd9ff2c4c39
Conway, Paul. 2014. “Digital Transformations and the Archival Nature of Surrogates.” Archival Science 15(1):
51–69. doi:10.1007/s10502-014-9219-z.
Council on Library and Information Resources. 2015. “Digitizing Hidden Special Collections and Archives.”
Council on Library and Information Resources. Accessed August 17, http://www.clir.org/hiddencollections.
Dey, Anind K. 2001. “Understanding and Using Context.” Personal and Ubiquitous Computing 5(1): 4–7.
doi:10.1007/s007790170019.
Digital Public Library of America. 2015. “Become a Hub.” Accessed August 17, http://dp.la/info/hubs/become-
a-hub/.
Duff, Wendy M., Emily Monks-Leeson, and Alan Galey. 2012. “Contexts Built and Found: A Pilot Study on
the Process of Archival Meaning-Making.” Archival Science 12(1): 69–92.
doi:http://dx.doi.org.libproxy.csun.edu/10.1007/s10502-011-9145-2.
Europeana. 2015. “Why Become a Data Provider?” Europeana Pro. Accessed August 17,
http://pro.europeana.eu/page/become-a-data-provider.
Gehlenborg, Nils, and Bang Wong. 2012. “Into the Third Dimension: Three-Dimensional Visualizations Are
Effective for Spatial Data But Rarely for Other Data Types.” Nature Methods 9(9): 851.
http://go.galegroup.com/ps/i.do?
id=GALE|A302298887&v=2.1&u=csunorthridge&it=r&p=HRCA&sw=w&asid=1354e6b6b2e3c36a85980c2781c1c4e6
———. 2012. “Power of the Plane: Two-Dimensional Visualizations of Multivariate Data Are Most Effective
When Combined.” Nature Methods 9(10): 935. http://go.galegroup.com/ps/i.do?
id=GALE|A304942671&v=2.1&u=csunorthridge&it=r&p=HRCA&sw=w&asid=e340ce3461b1fa6016092fe243066244
Ham, F. G. 1993. Selecting and Appraising: Archives and Manuscripts. Chicago, IL: Society of American
Archivists.
Huang, Weidong, Peter Eades, and Seok-Hee Hong. 2009. “Measuring Effectiveness of Graph Visualizations:
A Cognitive Load Perspective.” Information Visualization 8(3): 139–152.
doi:http://dx.doi.org/10.1057/ivs.2009.10.
Kelleher, Christa, and Thorsten Wagener. 2011. “Ten Guidelines for Effective Data Visualization in Scientific
Publications.” Environmental Modelling & Software 26(6): 822–827. doi:10.1016/j.envsoft.2010.12.006.
Koshman, Sherry. 2005. “Testing User Interaction with a Prototype Visualization-Based Information Retrieval
System.” Journal of the American Society for Information Science and Technology 56(8): 824–833.
doi:10.1002/asi.20175.
Lea, Martin, Tim O’Shea, and Pat Fung. 1995. “Constructing the Networked Organization: Content and
Context in the Development of Electronic Communications.” Organization Science 6(4): 462–478.
http://www.jstor.org/stable/2634998.
Lee, Christopher A. 2011. “A Framework for Contextual Information in Digital Collections.” Journal of
Documentation 67: 95–143. doi:10.1108/00220411111105470.
Matusiak, Krystyna K. 2006. “Information Seeking Behavior in Digital Image Collections: A Cognitive
Approach.” The Journal of Academic Librarianship 32(5): 479–488. doi:10.1016/j.acalib.2006.05.009.
Morville, Peter. 2015. “User Experience Design.” Semantic Studios. Accessed August 20,
http://semanticstudios.com/user_experience_design/.
Pearce-Moses, Richard. 2005. A Glossary of Archival and Records Terminology. Chicago, IL: Society of American
Archivists. Access August 15, 2015, http://www2.archivists.org/glossary.
Open Archives Initiative. 2015. “Open Archives Initiative Protocol for Metadata Harvesting.” Accessed August
14, https://www.openarchives.org/pmh/.
Schellenberg, Theodore. 1984. “The Appraisal of Modern Public Records.” In A Modern Archives Reader: Basic
Readings on Archival Theory and Practice, edited by Maygene F. Daniels and Timothy Walch, 57–70.
Washington, DC: National Archives and Records Service, U.S. General Services Administration.
Shiri, Ali. 2008. “Metadata-Enhanced Visual Interfaces to Digital Libraries.” Journal of Information Science
34(6): 763–775. doi:10.1177/0165551507087711.
Society of American Archivists. 2015. “Encoded Archival Context - Corporate Bodies, Persons, and Families
(EAC-CPF).” Accessed August 17, http://www2.archivists.org/groups/technical-subcommittee-on-eac-
cpf/encoded-archival-context-corporate-bodies-persons-and-families-eac-cpf.
Thangaraj, M., and V. Gayatri. 2013. “An Effective Technique for Context-Based Digital Collection Search.”
International Journal of Machine Learning and Computing 3(4): 372–375. doi:10.7763/IJMLC.2013.V3.341.
University of British Columbia. 2002. “InterPARES Glossary: A Controlled Vocabulary of Terms Used in the
InterPARES Project.” Accessed August 14, 2015, http://www.interpares.org/documents/InterPARES Glossary
2002-1.pdf.
US Department of Health and Human Services. 2015. “User Experience Basics.” Usability.gov. Accessed
August 20, http://www.usability.gov/what-and-why/user-experience.html.
Weiler, Angela. 2005. “Information-Seeking Behavior in Generation Y Students: Motivation, Critical
Thinking, and Learning Theory.” The Journal of Academic Librarianship 31(1): 46–53.
doi:10.1016/j.acalib.2004.09.009.
9
Using R and ggvis to Create Interactive
Graphics for Exploratory Data Analysis Tim
Dennis
Creating interactive web graphics requires a heterogeneous set of technical skills that
can include data cleaning, analysis, web development, and design. Data need to be
acquired, munged, transformed, and then sent to a web application for rendering as
a plot. This creates a barrier to entry for an analyst to effectively explore data and
create interactive graphics using a single programming language in a flexible,
iterative, and reproducible way. This chapter will cover how a new package for the
R programming language, ggvis (Chang and Wickham 2015), enables librarians to
explore data and communicate findings using interactive controls without the
overhead of tinkering with web application frameworks. To make the data
visualization workflow more coherent, ggvis brings together three concepts: (1) it is
based on a grammar of graphics, a structure for defining the elements of a graphic;
(2) it produces reactive and interactive plotting in a rendered web-based graphic;
and (3) it provides support for creating data pipelines, making data manipulation
and plotting code more readable.
I will start the chapter by going over the technology suite and dependencies
needed to run ggvis. I’ll also cover the data used in the chapter and where to obtain
them. Once the computing and data sources are covered, I’ll introduce how ggvis
utilizes a grammar of graphics to break graphs into components made up of data, a
coordinate system, a mark type, and properties. This modularity allows analysts to
better understand graphical elements and swap out different components to make a
multitude of different plots. Following the grammar of graphics, I’ll start using
ggvis to build simple static graphs with a built-in data set in R. While
demonstrating how to create simple plots such as histograms, bar charts, and
scatterplots, I will introduce how ggvis incorporates the R data pipelining syntax to
improve code readability and maintenance. I will also show how to use ggvis in
conjunction with features from the data manipulation package, dplyr (Wickham
and Francois 2015), to filter, group, and summarize the data before graphing. After
covering the basic features of ggvis, I will introduce interactivity by plotting library-
related data sets (gate count data, circulation counts, and article-level metrics).
Starting simply, I will add a slider to a scatterplot. I will then show how to create a
histogram and density plot with interactive inputs. Still very young, ggvis is
currently not recommended for production use. However, ggvis is under active
development and coauthored by two important R developers with strong track
records of delivering important R packages.
For a librarian with knowledge of R, ggvis opens the door to creating interactive
graphics without knowing the intricacies of a JavaScript framework or data format
transformations. Furthermore, through the grammar of graphics and data pipelines,
ggvis makes graphing code understandable and more reproducible.
BACKGROUND
Used extensively in both business and academic settings, R is a statistical
programming language. One of the major reasons for its popularity is that
statisticians and developers have contributed over 7,000 packages that provide
additional functionality to base R, including statistical techniques and data
acquisition, visualization, and reporting tools. Developed by Winston Chang and
Hadley Wickham, ggvis is an R package that employs a grammar of graphics in
making data visualizations (Chang and Wickham 2015). The goal of the package is
to make it easier for people to build dynamic interactive visualizations that can be
embedded on the web or in a dynamic report.
Grammar of Graphics
Leland Wilkinson introduced the concept of a grammar of graphics in 1999 partly
from a reaction to tools such as Excel that provided a selectable taxonomy of visual
treatments that users then altered postselection (Wilkinson 2005). Breaking away
from this canned approach to visualizations, he proposed decomposing graphics
into their elemental parts such as scales, layers, and marks and developing a
language for descriptively defining a plot in code. In 2005, Hadley Wickham
implemented his take on Wilkinson’s grammar of graphics in an R visualization
package, ggplot2, which was intended to make it easier to create publication-quality
static graphics in R (Wickham 2009). Typically among the top package downloads
in the Comprehensive R Archive Network (CRAN), ggplot2 has been a very
popular package in R. However, ggplot2 is primarily focused on providing R users
with easier ways to string together different parts of a plot to create static graphics.
With ggvis, Wickham also employed a grammar of graphics to organize and provide
structure to the graphing syntax that outputs as web-based plots.
USING GGVIS
Setup
To use ggvis or run the code in this chapter, you must install a number of tools on
your machine. You must have base R, RStudio, ggvis, tidyr (Wickham 2014c), and
dplyr installed and available. These tools are all open source and freely
downloadable via the R repository. Follow the instructions below for more
information on getting set up.
Install R
If you are on a Linux machine, R is most likely in the package manager for your
distribution (e.g., apt-get install r-base). On Windows and Macs, go to
http://www.r-project.org and follow the download R link on that page. R code and
packages are mirrored all over the world via the CRAN, so choose the mirror
nearest you (in my case it’s http://cran.stat.ucla.edu/). Choose your operating
system, select “base,” and then download the latest R release (as of this writing it’s R
3.2.2). Run the installer, and you should have R installed. You can confirm R is
installed by opening a terminal in Mac or Linux and typing R. On Windows, click
the RGui desktop icon that the install process created. Once started, you should see
a note about the R version you are running and information about demos, citing R,
and so on.
R belongs to a class of programming languages (Python, Julia, and Stata are
others) that adopts an REPL (read-evaluation-print-loop) style of interactive coding.
This simply means that you will often interact with a console to work out code to
see how your code snippets work in the console and continually build up a runnable
script in a text file. You can try this out by using the console as a calculator. Type 2
+ 2 or 4 / 2 and observe the output. Familiarizing yourself with how the R console
operates is a key step in becoming an efficient R data analyst. Before we explore
further, let’s install an integrated development environment for R. Type q() to quit
R.
Install RStudio
There are many ways to create and edit R scripts and interact with the R language.
For this chapter we are using RStudio because it makes presenting interactive ggvis
plots seamless (RStudio Team 2012). Running R natively via the console will open
ggvis plots in a browser, so you can do it that way, but RStudio provides a nice
View panel for viewing graphics. It also provides a multipaneled interface with
separate areas for script editing, the R help library, package installation,
environment objects, command history, and a project management tool. The
author highly recommends it to use with the code in this chapter. To install, go to
https://www.rstudio.com/products/rstudio/download/ and select your operating
system to download. Once installed you can open the application and familiarize
yourself with the RStudio windows. With R and RStudio installed, you can work
through any number of introductory free R courses on the web. One novel tutorial
is Swirl,1 an interactive R course built as an R package (Carchedi et al. 2015). This
tutorial will also develop your familiarity with RStudio as you interact with the
challenges.
Let’s analyze the code. First, we must use the library() statement to load the ggvis
into the current R session. Otherwise, even though we installed it on our file system
earlier, R will not know the package is available. Explicitly loading packages that are
installed will limit the cluttering of our session (and memory footprint). diamonds is
the name of a data set containing prices, carat size, and other features of diamonds.
It is installed as part of the ggplot2 package that we installed earlier. To find out
more information about the data set, type ?diamonds in the R console. Notice there
are variables such as price, carat, and cut, among others. To create figure 9.1, we
need to provide ggvis with a data set and map variables in the data set to some type
of mark in the graph. We do this by mapping x to the weight variable (~carat) and y
to the price variable (~price). We then select the ggvis points mark with
layer_points(). This creates a scatterplot of points with the carat variable on the x-
axis and price on the y-axis. These elements are composed this way using the pipe-
forward symbol (%>%) and are intended to be read as a sequence, like so:
Coordinate System
In this chapter, we are mainly concerned with Cartesian coordinates. Currently,
ggvis doesn’t support polar coordinates (pie charts), and the other coordinate
systems supported are out of scope for this chapter.
Marks
We’ve already encountered a marking system by using layer_points() in the previous
example. We can change this marking system by editing the previous code to pipe
our data into layer_lines() instead. See figure 9.2 to see the result.
library(ggvis)
diamonds %>%
ggvis(x = carat, y = price) %>%
layer_lines()
Figure 9.2. A ggvis-generated line chart using layer_lines().
Data Pipelines
Let’s return to the concept of data pipelining and demonstrate the power of this
approach in data preparation. Hadley Wickham has characterized tidy data (data
that are ready to be analyzed) as data with variables across the columns and
observations in the rows (Wickham 2014b). The data we normally receive in the
library world aren’t in this tidy format. Often, data we receive will be in a
spreadsheet format with values such as year across the columns or variables down
the rows. We also may receive data in separate files that need to be merged before
analysis or visualization. Hadley Wickham’s tidyr package helps us get the data in
the proper shape. In the past, to do these types of operations on data we often
needed to nest function calls and save outputs of these processes as intermediary
data frames. The problem of nesting these functions calls is sometimes called the
“Dagwood sandwich problem” (Wickham 2014a, 237) and makes the code hard to
parse or work with. Additionally, having to name and track intermediary data
frames as we create steps to clean and prepare the data clutters up our code and
makes it harder to understand and maintain. The piping operator provided by
magrittr solves these problems by allowing us to forward our data through cleaning,
manipulation, and visualization functions, making our code cleaner and easier to
read (and comprehend).
As mentioned before, you might receive data such as that below that have been
created for human consumption but aren’t ready to visualize in a tool such as R.
gate <- read.csv('data/1-gate-count.csv', check.names=F, sep=',') gate
<- tbl_df(gate)
head(gate)
## Source: local data frame [6 x 6]
##
## branch 2009 2010 2011 2012 2013
## (int) (int) (int) (int) (int) (int)
## 1 1 10238 10203 12223 11005 13172
## 2 2 8710 12950 23917 26936 29171
## 3 3 60946 56632 69950 78811 106627
## 4 4 60529 53500 69088 92319 111320
## 5 5 58549 52418 58113 60111 94568
## 6 6 38199 30085 46305 47137 48466
In this case, we see years are in the column headers and branches are in the rows.
With the data frame in this form, there would be no way to plot branch by count
and year in ggvis because neither count nor year is accessible as a variable. We need
to use the package tidyr to reshape the data so the year and count variables are in
their own columns. In tidyr, we do this by using the gather() function, which will
take multiple columns and gather them into key value pairs. In this case, we want to
gather the years and their counts and create two columns for each. Once we have a
variable for each branch, year, and count in our data frame, we can then use the
group_by() function provided by dplyr to group the data by branch. Only after
these steps are finished can we plot our data.
Let us first look at how we would tidy the data as described above using tidyr and
dplyr without pipes.
library(tidyr)
gate2 <- arrange(gather(gate, Year, Count, -branch), Year, Count) gate3
<- group_by(gate2,branch)
head(gate3)
head(gate3)
## Source: local data frame [6 x 3]
## Groups: branch [6]
##
## branch Year Count
## (int) (fctr) (int)
## 1 2 2009 8710
## 2 1 2009 10238
## 3 12 2009 12044
## 4 11 2009 15715
## 5 7 2009 22107
## 6 6 2009 38199
Above we are nesting arrange() and gather() functions with the gate data and saving
it to a gate2 temporary data frame. Following this, we use the group_by() function
to group gate3 by branch. With data requiring multiple steps for cleaning, sorting,
and summarization, the R code can quickly become nested or littered with
temporary data objects. Let’s clean this code up by using the pipe operator to set up
a chained sequence and then saving it as a new data frame (gate2).
library(tidyr)
gate2 <- gate %>%
gather(Year, Count, -branch) %>%
group_by(branch) %>%
arrange(Count)
head(gate2)
## Source: local data frame [6 x 3]
## Groups: branch [2]
##
## branch Year Count
## (int) (fctr) (int)
## 1 1 2010 10203
## 2 1 2009 10238
## 3 1 2012 11005
## 4 1 2011 12223
## 5 1 2013 13172
## 6 2 2009 8710
Now, both examples above produce the same data frame. But the second example
is clear, and it is easy to see what operations are happening to the data frame. Plus,
we neither have to name and keep track of temporary data objects in our code nor
use hard-to-read nested function calls. We will now repeat our data-cleaning
operations with the last chain plotting the data using the ggvis. See figure 9.3 to see
the result.
gate %>%
gather(Year, Count, -branch) %>%
ggvis(x= Year,y= Count) %>%
layer_bars()
Figure 9.3. A ggvis-generated bar chart showing gate counts for each year.
Wouldn’t it be better if we added some color and a stacked bar to differentiate the
branches? This is easy to do in ggvis by adding a fill property assigned to our branch
variable as an argument inside our ggvis() function. We also need to convert our
branch to a factor (categorical variable) so ggvis can group these on the graph (see
figure 28 in the photospread). The changes are made as follows: gate %>%
gather(Year, Count, -branch) %>%
ggvis(x= Year,y= Count, fill = ~factor(branch)) %>%
layer_bars()
Now we have color! ggvis will take the factor and automatically group and assign a
color to represent each branch (there are ways to control the color, but that’s
beyond the scope of this chapter). Notice how the resulting code is easy to read and
much more maintainable than nesting functions and keeping track of temporary
data frames. We can also easily alter the layers and properties to create different
plots. This is a powerful way to explore data in a succinct way and build up
meaningful plots in an interactive process.
Let’s add another variable, circulation statistics, to our gate count example and
look at a scatterplot. We will need to read in another CSV that contains our library
branches, gate, and circulation counts by year. These data have already been cleaned
and prepared for our use.
gcirc <- read.csv('data/2-gate-circ.csv', check.names=F, sep=',') gcirc
<- tbl_df(gcirc)
head(gcirc)
## Source: local data frame [6 x 4]
##
## branch Year Count circ
## (int) (int) (int) (int)
## 1 8 2009 45210 689
## 2 8 2010 52468 785
## 3 8 2013 45961 800
## 4 8 2011 54392 858
## 5 8 2012 89169 32105
## 6 1 2010 10203 8780
Returning to ggvis, we can create a scatterplot with this new data frame (figure 9.4).
gcirc %>%
ggvis(x = Count, y= circ) %>%
layer_points()
Figure 9.4. A ggvis-generated scatterplot showing circulation by gate count.
Nothing too surprising here. We expect there should be a relationship based on gate
count and circulation. We notice a few outliers that might be branches that do not
circulate as much or have restricted collections.
We can add a fill color to the graph for each branch. Once again, we need to
convert our branch numbers to a factor; otherwise ggvis will treat them as a
continuous spectrum of color instead of as groups (see figures 29 and 30 in the
photospread).
gcirc %>%
ggvis(x = ~Count, y= circ, fill = factor(branch)) %>%
layer_points()
gcirc %>%
ggvis(x = ~Count, y= circ, fill = factor(branch)) %>%
layer_points() %>%
layer_smooths()
Finally, we can draw a fitted curve between our two variables on the graph using the
layer_smooths(). This is a prediction line that is added to the graph with a smooth
model, which lets us know whether the two variables have a trend relationship to
one another. In this case, increase in gate count has a relationship with circulation
statistics. If we had noisier data, this fitted, curved line might highlight trends that
were not readily apparent in the scatterplot.
Adding Interactivity
Now that we have used the grammar of graphics with ggvis and have created a
number of plots, we can start to add interactivity to our plot. ggvis supports the
following interactive controls that you can include in your plot: input_slider(): a
slider that produces a range control
input_checkbox(): an interactive checkbox
input_checkboxgroup(): a grouping of checkboxes
input_numeric(): a validator that allows only numbers to be imputed
input_radiobuttons(): selectable radio buttons, only one can be selected
input_select(): a drop-down textbox
input_text(): plain text input
Let’s continue on from our previous example and alter it to add a point size
operator to the graph. We do this by setting the size of layer_points() to an
input_slider on a range from 100 to 1,000 (figure 9.5).
gcirc %>%
ggvis(x = ~Count, y = ~circ) %>%
layer_points(size := input_slider(100, 1000, value = 100)) %>%
layer_smooths()
Figure 9.5. A scatterplot using the same data as those used in figure 9.8.
We can also add radio buttons that set the color of the points by adding a fill
parameter set to input_radiobuttons(). For this example, we will see how the output
looks in RStudio (figure 9.6).
gcirc %>%
ggvis(x = ~Count, y = ~circ,
fill := input_radiobuttons(label = "Choose color:",
choices = c("blue", "red", "green"))) %>%
layer_points(size = ~Count)
Figure 9.6. The Viewer panel and console in RStudio.
As we run this code in RStudio, it will open up in the Viewer panel, but notice that
there is a red stop sign on the upper right of the console window. The console also
prints out a nice note letting us know we are running a dynamic visualization and
how to stop it. What’s happening here is that with the ggvis visualization running in
the viewer, the R process is still running and waiting to respond to changes in the
visualization radio button element. Once we select a different button, the plot will
be rendered again with the changes to the plot. As mentioned before, this call-and-
response type activity in ggvis is characterized as a reactive functional programming
style, and in ggvis it characterizes an interplay between an HTML/JavaScript
visualization framework (Shiny) and our R code. After running this code in
RStudio, interact with the slider and alter data. Notice the red stop light is still
showing up and waiting for further changes.
Notice that we have variables on the DOIs, titles, and metrics from each source.
Also notice that these data include metrics from social media and web sources such
as Twitter and Wikipedia. One way to explore these data from each source is by
using a histogram to see the distribution of metrics of these journal articles we have
in our data set. Since we haven’t covered a histogram yet, let’s create a static one
(figure 9.7).
altmet %>%
ggvis(~scopus_total) %>%
layer_histograms()
Figure 9.7. A histogram showing the distribution of Scopus metrics.
To create a histogram, ggvis must first create “bins” in which to group and place the
data. This happens behind the scenes in ggvis, but it is worth noting that ggvis has a
compute_bin() function that layer_histogram() calls to do the binning. This can be
called separately, like so: binned <- altmet %>% compute_bin(~scopus_total)
head(binned)
## count_ x_ xmin_ xmax_ width_
## 1 3 0 -2.5 2.5 5
## 1 3 0 -2.5 2.5 5
## 2 8 5 2.5 7.5 5
## 3 10 10 7.5 12.5 5
## 4 9 15 12.5 17.5 5
## 5 6 20 17.5 22.5 5
## 6 2 25 22.5 27.5 5
From here, we could use this binned data frame to set the histogram in a more
manual fashion. Bin width is an important parameter in a histogram. To show a
distribution of a variable one can miss the shape of the distribution if the bin width
is too large or too small. Let’s look at our previous plot with a different bin set
(figure 9.8).
altmet %>%
ggvis(~scopus_total) %>%
layer_histograms(width = 15)
Figure 9.9. A histogram with interactive user controls for bin width and center.
As we’ve seen, the bin width strongly affects the shape of the data, and altering this
width can lead us to discovering different features in the data that might not have
been apparent using default or fixed bins. Another way to get a different sense of the
distribution of a variable is through a kernel density plot. These plots are available
in ggvis via a layer_density() function. The density layer can produce a number of
types of density plots based on different smoothing algorithms, called kernel
smoothing. By default, ggvis uses Gaussian smoothing. Let’s see what a static
version looks like against our Altmetrics data (figure 9.10).
altmet %>%
ggvis(~scopus_total) %>%
layer_densities()
Figure 9.10. A kernel density plot created using the ggvis layer_density() function.
Now let’s make this graphic interactive by allowing the user to select different
kernel smoothers. This is done by assigning the kernel parameter in the
layer_density() function to an input_selector containing various smoothers. Run the
code and switch back and forth between the different types of kernel smoothers.
You can see each kernel produces a different shape and hopefully gives more insight
to the data (figure 9.11).
altmet %>% ggvis(x = ~scopus_total) %>%
layer_densities(
adjust = input_slider(.1, 2, value = 1, step = .1, label =
"Bandwidth adjustment"), kernel = input_select(
c("Gaussian" = "gaussian",
"Epanechnikov" = "epanechnikov",
"Rectangular" = "rectangular",
"Triangular" = "triangular",
"Biweight" = "biweight",
"Cosine" = "cosine",
"Optcosine" = "optcosine"),
label = "Kernel")
)
Figure 9.11. A kernel density plot with interactive kernel smoothing input.
DISCUSSION
Because ggvis is currently under development, if you load ggvis from the CRAN
repository, you will get this message: The ggvis API is currently rapidly evolving.
We strongly recommend that you do not rely on this for production, but feel free to
explore. If you encounter a clear bug, please file a minimal reproducible example at
https://github.com/rstudio/ggvis/issues. For questions and other discussion, please
use https://groups.google.com/group/ggvis.
At the time of this writing ggvis is on version 0.42. There are limitations, and as the
developers indicate it really shouldn’t be run in production. That said, as this
chapter has demonstrated, it can be a great tool to explore and get a feel for data.
Also, since the aim of ggvis is to be the successor to ggplot2, it’s reasonable to
assume that learning ggvis will put you in good stead for the future of visualization
in R. However, it will be a moving target, and code you write now might need to be
altered as the API changes.
CONCLUSION
In this chapter we introduced the R visualization package ggvis and its goal to make
it easier to create interactive graphics for exploratory data analysis. We walked
through how ggvis employs a grammar of graphics in decomposing plotting
elements into declarative components. We also demonstrated how ggvis plugs into a
larger R ecosystem for creating data pipelines to improve code readability and
maintenance. Finally, we created interactive graphics with library-related data and
showed how they change when a user changes elements in the graph.
NOTES
1. http://swirlstats.com/.
2. https://github.com/jt14den/rggvis-libdata.
3. http://lagotto.io/.
4. http://API.plos.org/.
REFERENCES
Allaire, J. J., Jeffrey Horner, Vicent Marti, and Natacha Porte. 2015a. Markdown: “Markdown” Rendering for R.
http://cran.r-project.org/package=markdown.
Allaire, J. J., Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, and
Rob Hyndman. 2015b. Rmarkdown: Dynamic Documents for R. http://CRAN.R-
project.org/package=rmarkdown.
Bache, Stefan Milton, and Hadley Wickham. 2014. magrittr: A Forward-Pipe Operator for R. http://cran.r-
project.org/package=magrittr.
Carchedi, Nick, Bill Bauer, Gina Grdina, and Sean Kross. 2015. Swirl: Learn R, in R. http://CRAN.R-
project.org/package=swirl.
Chamberlain, Scott, Carl Boettiger, and Karthik Ram. 2015b. alm: R Client for the Lagotto Altmetrics Platform.
http://CRAN.R-project.org/package=alm.
———. 2015a. rplos: Interface to the Search “API” for “PLoS” Journals. https://cran.r-project.org/package=rplos/.
Chang, Winston, and Hadley Wickham. 2015. ggvis: Interactive Grammar of Graphics.
http://ggvis.rstudio.com/.
Chang, Winston, Joe Cheng, J. J. Allaire, Yihui Xie, and Jonathan McPherson. 2015. shiny: Web Application
Framework for R. http://cran.r-project.org/package=shiny.
Knuth, Donald E. 1984. “Literate Programming.” Computer Journal 27(2): 97–111. Oxford, UK: Oxford
University Press. doi:10.1093/comjnl/27.2.97.
RStudio Team. 2012. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc.
http://www.rstudio.com/.
Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. New York: Springer.
http://www.springer.com/us/book/9780387981406.
———. 2014a. Advanced R. Boca Raton, FL: CRC Press.
———. 2014b. “Tidy Data.” Journal of Statistical Software 59(10). http://www.jstatsoft.org/v59/i10.
———. 2014c. tidyr: Easily Tidy Data with “Spread()” and “Gather()” Functions. http://cran.r-
project.org/package=tidyr.
Wickham, Hadley, and Romain Francois. 2015. dplyr: A Grammar of Data Manipulation. http://cran.r-
project.org/package=dplyr.
Wilkinson, Leland. 2005. The Grammar of Graphics. New York: Springer.
Xie, Yihui. 2015. knitr: A General-Purpose Package for Dynamic Report Generation in R.
http://yihui.name/knitr/.
10
Integrating Data and Spatial Literacy into
Library Instruction
Charissa Jefferson
BACKGROUND
Data literacy can be defined as being able to analyze and work with quantitative
information. People who are data literate must be able to understand how to read a
basic spreadsheet’s columns and rows. Data literate people can organize and
deconstruct the differing information by comparing each of the row’s and column’s
attributions and data points. From a spreadsheet, data literate people can
understand the structure and find the outlying data points. Data literate people can
detect trends in the data and create the appropriate form of graph, chart, or table to
illustrate the numerical information. Additionally, data literate people can describe
results of the data set to make statistical arguments and conclusions. An essential
component of quantitative literacy is statistical competency because statistics come
from raw data. It is important for people to be able to critically evaluate data-related
arguments by understanding the process of information creation (Prado and Marzal
2013).
The Association of American Colleges and Universities (AACU), an organization
comprising members from higher educational institutions, has created an initiative
to evaluate undergraduate learning. VALUE (Valid Assessment of Learning in
Undergraduate Education) Rubrics offer assessment tools for sixteen different
literacies including quantitative literacy, civic engagement, and global learning. The
Quantitative Literacy (QL) VALUE Rubric emphasizes that “[quantitative literacy]
is not just computation, not just the citing of someone’s data. QL is a habit of
mind, a way of thinking about the world that relies on data and on the
mathematical analysis of data to make connections and draw conclusions” (AACU
2009). The AACU makes an effort to address analysis of data as an important
achievement: “Virtually all of today’s students, regardless of career choice will need
basic QL skills such as the ability to draw information from charts, graphs, and
geometric figures, and the ability to accurately complete straightforward estimations
and calculations” (AACU 2009). When a student can draw information from charts,
graphs, and so on, the student can determine the accuracy of the information by
critically evaluating the underlying data. The QL VALUE Rubric measures (1)
interpretation and (2) representation of tables, graphs, diagrams, and words; (3)
calculation; (4) application/analysis; (5) assumptions; and (6) communication of
numbers as evidence.
Upon developing its Data Information Literacy (DIL) program, Purdue University
spearheaded a collaborative project involving university libraries to determine the
data literacy needs of their patrons through interviews (Carlson et al. 2011). The
DIL interviews were initiated by librarians from Purdue University, Cornell
University, the University of Minnesota, and the University of Oregon. During the
interviews, participants were asked to rate twelve DIL competencies on a five-point
Likert scale ranging from “not important” to “essential.” Each competency was
ranked at least “important” by all participants. However, data discovery and
acquisition and data visualization and representation were among those rated most
essential by faculty. Data processing and analysis, followed by data management and
organization, was ranked highest by the students. Ranked next of most importance
was data discovery and acquisition, followed by data visualization and
representation, which shared ranking with ethics and attribution and metadata and
data description.
The new Information Literacy Framework for Higher Education, created by the
Association of Colleges and Research Libraries (ACRL) addresses the concept of
information creation as a process in one of its six frames (ACRL 2015). The process
of information creation includes each of Bloom’s lower and higher order thinking
skills, including remembering, understanding, applying, analyzing, evaluating, and
creating. Library instruction often focuses on evaluating information. The following
competencies are linked to evaluation of information: critiquing, justifying,
summarizing, describing, and contrasting. Skills such as synthesizing, categorizing,
combining, compiling, composing, generating, modifying, organizing, planning,
rearranging, reusing, and rewriting are linked to creation. Integrating data literacy
into instruction is one of the best ways for learners to acquire these skills, because
learners must learn to choose the appropriate data set by locating the appropriate
repository or source of data. Then the student must make an accurate
representation of the content in the appropriate format.
Figure 10.3. A shapefile .dbf file opened and being edited using LibreOffice.
In addition to spatial data, attribute data consist of “what” about “where” and
“why.” Attribute data are often formatted in Excel or CSV or text files and include
latitude and longitude, street addresses, and a wealth of detail in architectural
description (Elliott 2014). Although spatial data usually refer to points or location
elements on a map and are often distinguished from the descriptive information of
attribute data, spatial literacy incorporates both spatial and attribute data. This is
because the scholarly aspect of GIS in curricula allows for students to describe their
geographic information in their project narratives.
Libraries are providing GIS support to fulfill the need within the community to
understand spatial information and create something new. By providing GIS
services, libraries expand patrons’ ability to not only access data and statistics across
multiple disciplines but also to view them spatially. Patrons can “manipulate [that
data] through queries and analysis to create new information” (Elliott 2014, 9).
There are many resources for data sets from governments, organizations,
institutional repositories, companies, and user-generated data available in an online
open source data repository.
Eva Dodsworth, a librarian on the forefront of GIS services in the library for
reference and instruction, wrote the LITA guide “Getting Started with GIS”
(2012), which extensively speaks to this. As libraries adopt more tools and
technologies to engage patrons in civic engagement and global citizenry, the
traditional library reference services have expanded to mapping and georeferencing.
Dodsworth explains integrating mapping in reference services by emphasizing
georeferencing: “Georeferencing is the procedure used to establish the spatial
location of an object (image, map, document, etc.) by linking its position to the
earth’s surface” (Dodsworth 2012, 11). Ms. Dodsworth also provides an online
professional development course for librarians who are interested in gaining
technical skills in the area of mapping using GIS open source software such as
Google Earth, Google Maps, and Esri products such as ArcGIS Explorer. There are
also other open source GIS programs from local governments and smaller
organizations with a specific subject focus such as public and urban fruit trees. The
open source mapping software can be easily utilized in reference and instruction to
increase patron’s awareness of their communities and beyond.
Branch, in his article “Libraries and Spatial Literacy: Toward Next-Generation
Education,” states that spatial literacy ought to be included in library services
because it “helps develop critical thinkers and an engaged citizenry” (Branch 2014,
109). He goes on to state that:
collaboration, collective problem solving, and data sharing are desirable habits for learners to develop
through such trends as community informatics, citizen science, constructivism in the classroom, or policy
debate. A critical thinker may benefit and develop from a librarian-led data experience; the librarian can pull
spatial data sources that already exist and assist the researcher, student, or faculty member in utilizing
government data that has been vetted to render a visual data comparison to better argue an intellectual
perspective. (Branch 2014, 110)
Branch argues that librarians ought to utilize census data and earth science data in
information literacy instruction to impart the important skill of spatial literacy in
global citizenry. Because world governments use geospatial data for decision
making, students need to learn the skills of understanding maps to problem solve
regional and global issues.
The AACU has two VALUE Rubrics that would be useful for assessment of spatial
literacy: Civic Engagement and Global Learning. The Civic Engagement Rubric
emphasizes historical and current policies, values, and cultural influences to create
cultural awareness through civic identity. The rubric measures (1) diversity of
communities and cultures, (2) analysis of knowledge, (3) civic identity, (4) civic
communication, (5) civic action and reflection and, (6) civic contexts.
An example of a civic engagement project using spatial literacy skills comes from a
participatory mapping project out of the University of Washington where students
used their own interpretations of archival “quantitative data with human-centered
narratives” (Williams et al. 2014) to think spatially about community culture
(Mitchell and Elwood 2012). Inspired from this participatory mapping project,
City Digits, a research team created “Local Lotto, a mathematics curriculum that
incorporates data collection and analysis methods informed by participatory media
projects for use in New York City high schools” (Williams et al. 2014). The
students use census data and state lottery data to formulate opinions about the
social justice issues the lottery impacts. The section headed Examples of Cases in
this chapter will present more detail about these projects.
The AACU Global Learning VALUE Rubric evaluates (1) global self-awareness,
(2) perspective taking, (3) cultural diversity, (4) personal and social responsibility,
(5) understanding global systems, and (6) applying knowledge to contemporary
global contexts.
Maps can illustrate the interdependent global systems of economic, political,
social, and physical legacies and their implications for people’s lives by providing
diverse perspectives. At Hobart and William Smith Colleges, assistant professors
Kristen Brubaker, Environmental Studies, and Christine Houseworth, Economics,
collaborated in a Mellon Grant–funded project to create a curriculum for social
science spatial literacy. Their curriculum provided an understanding of how
location could impact economic or other social issues (Brubaker and Houseworth
2014).
EXAMPLES OF CASES
This section describes the curricula some institutions have specifically provided to
engage students in data and/or spatial literacy activities. These examples of projects
include small to large higher education institutions with a variety of subject-specific
instruction in the social or environmental sciences. These projects are examples of
how students can utilize open data and open source software to better understand
their communities and their disciplines.
DISCUSSION
There are some possible challenges to implementing data literacy or spatial literacy
in library instruction. While there are open source data repositories and geographic
spatial data freely available online, keeping up the currency and relevancy of the
data sets can be difficult. Although data sharing and management incentives and
initiatives help researchers publish their data more regularly, not all data sharing is
mandated. There may be a lapse in time before researchers share their data and they
are accessible to the public for fear of being scooped. Government data are
produced regularly depending on the nature and breadth of the surveys. However,
although the data may have been recently collected, it takes several months, even
years, before the data are available to the public. Sometimes the data available aren’t
as current as one would hope, thus presenting a challenge to teach the application of
data sets on research projects with current events.
Libraries may also consider using subscription databases to incorporate data and
spatial thinking into library instruction. At my institution, California State
University, Northridge, I have used Data-Planet and PolicyMap. Data-Planet
combines government and proprietary data that enable a user to have much more
control to layer multiple data sets and create choropleth maps. PolicyMap is
effective for political science, public policy, and social science spatial thinking. The
challenge of relying on proprietary sources of data is the expense. Because most of
the research and analytics of subscription databases are subject specific and can go
through a detailed vetting process to meet the competitive intelligence needs of the
databases’ clients, many of their subscription fees are too high for most public-
serving institutions. Overall, “libraries need budgets that allow them to implement
data services” (Branch 2014, 112). One possible solution may be that while libraries
are assessing the current usage of subscription databases, they could trial a
competitive intelligence data analytics database during a time aligned with students’
research projects. California State University, Northridge, trialed a proprietary
business database, IBISWorld, during the height of project-oriented research
assignments. IBISWorld received high usage, and we were able to cancel a database
with similar content and cost with substantially lower usage to replace coverage.
Overall, the students and faculty are more pleased with this acquisition, and much
of our data needs in the subject area have been met with the new database. Students
want to use the newer database, and it has made teaching data visualization more
effective because of the currency and relevancy of the information, which is
packaged in a more user-friendly way. We were lucky that we had the ability to
cancel one database and subscribe to another. However, I realize that this option is
not always available.
An additional challenge in the area of spatial data literacy involves the training of
librarians. While GIS software costs have decreased and many open source software
programs are available online, the cost of infrastructure, staffing, and training to
offer the services and programs must be considered. As the popularity of GIS
increases among patrons and librarians, a geospatially equipped librarian workforce
needs initial and ongoing training to meet an inevitably growing demand. Despite
the growing role of GIS and geospatial visualization in the profession, many library
and information science programs do not incorporate GIS training in the curricula.
Some libraries are hiring geospatial data specialists to meet their demands, but they
are hiring outside of the library profession because of the lack of specialized training
librarians have received in this area. Geospatial thinking is another area in which
libraries need to adapt to remain relevant to users’ needs. In their article “Geospatial
Thinking of Information Professionals,” authors Bradley Bishop and Melissa
Johnston provide recommendations on how to incorporate geospatial thinking in
library science curricula. They suggest that management classes involving strategic
planning incorporate an aspect of facility location analysis. The authors also suggest
that in specialized reference courses, a portion of the class ought to be devoted to
“finding and locating geospatial data” (Bishop and Johnston 2013, 20).
NOTE
1. LibreOffice can be downloaded for free from https://www.libreoffice.org/.
REFERENCES
Abresch, John. 2008. Integrating Geographic Information Systems into Library Services: A Guide for Academic
Libraries. Hershey, PA: Information Science Pub.
Association of American Colleges and Universities (AACU). 2009. VALUE rubrics. Accessed October 30, 2015.
http://www.aacu.org.
Association of College and Research Libraries (ACRL). 2015. Framework for Information Literacy for Higher
Education. Accessed October 30, 2015. http://www.ala.org/acrl/standards/ilframework.
Bishop, Bradley Wade, and Melissa P. Johnston. 2013. “Geospatial Thinking of Information Professionals.”
Journal of Education for Library and Information Science 54(1): 15.
Bloom, Benjamin S. 1956. Taxonomy of Educational Objectives: The Classification of Educational Goals. New
York: Longmans, Green.
Branch, Benjamin D. 2014. “Libraries and Spatial Literacy: Toward Next-Generation Education.” College &
Undergraduate Libraries 21(1): 109–114.
Brubaker, Kristen, and Christina Houseworth. 2014. Digital Pedagogy Project: Teaching Geographic Information
Systems and Spatial Literacy in the Social Sciences. William and Hobart Smith Colleges.
http://www.hws.edu/offices/provost/digital.aspx.
Carlson, Jacob, Michael Fosmire, C. C Miller, and Megan Sapp Nelson. 2011. “Determining Data Information
Literacy Needs: A Study of Students and Research Faculty.” Portal: Libraries and the Academy 11(2): 629–657.
Carlson, Jake, and Marianne S. Bracke. 2015. “Agriculture/Graduate Students/Carlson & Bracke/Purdue
University/2014.” Data Information Literacy Case Study Directory 1(3).
Dodsworth, Eva. 2010. “Indirect Outreach in a GIS Environment: Reflections on a Map Library’s Approach to
Promoting GIS Services to Non-GIS Users.” Journal of Library Innovation 1(1): 24.
Dodsworth, Eva. 2012. Getting Started with GIS: A LITA Guide. New York: Neal-Schuman Publishers.
Elliott, Rory. 2014. “Geographic Information Systems (GIS) and Libraries: Concepts, Services and Resources.”
Library Hi Tech News 31(8): 8–11.
Fitzpatrick, Charlie. 2011. Using External Data Tables with ArcGIS Online. Esri.
http://edcommunity.esri.com/Resources/ArcLessons/Lessons/U/Using_External_Data_Tables_wit.
Koltay, Tibor. 2015. “Data Literacy: In Search of a Name and Identity.” Journal of Documentation 71(2): 401–
415.
Mitchell, K., and Elwood, S. 2012. “Engaging Students through Mapping Local History.” Journal of Geography
111(4): 148–157.
Prado, Javier Calzada, and Miguel Ángel Marzal. 2013. “Incorporating Data Literacy into Information Literacy
Programs: Core Competencies and Contents.” Libri 63(2): 123–134.
Williams, Sarah, Erica Deahl, Laurie Rubel, and Vivian Lim. 2014. “City Digits: Local Lotto: Developing
Youth Data Literacy by Investigating the Lottery.” Journal of Digital and Media Literacy 2(2).
http://www.jodml.org/2014/12/15/city-digits-local-lotto-developing-youth-data-literacy-by-investigating-the-
lottery/.
11
Using Infographics to Teach Data Literacy
Caitlin A. Bagley
Since 2012 librarians at Gonzaga University have begun teaching data visualization
elements to their students through the use of infographics. A range of methods have
taken place over the course of the past three years, and with these experiences in
place this chapter aims to discuss best practices as well as some common pitfalls
when attempting to use infographics in lesson planning. Infographics can be an
exciting tool to employ in instruction, but as with all new instructional
technologies, it takes time to learn and use effectively. Similarly, not all methods are
best suited for all classes. During the learning period, instructors discovered that
there were times when it was best not to use infographics, such as for more
advanced classes or classes that relied heavily on lecture and in-depth work.
While this chapter was being written, librarians were been busy finalizing the
Framework for Information Literacy for Higher Education from the Association of
College and Research Libraries (ACRL). Keeping this in mind, the author seeks to
examine how infographics and data literacy can fit into the ACRL information
literacy standards as they currently exist and to offer some thoughts on the way they
might fit into future iterations. As no pedagogical method stands alone, it is
important to evaluate the methods of instruction and the value they could
potentially bring to students before adding them into the curriculum. While data
visualization does not and cannot touch on all aspects of the standards, it can
provide the instructor with structure and guidance when presenting the idea of
using data visualization within instructional programs to a director or dean.
The tools and resources covered in this chapter are largely free or already owned by
most academic libraries. In particular, the resources include reference staples such as
the Statistical Abstracts of the United States and other government publications that
most libraries have access to. The federal government has ceased print publication
of the abstracts as of 2011, but since then ProQuest has published the annual as a
searchable database. Although some librarians (and many patrons) can be
intimidated by government documents, data visualization is particularly well suited
to the strictures and methods of government data collection, and the use of these
collections strengthens both librarian and patron skill levels. Other resources involve
freemium web apps such as Piktochart, an infographic generator perfect for students
unfamiliar with the concept and new to graphic design. The use of these tools
delicately walks the line between hand-holding and allowing students to explore
their own creativity without being hampered by preset rules. With these two tools,
instructors were able to do the majority of instruction.
BACKGROUND
The Foley Center Library at Gonzaga University is a Jesuit Catholic institution
based in Spokane, Washington, that serves a diverse student body of over 8,000
traditional, distant, and online students. Every year, instruction librarians teach a
wide span of classes ranging from upper division one-shots to entry to the major-
and freshman-level courses. One set of nonacademic classes that they spend a special
amount of time on is the Pathways series of classes. These classes differ from
traditional freshman classes in that they usually comprise honors students who are
new to the university and they have no specific focus other than to develop a cohort
of students familiar with the university. Special focus is put on developing a
program for them, and in 2012 librarians decided to show students in the Pathways
program how to create infographics.
For those readers who are unfamiliar with the concept of infographics, the
generally accepted definition is a visual image such as a chart or diagram that is used
to represent information or data in a creative way. Although infographics have been
around in many forms for some time, only in the past few years have they have
come into common use in classroom and other academic settings. In some ways,
this is due to generators that make infographics easier to create along with a more
general familiarity with them in the populace. In recent years, their use has
exploded into lesson plans in libraries and across other curricula at the university
level. They represent a quickly and easily understood method of seeing numbers
and hard facts for people who, prior to the lesson, may not have felt entirely
comfortable with either graphic design or statistics. Many students have remarked
that they would not have used these sources had they not learned about them in
class.
Another key concept to be aware of is that many infographics are built around
templates, blocks, and themes. A template is a blank or prefilled infographic screen
that allows users to build their individual infographics. Within Piktochart, the
service that we will be exploring here, templates can be created out of blocks so that
each template can be as long or as short as desired (see figure 11.1). Each template
starts with three blank blocks, which can be added or deleted as needed. Similar to a
template, a theme is an infographic that has already had the majority of its design
elements created along particular thematic approaches but leaves spaces for users to
add their own information and stylistic choices. Individually, all of these elements
were approachable, but combined as a whole, the instructors found that students
could easily become overwhelmed. So they made sure that they discussed each
element of the infographic. Instructors also relied heavily on Microsoft Excel and
Google Spreadsheets at the beginning of the course.
During the time that these classes first began to be taught, professors at the
university were looking for creative ways to bring multimodal learning and digital
humanities aspects into their teaching styles. Likewise, the librarians wanted to
explore new methods for instruction to find ways of teaching that were not static. In
particular, with the Pathways classes, the author was struggling to find a way to
teach an academic class without a fixed topic and without repeating ideas that the
students would most likely receive in future library instruction. Prior to giving
infographic lessons, instruction had centered on projects such as creating videos and
gamification in the library. A precedent had been set for the Pathways classes to be
energetic and fun but with an academic lesson hidden within.
When deciding what to teach to the students, the librarians decided to represent
underused databases that were not typically shown in instruction, so they reviewed
database usage statistics to identify such databases. During a time of increased costs
and decreased budgets, librarians used this opportunity to promote databases that
they knew students could benefit from. Eventually the librarians decided to settle
on the ProQuest Statistical Abstract of the United States. They did so because it was
a new database to the library with which many librarians were unfamiliar and
because it had been a reference staple prior to its publication as an electronic
resource. Students often see statistics and large data sets as intimidating, so the
librarians were concerned that it would not be possible to teach the new database in
a fifty-minute time slot. Infographics were suggested as a potential solution.
Relatively quick and heavily reliant on data, they seemed like they would make the
perfect bridge between trying to teach a relatively new database while maintaining a
sense of fun to encourage students to get involved with the library beyond their
initial introduction to it.
ASSESSMENT
One of the biggest points of discussion when thinking of how to design the lesson
revolved around how to judge and assess creativity. The instructors knew they could
not require students to be perfect at this on their first attempt, but they did want
infographics to have a few specific elements. Although the ideal of a perfect
infographic could not be dictated, there were a few tenets that we all felt each
infographic should contain. In the expanding and evolving world of infographics,
there has been much debate over how to teach and effectively use such tools. Some
instructors, such as Abilock and Williams (2014), have a line of questioning that
they prefer their students follow called an Infographic Question Matrix, which
builds on addressing most effective areas. Others, such as Richard Saul Wurman
(1989), rely on acronyms such as LATCH (location, alphabet, time, category, and
hierarchy). Bearing these two ideas in mind, the librarians elected to create their
own model. These were the items written on the white board and on the handout
so that each student knew what was expected of his or her end product. They were
as follows (see figure 31 in the photospread):
A title
One graph/chart
A citation
Images/clip art
Explanatory information on data
Although there were numerous other things the instructors would have wanted in
an idealized infographic, they felt it was best to keep the requirements to a smaller
number so that students would not feel overwhelmed with too much information
and too many new technologies. Time was also a huge factor. Had the class taken
place over a longer period of time or several days there would likely have been more
requirements. In the best-case scenarios of the class after it had spread to other
professors, the students would have a clear objective and audience to which they
could direct these infographics. Lacking these things often gave students a small
hurdle. It was not that they did not have anything to say but rather that they did
not have a clear direction in which to point what they had to say. Particularly in the
original Pathways classes, it became clear that having a directed topic was a necessity
and that prompts were needed to give direction. Sometimes this allowed for
breaking up students into small groups of two or three to have them work together
to research a common topic (see figure 31 in the photospread).
TECHNICAL CONSIDERATIONS
As noted above, one of the major, unexpected hurdles to deal with when first
presenting the topic was scalability. Most of the instructors, having grown up and
using Microsoft Excel frequently in professional environments, assumed that their
students would also understand how to input information in a spreadsheet. While
all students had used Microsoft Excel before, instructors found that the degree to
which the students had used it did not necessarily match up with the skill level
needed for the course. The most common issues dealt with a basic
misunderstanding of how columns and rows interacted with each other, in
particular realizing that these columns and rows would relate directly to their x- and
y-axes within their graphs. Understandably, many students were inputting
information in a manner that would make the spreadsheet inherently readable to
human readers but often proved to be unreadable or incoherent to machines when
translating it to graphs. Much time was spent correcting students and showing them
that they needed to move entire rows or columns. Likewise, many failed to
understand how information would be relayed onto a graph, in that they would
scale their information so that the numbers would look disproportionately small or
large by putting information on the wrong axis. Usually these mistakes could be
spotted before they were committed to a final form, but this required the instructors
to spend the majority of their time walking around the classrooms and checking in
with individual students to monitor their progress. In a class of twenty or more
students this could sometimes take up the majority of instructional time. Initially,
these classes were taught with two instructors so there would be a backup in case
there were too many questions for one instructor to handle alone. Since first
debuting this course that practice has changed to only one instructor per class.
Although the individual problems have not changed that much from class to class,
the instructors have become more experienced in handling the problems and now
feel competent to teach the class alone.
A lingering problem with using Piktochart involves the fact that it works best in
Mozilla Firefox and Chrome but many students default to using Internet Explorer.
At the start of every class, students were reminded that the preferred Internet
browser for the course was Firefox, but there were always one or two students who
simply did not hear or ignored the directions. It is unlikely that this problem will
ever be completely resolved unless Internet Explorer should suddenly become
compatible with Piktochart. The best bet is to be aware of the issue so that when
the problem arises you know how to handle it. The most frequent problem is that
using Internet Explorer with Piktochart tends to cause the browser to either drag
down performance, freeze, or in the worst scenario, crash. One of the nicer features
of Piktochart is that it automatically saves all work. In the event of crashes or having
to restart machines, users can log back into Piktochart in a different browser and
restart where they left off. A good way to prevent the problem is to mention that
everyone should use Firefox or Chrome when students are logging in to their
computers.
Another small annoyance about Piktochart are inconsistencies within the product
when creating chart types (see figure 31 in the photospread). In the best-case
scenario, the program should allow users to input multiple types of charts that they
can choose from including icon matrices, gauges, and progress bars, which students
are often not as familiar with. Information can be manually added directly or can be
imported from CSV or Excel files (.xls) or Google Spreadsheet. While individually
many of the students had the capability to understand and important the
information via CSV or .xls files, in practical terms the time limitations of the class
did not allow for the teachers to teach every student how to do this. Frequently,
data were manually imported into Piktochart’s in-house spreadsheets. Yet not all
students had access to the same types of charts as their classmates. This was a
problem that instructors could see no obvious evidence for why one student would
get preference over another. At first glance it seemed as though it might be a
difference between browsers or even operating systems, but the instructors have
never been able to fully determine what causes these inconsistencies. Since 2012,
when we first began to roll this out, this problem has slowly receded, and now most
students have the same default charts when they first sign up. The best guess
instructors can offer for this discrepancy would be that these new types of charts
were phased rollouts that were randomly assigned to new accounts before the
finalized versions went site wide. While it would appear that this problem has been
temporarily resolved, it is something that is always on the back burner during
instruction as something to deal with should it pop up. Considering that this is a
free resource that provides a powerful tool, users must take what they are given, but
when choosing, there are other options out there that an instructor might want to
consider before settling with Piktochart. There were and remain many other viable
options for generating infographics, but instructors have generally felt that the
limitations of Piktochart are small compared to the abilities it offers. Likewise, the
other options considered were frequently not as user friendly nor did not offer the
breadth of services wanted for students (see figure 32 in the photospread).
As the course shifted from being one taught in a nonacademic setting to being
specifically requested by professors for their own individual courses, the librarians
had to think of a way for it to still work in a now deliberately academic setting. The
time constraints in place were already fairly inflexible, and it did not seem likely
that in most cases there would be more class time. However, it became clear that
while the Statistical Abstract was a stellar database for a nonacademic class on this
subject, it was not always the best solution when students had course-specific
subjects that were not necessarily targeted by the database. English, business, and
some science classes began to adopt the lesson plan in droves, but each had its own
unique demands. In some cases, professors were fine with instructors continuing to
provide instruction with the Statistical Abstract, but especially with English classes
they wanted other options. In a typical English 101 class, an instructor would go
over basic search construction while using EBSCO’s Academic Search Complete.
Initially instructors had hoped that they would be able to use this database as a
replacement for the Statistical Abstract. Unfortunately, while it had data sets hidden
within the articles, most students did not want to put in the effort to fully explore
the articles or they would not choose with careful deliberation, instead choosing the
first article that matched their keywords. Had the instructors been able to do this
over multiple days in embedded instruction, this plan likely would have worked
because there would be more time for discussion about sourcing materials and
choosing appropriate resources. Very little embedded instruction takes place with
the infographics lessons, so ultimately the instructors elected to work with tools
from the Census or Gale’s Opposing Viewpoints database, which frequently
covered the types of arguments that freshmen were offered and gave sourcing for
their data sets that students could use. Subject specialists often knew of databases or
websites that would give advanced scholars the information that we were seeking,
but that would not be appropriate for a freshman with little research experience. It
was a balancing act of exposing them to new information without overwhelming
them.
DISCUSSION
The initial infographics project for freshman students new to the university was a
resounding success in many ways. Word of the lesson spread from professor to
professor, and it became a common request from professors regardless of whether
their specific class had initially been targeted to receive that type of instruction.
Infographics is now taught not only to freshmen but also to English 101 courses,
business students, and communications students, among other populations. While
the lesson has proved popular across campus, there were some concerns that the
lesson would get overblown and eventually students would receive instruction in it
more than once or simply move on beyond the topic, having done it already in
other courses. Preventing these occurrences is a special pet project of the instructors
in that each instructor tries to tailor his or her lessons to each class and make it
unique no matter how many times a student may have been in the library. There
was some initial hesitation to make sure that even these infographics lessons would
end up becoming unique in their own way, whether by focusing on a specific type
of data or whether it was by going into a more detailed talk about design elements
for the infographics. Many instructors chose to incorporate infographics into
multimodal lesson plans, often having the students use the infographics as a final
project to sum up a semester’s worth of research into a presentable project. In
particular, professors who were seeking to add digital humanities elements to their
courses were impressed by the infographics, and they felt that this project was
perfectly matched to melding their digital requirements with their research needs.
Assessing the end product of the infographics initially offered some amount of
confusion. While some students would turn in almost astounding infographics that
offered comprehensive thought, beautiful pictures, and great composition, there
were also students who turned in infographics that seemed to offer no thought
process or simply seemed as though the student was not engaged with the project
and was merely going through the motions with it. Given such wildly variable end
products, instructors debated for a while whether this had been a successful project.
Indeed, although professors were excited about the project, the library instructors
wanted to ensure that this was also an exciting project for the students as well.
Overall, consensus and reports from the students said that the students were
enjoying the project. Feedback came from firsthand, face-to-face interactions and
emailed follow-up from instructors and students. It was not uncommon to see
students later in the semester and hear them say, “Hey! You’re that infographics
lady!” Similarly, this familiarity with librarian instructors encouraged students to
feel welcome and relaxed at the library because they knew a friendly face. While
many students had been in the library before coming to their orientations, many
expressed that they were uncomfortable using services such as reference, whereas
after instruction they were more comfortable in the space overall.
Although part of assessment involved ensuring that each infographic had the
elements listed above as part of the necessary requirements, assessment still needed
to be more than a simple checklist. Some discussions have revolved around the
library’s need for annual assessment. One way that instructors hope to incorporate
instruction assessment is to have professors submit the infographics to librarians
after class and have them build a rubric and norming session to see how successful
the infographics truly were in effectively communicating to the students the
particular strengths of the assignment. Items up for debate on the rubric would be
creativity (use of color, imagery, etc.), citation, data accuracy, and time. This may
vary depending on what instructors end up preferring in their own teaching styles,
but we wanted to ensure that all infographics were judged based on the same basic
elements. Once a semester’s worth of infographics have been gathered, library
instructors can spend time reviewing and assessing just what has come of these
projects. Depending on your institution, you may need to gather institutional
review board approval or a student release form for these data.
While using infographics in class is not necessarily the most straightforward
assignment, if instructors have the time to plan their lessons carefully, they should
be able to convey their own sense of excitement to their students along with what
their individual desires for and needs from their students could be. One librarian
took it on herself to find books and reference materials on graphic design and used
them in her lesson plans to incorporate better creativity and more appealing
infographics. With these small variations, infographics instruction began to vary.
One instructor might truly excel at explaining graphic design and ways to choose
colors and imagery, while another instructor might have strengths that rely mostly
on statistics and data literacy. While both of these skill sets are invaluable to a class
on infographics, inconsistent instruction can over time create wildly differing
classes. It was important to ensure that while some instructors might be stronger at
teaching different aspects of the class, they were all relying on the same basic plan.
This ensures that while details of individual classes might be different, all students
are still receiving the same quality and level of education, appropriate to where they
are in their educational process. It is an easy step to overlook when scheduling and
organizing, but it is a vital step that cannot be forgotten, no matter the class.
CONCLUSION
The overall consensus from both students and faculty was that this project was
perceived as fun and engaging, and while not necessarily a task that students would
immediately associate the library with, it helped them to learn about resources that
they otherwise would never have known about. While this project was focused
mostly on the use of the Statistical Abstract, it was easily adapted to other databases,
particularly around class-specific research such as business and engineering.
Piktochart is undertaking the process of creating a mobile-friendly app, and
instructors are hopeful that in the future they will be able to use this application in
the classroom. Although not all students have tablets and smartphones, increasingly
the numbers of students who do not have these items can be offset by the numbers
of tablets that the library already has available for checkout. This would give
students the ability to not only learn how to decipher and disseminate information
using mobile technology, but it would also help instructors learn to adapt to
increasingly swift changes in the technological field. Indeed, just as there is room for
growth in terms of mobile applications, instructors always watch for new and better
infographics sources in case there is a way to improve the lesson plan. Just because
something works does not mean we should ignore it and leave it alone without
reassessing its power over time. Sometimes what was effective in the beginning of
use, as time goes on is no longer relevant or needed by students. Continuous
assessment allows instructors to see where their strengths and weaknesses with a
particular subject area might be. Currently, the mobile app is best just for viewing
infographics that have already been created, but there is hope that future versions
will be in-app creator friendly.
Teaching nonacademic courses can be a challenge for any library instructor. While
these instructors were used to arriving with new ideas each year for this course, as
this class was taught over a string of semesters to different expanding groups,
instructors began to see that this was a powerhouse tool that was worth investing
time in. So many new instruction techniques can appear as the buzzword-like
activity of the week and result in a sense of burnout about new ideas. This project
helped renew the passion that many felt about instruction. It can be worth trying
out new things, as this was a stellar project that could easily be translated into a
number of environments. The library instructors would recommend this for a wide
range of audiences, although those more familiar with the details of data
visualization might find this project to be too simple for them. This makes an
excellent beginner’s project for those who are new to data literacy and would enjoy
having a threshold project to introduce them to the topic. While there are certainly
more in-depth ways to take advantage of a project like this, it can be viewed as an
entry-level way to bring data visualization to students who may be unfamiliar with
the concept.
REFERENCES
Abilock, Debbie, and Connie Williams. 2014. “Recipe for an Infographic.” Knowledge Quest 43(2): 4655.
Wurman, Richard Saul. 1989. Information Anxiety. 1st ed. New York: Doubleday.
Appendix
The following pages list and describe the data visualization technologies discussed in
this book.
Technology Type Description Supported Associated Examples Proprietary?
Operating Libraries and Tools of
Systems, Associated
Browsers, File
and Software Extensions
ArcGIS Software Suite of geospatial Windows, (see Yes
visualization and Mac OS X, Shapefile
GIS tools. Linux below)
(ArcGIS
Server)
CSS3 Stylesheet Display, formatting, All modern .css No
language and style Internet
information for browsers2
web pages and
web data
visualizations.
Can be edited
using plain text
editors and IDEs.
1
Data format Comma Separated Excel, Apple .csv No
Values; stores Numbers,
data delimited by Open/LibreOffice
CSV commas.
1. Integrated development environments designed for reading and editing computer code such as
Sublime Text (https://www.sublimetext.com/).
2. Modern browsers include Internet Explorer, Chrome, Firefox, and Safari.
3. The source code for Recollection is freely available to download
(http://sourceforge.net/projects/loc-recollect/).
About the Editor