Seeing Cities Through Big Data - Research, Methods and Applications in Urban Informatics
Seeing Cities Through Big Data - Research, Methods and Applications in Urban Informatics
Seeing Cities Through Big Data - Research, Methods and Applications in Urban Informatics
Piyushimita (Vonu) Thakuriah
Nebiyou Tilahun
Moira Zellner Editors
Seeing Cities
Through Big
Data
Research, Methods and Applications in
Urban Informatics
Springer Geography
The Springer Geography series seeks to publish a broad portfolio of scientific
books, aiming at researchers, students, and everyone interested in geographical
research. The series includes peer-reviewed monographs, edited volumes, text-
books, and conference proceedings. It covers the entire research area of geogra-
phy including, but not limited to, Economic Geography, Physical Geography,
Quantitative Geography, and Regional/Urban Planning.
Moira Zellner
Department of Urban Planning
and Policy
University of Illinois at Chicago
Chicago, IL, USA
Big Data is spawning new areas of research, new methods and tools, and new
insights into Urban Informatics. This edited volume presents several papers
highlighting the opportunities and challenges of using Big Data for understanding
urban patterns and dynamics. The volume is intended for researchers, educators,
and students who are working in this relatively new area and outlines many of the
considerations that are likely to rise in research, applications, and education. The
papers tackle a myriad of issues—while some empirical papers showcase insights
that Big Data can provide on urban issues, others consider methodological issues or
case studies which highlight how Big Data can enrich our understanding of urban
systems in a variety of contexts.
The chapters in this book are peer-reviewed papers selected among those
originally presented in a 2-day workshop on Big Data and Urban Informatics
sponsored by the National Science Foundation and held at the University of Illinois
at Chicago in 2014. The workshop brought together researchers, educators, practi-
tioners, and students representing a variety of academic disciplines including Urban
Planning, Computer Science, Civil Engineering, Economics, Statistics, and Geog-
raphy. It was a unique opportunity for urban social scientists and data scientists to
exchange ideas in how Big Data can or is being used to address a variety of urban
challenges. This edited volume draws from these various disciplines and seeks to
address the numerous important issues emerging from these areas.
This volume is intended to introduce and familiarize the reader with how Big
Data is being used as well as to highlight different technical and methodological
issues that need to be addressed to ensure urban Big Data can answer critical urban
questions. The issues explored in this volume cover eight broad categories and span
several urban sectors including energy, the environment, transportation, housing,
and emergency and crisis management. Authors have also considered the complex-
ities and institutional factors involved in the use of Big Data, from meeting
educational needs to changing organizational and social equity perspectives regard-
ing data innovations and entrepreneurship. Others consider the methodological
and technical issues that arise in collecting, managing, and analyzing unstructured
v
vi Preface
user-generated content and other sensed urban data. We have aimed to make the
volume comprehensive by incorporating papers that show both the immense poten-
tial Big Data holds for Urban Informatics and the challenges it poses.
We would like to acknowledge the support of the National Science Foundation
which funded the Big Data and Urban Informatics workshop, without which this
volume would not have been possible. We would also like to thank the Department
of Urban Planning and Policy at the University of Illinois at Chicago which
provided additional support for the workshop. A number of people helped us in
preparing this edited volume and in the events that led up to the workshop. A special
thank you to Alison Macgregor of the University of Glasgow who helped us
organize and manage the review process and to Keith Maynard for providing
editing support. We are immensely grateful to Ms. Nina Savar whose efforts
ensured a successful workshop. We are also indebted to all of the anonymous
reviewers who took their time to provide useful feedback to the authors in this
volume.
vii
viii Contents
Deepti Adlakha, M.U.D. Brown School, Washington University in St. Louis, St.
Louis, MO, USA
Prevention Research Center, Washington University in St. Louis, St. Louis, MO,
USA
Amna Alruheli Department of Landscape Architecture and Environmental
Planning, University of California, Berkeley, CA, USA
Francisco Antunes Center of Informatics and Systems of the University
of Coimbra, Coimbra, Portugal
Elsa Arcaute Centre for Advanced Spatial Analysis (CASA), University College
London, London, UK
Camille Barchers School of City and Regional Planning, Georgia Institute
of Technology, Atlanta, GA, USA
Michael Batty Centre for Advanced Spatial Analysis (CASA), University College
London, London, UK
Itzhak Benenson Department of Geography and Human Environment, Tel Aviv
University, Tel Aviv, Israel
Eran Ben-Elia Department of Geography and Environmental Development,
Ben-Gurion University of the Negev, Beersheba, Israel
Gregory S. Biging Department of Environmental Science, Policy, and Manage-
ment, University of California, Berkeley, CA, USA
Emma Boundy, M.A. Department of City and Regional Planning, University
of North Carolina at Chapel Hill, Chapel Hill, NC, USA
xi
xii Contributors
Piyushimita (Vonu) Thakuriah Urban Studies and Urban Big Data Centre,
University of Glasgow, Glasgow, UK
Andrew Tice Faculty of Built Environment, City Futures Research Centre,
University of New South Wales, Kensington, NSW, Australia
Nebiyou Tilahun Department or Urban Planning and Policy, College or Urban
Planning and Public Affairs, University of Illinois at Chicago, Chicago, IL, USA
Shaowen Wang CyberGIS Center for Advanced Digital and Spatial Studies,
University of Illinois at Urbana—Champaign, Urbana, IL, USA
CyberInfrastructure and Geospatial Information Laboratory, University of Illinois
at Urbana—Champaign, Urbana, IL, USA
Department of Geography and Geographic Information Science, University of
Illinois at Urbana—Champaign, Urbana, IL, USA
Department of Computer Science, University of Illinois at Urbana—Champaign,
Urbana, IL, USA
Department of Urban and Regional Planning, University of Illinois at Urbana—
Champaign, Urbana, IL, USA
National Center for Supercomputing Applications, University of Illinois at
Urbana—Champaign, Urbana, IL, USA
Zhenxin Wang Center for Human-Engaged Computing, Kochi University of
Technology, Kochi, Japan
Department of Urban and Regional Planning, University at Buffalo, The State
University of New York, Buffalo, NY, USA
Nigel Waters GeoInformatics and Earth Observation Laboratory, Department of
Geography and Institute for CyberScience, The Pennsylvania State University,
University Park, PA, USA
Laiyun Wu Department of Urban and Regional Planning, University at Buffalo,
The State University of New York, Buffalo, NY, USA
Jeremy S. Wu, Ph.D. Retired, Census Bureau, Suitland, Maryland, and
Department of Statistics, George Washington University, Washington, DC, USA
Ci Yang, Ph.D. Senior Transportation Data Scientist, DIGITALiBiz, Inc., Cam-
bridge, MA, USA
Junjun Yin CyberGIS Center for Advanced Digital and Spatial Studies,
University of Illinois at Urbana—Champaign, Urbana, IL, USA
CyberInfrastructure and Geospatial Information Laboratory, University of Illinois
at Urbana—Champaign, Urbana, IL, USA
Contributors xvii
The chapters in this book were first presented in a 2-day workshop on Big Data and
Urban Informatics held at the University of Illinois at Chicago in 2014. The
workshop, sponsored by the National Science Foundation, brought together approx-
imately 150 educators, practitioners and students from 91 different institutions in
11 countries. Participants represented a variety of academic disciplines including
Urban Planning, Computer Science, Civil Engineering, Economics, Statistics, and
Geography and provided a unique opportunity for discussions by urban social
scientists and data scientists interested in the use of Big Data to address urban
challenges. The papers in this volume are a selected subset of those presented at the
workshop and have gone through a peer-review process.
Our main motivation for the workshop was to convene researchers and pro-
fessionals working on the emerging interdisciplinary research area around urban
Big Data. We sought to organize a community with interests in theoretical devel-
opments and applications demonstrating the use of urban Big Data, and the next-
generation of Big Data services, tools and technologies for Urban Informatics. We
were interested in research results as well as idea pieces and works in progress that
highlighted research needs and data limitations. We sought papers that clearly
create or use novel, emerging sources of Big Data for urban and regional analysis
in transportation, environment, public health, land-use, housing, economic
P. Thakuriah (*)
Urban Studies and Urban Big Data Centre, University of Glasgow, Glasgow, UK
e-mail: [email protected]
N.Y. Tilahun • M. Zellner
Department or Urban Planning and Policy, College or Urban Planning and Public Affairs,
University of Illinois at Chicago, Chicago, IL, USA
e-mail: [email protected]; [email protected]
The chapters in this book are organized around eight broad categories: (1) Analytics
of user-generated content; (2) Challenges and opportunities of urban Big Data;
(3) Changing organizational and educational perspectives with urban Big Data;
(4) Urban data management; (5) Urban knowledge discovery applied to a variety of
urban contexts; (6) Emergencies and Crisis; (7) Health and well-being; and
(8) Social equity and data democracy.
The second set of papers considers the challenges and opportunities of urban Big
Data, particularly as an auxiliary data source that can be combined with more
traditional survey data, or even as a substitute for large survey-based public
datasets. Big Data exists within a broader data economy that has changed in recent
Introduction to Seeing Cities Through Big Data: Research, Methods and. . . 3
years (e.g., the American Community Survey (ACS) data quality). Spielman (2016)
argues that carefully considered Big Data sources hold potential to increase confi-
dence in the estimates provided by data sources such as the ACS. Recognizing that
Big Data appears as an attractive alternative to design-based survey data, Johnson
and Smith (2016) caution the potential of serious methodological costs and call on
efforts to find ways of integrating these data sources, which have different qualities
that make them valuable to understand cities (Johnson and Smith 2016).
In addition to the cost savings, the potential for data fusion strategies lies in the
integration of a rich diversity of data sources shedding light on complex urban
phenomena from different angles, and covering different gaps. There are, however,
major barriers to doing so, stemming from the difficulty in controlling the quality
and quantity of the data, and privacy issues (Spielman 2016). The proliferation of
Big Data sources also demand new approaches to computation and analysis.
Gunturi and Shekhar (2016), explore the computational challenges posed by
spatio-temporal Big Data generated from location-aware sensors and how these
may be addressed by use of scalable analytics. In another application, Antunes
et al. (2016) discuss how explicitly addressing heteroscedasticity greatly improves
the quality of model predictions and the confidence associated with those pre-
dictions in regression analysis using Big Data.
The third set of papers focus on the organizational and educational perspectives that
change with Big and Open Urban Data. Cities are investing on technologies to
enhance human and automated decision-making. For smarter cities, however, urban
systems and subsystems require connectivity through data and information man-
agement. Conceptualizing cities as platforms, Krishnamurthy et al. (2016) discuss
the importance of how data and technology management are critical for cities to
become agile, adaptable and scalable while also raising critical considerations to
ensure such goals are achieved. Thakuriah et al. (2016a) review organizations in the
urban data sector with the aim of understanding their role in the production of data
and service delivery using data. They identify nine organizational types in this
dynamic and rapidly evolving sector, which they align along five dimensions to
account for their mission, major interest, products and activities: techno-
managerial, scientific, business and commercial, urban engagement, and openness
and transparency.
Despite the rapid emergence of this data rich world, French et al. (2016) ask if
the urban planners of tomorrow are being trained to leverage these emerging
resources for creating better urban spaces. They argue that urban planners are still
being educated to work in a data poor environment, taking courses in statistics,
survey research and projection and estimation that are designed to fill in the gaps in
4 P. Thakuriah et al.
this environment. With the advent of Big Data, visualization, simulation, data
mining and machine learning may become the appropriate tools planners can use,
and planning education and practice need to reflect this new reality (French
et al. 2016). In line with this argument, Estiri (2016) proposes new planning
frameworks for planning for urban energy demand, based on improvements that
non-linear modeling approaches provide over mainstream traditional linear
modeling.
The book also includes examples of online platforms and software tools that allow
for urban data management and applications that use such urban data for measure-
ment of urban indicators. The AURIN (Australian Urban Research Infrastructure
Network) workbench (Pettit et al. 2016), for example, provides a machine-to-
machine online access to large scale distributed and heterogeneous data resources
from across Australia, which can be used to understand, among other things,
housing affordability in Australia. AURIN allows users to systematically access
existing data and run spatial-statistical analysis, but a number of additional software
tools are required to undertake data extraction and manipulation. In another appli-
cation to measure the performance of transit systems in San Francisco, researchers
have developed software tools to support the fusion and analysis of large, passively
collected data sources like automated vehicle location (AVL) and automated
passenger counts (APC) (Erhardt et al. 2016). The tools include methods to expand
the data from a sample of buses, and is able to report and track performance in
several key metrics and over several years. Queries and comparisons support the
analysis of change over time.
Owen and Levinson (2016) also showcase a national public transit job accessi-
bility evaluation at the Census block level. This involved assembling and
processing a comprehensive national database of public transit network topology
and travel times, allowing users to calculate accessibility continuously for every
minute within a departure time window of interest. The increased computational
complexity is offset by the robust representation of the interaction between transit
service frequency and accessibility at multiple departure times.
Yet, the data infrastructure needed to support Urban Informatics does not
materialize overnight. Wu and Zhang (2016) demonstrate how resources at the
scale of an entire country is needed to establish basic processes required to develop
comprehensive citizen-oriented services. By focusing on China’s emerging smart
cities program, they demonstrate the need for a proactive data-driven approach to
meet challenges posed by China’s urbanization. The approach needs not only a
number of technological and data-oriented solutions, but also a change in culture
towards statistical thinking, quality management, and data integration. New invest-
ments in smart cities have the potential to design systems such that the data can lead
to much-needed governmental innovations towards impact.
Introduction to Seeing Cities Through Big Data: Research, Methods and. . . 5
Big Data is playing a major role in urban knowledge discovery and planning
support. For example, a high-resolution digital surface model (DSM) from Light
Detection and Ranging (LiDAR) have supported the dynamic simulation of
flooding due to sea level rise in California (Ju et al. 2016). This study provides
more detailed information than static mapping, and serves as a fine database for
better planning, management, and governance to understand future scenarios. In
another example, Khan and Machemehl (2016) study how land use and different
social and policy variables affect free-floating carsharing vehicle choice and
parking duration, for which there is very little empirical data. The authors use
two approaches; logistic regression and a duration model and find that land-use
level socio-demographic attributes are important factors in explaining usage pat-
terns of carsharing services. This has implications for carsharing parking policy and
the availability of transit around intermodal transportation. Another example by
Grinberger et al. (2016) shows that synthetic big data can also be generated from
standard administrative small data for applications in urban disaster scenarios. The
data decomposition process involves moving from a database describing only
hundreds or thousands of spatial units to one containing records of millions of
buildings and individuals (agents) over time, that then populate an agent-based
simulation of responses to a hypothetical earthquake in downtown Jerusalem.
Simulations show that temporary shocks to movement and traffic patterns can
generate longer term lock-in effects, which reduce commercial activity. The issue
arising here is the ability to identify when this fossilization takes place and when a
temporary shock has passed the point of no return. A large level of household
turnover and ‘churning’ through the built fabric of the city in the aftermath of an
earthquake was also observed, which points to a waste of resources, material,
human and emotional. Less vulnerable socio-economic groups ‘weather the
storm’ by dispersing and then re-clustering over time.
A suite of studies focuses on new methods to apply Big Data to transportation
planning and management, particularly with the help of GIS tools. Benenson
et al. (2016) use big urban GIS data that is already available to measure accessibility
from the viewpoint of an individual traveler going door-to-door. In their work, a
computational application that is based on the intensive querying of relational
database management systems was developed to construct high-resolution acces-
sibility maps for an entire metropolitan area, to evaluate new infrastructure projects.
High-resolution representations of trips enabled unbiased accessibility estimates,
providing more realistic assessments of such infrastructure investments, and a
platform for transportation planning. Similarly, Yang and Gonzales (2016) show
that Big Data derived from taxicabs’ Global Positioning Systems (GPS) can be used
to refine travel demand and supply models and street network assessments, by
processing and integrating with GIS. Such evaluations can help identify service
mismatch, and support fleet regulation and management. Hwang et al. (2016)
demonstrate a case where GPS trajectory data is used to study travel behavior and
6 P. Thakuriah et al.
to estimate carbon emission from vehicles. They propose a reliable method for
partitioning GPS trajectories into meaningful elements for detecting a stay point
(where an individual stays for a while) using a density-based spatial clustering
algorithm.
Big Data has particular potential in helping to deal with emergencies and urban
crises in real time. Cervone et al. (2016) propose a new method to use real-time
social media data (e.g., Twitter, photos) to augment remote sensing observations of
transportation infrastructure conditions in response to emergencies. Challenges
remain, however, associated with producer anonymity and geolocation accuracy,
as well as differing levels in data confidence.
Health and well-being is another major area where Big Data is making significant
contributions. Data on pedestrian movement has however proven difficult and
costly to collect and analyze. Yin et al. (2016b) propose and test a new image-
based machine learning method which processes panoramic street images from
Google Street View to detect pedestrians. Initial results with this method resemble
the pedestrian field counts, and thus can be used for planning and design. Another
paper, by Hipp et al. (2016) using the Archive of Many Outdoor Scenes (AMOS)
project aims to geolocate, annotate, archive, and visualize outdoor cameras and
images to serve as a resource for a wide variety of scientific applications. The
AMOS image dataset, crowdsourcing, and eventually machine learning can be used
to develop reliable, real-time, non-labor intensive and valid tools to improve
physical activity assessment via online, outdoor webcam capture of global physical
activity patterns and urban built environment characteristics.
A third paper (Park 2016) describes research conducted under the Citygram
project umbrella and illustrates how a cost-effective prototype sensor network,
remote sensing hardware and software, database interaction APIs, soundscape
analysis software, and visualization formats can help characterize and address
urban noise pollution in New York City. This work embraces the idea of time-
variant, poly-sensory cartography, and reports on how scalable infrastructural
technologies can capture urban soundscapes to create dynamic soundmaps.
Introduction to Seeing Cities Through Big Data: Research, Methods and. . . 7
Last, but not least, Nguyen and Boundy (2016) discuss issues surrounding Big Data
and social equity by focusing on three dimensions: data democratization, digital
access and literacy, and promoting equitable outcomes. The authors examine how
Big Data has changed local government decision-making, and how Big Data is
being used to address social equity in New York, Chicago, Boston, Philadelphia,
and Louisville. Big Data is changing decision-making by supplying more data
sources, integrating cross agency data, and using predictive rather than reactive
analytics. Still, no study has examined the cost-effectiveness of these programs to
determine the return on investment. Moreover, local governments have largely
focused on tame problems and gains in efficiency. Technologies remain largely
accessible to groups that are already advantaged, and may exacerbate social
inequalities and inhibit democratic processes. Carr and Lassiter (2016) question
the effectiveness of civic apps as an interface between urban data and urban
residents, and ask who is represented by and who participates in the solutions
offered by apps. They determine that the transparency, collaboration and innovation
that hackathons aim to achieve are not yet fully realized, and suggest that a first step
to improving the outcomes of civic hackathons is to subject these processes to the
same types of scrutiny as any other urban practice.
3 Conclusions
The urban data landscape is changing rapidly. There has been a tremendous amount
of interest in the use of emerging forms of data to address complex urban problems.
It is therefore an opportune time for an interdisciplinary research community to
have a discussion on the range of issues relating to the objectives of Urban
Informatics, the research approaches used, the research applications that are emerg-
ing, and finally, the many challenges involved in using Big Data for Urban
Informatics.
We hope this volume familiarizes the reader to both the potential and the
technological and methodological challenges of Big Data, the complexities and
institutional factors involved, as well as the educational needs for adopting these
emerging data sources into practice, and for adapting to the new world of urban Big
Data. We have also sought to incorporate papers that highlight the challenges that
need to be addressed so the promise of Big Data is fulfilled. The challenges of
representativeness and of equity in the production of such data and in applications
that use Big Data are also areas needing continued attention. We have aimed for
making the volume comprehensive but we also recognize that a single volume
cannot completely cover the broad range of applications using Big Data in urban
contexts. We hope this collection proves an important starting point.
8 P. Thakuriah et al.
References
Krishnamurthy R, Smith KL, Desouza KC (2016) Urban informatics: critical data and technology
considerations. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities through big data:
research, methods and applications in urban informatics. Springer, New York
Nguyen MT, Boundy E (2016) Big data and smart (equitable) cities. In: Thakuriah P, Tilahun N,
Zellner M (eds) Seeing cities through big data: research, methods and applications in urban
informatics. Springer, New York
Owen A, Levinson DM (2016) Developing a comprehensive U.S. transit accessibility database. In:
Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities through big data: research, methods and
applications in urban informatics. Springer, New York
Park TH (2016) Mapping Urban Soundscapes via Citygram. In: Thakuriah P, Tilahun N, Zellner M
(eds) Seeing cities through big data: research, methods and applications in urban informatics.
Springer, New York
Pettit C, Tice A, Randolph B (2016) Using an online spatial analytics workbench for understand-
ing housing affordability in Sydney. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities
through big data: research, methods and applications in urban informatics. Springer, New York
Spielman SE (2016) The potential for big data to improve neighborhood-level census data. In:
Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities through big data: research, methods and
applications in urban informatics. Springer, New York
Tang Z, Zhou Y, Yu H, Gu Y, Liu T (2016) Developing an interactive mobile volunteered
geographic information platform to integrate environmental big data and citizen science in
urban management. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities through big data:
research, methods and applications in urban informatics. Springer, New York
Tasse D, Hong JI (2016) Using user-generated content to understand cities. In: Thakuriah P,
Tilahun N, Zellner M (eds) Seeing cities through big data: research, methods and applications
in urban informatics. Springer, New York
Thakuriah P, Dirks L, Mallon-Keita K (2016a) Digital Infomediaries and Civic Hacking in
Emerging Urban Data Initiatives. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities
through big data: research, methods and applications in urban informatics. Springer, New York
Thakuriah P, Tilahun N, Zellner M (2016b) Big data and urban informatics: innovations and
challenges to urban planning and knowledge discovery. In: Thakuriah P, Tilahun N, Zellner M
(eds) Seeing cities through big data: research, methods and applications in urban informatics.
Springer, New York
Wu J, Zhang R (2016) Seeing Chinese cities through big data and statistics. In: Thakuriah P,
Tilahun N, Zellner M (eds) Seeing cities through big data: research, methods and applications
in urban informatics. Springer, New York
Yang C, Gonzales EJ (2016) Modeling taxi demand and supply in New York City using large-scale
taxi GPS data. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities through big data:
research, methods and applications in urban informatics. Springer, New York
Yin J, Gao Y, Wang S (2016a) CyberGIS-enabled urban sensing from volunteered citizen
participation using mobile devices. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities
through big data: research, methods and applications in urban informatics. Springer, New York
Yin L, Cheng Q, Shao Z, Wang Z, Wu L (2016b) ‘Big Data’: pedestrian volume using Google
street view images. In: Thakuriah P, Tilahun N, Zellner M (eds) Seeing cities through big data:
research, methods and applications in urban informatics. Springer, New York
Big Data and Urban Informatics:
Innovations and Challenges to Urban
Planning and Knowledge Discovery
Abstract Big Data is the term being used to describe a wide spectrum of obser-
vational or “naturally-occurring” data generated through transactional, operational,
planning and social activities that are not specifically designed for research. Due to
the structure and access conditions associated with such data, their use for research
and analysis becomes significantly complicated. New sources of Big Data are
rapidly emerging as a result of technological, institutional, social, and business
innovations. The objective of this background paper is to describe emerging sources
of Big Data, their use in urban research, and the challenges that arise with their use.
To a certain extent, Big Data in the urban context has become narrowly associated
with sensor (e.g., Internet of Things) or socially generated (e.g., social media or
citizen science) data. However, there are many other sources of observational data
that are meaningful to different groups of urban researchers and user communities.
Examples include privately held transactions data, confidential administrative
micro-data, data from arts and humanities collections, and hybrid data consisting
of synthetic or linked data.
The emerging area of Urban Informatics focuses on the exploration and under-
standing of urban systems by leveraging novel sources of data. The major potential
of Urban Informatics research and applications is in four areas: (1) improved
strategies for dynamic urban resource management, (2) theoretical insights and
knowledge discovery of urban patterns and processes, (3) strategies for urban
engagement and civic participation, and (4) innovations in urban management,
and planning and policy analysis. Urban Informatics utilizes Big Data in innovative
ways by retrofitting or repurposing existing urban models and simulations that are
underpinned by a wide range of theoretical traditions, as well as through data-
driven modeling approaches that are largely theory agnostic, although these diver-
gent research approaches are starting to converge in some ways. The paper surveys
P. Thakuriah (*)
Urban Studies and Urban Big Data Centre, University of Glasgow, Glasgow, UK
e-mail: [email protected]
N.Y. Tilahun • M. Zellner
Department or Urban Planning and Policy, College or Urban Planning and Public Affairs,
University of Illinois at Chicago, Chicago, IL, USA
e-mail: [email protected]; [email protected]
the kinds of urban problems being considered by going from a data-poor environ-
ment to a data-rich world and the ways in which such enquiries have the potential to
enhance our understanding, not only of urban systems and processes overall, but
also contextual peculiarities and local experiences. The paper concludes by
commenting on challenges that are likely to arise in varying degrees when using
Big Data for Urban Informatics: technological, methodological, theoretical/episte-
mological, and the emerging political economy of Big Data.
1 Introduction
Urban and regional analysis involves the use of a wide range of approaches to
understand and manage complex sectors, such as transportation, environment,
health, housing, the built environment, and the economy. The goals of urban
research are many, and include theoretical understanding of infrastructural, phys-
ical and socioeconomic systems; developing approaches to improve urban opera-
tions and management; long-range plan making; and impact assessments of urban
policy.
Globally, more people live in urban areas than in rural areas, with 54 % of the
world’s population estimated to be residing in urban areas in 2014 (United
Nations 2014), levying unprecedented demand for resources and leading to
significant concerns for urban management. Decision-makers face a myriad of
questions as a result, including: What strategies are needed to operate cities
effectively and efficiently? How can we evaluate potential consequences of
complex social policy change? What makes the economy resilient and strong,
and how do we develop shockproof cities? How do different cities recover from
man-made or natural disasters? What are the technological, social and policy
mechanisms needed to develop interventions for healthy and sustainable behav-
ior? What strategies are needed for lifelong learning, civic engagement, and
community participation, adaptation and innovation? How can we generate
hypothesis about the historical evolution of social exclusion and the role of
agents, policies and practices?
The Big Data tsunami has hit the urban research disciplines just like many other
disciplines. It has also stimulated the interest of practitioners and decision-makers
seeking solutions for governance, planning and operations of multiple urban sec-
tors. The objective of this background paper is to survey the use of Big Data in the
urban context across different academic and professional communities, with a
particular focus on Urban Informatics. Urban Informatics is the exploration and
understanding of urban systems for resource management, knowledge discovery of
patterns and dynamics, urban engagement and civic participation, and planning and
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 13
For many, Big Data is just a buzzword and to a certain extent, the ambiguity in its
meaning reflects the different ways in which it is used in different disciplines and
user communities. The ambiguity is further perpetuated by the multiple concepts
that have become associated with the topic. However, the vagueness and well-worn
clichés surrounding the subject have overshadowed potentially strong benefits in
well-considered cases of use.
Based on a review of 1437 conference papers and articles that contained the full
term “Big Data” in either the title or within the author-provided keywords,
De Mauro et al. (2015) arrived at four groups of definitions of Big Data. These
definitions focus on: (1) the characteristics of Big Data (massive, rapid, complex,
unstructured and so on), with the 3-Vs—Volume, Variety and Velocity—referring
to the pure amount of information and challenges it poses (Laney 2001) being a
particularly over-hyped example; (2) the technological needs behind the processing
of large amounts of data (e.g., as needing serious computing power, or, scalable
architecture for efficient storage, manipulation, and analysis); (3) as Big Data being
associated with crossing of some sort of threshold (e.g., exceeding the processing
capacity of conventional database systems); and (4) as highlighting the impact of
Big Data advancement on society (e.g., shifts in the way we analyze information
that transform how we understand and organize society).
14 P. Thakuriah et al.
Moreover, the term Big Data has also come to be associated with not just the data
itself, but with curiosity and goal-driven approaches to extract information out of
the data (Davenport and Patil 2012), with a focus on the automation of the entire
scientific process, from data capture to processing to modeling (Pietsch 2013). This
is partly an outcome of the close association between Big Data and data science,
which emphasizes data-driven modeling, hypothesis generation and data descrip-
tion in a visually appealing manner. These are elements of what has become known
as the Fourth Paradigm of scientific discovery (Gray 2007 as given in Hey
et al. 2009), which focuses on exploratory, data-intensive research, in contrast to
earlier research paradigms focusing on describing, theory-building and computa-
tionally simulating observed phenomena.
Quantitative urban research has historically relied on data from censuses, sur-
veys, and specialized sensor systems. While these sources of data will continue to
play a vital role in urban analysis, declining response rates to traditional surveys,
and increasing costs of administering the decennial census and maintaining and
replacing sensor systems have led to significant challenges to having high-quality
data for urban research, planning and operations. These challenges have led to
increasing interest in looking at alternative ways of supplementing the urban data
infrastructure.
For our purposes, Big Data refers to structured and unstructured data generated
naturally as a part of transactional, operational, planning and social activities, or the
linkage of such data to purposefully designed data. The use of such data gives rise to
technological and methodological challenges and complexities regarding the sci-
entific paradigm and political economy supporting inquiry. Established and emerg-
ing sources of urban Big Data are summarized in Table 1: sensor systems, user-
generated content, administrative data (open and confidential micro-data), private
sector transactions data, data from arts and humanities collections, and hybrid data
sources, including linked data and synthetic data. While there are many ways to
organize Big Data for urban research and applications, the grouping here is pri-
marily informed by the user community typically associated with each type of data;
other factors such as methods of generation, and issues of ownership and access are
also considered. The grouping is not mutually exclusive; for example, sensor
systems might be owned by public agencies for administrative and operational
purposes as well as by private companies to assist with transactions.
Novel patterns of demand and usage can be extracted from these data. The sensors
detect activity and changes in a wide variety of urban phenomena involving
inanimate objects (infrastructure, building structure), physical aspects of urban
areas (land cover, water, tree cover, atmospheric conditions), movement (of cars,
people, animals), and activity (use patterns, locations).
As noted earlier, sensor systems might be government or privately owned, with
very different access and data governance conditions, and some have been opera-
tional for a long time. Typical user communities are public and private urban
operations management organizations, independent ICT developers, and
16 P. Thakuriah et al.
Transformative changes have taken place in the last decade regarding ways in
which citizens are being involved in co-creating information, and much has been
written about crowd-sourcing, Volunteered Geographic Information, and, gener-
ally, User-Generated Content (UGC). Citizens, through the use of sensors or social
media, and other socially generated information resulting from their participation in
social, economic or civic activities, are going from being passive subjects of survey
and research studies to being active generators of information. Citizen-based
approaches can be categorized as contributory, collaborative, or co-created
(Bonney et al. 2009). UGC can generally occur: (1) proactively when users volun-
tarily generate data on ideas, solve problems, and report on events, disruptions or
activities that are of social and civic interest, or (2) retroactively, when analysts
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 17
process secondary sources of user-submitted data published through the web, social
media and other tools (Thakuriah and Geers 2013).
UGC can be proactively generated through idea generation, feedback and
problem solving. Developments in Information and Communications Technology
(ICT) have expanded the range and diversity of ways in which citizens provide
input into urban planning. It has enabled sharing ideas and voting on urban projects,
and providing feedback regarding plans and proposals with the potential to affect
life in cities. These range from specialized focus groups where citizens provide
input to “hackathons” where individuals passionate about ICT and cities get
together to generate solutions to civic problems using data. Citizens also solve
problems; for example, through human computation (described further in Sect. 3.4)
to assess livability or the quality of urban spaces where objective metrics from
sensors and machines are not accurate. These activities produce large volumes of
structured and unstructured data that can be analyzed to obtain insights into
preferences, behaviors and so on.
There has been an explosive growth in the wealth of data proactively generated
through different sensing systems. Depending on the level of decision-making
needed on the part of users generating information, proactive sensing modes can
be disaggregated into participatory (active) and opportunistic (passive) sensing
modes. In participatory sensing, users voluntarily opt into systems that are specif-
ically designed to collect information of interest (e.g., through apps which capture
information on quality of local social, retail and commercial services, or websites
that consolidate information for local ride-sharing), and actively report or upload
information on objects of interest. In opportunistic sensing, users enable their
wearable or in-vehicle location-aware devices to automatically track and passively
transmit their physical sensations, or activities and movements (e.g., real-time
automotive tracking applications which measure vehicle movements yielding data
on speeds, congestion, incidents and the like, as well as biometric sensors, life
loggers and a wide variety of other devices for personal informatics relating to
health and well-being). The result of these sensing programs are streams of content
including text, images, video, sound, GPS trajectories, physiological signals and
others, which are available to researchers at varying levels of granularity depending
on, among other factors, the need to protect personally identifiable information.
In terms of retroactive UCG, massive volumes of content are also created every
second of every day as a result of users providing information online about their
lives and their experiences. The key difference from the proactive modes is that
users are not voluntarily opting into specific systems for the purpose of sharing
information on particular topics and issues. There are many different types of
retroactive UGC that could be used for urban research including Internet search
terms, customer ratings, web usage data, and trends data. Data from social net-
works, micro-blogs or social media streams have generated a lot of interest among
researchers, with the dominant services at the current time being online social
networks such as Twitter, Facebook, LinkedIn and Foursquare, the latter being a
location-based social network. Additionally, there are general question-and-answer
databases from which data relevant to urban researchers could be retrieved, as well
18 P. Thakuriah et al.
privacy legislation, limitations in data quality that prohibit publication, and limited
user-friendliness (Huijboom and van den Broek 2011).
Many valuable uses of administrative data require access to personally identifi-
able information, typically micro-data at the level of individual persons, which are
usually strictly protected by data protection laws or other governance mechanisms.
Personally-identifiable information are those that can be used on its own or together
with other information to identify a specific individual, and the benefits of accessing
and sharing identifiable administrative data for research purposes have to be
balanced against the requirements for data security to ensure the protection of
individuals’ personal information. Confidential administrative micro-data are of
great interest to urban social scientists involved in economic and social policy
research, as well as to public health and medical researchers.
There are several current activities involving confidential data that are likely to
be of interest to urban researchers. The UK Economic and Social Research Council
recently funded four large centers on administrative data research, including run-
ning data services to support confidential administrative data linkage, in a manner
similar to that offered in other countries such as Denmark, Finland, Norway and
Sweden. In the US, the Longitudinal Employment Household Dynamics (LEHD)
program of the Census Bureau is an example of an ambitious nationwide program
combining federal, state and Census Bureau data on employers and employees from
unemployment insurance records, data on employment and wages, additional
administrative data and data from censuses and surveys (Abowd et al. 2005), to
create detailed estimates of workforce and labor market dynamics.
Administrative data in some cases can be linked both longitudinally for the same
person over time and between registers of different types, e.g. linking employment
data of parents to children’s test scores, or linking medical records to a person’s
historical location data and other environmental data. The latter, for example, could
potentially allow research to investigate questions relating to epigenetics and
disease heritability (following Aguilera et al. 2010). Such linkages are also likely
to allow in-depth exploration of spatial and temporal variations in health and social
exclusion.
There are vast arts and humanities collections that depict life in the city in the form
of text, image, sound recording, and linguistic collections, as well as media
repositories such as film, art, material culture, and digital objects. These highly
unstructured sources of data allow the representation of the ocular, acoustic and
other patterns and transformations in cities to be mapped and visualized, to poten-
tially shed light on social, cultural and built environment patterns in cities. For
example, a recent project seeks to digitize a treasure trove of everyday objects such as
“advertisements, handbills, pamphlets, menus, invitations, medals, pins, buttons,
badges, three-dimensional souvenirs and printed textiles, such as ribbons and sashes”
to provide “visual and material insight into New Yorkers’ engagement with the
social, creative, civic, political, and physical dynamics of the city, from the Colonial
era to the present day” (Museum of the City of New York 2014). This collection will
contain detailed metadata making it searchable through geographic querying.
Inferring knowledge from such data involves digitization, exploratory media
analysis, text and cultural landscape mapping, 3-D mapping, electronic literary
analysis, and advanced visualization techniques. With online publishing and virtual
archives, content creators and users have the potential to interact with source
materials to create new findings, while also facilitating civic engagement, commu-
nity building and information sharing. Recent focus has been on humanities to
foster civic engagement; for example, the American Academy of Arts and Sciences
(2013), while making a case for federal funding for the public humanities, empha-
sized the need to encourage “civic vigor” and to prepare citizens to be “voters,
jurors, and consumers”. There is potential for this line of work in improving the
well-being of cities by going beyond civic engagement, for example, to lifelong
learning (Hoadley and Bell 1996; CERI/OECD 1992). Stakeholders engaged in this
area are typically organizations involved in cultural heritage and digital culture,
such as museums, galleries, memory institutions, libraries, archives and institutions
of learning. Typical user communities for this type of data are history, urban design,
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 21
art and architecture, and digital humanities organizations, as well as community and
civic organizations, data scientists, and private organizations. The use of such data
in quantitative urban modeling opens up a whole new direction of urban research.
Data combinations can occur in two ways: combination through study design to
collect structured and unstructured data during the same data collection effort (e.g.,
obtaining GPS data from social survey participants, so that detailed movement data
are collected from the persons for whom survey responses are available), or through
a combination of different data sources brought together data by data linkage or
multi-sensor data fusion under the overall banner of what has recently been called
“broad data” (Hendler 2014).
There are now several examples where data streams have been linked by design
such as household travel surveys and activity diaries administered using both a
questionnaire-based survey instrument and a GPS element. One of many examples
is the 2007/2008 Travel Tracker data collection by the Chicago Metropolitan
Agency for Planning (CMAP), which included travel diaries collected via computer
assisted telephone interviews (CATI) and GPS data collected from a subset of
participants over 7 days. Recent efforts have expanded the number of sensing
devices used and the types of contextual data collected during the survey period.
For example, the Integrated Multimedia City Data (iMCD) (Urban Big Data Centre
n.d., Thakuriah et al, forthcoming) involves a questionnaire-based survey covering
travel, ICT use, education and literacy, civic and community engagement, and
sustainable behavior of a random sample of households in Glasgow,
UK. Respondents undertake a sensing survey using GPS and life logging sensors
leading to location and mobility data and rapid still images of the world as the
survey respondent sees it. In the survey background is a significant Information
Retrieval effort from numerous social media and multimedia web sources, as well
as retrieval of information from transport, weather, crime-monitoring CCTV and
other urban sectors. Alongside these data streams are Very High Resolution satellite
data and LiDAR allowing digital surface modeling creating 3D urban
representations.
The census is the backbone for many types of urban analysis; however, its
escalating costs has been noted to be unsustainable, with the cost of the 2010 US
Census being almost $94 per housing unit, representing a 34 % increase in the cost
per housing unit over Census 2000 costs, which in turn represents a 76 % increase
over the costs of the 1990 Census (Reist and Ciango 2013). There was an estimated
net undercount of 2.07 % for Blacks, 1.54 % for Hispanics, and 4.88 % for Amer-
ican Indians and Alaska Natives, while non-Hispanic whites had a net over-count of
0.83 % (Williams 2012). Vitrano and Chapin (2014) estimated that without signif-
icant intervention, the 2020 Census would cost about $151 per household. This has
led the US Census Bureau to actively consider innovative solutions designed to
22 P. Thakuriah et al.
reduce costs while maintaining a high quality census in 2020. Some of the strategies
being considered include leveraging the Internet and new methods of communica-
tions to improve self-response by driving respondents to the Internet and taking
advantage of Internet response processes. Another census hybridization step being
considered is the use of administrative records to reduce or eliminate some inter-
views of households that do not respond to the census and related field contacts.
Similar concerns in the UK led to the Beyond 2011 program where different
approaches to produce population statistics were considered. The program
recommended several potential approaches such as the use of an online survey
for the decennial census and a census using existing government data and annual
compulsory surveys (Office for National Statistics 2015). The ONS Big Data
project also evaluated the possibility of using web scraping through a series of
pilot projects including Internet price data for the Consumer Price Index (CPI) and
the Retail Price Index (RPI) and Twitter data to infer student movement, which is a
population that has been historically been difficult to capture through traditional
surveys (Naylor et al. 2015). Other Big Data sources being studied as a part of the
pilots are smart meter data to identify household size/structure and the likelihood of
occupancy during the day, and mobile phone positioning data to infer travel
patterns of workers.
Another situation is where data on survey respondents are linked to routine
administrative records; one approach involved the use of an informed consent
process where respondents who agree to participate in a survey are explicitly
asked if the information they provide can be linked to their administrative records.
One example of this approach is the UK Biobank Survey (Lightfoot and Dibben
2013). Having survey responses linked to administrative data enables important
urban policy questions to be evaluated; the key issue here is that participants
understand and agree to such linkage.
From urban an operations point of view, connected systems allow a degree of
sophistication and efficiency not possible with data from individual data systems.
This was touched upon briefly in Sect. 2.1; clearly weather-responsive traffic
management systems (Thakuriah and Tilahun 2013) and emergency response
systems (Salasznyk et al. 2006) require extensive integration of very different
streams of data, often in real time. This can be computationally challenging, but
also perhaps equally challenging to get data owners to share information. These
types of linked data are likely to accrue a diverse user community including urban
planning and operations management researchers, as well as the economic and
social policy community, in addition to public and private data organizations.
3 Urban Informatics
Overall, developments with urban Big Data have opened up several opportunities
for urban analysis. Building on previous definitions (Foth et al. 2011; Bays and
Callanan 2012; Batty 2013; Zheng et al. 2014), we view Urban Informatics as the
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 23
Fig. 1 Relationships among Urban Informatics objectives, research approaches and applications
exploration and understanding of urban patterns and processes, and that it involves
analyzing, visualizing, understanding, and interpreting structured and unstructured
urban Big Data for four primary objectives:
1. Dynamic resource management: developing strategies for managing scarce
urban resources effectively and efficiently and often making decisions in real-
time regarding competitive use of resources;
2. Knowledge discovery and understanding: discovering patterns in, and relation-
ships among urban processes, and developing explanations for such trends;
3. Urban engagement and civic participation: developing practices, technologies
and other processes needed for an informed citizenry and for their effective
involvement in social and civic life of cities;
4. Urban planning and policy analysis: developing robust approaches for urban
planning, service delivery, policy evaluation and reform, and also for the
infrastructure and urban design decisions.
The overall framework used here, in terms of the objectives, research approach
and applications, and their interdependencies, is shown in Fig. 1.
Large-scale urban modeling practices also use complex systems approaches utiliz-
ing Agent-based Models (ABM) and myriad forms of specialized survey, adminis-
trative, synthetic and other data sources, to study outcomes that are emergent from
individual agent action in interaction with other agents and the environment while
26 P. Thakuriah et al.
A vast body of empirical work embedded in the urban disciplines is among the most
active consumers of urban data, for better understanding, hypothesis testing and
inference regarding urban phenomenon. Among these, one vast research area with
requirements for specialized data sources, models and tools is that of environmental
sustainability and issues relating to clean air, non-renewable energy dependence
and climate change. While difficult to generalize, a recent OECD report identified
gaps in quantitative urban and regional modeling tools to systematically assess the
impacts of urban systems on climate change and sustainability (OECD 2011).
Significant developments in sensor technology have led to the emergence of
smart commodities ranging from household appliances to smart buildings. This
has in turn led to cost-efficiencies and energy savings, and to the design of Vehicle-
to-Grid (V2G) systems (Kempton et al. 2005), and to personal carbon trading
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 27
(Bottrill 2006) and vehicular cap-and-trade systems (Lundquist 2011), with data-
analytic research around technology adoption, and behavioral and consumption
patterns.
Urban models that detect disparities relating to social justice and distributional
aspects of transportation, housing, land-use, environment and public health are
other consumers of such data. These approaches provide an empirical understand-
ing of the social inclusion and livability aspects of cities, and operational decisions
and policy strategies needed to address disparities. This line of work has focused on
social exclusion and connections to work and social services (Kain and Persky
1969; Wilson 1987; Krivo and Peterson 1996; Thakuriah et al. 2013), issues of
importance to an aging society (Federal Interagency Forum on Aging-Related
Statistics 2010), health and aging in place (Black 2008; Thakuriah et al. 2011)
and needs of persons with disabilities (Reinhardt et al. 2011). Administrative data
has played a significant role in this type of research leading to knowledge discovery
about urban processes as well as in evaluation of governmental actions such as
welfare reform and post-recession austerity measures. Linked and longitudinal
administrative data can support understanding of complex aspects of social justice
and changes in urban outcomes over time. For example, Evans et al. (2010)
highlighted the importance of using longitudinal administrative data to understand
the long-term interplay of multiple events associated with substance abuse over
time, while Bottoms and Costello (2009) discuss the role that longitudinal police
crime records can play in studying repeat victimization of crime.
New ICT-based solutions to track and monitor activities allow urban quality and
well-being to be assessed at more fine-grained levels. Personalized data generated
by assistive technologies, ambient assisted living (Abascal et al. 2008) and related
ICT applications can contribute to urban quality of life as well as to design solutions
supporting urban wellness (e.g., hybrid qualitative-GPS data as described by Huang
et al. 2012 to understand barriers to accessing food by midlife and older adults with
mobility disability). Mobile heath and awareness technologies (Consolvo
et al. 2006) particularly those embedded within serious medical pervasive gaming
environments (e.g., DiabetesCity—Collect Your Data, Knoll 2008) and numerous
mobile, wearable and other sensor-based physical health recommender systems,
one example of which is Lin et al. (2011), open up possibilities for urban
researchers to tap into a wealth of data to understand overall built environment
and activity-based conditions fostering health and well-being.
The discussion above shows that urban information generation and strategies to
analyze the data increasingly involve ICT solutions and the active participation of
users. Strategies such as focus groups, SWOT, Strategic Approach, Future Work-
shops and other approaches have been extensively used in the past as a part of urban
participatory practice to generate ideas and find solutions to problems. However,
28 P. Thakuriah et al.
advances in ICT solutions have led to the emergence of new models of citizens
input into problem solving, plan and design sourcing, voting on projects, and
sharing of ideas on projects. Examples range from civic hackers analyzing data
from Open Data portals to generate ideas about fixing urban problems to using
serious games and participatory simulations for the ideation process (Poplin 2014;
Zellner et al. 2012).
As noted earlier, citizens may also engage by generating content through human
computation, or by performing tasks that are natural for humans but difficult for
machines to automatically carry out (von Ahn et al. 2008). Human computation
approaches provide structured ways for citizens to engage in play, to provide input
and to interact with, and learn about the urban environment. For example, citizens
may be able to judge different proposed urban design, or they may be used to assess
the quality of urban spaces where objective metrics from data derived through
machine vision algorithms are not accurate. Celino et al. (2012) gives the example
of UrbanMatch, a location-based Game with a Purpose (GWAP), which is aimed at
exploiting the experience that players have of the urban environment to make
judgments towards correctly linking points of interests in the city with most
representative photos retrieved from the Internet. There are multiple variants of
human computation including social annotations (where users tag or annotate
photos or real-world objects), information extraction (e.g., where users are asked
to recognize objects in photos), and others.
By “sensing” the city and its different behavioral and use patterns, data-driven
models have stimulated research into a broad range of social issues relevant to
understanding cities, including building participatory sensing systems for urban
engagement, location-based social networks, active travel and health and wellness
applications, and mobility and traffic analytics. Other objectives include dynamic
resource management of urban assets and infrastructure, assisted living and social
inclusion in mobility, and community and crisis informatics. For example, one of
the major cited benefits of social media analysis has been the ability to instanta-
neously and organically sense sentiments, opinions and moods to an extent not
previously possible, and ways in which these diffuse over space and time, thereby
enabling the policy community to monitor public opinion, and predict social trends.
A part of this trend is being stimulated by major governmental agencies which are
increasingly realizing the power of social media in understanding where needs are,
and how the public are reacting to major policy changes and political events and
people’s political preferences (Golbeck and Hansen 2013).
A data-driven focus is also being seen in learning analytics (e.g., Picciano 2012),
location-based social networks (Zheng and Xie 2011), recommender systems based
on collaborative filtering for travel information (Ludwig et al. 2009) and
approaches to detect disruptions from social media (Sasaki et al. 2012). Presumably
if these information streams are collected over time and linked to other socio-
demographic data, it would be possible to examine variations in the outcomes
currently measured by the socially generated data to capture urban dynamics to a
greater degree.
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 29
Overall, Big Data is being increasingly utilized for a range of Urban Informatics
research and applications. By using existing urban models with new forms of data,
or through data-driven modeling, urban processes and behaviors can be studied in a
timely manner and contextual peculiarities of urban processes and local experiences
can examined in greater detail. Yet significant challenges arise in their use, which
are addressed next.
The challenges associated with the use of Big Data for Urban Informatics are:
(1) technological, (2) methodological, (3) theoretical and epistemological, and
(4) due to political economy that arises from accessing and using the data. These
challenges are given in Table 2 along with the characteristics of the challenges and
examples of the complexities involved with different types of Big Data.
Table 2 Challenges in using Big Data for Urban Informatics and illustrative topics
Challenges Characteristics Challenges by type of data
Technological Urban information management Information management challenges
challenges: likely to be very high with real-time,
1. Information generation and capture high-volume sensor and UGC data
2. Management which require specific IT
3. Processing infrastructure development
4. Archiving, curation and storage and information management
5. Dissemination and discovery solutions
Methodological 1. Data Preparation Challenges Data preparation challenges likely
(a) Information retrieval and to be very high with unstructured or
extraction semi-structured sensor, UCG and arts
(b) Data linkage/information inte- and humanities data, and data from
gration real-time private-sector and
(c) Data cleaning, anonymization administrative transactional systems
and quality assessment All types of observational Big Data
2. Urban Analysis Challenges pose significant methodological
(a) Developing methods for challenges in deriving generalizable
data-rich urban modeling and knowledge requiring specialist
data-driven modeling knowledge to assess and address
(b) Ascertaining uncertainty, measurement issues and error
biases and error propagation structures
Theoretical and 1. Understanding metrics, definitions, All types of observational Big
epistemological concepts and changing ideologies Data pose limitations in deriving
and methods to understanding theoretical insights and in hypothesis
“urban” generation without adequate
2. Determining validity of approaches cross-fertilization of knowledge
and limits to knowledge between the data sciences and the
3. Deriving visions of future cities urban disciplines, but the challenges
and the links to sustainability and are greater with certain forms of
social justice UGC and sensor data which yield
high-value descriptions but are less
amenable to explanations and
explorations of causality
Political 1. Data entrepreneurship, innovation Data confidentiality and power
economy networks and power structures structures pose significant challenges
2. Value propositions and economic to use of administrative data in open
issues government and program evaluation,
3. Data access, governance frame- while access to private sector
work and provenance transactions data, and privately-
4. Data confidentiality, security and controlled sensor and UGC are
trust management potentially susceptible to changing
5. Responsible innovation and emer- innovation and profitability
gent ethics motivations; challenges to ethics
and responsible innovation are
significantly high for certain
sensor-based (e.g., IOT) applications
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 31
approaches to data privacy, and these range from technological encryption and
anonymization solutions to design, access and rights management solutions. A vast
range of Privacy Enhancing Technologies (PETs) (Beresford and Stajano 2003;
Gruteser and Grunwald 2003) that are relevant to urban Big Data focus on
anonymization of GPS data, images and so on. In the case of administrative
micro-data, many approaches to ensure confidentiality are used, including
de-identified data, simulated micro-data (called synthetic data) that is constructed
to mimic some features of the actual data using micro-simulation methods
(Beckman et al. 1996; Harland et al. 2012) and utilization of Trusted Third Party
(TTP) mechanisms to minimize the risks of the disclosure of an individual’s
identity or loss of the data (Gowans et al. 2012).
One major capability needed to progress from data-poor to data-rich urban
models is that data should be archived over time, enabling storage of very high-
resolution and longitudinal spatio-temporal data. The linkage to other socio-
economic, land-use and other longitudinal data opens up additional avenues for
in-depth exploration of changes in urban structure and dynamics. Although this was
previously a challenge, the decrease in storage costs and increase in linkage
capacity has made this possible.
Another important determinant in data access is having access to high-quality
resource discovery tools for urban researchers to find and understand data, ontol-
ogies for knowledge representation, and a data governance framework that includes
harmonization of standards, key terms and operational aspects. Given the vast and
dispersed sources of urban Big Data, resource discovery mechanisms to explore and
understand data are critical. This includes metadata, or data about the data,
containing basic to advanced information describing the data and the management
rights to it, including archiving and preservation, in a consistent, standardized
manner so that it is understandable and usable by others. Other issues are data
lifecycle management (the strategic and operational principles underpinning long-
term publication and archiving), access to necessary para-data, (i.e., data about the
processes used to collect data), and social annotations (i.e., social bookmarking that
allows users to annotate and share metadata about various information sources).
These issues not only have significant technical requirements in terms of integrating
urban data from different sources; they also have legal (e.g., licensing, terms of
service, non-disclosure), ethical (e.g., regarding lack of informed consent in some
cases, or use by secondary organizations which did not seek consent), and research
culture implications (e.g., establishing a culture of reanalysis of evidence, repro-
duction and verification of results, minimizing duplication of effort, and building on
the work of others, as in Thanos and Rauber 2015).
The above technology issues are not likely to be directly relevant to urban
researchers in many cases. However, methodological aspects of Big Data such as
information retrieval, linkage and curation or the political economy of Big Data
including data access, governance, privacy, and trust management may have IT
requirements that could limit the availability of data for urban research.
32 P. Thakuriah et al.
science while others have considered sampling issues relating to social networks
(Gjoka et al. 2010) and social media (Culotta 2014). However, this work is in its
infancy, and further developments are necessary in order to use highly unstructured
forms of data for urban inference.
Using Big Data for Urban Informatics requires methods for information
retrieval, information extraction, GIS technologies, and multidisciplinary modeling
and simulation methods from urban research as well as the data sciences (e.g.,
machine learning and tools used to analyze text, image and other unstructured
sources of data). Methods of visualization and approaches to understanding uncer-
tainty, error propagation and biases in naturally occurring forms of data are
essential in order to use and interpret Big Data for urban policy and planning.
The theoretical and epistemological challenges pertain to the potential for insights
and hypothesis generation about urban dynamics and processes, as well as validity
of the approaches used, and the limits to knowledge discovery about urban systems
derived from a data focus. As noted earlier, Big Data for Urban Informatics has two
distinct roots: quantitative urban research and data science. Although the walls
surrounding what may be considered “urban models and simulations” are pervious,
these are typically analytical, simulation-based or empirical approaches that are
derived from diverse conceptual approaches (e.g., queuing theory, multi-agent
systems) and involve strong traditions of using specialist data to calibrate. These
models support the understanding of urban systems using theory-driven forecasting
of urban resources, simulation of alternative investment scenarios, strategies for
engagement with different communities, and evaluation of planning and policy, as
well as efficient operation of transportation and environmental systems. Such
models are now using Big Data in varying degrees.
At the same time, exploratory data-driven research is largely devoid of theoret-
ical considerations but is necessary to fully utilize emerging data sources to better
discover and explore interesting aspects of various urban phenomena. Social data
streams and the methods that are rapidly building around them to extract, analyze
and interpret information are active research areas, as are analytics around data-
driven geography that may be emerging in response to the wealth of geo-referenced
data flowing from sensors and people in the environment (e.g., Miller and
Goodchild 2014). The timely discovery and continuous detection of interesting
urban patterns possible with Big Data and the adoption of innovative data-driven
urban management is an important step forward and serves useful operational
purposes. The knowledge discovery aspects of data-driven models are important
to attract the attention of citizens and decision-makers on urban problems and to
stimulate new hypotheses about urban phenomena, which could potentially be
rigorously tested using inferential urban models.
34 P. Thakuriah et al.
The limitation of the data-driven research stream is that there is less of a focus on
the “why” or “how” of urban processes and on complex cause-and-effect type
relationships. In general, data-driven methods have been the subject of interesting
debates regarding the scope, limitations and possibility of such approaches to
provide solutions to complex problems beyond pattern detection, associations,
and correlations. The current focus on data-driven science and the advocacy for it
has in some cases led to rather extreme proclamations to the effect that the data
deluge means the “end of theory” and that it will render the scientific process of
hypothesizing, modeling, testing, and determining causation obsolete (Anderson
2008). Quantitative empirical research has always been a mainstay for many urban
researchers but there is inevitably some conceptual underpinning or theoretical
framing which drives such research. Long before Big Data and data science became
options to derive knowledge, the well-known statistician, Tukey (1980), noted in an
article titled “We Need Both Exploratory and Confirmatory” that “to try to replace
one by the other is madness”, while also noting that “ideas come from previous
exploration more often than from lightning strikes”.
A part of the debate is being stimulated by the fact that data-driven models have
tended to focus on the use of emerging sources of sensor or socially co-created data,
and is closely connected to the data science community. At the time of writing this
paper, entering “GPS data” into the Association for Computing Machinery (ACM)
Digital Library, a major computer science paper repository returns about 11,750
papers, while entering the same term in IEEE XPlore Digital Library, another such
source, returns another 6727 papers; these numbers are in fact higher than the
counts obtained when the first author was writing her book “Transportation and
Information: Trends in Technology and Policy” (Thakuriah and Geers 2013),
indicating not just a voluminous literature on these topics in the data sciences but
one that continues to grow very fast. Such data sources have become most closely
associated with the term Big Data in the urban context, to the exclusion of
administrative data, hybrids of social survey and sensing data, humanities reposi-
tories and other novel data sources, which play an important role in substantive,
theoretically-informed urban inquiry, beyond detection, correlations and
association.
Sensor and ICT-based UGC has also become closely associated with smart
cities, or the use of ICT-based intelligence as a development strategy mostly
championed and driven by large technology companies for efficient and cost-
effective city management, service delivery and economic development in cities.
There are numerous other definitions of smart cities, as noted by Hollands (2008).
The smart cities movement has been noted to have several limitations, including
having “a one-size fits all, top-down strategic approach to sustainability, citizen
well-being and economic development” (Haque 2012) and for being “largely
ignorant of this (existing and new) science, of urbanism in general, and of how
computers have been used to think about cities since their deployment in the
mid-20th century” (Batty 2013), a point also made by others such as Townsend
(2013). It needs to be pointed out that the scope of smart cities has expanded over
time to include optimal delivery of public services to citizens and on processes
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 35
derive explanations for parts of the urban environment not measured by data. The
linked data hybrids suggested previously potentially offers a way to address these
limitations.
The political economy of Big Data arises due to the agendas and actions of the
institutions, stakeholders and processes involved with the data. Many of the chal-
lenges facing urban researchers in using Big Data stem from complexities with data
access, data confidentiality and security, and responsible innovation and emergent
ethics. Access and use conditions are in turn affected by new types of data
entrepreneurship and innovation networks, which make access easier in some
cases, through advocacy for Open Data, or more difficult through conditions
imposed as a result of commercialization. Such conditions are generally
underpinned by power structures and value propositions arising from Big Data.
The economic, legal and procedural issues that relate to data access and gover-
nance are non-trivial and despite the current rhetoric around the open data move-
ment, vast collections of data that are useful for urban analysis are locked away in a
mix of legacy and siloed systems owned and operated by individual agencies and
private organizations, with their own internal data systems, metadata, semantics
and so on. Retrieving information from social media and other online content
databases, and the analytics of the resulting retroactive UGC either in real-time or
from historical archives have mushroomed into a significant specialized data
industry, but the data availability itself is dictated by the terms of service agree-
ments required by the private companies which own the system or which provide
access, giving rise to a new political economy of Big Data. User access is provided
in some cases using an API, but often there are limits on how much data can be
accessed at any one time by the researcher and the linkage of a company’s data to
other data. Others may mandate user access under highly confidential and secure
access conditions requiring users to navigate a complex legal landscape of data
confidentiality, and special end-user licensing and terms of service and
non-disclosure agreements. Data users may also be subject to potentially changing
company policy regarding data access and use. There are also specific restrictions
on use including data storage in some cases, requiring analytics in real-time.
Undoubtedly, a part of the difficulty in access stems from data confidentiality
and the need to manage trust with citizens, clients and the like. Privacy, trust and
security are concepts that are essential to societal interactions. Privacy is a funda-
mental human right and strategies to address privacy involve privacy-enhancing
technology, the legal framework for data protection, as well as consumer awareness
of the privacy implications of their activities (Thakuriah and Geers 2013), espe-
cially as users leave a digital exhaust with their everyday activities. However,
privacy is also not a static, immutable constant. People are likely to trade off
some privacy protection in return for utility gained from information, benefits
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 37
received, or risks minimized (Cottrill and Thakuriah 2015). Aside from technolog-
ical solutions to maintain privacy, a process of user engagement is necessary to
raise consumer awareness, in addition to having the legal and ethical processes in
place in order to be able to offer reassurance about confidential use of data. Further,
many private data owners may not release data due to being able to reserve
competitive advantage through data analytics. However, lack of knowledge about
the fast-moving legal landscape with regards to data confidentiality, copyright
violations and other unintended consequences of releasing data are central elements
of the political economy of Big Data.
The social arguments for and against Big Data, connected systems and IoT are
similar to other technology waves that have been previously witnessed, and these
considerations also generate multiple avenues for research. Increasingly pervasive
sensing and connectivity associated with IoT, and the emphasis on large-scale
highly coupled systems that favor removing human input and intervention has
been seen to increase exposure to hacking and major system crashes (BCS 2013).
Aside from security, the risks for privacy are greatly enhanced as the digital trail left
by human activities may be masked under layers of connected systems. Even those
systems that explicitly utilize privacy by design are potentially susceptible to
various vulnerabilities and unanticipated consequences since the technological
landscape is changing very rapidly and the full implications cannot be thought
through in their entirety. This has prompted the idea of “responsible innovation”,
which “seeks to promote creativity and opportunities for science and innovation
that are socially desirable and undertaken in the public interest” and which makes
clear that “innovation can raise questions and dilemmas, is often ambiguous in
terms of purposes and motivations and unpredictable in terms of impacts, beneficial
or otherwise. Responsible Innovation creates spaces and processes to explore these
aspects of innovation in an open, inclusive and timely way” (Engineering and
Physical Sciences Research Council n.d.).
Against this backdrop of complex data protection and governance challenges,
and the lure of a mix of objectives such as creating value and generating profit as
well as public good, a significant mix of private, public, non-profit and informal
infomediaries, ranging from very large organizations to independent developers
that are leveraging urban Big Data have emerged. Using a mixed-methods
approach, Thakuriah et al. (2015) identified four major groups of organizations
within this dynamic and diverse sector: general-purpose ICT companies, urban
information service providers, open and civic data infomediaries, and independent
and open source developer infomediaries. The political economy implication of
these developments is that publicly available data may become private as value is
added to such data, and the publicly-funded data infrastructure, due to its complex-
ity and technical demands, may increasingly be managed by private companies that
in turn, potentially restricts access and use.
38 P. Thakuriah et al.
5 Conclusions
In this paper, we discussed the major sources of urban Big Data, their benefits and
shortcomings, and ways in which they are enriching Urban Informatics research.
The use of Big Data in urban research is not a distinct phase of a technology but
rather a continuous process of seeking novel sources of information to address
concerns emerging from high cost or design or operational limitations. Although
Big Data has often been used quite narrowly to include sensor or socially generated
data, there are many other forms that are meaningful to different types of urban
researchers and user communities, and we include administrative data and other
data sources to capture these lines of scholarship. But even more importantly, it is
necessary to bring together (through data linkage or otherwise), data that have
existed in fragmented ways in different domains, for a holistic approach to urban
analysis.
We note that both theory-driven as well as data-driven approaches are important
for Urban Informatics but that retrofitting urban models to reflect developments in a
data-rich world is a major requirement for comprehensive understanding of urban
processes. Urban Informatics in our view is the study of urban patterns using novel
sources of urban Big Data that is undertaken from both a theory-driven empirical
perspective as well as a data-driven perspective for the purpose of: urban resource
management, knowledge discovery and understanding, urban engagement and civic
participation, and planning and policy implementation. The research approaches
utilized to progress these objectives are a mix of enriched urban models underpinned
by theoretical principles and retrofitted to accommodate emerging forms of data, or
data-driven modeling that are largely theory-agnostic and emerge bottom-up from
the data. The resulting Urban Informatics research applications have focused on
revisiting classical urban problems using urban modeling frameworks but with new
forms of data; evaluation of behavioral and structural interactions within enriched
complex systems approach; empirical research on sustainable, socially-just and
engaged cities; and applications to engage and collaboratively sense cities.
The use of Big Data poses a considerable challenge for Urban Informatics
research. This includes technology-related challenges putting requirements for
special information management approaches, methodological challenges to
retrieve, curate and draw knowledge from the data; theoretical or epistemological
challenges to frame modes of inquiry to derive knowledge and understand the limits
of Urban Informatics research; and finally, an issue that is likely to play an
increasingly critical role for urban research—the emerging political economy of
urban Big Data, arising from complexities associated with data governance and
ownership, privacy and information security, and new modes of data entrepreneur-
ship and power structures emerging from the economic and political value of data.
From the perspective of urban analysts, the use of sensor data, socially generated
data, and certain forms of arts and humanities and private sector data may pose
significant technical and methodological challenges. With other sources such as
administrative micro-data, the data access challenges and issues relating to political
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 39
References
Abascal J, Bonail B, Marco Á, Casas R, JL Sevillano (2008). AmbienNet: an intelligent environ-
ment to support people with disabilities and elderly people. In: Proceedings of tenth interna-
tional ACM SIGACCESS conference on computers and accessibility (Assets ’08), pp 293–294
Abowd JM, Stephens BE, Vilhuber L, Andersson F, McKinney KL, Roemer M, Woodcock S
(2005) The LEHD infrastructure files and the creation of the quarterly workforce indicators. In:
Producer dynamics: new evidence from micro data. Published 2009 by University of Chicago
Press, pp 149–230. http://www.nber.org/chapters/c0485.pdf. Accessed 1 March 2014
Federal Interagency Forum on Aging-Related Statistics (2010) Older American 2010: key indi-
cators of well-being. http://www.agingstats.gov/agingstatsdotnet/Main_Site/Data/2010_Docu
ments/Docs/OA_2010.pdf. Accessed 31 July 2010
Aguilera O, Fernández AF, Mu~noz A, Fraga MF (2010) Epigenetics and environment: a complex
relationship. J Appl Physiol 109(1):243–251
Alonso W (1960) A theory of the urban land market. Pap Reg Sci 6(1):149–157
American Academy of the Arts and Sciences (2013) The heart of the matter. http://www.amacad.
org. Accessed 1 April 2015
Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete. Wired
Magazine, 23 June 2008. http://www.wired.com/science/discoveries/magazine/16-07/pb_the
ory. Accessed 10 Feb 2012
Antenucci D, Cafarella M, Levenstein MC, Ré C, Shapiro MD (2014) Using social media to
measure labor market flows. Report of the University of Michigan node of the NSF-Census
Research Network (NCRN) supported by the National Science Foundation under Grant
No. SES 1131500
Ashton K (2009) That “Internet of Things” Thing. RFID Journal, May/June 2009
Auto-id Labs. http://www.autoidlabs.org
Balmer M, Rieser M, Meister K, Charypar D, Lefebvre N, Nagel K, Axhausen K (2009) MATSim-
T: architecture and simulation times. In: Multi-agent systems for traffic and transportation
engineering. IGI Global, Hershey, pp 57–78
Batty M (2013) Urban informatics and Big Data: a report to the ESRC Cities Expert Group. http://
www.smartcitiesappg.com/wp-content/uploads/2014/10/Urban-Informatics-and-Big-Data.
pdf. Accessed 15 Dec 2014
40 P. Thakuriah et al.
Bays J, Callanan L (2012) ‘Urban informatics’ can help cities run more efficiently. McKinsey on
Society. http://mckinseyonsociety.com/emerging-trends-in-urban-informatics/. Accessed
1 July 2014
BCS, The Chartered Institute for IT (2013) The societal impact of the internet of things. www.bcs.
org/upload/pdf/societal-impact-report-feb13.pdf. Accessed 10 April 2015
Beckman MJ, McGuire CB, Winston CB (1956) Studies in the economics of transportation. Yale
University Press, Connecticut
Beckman R, Baggerly KA, McKay MD (1996) Creating synthetic baseline populations. Transp
Res A Policy Pract 30(6):415–429
Beckmann MJ (1973) Equilibrium models of residential location. Reg Urban Econ 3:361–368
Ben-Akiva M, Lerman SR (1985) Discrete choice analysis: theory and application to travel
demand. MIT Press, Cambridge
Beresford AR, Stajano F (2003) Location privacy in pervasive computing. IEEE Pervasive
Comput 2(1):46–55
Bijwaard GE, Schluter C, Wahba J (2011) The impact of labour market dynamics on the return—
migration of immigrants. CReAM Discussion Paper No. 27/12
Black K (2008) Health and aging-in-place: implications for community practice. J Commun Pract
16(1):79–95
Bonney R, Ballard H, Jordan R, McCallie E, Phillips T, Shirk J, Wilderman CC (2009) Public
participation in scientific research: defining the field and assessing its potential for informal
science education. Technical report, Center for Advancement of Informal Science Education
Bottoms AE, Costello A (2009) Crime prevention and the understanding of repeat victimization: a
longitudinal study. In: Knepper P, Doak J, Shapland J (eds) Urban crime prevention, surveil-
lance, and restorative justice: effects of social technologies. CRC, Boca Raton, pp 23–54
Bottrill C (2006) Understanding DTQs and PCAs. Technical report, Environmental Change.
Institute/UKERC, October
Burgess EW (1925) The growth of the city: an introduction to a research project. In: Park RE,
Burgess EW, Mackenzie RD (eds) The city. University of Chicago Press, Chicago, pp 47–62
Card D, Chetty R, Feldstein M, Saez E (n.d.) Expanding access to administrative data for research
in the United States. NSF-SBE 2020 White Paper. http://www.nsf.gov/sbe/sbe_2020/all.cfm.
Accessed 10 April 2015
Celino I, Contessa S, Della Valle E, Krüger T, Corubolo M, Fumeo S (2012) UrbanMatch—
linking and improving Smart Cities Data. LDOW2012, Lyon, France
U.S. Census Bureau (2012) Press release, 22 May 2012. https://www.census.gov/2010census/
news/releases/operations/cb12-95.html
CERI/OECD (1992) City strategies for lifelong learning. In: A CERI/OECD study No. 3 in a series
of publications from the Second congress on educating cities, Gothenburg, November
Christakis NA, Fowler JH (2007) The spread of obesity in a large social network over 32 years.
N Engl J Med 357(4):370–379
Consolvo S, Everitt K, Smith I, Landay JA (2006) Design requirements for technologies that
encourage physical activity. In: Proceedings of SIGCHI conference on human factors in
computing systems (CHI ’06), pp 457–466
Conway J (1970) The game of life. Sci Am 223(4):4
Cottrill CD, Thakuriah P (2015) Location privacy preferences: a survey-based analysis of con-
sumer awareness, trade-off and decision-making. Transp Res C Emerg Technol 56:132–148
Culotta A (2014) Reducing sampling bias in social media data for county health inference. In: JSM
proceedings
Davenport TH, Patil DJ (2012) Data scientist: the sexiest job of the 21st century. Harvard Business
Review, October, pp 70–76
Dekel O, Shamir O (2009) Vox Populi: collecting high-quality labels from a crowd. In: Pro-
ceedings of the 22nd annual conference on learning theory (COLT), pp 377–386. http://www.
cs.mcgill.ca/~colt2009/proceedings.html
Big Data and Urban Informatics: Innovations and Challenges to Urban Planning. . . 41
De Mauro A, Greco M, Grimaldi M (2015) What is big data? A consensual definition and a review
of key research topics. In: AIP conference proceedings, 1644, pp 97–104
NAREC Distributed Energy (2013) ERDF social housing energy management project—final
project report. UK National Renewable Energy Centre. https://ore.catapult.org.uk/docu
ments/10619/127231/Social%20Housing%20final%20report/6ca05e01-49cc-43ca-a78c-
27fe0e2dd239. Accessed 1 April 2015
Drake JS, Schofer JL, May A, May AD (1965) Chicago area expressway surveillance project, and
Expressway Surveillance Project (Ill.). A statistical analysis of speed-density hypotheses:
a summary. Report (Expressway Surveillance Project (Ill.)). Expressway Surveillance Project
Ellis RH (1967) Modelling of household location: a statistical approach. Highw Res Rec
207:42–51
Engineering and Physical Sciences Research Council (n.d) Framework for responsible innovation.
https://www.epsrc.ac.uk/research/framework/. Accessed 10 April 2015
Erlander S (1980) Optimal spatial interaction and the gravity model. Lecture notes in economics
and mathematical systems, vol 173. Springer, Berlin
European Commission (2015) Digital agenda for Europe: a Europe 2020 initiative: European
Innovation Partnership (EIP) on Smart Cities and Communities. https://ec.europa.eu/digital-
agenda/en/smart-cities. Accessed 1 Aug 2015
Evans TP, Kelley H (2004) Multi-scale analysis of a household level agent-based model of
landcover change. J Environ Manage 72(1):57–72
Evans E, Grella CE, Murphy DA, Hser Y-I (2010) Using administrative data for longitudinal
substance abuse research. J Behav Health Serv Res 37(2):252–271
Foth M, Choi JH, Satchell C (2011) Urban informatics. In: Proceedings of the ACM 2011
conference on computer supported cooperative work (CSCW ’11). ACM, New York, pp 1–8
Fujita M (1988) A monopolistic competition model of spatial agglomeration: differentiated
product approach. Reg Sci Urban Econ 18(1):87–124
Fujita M, Ogawa H (1982) Multiple equilibria and structural transition of non-monocentric urban
configurations. Reg Sci Urban Econ 12(2):161–196
Fujita M, Krugman P, Venables AJ (1999) The spatial economy: cities, regions, and international
trade. MIT Press, Cambridge
Ghosh D, Guha R (2013) What are we tweeting about obesity? Mapping tweets with topic
modeling and geographic information system. Cartogr Geogr Inf Sci 40(2):90–102
Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl
Acad Sci 99(12):7821–7826
Gjoka M, Kurant M, Butts CT, Markopoulou A (2010) Walking in Facebook: a case study of
unbiased sampling of OSNs. In: Proceedings of IEEE 2010 INFOCOM 2010
OECD Global Science Forum (2013) New data for understanding the human condition: interna-
tional perspectives. Report on data and research infrastructure for the social sciences. http://
www.oecd.org/sti/sci-tech/new-data-for-understanding-the-human-condition.pdf. Accessed
1 April 2015
Golbeck J, Hansen D (2013) A method for computing political preference among Twitter
followers. Soc Netw 36:177–184
Gowans H, Elliot M, Dibben C, Lightfoot D (2012) Accessing and sharing administrative data and
the need for data security, Administrative Data Liaison Service
Gruteser M, Grunwald D (2003) Anonymous usage of location-based services through spatial and
temporal cloaking. In: Proceedings of first international conference on mobile systems, appli-
cations and services, MobiSys ’03, pp 31–42
Gurstein M (2011) Open data: empowering the empowered or effective data use for everyone?
First Monday, vol 16, no 2, 7 Feb 2011. http://firstmonday.org/ojs/index.php/fm/article/view/
3316/2764. Accessed 1 July 2013
Haque U (2012) Surely there’s a smarter approach to smart cities? Wired Magazine. 17 April 2012.
http://www.wired.co.uk/news/archive/2012-04/17/potential-of-smarter-cities-beyond-ibm-
and-cisco. Accessed 10 April 2012
42 P. Thakuriah et al.
~
Thorhildur J, Avital M, BjArn-Andersen N (2013) The generative mechanisms of open govern-
ment data. In: ECIS 2013 proceedings, Paper 179
Tilahun N, Levinson D (2013) An agent-based model of origin destination estimation (ABODE). J
Transp Land Use 6(1):73–88
Townsend A (2013) Smart cities: Big Data, civic hackers and the quest for a New Utopia. W. W.
Norton, New York
Tukey JW (1980) We need both exploratory and confirmatory. Am Stat 34(1):23–25
United Nations, Department of Economic and Social Affairs, Population Division (2014) World
Urbanization prospects: the 2014 revision, highlights (ST/ESA/SER.A/352)
Urban Big Data Centre (n.d.) Integrated Multimedia City Data (iMCD). http://ubdc.ac.uk/
research/research-projects/methods-research/integrated-multimedia-city-data-imcd/.
Accessed 18 September 2016
Vitrano FA, Chapin MM (2010) Possible 2020 census designs and the use of administrative
records: what is the impact on cost and quality? U.S. Census Bureau, Suitland. https://fcsm.
sites.usa.gov/files/2014/05/Chapin_2012FCSM_III-A.pdf
Vitrano FA, Chapin MM (2014) Possible 2020 census designs and the use of administrative
records: what is the impact on cost and quality? https://fcsm.sites.usa.gov/files/2014/05/
Chapin_2012FCSM_III-A.pdf. Accessed 1 March 2015
von Ahn L, Blum M, Hopper NJ, Langford J (2003) CAPTCHA: using hard AI problems for
security. Technical Report 136
von Ahn L, Maurer B, McMillen C, Abraham D, Blum M (2008) reCAPTCHA: human-based
character recognition via web security measures. Science 321(5895):1465–1468
Wauthier FL, Jordan MI (2011) Bayesian bias mitigation for crowdsourcing. In: Proceedings of
the conference on neural information processing system, no 24, pp 1800–1808. http://
machinelearning.wustl.edu/mlpapers/papers/NIPS2011_1021
Wegener M (1994) Operational urban models state of the art. J Am Plann Assoc 60(1):17–29
Weil R, Wootton J, Garca-Ortiz A (1998) Traffic incident detection: sensors and algorithms. Math
Comput Model 27(911):257–291
Williams JD (2012) The 2010 decennial census: background and issues. Congressional Research
Service R40551. http://www.fas.org/sgp/crs/misc/R40551.pdf. Accessed 1 March 2015
Wilson AG (1971) A family of spatial interaction models, and associated developments. Environ
Plann 3(1):1–32
Wilson WJ (1987) The truly disadvantaged: the inner city, the underclass and public policy
Chicago. University of Chicago Press, Chicago
Wilson AG (2013) A modeller’s utopia: combinatorial evolution. Commentary. Environ Plann A
45:1262–1265
Wu G, Talwar S, Johnsson K, Himayat N, Johnson KD (2011) M2m: from mobile to embedded
internet. IEEE Communications Magazine 49(4):36–43, April
Zellner ML, Reeves HW (2012) Examining the contradiction in ‘sustainable urban growth’: an
example of groundwater sustainability. J Environ Plann Manage 55(5):545–562
Zellner ML, Page SE, Rand W, Brown DG, Robinson DT, Nassauer J, Low B (2009) The
emergence of zoning games in exurban jurisdictions. Land Use Policy 26(2009):356–367
Zellner ML, Lyons L, Hoch CJ, Weizeorick J, Kunda C, Milz D (2012) Modeling, learning and
planning together: an application of participatory agent-based modeling to environmental
planning. URISA J (GIS in Spatial Planning Issue) 24(1):77–92
Zellner M, Massey D, Shiftan Y, Levine J, Arquero M (2016) Overcoming the last-mile problem
with transportation and land-use improvements: an agent-based approach. Int J Transport 4
(1):1–26
Zhang X, Qin S, Dong B, Ran B (2010) Daily OD matrix estimation using cellular probe data. In:
Proceedings of ninth annual meeting Transportation Research Board
Zheng Y, Xie X (2011) Location-based social networks: locations. In: Zheng Y, Zhou X (eds)
Computing with spatial trajectories. Springer, New York, pp 277–308
Part I
Analytics of User-Generated Content
Using User-Generated Content
to Understand Cities
1 Introduction
Over half of the world’s population now lives in cities (Martine et al. 2007), and
understanding the cities we live in has never been more important. Urban planners
need to plan future developments, transit authorities need to optimize routes, and
people need to effectively integrate into their communities.
Currently, a number of methods are used to collect data about people, but these
methods tend to be slow, labor-intensive, expensive, and lead to relatively sparse
data. For example, the US census cost $13 billion in 2010 (Costing the Count 2011),
and is only collected once every 10 years. The American Community Survey is
collected annually, and cost about $170 million in 2012, but only samples around
1 % of households in any given year (Griffin and Hughes 2013). While data like this
can benefit planners, policy makers, researchers, and businesses in understanding
changes over time and how to allocate resources, today’s methods for understand-
ing people and cities are slow, expensive, labor-intensive, and do not scale well.
Researchers have looked at using proprietary call detail records (CDRs) from
telecoms to model mobility patterns (Becker et al. 2011; González et al. 2008;
Isaacman et al. 2012) and other social patterns, such as the size of one’s social
network and one’s relationship with others (Palchykov et al. 2012). These studies
leverage millions of data points; however, these approaches also have coarse
location granularity (up to 1 sq mile), are somewhat sparse (CDRs are recorded
only when a call or SMS is made), have minimal context (location, date, caller,
callee), and use data not generally available to others. Similarly, researchers have
also looked at having participants install custom apps. However, this approach has
challenges in scaling up to cities, given the large number of app users needed to get
useful data. Corporations have also surreptitiously installed software on people’s
smartphones (such as CallerIQ (McCullagh 2011) and Verizon’s Precision Market
Insights (McCullagh 2012)), though this has led to widespread outcry due to
privacy concerns.
We argue that there is an exciting opportunity for creating new ways to concep-
tualize and visualize the dynamics, structure, and character of a city by analyzing
the social media its residents already generate. Millions of people already use
Twitter, Instagram, Foursquare, and other social media services to update their
friends about where they are, communicate with friends and strangers, and record
their actions. The sheer quantity of data is also tantalizing: Twitter claims that its
users send over 500 million tweets daily, and Instagram claims its users share about
60 million photos per day (Instagram 2014). Some of this media is geotagged with
GPS data, making it possible to start inferring people’s behaviors over time. In
contrast to CDRs from telcos, we can get fine-grained location data, and at times
beyond when people make phone calls. In contrast to having people install custom
apps (which is hard to persuade people to do), we can leverage social media data
that millions of people are already creating every day.
We believe that this kind of geotagged social media data, combined with new
kinds of analytics tools, will let urban planners, policy analysts, social scientists,
and computer scientists explore how people actually use a city, in a manner that is
cheap, highly scalable, and insightful. These tools can shed light onto the factors
that come together to shape the urban landscape and the social texture of city life,
including municipal borders, demographics, economic development, resources,
geography, and planning.
As such, our main question here is, how can we use this kind of publicly visible,
geotagged social media data to help us understand cities better? In this position
paper, we sketch out several opportunities for new kinds of analytics tools based on
geotagged social media data. We also discuss some longer-term challenges in using
this kind of data, including biases in this kind of data, issues of privacy, and
fostering a sustainable ecosystem where the value of this kind of data is shared
with more people.
Using User-Generated Content to Understand Cities 51
2 Opportunities
In this section, we sketch out some design and research opportunities, looking at
three specific application areas. Many of the ideas we discuss below are speculative.
We use these ideas as a way of describing the potential of geotagged social media
data, as well as offering possible new directions for the research community.
First, and perhaps most promisingly, we believe that geotagged social media data
can offer city planners and developers better information that can be used to
improve planning and quality of life in cities. This might include new kinds of
metrics for understanding people’s interactions in different parts of a city, new
methods of pinpointing problems that people are facing, and new ways of identi-
fying potential opportunities for improving things.
Fig. 1 Left: accurate data about socio-economic deprivation in England; darker indicates more
deprivation. Right: much coarser, out-of-date information about the same index. Figure from
(Smith-Clarke et al. 2014)
Socioeconomic status, however, is not the only metric that matters. A community
can be poor but flourishing, or rich but suffering. Other metrics like violence,
pollution, location efficiency, and even community coherence are important for
cities to track. Some of these are even more difficult to track than socioeconomic
status.
We believe some aspects of quality of life can be modeled using geotagged
social media data. For example, approximating violence may be possible by
analyzing the content of posts. Choudhury et al. (2014) showed that psychological
features associated with desensitization appeared over time in tweets by people
affected by the Mexican Drug War. Other work has found that sentiments in tweets
are correlated with general socio-economic wellbeing (Quercia et al. 2011). Mea-
suring and mapping posts that contain these emotional words may help us find high
crime areas and measure the change over time.
As another example, location efficiency, or the total cost of transportation for
someone living in a certain location, can be approximated by sites like Walkscore.
com. However, Walkscore currently only relies on the spaces, that is, where
services are on the map. It does not take into account the places, the ways that
people use these services. A small market may be classified as a “grocery store”, but
if nobody goes to it for groceries, maybe it actually fails to meet people’s grocery
needs. We believe geotagged social media data can be used as a new way of
understanding how people actually use places, and thereby offer a better measure
of location efficiency.
Using User-Generated Content to Understand Cities 53
One more analysis that could be useful for city planners is in understanding the
mobility patterns of people in different parts of different cities. This kind of
information can help, for example, in planning transportation networks (Kitamura
et al. 2000). Mobility can help planners with social information as well, such as how
public or private people feel a place is (Toch et al. 2010). Previously, mobility
information has been gathered from many sources, but they all lack the granularity
and ease of collection of social media data. Cell tower data has been used to
estimate the daily ranges of cell phone users (Becker et al. 2013). At a larger
scale, data from moving dollar bills has been used to understand the range of human
travel (Brockmann et al. 2006). Among other applications, this data could be used
for economic purposes, such as understanding the value of centralized business
districts like the Garment District in New York (Williams and Currid-Halkett
2014). It seems plausible that pre-existing social media data could help us find
the similar information without needing people to enter dollar bill serial numbers or
phone companies to grant access to expensive and sensitive call logs. Geotagged
social media data is also more fine-grained, allowing us to pinpoint specific venues
that people are going to.
Originating in the field of architecture, design patterns are good and reusable
solutions to common design problems. Geotagged social media data offers new
ways of analyzing physical spaces and understanding how the design of those
spaces influences people’s behaviors.
For example, in his book A Pattern Language (1977), Alexander and colleagues
present several kinds of patterns characterizing communities and neighborhoods.
These patterns include Activity Nodes (community facilities should not be scattered
individually through a city, but rather clustered together), Promenades (a center for
its public life, a place to see people and to be seen), Shopping Streets (shopping
centers should be located near major traffic arteries, but should be quiet and
comfortable for pedestrians), and Night Life (places that are open late at night
should be clustered together).
By analyzing geotagged social media data, we believe it is possible to extract
known design patterns. One possible scenario is letting people search for design
patterns in a given city, e.g. “where is Night Life in this city?” or “show the major
Promenades”. Another possibility is to compare the relationship of different pat-
terns in different cities, as a way of analyzing why certain designs work well and
others do not. For example, one might find that areas that serve both as Shopping
Streets and as Night Life are well correlated with vibrant communities and general
well-being.
54 D. Tasse and J.I. Hong
Small businesses cannot easily compete with big-box stores in terms of data and
analytics about existing customers. This makes it difficult for small businesses to
tailor their services and advertisements effectively.
Businesses can already check their reviews on Yelp or Foursquare. We believe
that geotagged social media data can offer different kinds of insights about the
behaviors and demographics of customers. One example would be knowing what
people do before and after visiting a given venue. For example, if a coffee shop
owner finds that many people go to a sandwich shop after the coffee shop, they may
want to partner with those kinds of stores or offer sandwiches themselves. This
same analysis could be done with classes of venues, for example, cafés or donut
shops.
As another example, an owner may want to do retail trade analysis (Huff 1963),
which is a kind of marketing research for understanding where a store’s customers
are coming from, how many potential customers are in a given area, and where one
can look for more potential customers. Some examples include quantifying and
visualizing the flow and movement of customers in the area around a given store.
Using this kind of analysis, a business can select potential store locations, identify
likely competitors, and pinpoint ideal places for advertisements.
Currently, retail trade analysis is labor intensive, consisting of numerous obser-
vations by field workers (e.g. watching where customers come from and where they
go, or shadowing customers) or surveys given to customers. Publicly visible social
media data offers a way of scaling up this kind of process, and extending the kind of
analysis beyond just the immediate area. For example, one could analyze more
general kinds of patterns. For example, what are the most popular stores in this area,
and how does the store in question stack up? How does the store in question
compare against competitors in the same city?
Knowing more about the people themselves would be useful as well. For
example, in general, what kinds of venues are most popular for people who come
to this store? If a business finds that all of its patrons come from neighborhoods
where live music is popular, they may want to consider hosting musicians them-
selves. All of the information offered by a service like this would have to be rather
coarse, but it could provide new kinds of insights for small businesses.
Using User-Generated Content to Understand Cities 55
New businesses often have many different potential locations, and evaluating them
can be difficult. Public social media data could give these business owners more
insight into advantages and disadvantages of their potential sites. For example, if
they find that a certain neighborhood has many people who visit Thai restaurants in
other parts of the city, they could locate a new Thai restaurant there.
There are also many opportunities for using geotagged social media to benefit
individuals as well. Below, we sketch out a few themes.
Moving to a new city can make it hard for people to be part of a community. The
formally defined boundaries of neighborhoods may help people understand the
spaces where they live, but not so much the socially constructed places (Harrison
and Dourish 1996). Currently it is difficult for non-locals to know the social
constructs of a city as well as locals do. This is particularly important when
someone changes places, either as a tourist or a new resident.
Imagine a new person arriving in a diverse city like San Francisco, with multiple
neighborhoods and sub-neighborhoods. It would be useful for that person to know
the types of people who live in the city, and where each group goes: students go to
this neighborhood in the evening, members of the Italian community like to spend
time in this area, families often live in this neighborhood but spend time in that
neighborhood on weekends.
Some work has been done in this area, but we believe it could be extended.
Komninos et al. collected data from Foursquare to examine patterns over times of
day and days of the week (Komninos et al. 2013), showing the daily variations of
people’s activity in a city in Greece. They showed when people are checking in, and
at what kind of venue, but not who was checking in. Cheng et al, too, showed
general patterns of mobility and times of checkins (Cheng et al. 2011), but these
statistics remain difficult for individuals to interpret.
Related, the Livehoods project (Cranshaw et al. 2012) and Hoodsquare (Zhang
et al. 2013) both look at helping people understand their cities by clustering nearby
places into neighborhoods. Livehoods used Foursquare checkins, clustering nearby
places where the same people often checked in. Hoodsquare considered not only
checkins but also other factors including time, location category, and whether
tourists or locals attended the place. Both of these projects would be helpful for
people to find their way in a new city, but even their output could be more
56 D. Tasse and J.I. Hong
Tools like Yelp and Urbanspoon already exist to help people find places they would
like to go or discover new places that they didn’t know about. Previous work like
Livehoods (Cranshaw et al. 2012) also worked to enable a richer understanding of
the places there. The benefit of incorporating social media data, though, is that users
can start to understand the people who go to places, not just the places themselves.
In this section we sketch out some potentially interesting research projects in using
geotagged social media data. Our goal here is to map out different points in the
overall design space, which can be useful in understanding the range of applications
as well as the pros and cons of various techniques.
Using User-Generated Content to Understand Cities 57
3.2 Groceryshed
Watershed maps show where water drains off to lakes and oceans. Researchers have
extended this metaphor to map “laborsheds” and “paradesheds” (Becker et al. 2013)
to describe where people who work in a certain area come from, or people who
attend a certain parade. We could extend this metaphor even further to describe
“sheds” of smaller categories of business, such as Thai restaurants.
More interestingly, we could map change over time in various “sheds”. This
could be particularly important for grocery stores. Places that are outside any
“groceryshed” could be candidate areas for a new store, and showing the change
in people’s behavior after a new store goes in could help measure the impact of that
store. The content of Tweets or other social data could also show how people’s
behavior changed after a new grocery store was put in.
We envision a system that can convey not only what a place is, but also what it
means. Imagine a user looking up a particular coffee shop. Currently, they can look
up the coffee shop’s web site, find basic information like store hours, and find
reviews on sites like Yelp. Using geotagged social media data, however, we could
surface information like:
• Your friends (or people you follow on Twitter) go here five times per week.
• Friends of your friends go here much more than nearby coffee shops.
• People who are music enthusiasts like you (using topic modeling as in Joseph
et al. (2012)) often go to this coffee shop.
• You’ve been to three other coffee shops that are very similar to this one.
• People who tweet here show the same profiles of emotions as your tweets.
These could help people form a deeper relationship with a place than one based
on locality or business type alone. In addition, we could pre-compute measures of
relevance for a particular user, giving them a map of places that they might enjoy.
58 D. Tasse and J.I. Hong
We can go beyond assigning people to groups by topics, also showing where they
tweet over time. This could help people understand the dynamics of neighborhoods
where, for example, one group of more affluent people are pricing out a group of
previous residents. One interesting work in this area is the Yelp word maps (2013),
which show where people write certain words, like “hipster”, in reviews of busi-
nesses. However, this still describes the places; using social media data, we could
show maps that describe the people. Instead of a map of locations tagged as
“hipster”, we could identify groups of people based on their check-in patterns and
tag where they go during the day. Perhaps the hipsters frequent certain coffee shops
in the morning and certain bars at night, but during the day hang out in parks where
they do not check in.
For our research, we are currently using data from Twitter, due to its richness and
volume. Twitter claims over 500 million tweets are posted per day (Krikorian
2013). Furthermore, this data is publicly available. While only a small fraction of
these tweets are geotagged, even a small fraction of tweets from any given day
forms a large and rich data set. Furthermore, past work suggests that the sampling
bias from only selecting these tweets is limited (Priedhorsky et al. 2014).
Using User-Generated Content to Understand Cities 59
Fig. 2 Several livehoods in Pittsburgh. Each dot represents one Foursquare location. Venues with
the same color have been clustered into the same livehood (the colors are arbitrary). The center of
the map is a residential area where people with relatively low socioeconomic status live. There is
also a notable lack of foursquare data in this area
Privacy is also a clear concern in using geotagged social media to understand cities.
From an Institutional Review Board (IRB) perspective, much of social media data
is considered exempt, because the researchers do not directly interact with partic-
ipants, the data already exists, and the data is often publicly visible. However, as
researchers, we need to go beyond IRB and offer stronger privacy protections,
especially if we make our analyses available as interactive tools.
Here, there are at least two major privacy concerns. The first is making it easy to
access detailed information about specific individuals. Even if a person’s social
media data is public data, interactive tools could make a person’s history and
inferences on that history more conspicuously available. Some trivial examples
include algorithms for determining a user’s home and work locations based on their
tweets (Komninos et al. 2013). More involved examples might include other
aspects of their behaviors, such as their activities, preferences, and mobility pat-
terns. In the Livehoods project, we mitigated this aspect of user privacy by only
presenting information about locations, not people. We also removed all venues
labeled as private homes.
Second, we need to be more careful and more thoughtful about the kinds of
inferences that algorithms can make about people, as these inferences can have
far-reaching effects, regardless of whether they are accurate or not. There are
Using User-Generated Content to Understand Cities 61
We hope to find a way to co-create value both to social media users and to people in
our work. As it exists now, value flows only from users to marketers and analysts.
To create a more sustainable tool, and to avoid impinging on users’ freedoms, it is
important that the users gain some benefit from any system we create as well. Some
of our projects point in this direction, especially the ones aimed at individual users.
People may be more amenable to a tool that offers businesses insights based on their
public tweets if they can have access to those insights as well.
A successful example of co-creation of value is Tiramisu (Zimmerman
et al. 2011). This app aimed to provide real-time bus timing information by asking
people to send messages to a server when they were on a bus. Users were allowed to
get more information if they shared more information. In contrast, OneBusAway
(Ferris et al. 2010) provides real-time bus information using sensors that are
installed on buses. Using a collaborative approach, it may not be necessary to
implement a costly instrumentation project. In addition, people may feel more
ownership of a system if they contribute to it.
62 D. Tasse and J.I. Hong
5 Conclusion
References
Ferris B, Watkins K, Borning A (2010) OneBusAway: results from providing real-time arrival
information for public transit. In: Proceedings of CHI, pp 1–10
González MC, Hidalgo CA, Barabási A-L (2008) Understanding individual human mobility
patterns. Nature 453(7196):779–782
Griffin D, Hughes T (2013) Projected 2013 costs of a voluntary American Community Survey.
United States Census Bureau
Harrison S, Dourish P (1996) Re-place-ing space: the roles of place and space in collaborative
systems. In: Proceedings of CSCW, pp 67–76
Hecht B, Stephens M (2014) A tale of cities: urban biases in volunteered geographic information.
ICWSM
Huff DL (1963) A probabilistic analysis of shopping center trade areas. Land Econ 39(1):81–90
(2014) Instagram: our story. http://www.instagram.com/press. Retrieved 24 May 2014
Isaacman S, Becker R, Cáceres R et al (2012) Human mobility modeling at metropolitan scales. In:
Proceedings of MobiSys
Jernigan C, Mistree BFT (2009) Gaydar: Facebook friendships expose sexual orientation. First
Monday 14:10
Joseph K, Tan CH, Carley KM (2012) Beyond “Local”, “Categories” and “Friends”: clustering
foursquare users with latent “Topics.” In: Proceedings of Ubicomp
Kitamura R, Chen C, Pendyala RAMM, Narayanan R (2000) Micro-simulation of daily activity-
travel patterns for travel. Transportation 27:25–51
Komninos A, Stefanis V, Plessas A, Besharat J (2013) Capturing urban dynamics with scarce
check-in data. IEEE Pervasive Comput 12(4):20–28
Krikorian R (2013) New tweets per second record, and how! Twitter Engineering Blog. 16 Aug
2013. Retrieved from https://blog.twitter.com/2013/new- tweets-per-second-record-and-how
Lindqvist J, Cranshaw J, Wiese J, Hong J, Zimmerman J (2011) I’m the Mayor of my house:
examining why people use foursquare - a social-driven location sharing application. In: Pro-
ceedings of CHI
Lowenthal TA (2006) American Community Survey: evaluating accuracy
Mahmud J, Nichols J, Drews C (2013) Home location identification of Twitter users. ACM Trans
Intell Syst Technol. doi:10.1145/2528548
Martine G et al (2007) The state of world population 2007: unleashing the potential of urban
growth. United Nations Population Fund, New York, Retrieved from https://www.unfpa.org/
public/home/publications/pid/408
Mccullagh D (2011) Carrier IQ: more privacy alarms, more confusion | Privacy Inc. - CNET News.
C|Net. http://www.cnet.com/news/carrier-iq-more-privacy-alarms-more-confusion/
Mccullagh D (2012) Verizon draws fire for monitoring app usage, browsing habits. C|Net. http://
www.cnet.com/news/verizon-draws-fire-for-monitoring-app-usage-browsing-habits/
Palchykov V, Kaski K, Kertész J, Barabási A-L, Dunbar RIM (2012) Sex differences in intimate
relationships. Sci Rep 2:370
Pew Internet (2014) Mobile technology fact sheet. http://www.pewinternet.org/fact-sheets/
mobile-technology-fact-sheet/. January 2014
Priedhorsky R, Culotta A, Del Valle SY (2014) Inferring the origin locations of tweets with
quantitative confidence. In: Proceedings of CSCW, pp 1523–1536
Quercia D, Ellis J, Capra L et al (2011) Tracking “Gross Community Happiness” from Tweets.
CSCW
Rost M, Barkhuus L, Cramer H, Brown B (2013) Representation and communication: challenges
in interpreting large social media datasets. In: Proceedings of CSCW
Smith-Clarke C, Mashhadi A, Capra L (2014) Poverty on the cheap: estimating poverty maps
using aggregated mobile communication networks. In: Proceedings of CHI
Toch E, Cranshaw J, Drielsma PH et al (2010) Empirical models of privacy in location sharing. In:
Proceedings of Ubicomp, pp 129–138
64 D. Tasse and J.I. Hong
Williams S, Currid-Halkett E (2014) Industry in motion: using smart phones to explore the spatial
network of the garment industry in New York City. PLoS One 9(2):1–11
Yelp Word Maps (2013) http://www.yelp.com/wordmap/sf/
Zhang AX, Noulas A, Scellato S, Mascolo C (2013) Hoodsquare: modeling and recommending
neighborhoods in location-based social networks. In: Proceedings of SocialCom, pp 1–15
Zimmerman J et al (2011) Field trial of tiramisu: crowd-sourcing bus arrival times to spur
co-design. In: Proceedings of CHI
Developing an Interactive Mobile
Volunteered Geographic Information
Platform to Integrate Environmental Big
Data and Citizen Science in Urban
Management
Zhenghong Tang, Yanfu Zhou, Hongfeng Yu, Yue Gu, and Tiantian Liu
Abstract A significant technical gap exists between the large amount of complex
scientific environmental big data and the limited accessibility to these datasets.
Mobile platforms are increasingly becoming important channels through which
citizens can receive and report information. Mobile devices can be used to report
Volunteered Geographic Information (VGI), which can be useful data in environ-
mental management. This paper evaluates the strengths, weaknesses, opportunities,
and threats for the selected real cases: “Field Photo,” “CoCoRaHS,” “OakMapper,”
“What’s Invasive!”, “Leafsnap,” “U.S. Green Infrastructure Reporter”, and
“Nebraska Wetlands”. Based on these case studies, the results indicate that active,
loyal and committed users are key to ensuring the success of citizen science pro-
jects. Online and off-line activities should be integrated to promote the effective-
ness of public engagement in environmental management. It is an urgent need to
transfer complex environmental big data to citizens’ daily mobile devices which
will then allow them to participate in urban environmental management. A tech-
nology framework is provided to improve existing mobile-based environmental
engagement initiatives.
1 Introduction
access the data, they must follow specific instructions on the websites, download
the data and then open it using professional tools.
Authoritative information and tools, such as remote sensing data from NASA
(NASA, National Aeronautics and Space Administration) or NWI data from
USWFS (NWI, a product of the U.S. Fish & Wildlife Service), are reliable and
accurate to a high resolution. However, they are expensive to use, and time-
consuming, inaccessible, and geographically limited. In contrast, using VGI to
aid environmental monitoring and management does not require expensive
updating and maintenance for high-resolution remote sensing data. This is because
users, instead of agencies and corporations, update the maps. VGI also does not
involve the obvious costs of building large databases, because once a VGI system is
built, it functions as a data-driven website (data-driven website, one of the types of
dynamic web pages). However, there are still certain costs associated with mainte-
nance and monitoring, security, privacy or trust management, obtaining open
source licenses, updating underlying databases and linking to new applications.
Other forms of expense are general start-up costs and the cost of forming strategies
for growing the user community. In addition, VGI can also bridge the gap between
citizens, scientists, and governments. The application of VGI in environmental
monitoring emphasizes the importance of participants in knowledge production
and reduction of existing gaps between the public, researchers, and policymakers
(Peluso 1995; Bailey et al. 2006; Mason and Dragicevic 2006; Parker 2006; Walker
et al. 2007). Applying VGI in environmental monitoring enables access to potential
public knowledge (Connors et al. 2012). It is only limited by a user’s spatial
location, and therefore it is more flexible than traditional methods in certain cases
such as wetland locations and water quality management.
While some users may just want to browse rather than modify data, the learning
process required for accessing existing databases may be very time-consuming and
therefore off-putting. For example, if a user wants to check whether his or her
property is in a national wetland inventory area, the user doesn’t need to do
extensive research on federal website and open data from ESRI’s ArcMap program
(ESRI, Environmental Systems Research Institute). Instead, the user can simply
access wetland data on a smartphone anytime, anywhere. The enabled Web 2.0
technologies and VGI methods can resolve these previously-mentioned issues,
incorporate websites to be compatible with smart devices, and transfer databases
to crowdsourcing clients, which would significantly benefit environmental moni-
toring and management.
In order to ensure the objectivity of the case study methodology, this paper adopts
six criteria for case selection: information platform, issue addressed, data collection
method, data presentation, service provider, and coverage. The information plat-
form of the target cases should have interactive mobile-accessible platforms to
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 69
allow citizens to view the spatial information and engage their participation through
mobile devices. The selected cases should be environmentally-related topics. The
data collection should rely on the public’s submissions as primary data sources. The
data presentation should have geospatial maps. The service providers should be
research institutes and the projects chosen should have been designed for research
purposes. The coverage indicates each case should address a different topic to avoid
repetition and ensure the diversity of selected cases. Based on the above criteria,
seven real cases were selected: “Field Photo”, “CoCoRaHS,” “OakMapper,”
“What’s Invasive!”, “Leafsnap,” “U.S. Green Infrastructure Reporter”, and
“Nebraska Wetlands”.
4 Case Studies
“Field Photo” was developed by the Earth Observation Modeling Facility at the
University of Oklahoma. It has a mobile system to enable citizens to share their
field photos, show footprints of travel, support monitoring of earth conditions, and
verify satellite image data. Field Photo can document citizen observations of
landscape conditions (e.g., land use types, natural disasters, and wildlife).
Citizen-reported photos are included in the Geo-Referenced Field Photo Library
which is an open-sourcing data archive. Researchers, land managers, and citizens
can share, visualize, edit, query, and download the field photos. More importantly,
these datasets provide crowdsourcing geospatial datasets for research on land use
and changes to land coverage, the impacts of extreme weather events, and envi-
ronmental conservation. Registered users have more accessibility to the photo
library than guest users do. Both the iOS and Android versions of “Field Photo”
applications have been available since 2014 for public download. Potential users
include researchers, students and citizens. They can use GPS-enabled cameras,
smartphones, or mobile devices to take photos to document their observations of
landscape conditions and events.
“CoCoRaHS” represents the Community Collaborative Rain, Hail and Snow
Network that was set up and launched by three high school students with local
funding. CoCoRaHS is a non-profit, community-based network of volunteers who
collaboratively measure and map precipitation (rain, hail and snow) (Cifelli
et al. 2005). Beginning with several dozens of enthusiastic volunteers in 1998,
the number of participants has increased every year. Besides active volunteers,
there are some people who have participated in this program for a few weeks but
have not remained active over the long-term. In 2000, CoCoRaHS received funding
from the National Science Foundation’s Geoscience Education program and was
operated by the Colorado Climate Center at Colorado State University. Based on
real-time statistical data, around 8000 daily precipitation reports were received in
2013 across the United States and Canada. Mobile applications of CoCoRaHS
Observer for iOS and Andriod systems were provided by Steve Woodruff and
Appcay Software (not CoCoRaHS) to allow registered volunteers to submit their
70 Z. Tang et al.
daily precipitation reports via their mobile devices. The potential users are general
public who are interested in measuring precipitation issues.
“OakMapper” was developed by the University of California-Berkeley in 2001
(Kelly and Tuxen 2003). Sudden oak death is a serious problem in California and
Oregon forests. Because there are so many people who walk or hike in these forests,
“OakMapper” extended its original site, which further allowed communities to
monitor sudden oak death. “OakMapper” has successfully explored the potential
synergy of citizen science and expert science efforts for environmental monitoring
in order to provide timely detection of large-scale phenomena (Connors et al. 2012).
By 2014, it had collected 3246 reports, most of which came from California.
However, “OakMapper” is not a full real-time reporting system. Submitted data
can only be displayed if it contains a specific address. As of 2014, OakMapper only
had an iOS-based mobile application. Users can view the data, but they need to
register in order to submit volunteered reports. The potential users include either
general citizens or professional stakeholders (e.g. forest managers, biologists, etc.).
This study provides a unique data source for examining sudden oak death issues in
California.
“What’s Invasive!” is a project that attempts to get volunteered citizens to
locate invasive species anywhere in the United States by making geo-tagged
observations and taking photos that provide alerts of habitat-destroying invasive
plants and animals. This project is hosted and supported by the Center for Embed-
ded Networked Sensing (CENS) at the University of California, the Santa Monica
Mountains National Recreation Area, and Invasive Species Mapping Made Easy, a
web-based mapping system for documenting invasive species distribution devel-
oped by the University of Georgia’s Center for invasive Species and Ecosystem
Health. Any user who registers as a project participant must provide an accurate
email address. Users can self-identify as a beginner, having had some species
identification training, or an expert. This project only tracks statistical data such
as the frequency with which users log in for research use. No personal background
data on users are collected. Both the iOS and Android versions of mobile applica-
tions have been available since 2013 for citizen download. The potential users are
general citizens who are interested in invasive species.
“Leafsnap” is a pioneer in a series of electronic field guides being developed by
Columbia University, the University of Maryland, and the Smithsonian Institution.
It uses visual recognition software to help identify tree species from photographs of
their leaves. Leafsnap provides high-resolution images of leaves, flowers, fruit,
petiole, seed, and bark in locations that span the entire continental United States.
Leafsnap allows users to share images, species identifications, and geo-coded
stamps of species locations and map and monitor the ebb and flow of flora
nationwide. Both the iOS and Android versions of mobile applications are available
for citizen use. Citizens with interests in trees and flowers are the potential users for
this application.
“U.S. Green Infrastructure Reporter” was developed in the Volunteered
Geographic Information Lab at the University of Nebraska-Lincoln in 2012. Its
main purpose is to allow stakeholders and citizens to report green infrastructure
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 71
sites and activities through their mobile devices. This mobile information system
has a GPS-synchronous real-time reporting function with its own geospatial data-
base that can be used for analysis. It provides both iOS and Android versions of
mobile applications. More than 6700 reports were collected across the United States
by 2013. Both professional stakeholders and general citizens are the potential users.
“Nebraska Wetlands” was developed in the Volunteered Geographic Informa-
tion Lab at the University of Nebraska-Lincoln in 2014. This application translates
the wetland datasets such as NWI and SSURGO data to mobile devices. It can allow
citizens to easily access the complex environmental datasets. In addition, this
application also incorporates a GPS-synchronous real-time reporting system to
allow citizens to upload their observations. This system has both iOS and Android
versions of mobile applications. More than 600 reports were collected in Nebraska
in 2005. Both the professional stakeholders and general citizens are the potential
users.
5 SWOT Analysis
This study adopts a qualitative analysis method to analyze the selected cases.
SWOT (Strengths, Weaknesses, Opportunities, and Threats) analysis is a structured
method used to evaluate these cases. The strengths indicate the advanced charac-
teristics of one case over other cases. The weaknesses indicate the disadvantages of
a case in this usage. Opportunities mean the elements in an environment that can be
exploited to their advantage. Threats indicate the elements in an environment that
could cause trouble or uncertainty for development. SWOT analysis results can
provide an overview of these cases in terms of their existing problems, barriers, and
future directions. Through identifying the strengths, weaknesses, opportunities, and
threats from the seven real cases; “Field Photo”, “CoCoRaHS,” “OakMapper,”
“What’s Invasive!”, “Leafsnap,” “U.S. Green Infrastructure Reporter”, and
“Nebraska Wetlands”, this paper proposes a technical framework to guide future
mobile application development.
6 Results
6.1 Strengths
The mobile-based information platform has more unique strengths than other
web-based platforms. All the cases were built on mobile platforms. Citizens can
use their portable mobile devices to participate in these projects and do not need to
rely on special devices. The projects, such as “Field Photo”, “CoCoRaHS”, or
“OakMapper”, can be incorporated into their daily activities. Another example is
72 Z. Tang et al.
6.2 Weaknesses
The weaknesses of these projects include limited availability of mobile devices, age
gaps, data quality issues, and data verification issues. Some citizens, particularly the
minority groups, still cannot afford GPS-enabled mobile devices with expensive
data plans (Abascal et al. 2015). From the current usage of mobile devices, aging
population may have relatively lower levels of knowledge and/or interest in using
mobile devices. The quality of citizen-reported data is a fundamental challenge for
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 73
any citizen science project, including our seven cases. For example, data verifica-
tion for citizen-reported invasive species needs a significant amount of time and
resources. In addition, experiences with the “U.S. Green Infrastructure Reporter”
also bear out that the number of users does not equal the number of active users.
Most of the green infrastructure were submitted by a small number of active users.
Thus, the total number of users only provides an overview of project coverage, but
the number of registered users is a better indicator of those who contribute to citizen
science projects. More importantly, a high level of participation can bring a large
amount of citizen-submitted data, but this does not mean that citizens provide a
large amount of useful data, such as the precipitation, snow, hails data from the
“CoCoRaHS” project. Data quality comes from better users’ understanding, judg-
ment and operational experiences. The quality verification depends on an expert’s
knowledge and experience. For example, a manual verification procedure for green
infrastructure sites, on the other hand, can reduce the probability of error, but needs
more time and money to implement. Compared to the authoritative data from
agencies or corporations, the quality of data collected from VGI systems is typically
a concern. However, it has been proven that there is no significant difference
between data collected by scientists and volunteers and, in fact, volunteers can
made valuable contributions to the data collection process (Engel and Voshell
2002; Harvey et al. 2001).
The VGI method has similar problems to other crowdsourcing practices. The
VGI method still has occasional data quality problems, especially when volunteers
are not motivated to contribute their knowledge to the VGI system. For example, in
the “Field Photo” and “CoCoRaHS,” projects, some volunteers may not be willing
to finish all the inventory questions when the results of the VGI methods cannot be
seen immediately. In addition to data quality issues, crowdsourcing also has the
following problem: There is no time constraint for the data collection process with
the VGI method. When using the VGI method, it is hard for planners to define or
identify what citizens’ motivations and interests are, and whether their motivation
and interests will have a positive or a negative impact to the planning process. For
example, in the “Field Photo” and ““Leafsnap”” projects, how to prohibit intellec-
tual property leakage for sensitive topics is another concern. There are currently no
standards for designing a VGI system, so there is not much control over the
development or the ultimate products, such as “U.S. Green Infrastructure Reporter”
and “Nebraska Wetlands”. Thus, when using the VGI method, planners also need to
think about what things can be crowd sourced and what cannot, such as content that
has copyright issues.
6.3 Opportunity
information channel for people to receive and report information. Many of these
projects, such as “Field Photo”, “Leafsnap,”, “OakMapper,” and “What’s Inva-
sive!”, have successfully combined online activities and off-line engagement activ-
ities to attract new participants and retain older users. The case study for
“CoCoRaHS” project also found that informal communication is a valuable portion
of mobile engagement. From the experiences in the “CoCoRaHS” project, the
integration of nonverbal communication and personal touches can improve the
effectiveness of mobile engagement. In addition, volunteered coordinators are
helpful in promoting volunteer activities. From the “U.S. Green Infrastructure
Reporter”, and “Nebraska Wetlands”, we found that social media and social
networks are promising channels through which to attract people in the virtual
community. These mobile VGI projects are still not empowered by social media
and social networks. In general, applying VGI methods to planning means not only
building a VGI system. It is both a planning process and a crowdsourcing practice.
Crowdsourcing can be considered as a new and upcoming planning issue in the
future, especially when e-government practices are expected to become more
popular in the future.
6.4 Threats
The first threat is limited motivation from citizens (Agostinho and Paço 2012).
Projects without adequate motivation are the greatest threat to long-term success.
Most citizens have no interest in scientific research projects that cannot bring any
direct economic or social benefits to them (Tang and Liu 2016). A successful
engagement should build a shared vision and devote responsibility and commitment
in specific tasks, such as invasive species observation and green infrastructure site
report. From the seven cases, we found that a two-way communication platform
cannot be fully engaged due to lack of timely feedback from the organizers. None of
the cases has provided timely online feedbacks for the users. The reported data can
only be directly used by the project organizers for analysis. Current databases
(e.g. “Field Photo”, “U.S. Green Infrastructure Reporter”, and “Nebraska Wet-
lands”) do not have efficient approaches to manage citizen science data such as
photos and videos. The analysis of the collected data is still very superficial and
does not have any in-depth data mining. The collected data types mainly only
include text and photos and do not include more scientific data. These projects still
lack strategic management and sustainability. Most of these projects depend on
specific funding to support their daily operations. Both the “CoCoRaHS” and
“What’s Invasive!” projects seek for alternative funding resources to reduce the
threats and to ensure effective implementation.
Although technology has improved greatly on mobile devices, there are still two
technology problems that have an impact on adopting VGI methods: signal cover-
age range and signal quality for mobile devices, and battery life. Although most
mobile devices can get good signals in urban areas, in rural areas mobile devices
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 75
can disconnect suddenly due to inadequate signal range or poor signal quality,
which can result in missing or erroneous data. For example, the “Field Photo” and
“Nebraska Wetlands” were not accessible in the rural areas. In addition, informa-
tion service providers have different signal coverage. Even in the same city, the
signal quality and strength varies greatly and VGI users may lose patience if their
phone service isn’t available in a particular area. Signal quality and strength not
only has an impact on the adoption of VGI methods, it also has an impact on the
volunteers themselves. When we used the “Nebraska Wetlands”, the speed of
mobile maps was very slow in some areas. Adopting VGI methods also requires
users to have better mobile devices, although VGI developers and planners will try
to cover as many popular devices as they can and reduce memory requirements for
mobile applicants.
There are still some hardware limits that cannot be solved, however, such as
battery life. If the VGI users are a group of professionals, they may use VGI apps on
their mobile devices to collect data for an entire 24 h, and the screen will be
constantly refreshing, reducing the battery life. Another problem with adopting
VGI methods is covering the entire mobile phone market. It is in the VGI devel-
opers’ best interest to have their apps deployed on both Android and iOS phones.
This is not a simple task because Android and iOS systems have different policies
for reviewing a developer’s app. Google is open source and allows more freedom to
access phone abilities, whereas Apple is closed and is sensitive to accessing phone
abilities such as sensors. Trying to keep VGI apps working both on Google and
Apple phones is a challenge for developers and planners. In general, Google and
Apple both provide a good development space for promoting VGI methods, and
each has its own advantages and disadvantages.
The invasion of hackers is still a threat, and the same concerns extend to the
mobile era. From a technological viewpoint, it is true that some smartphone systems
have safety issues, and hacking software targets to smart phones is common.
Utilizing VGI methods can also become a new opportunity for hackers to invade
public privacy. With these risks in minds, users question the necessity of installing
software on their smartphones that they don’t need to rely on or use every day.
7 Discussion
Based on the findings from the selected cases, this paper suggests a technical
development framework that can be used to improve the existing mobile-based
environmental information system. The below sections explain the technical devel-
opment framework that can be used for future VGI system development. The
proposed technical development framework aims to continue the strengths, over-
come the weaknesses, expand the opportunities, and reduce the threats.
76 Z. Tang et al.
From the seven selected cases, we analyzed the strengths, weaknesses, opportuni-
ties, and threats, and thus propose a new technical development framework that can
maximize the VGI applications in environmental management. Understanding
users’ needs is essential for using VGI methods in the environmental management
field. When designing mobile information platforms, designers also need to think
about what kinds of needs the public has. System deployment includes two tasks:
front-end tasks and back-end tasks. Front ends represent the client side of the
system, such as desktop users and mobile users. Back ends are the server side of
the system. It includes two web servers and a GIS server. There are two types of
front ends: web front ends and mobile front ends. Web front ends are web appli-
cations that allow Internet clients or users to request back-end services through a
URL (Uniform Resource Locator) via web browsers. Mobile front-ends are mobile
applications that allow mobile clients to request back-end services through
smartphones or tablets. Mobile applications can be downloaded and installed via
Apple Store or Google Play. Typically, multi-requests occur when there are too
many front-end users accessing the server simultaneously. The number of multi-
requests from mobile clients is usually higher than the number of multi-requests
from desktop clients, which means that the server back-ends should be optimized to
account for these differences. Besides the REST (REST, Representational state
transfer) service hosted on a GIS server, there is also educational information
hosted on the back end. Educational information is typically placed on static web
pages which does not require communication with a GIS server to access. By
adding an additional web server, the system can filter the client into two groups:
those requiring use of GIS services and those who do not. Using this deployment
method, the total number of multi-requests for a GIS server can be reduced
significantly if a certain number of clients only browse the education web page
and, at the same time, those who search on the map or report data can still have a
smooth experience. In short, this deployment method can reduce the peak number
of multi-requests from clients, especially when there are many mobile users who
are only viewing static information through the front end. A workstation was also
added to the back end in order to give data access to professional users such as GIS
analysts (Fig. 1). In addition, by publishing web maps through ArcGIS online.com
for related research, data can also be shared between other professionals such as
environmental scientists and planners (Fig. 1).
The key point of system architecture is using embedded Javascripts in Html
(Html, HyperText Markup Language) (Fig. 2). JavaScripts is a client-side scripting
language. Using embedded JavaScripts is an affordable way for deploying small
business applications through clients, because most current browsers, such as
Firefox, Safari or Chrome, support JavaScripts very well. Thus, it is possible to
develop only one application and make it executable on different platforms. The
PhoneGap (PhoneGap, a mobile development framework produced by Nitobi) has
realized this possibility; developers do not need to update their code frequently or
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 77
develop applications for each different platform, which can significantly reduce the
total cost of the system. The Html code with embedded JavaScripts can be wrapped
into the native environment on the client side, because PhoneGap provides a bridge
between the native environment and web environment (Fig. 2).
The design framework for mobile front ends, including visualization and user
interfaces, includes five key features (Fig. 3): (1) GPS reporting and mapping
features will enable users to browse maps, send geocoded photos or data, and
query attributes of geometric objects. (2) Publication and education features will
be designed for novice technicians or students to study green infrastructure strate-
gies. Third party publications are also posted as a link in this feature. (3) With the
linkage to social media, users, experts and advocates can share their ideas through
78 Z. Tang et al.
social networks. Several popular social networks will be included, such as Twitter
and Facebook. (4) News and research progress will be posted through a research
exhibition feature for environmental experts. (5) Users can also find reviews and
contact information through feedback and contacts features.
A mobile application, the front end of the VGI system, has a great impact on
attracting volunteers and promoting VGI concepts to the public. Issues to consider
include whether it appears user-friendly, and whether the key functionality
enhances the user experience. There are three different types of mobile applica-
tions: native apps, hybrid apps, and web apps. All of these applications have their
advantages and disadvantages when building a VGI system. Developers and plan-
ners also need to choose a suitable application type for their projects. It is difficult to
assess which type is the perfect option for building a VGI system; planners and
developers need to balance development costs and time, as well as key functional
features.
Native applications are those that are developed by native programming lan-
guages, which are specified by mobile Operating Systems (OS). A native applica-
tion can only run on the mobile OS that supports it. One immediate advantage of a
native application is that all the documentation can be done with or without Internet
connectivity. A native application can also run smoothly on the specified OS, and it
usually has fewer bugs. However, a native application also has disadvantages. Since
the native programming language is used to develop native apps, it is hard for a
native app to be cross-platform. If planners or developers choose to develop a native
application, they need to program on every OS platform by using different native
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 79
Build Android
Build iOS
Hybrid Write Test
Build BlackBerry
Build Windows8
Fig. 4 Comparison of the development workflows between a native app and a hybrid app
OS coding languages, such as Objective-C and Java (Fig. 4). In addition, updating
the native app is also a problem since it requires knowledge of different program-
ming languages. It can be expensive and time-consuming to develop a native app
for the VGI concept, but in some cases, such as local data analysis, choosing a
native application is a wise solution.
Web applications actually are websites in mobile form, which rely on browser
and Internet connectivity. Web applications don’t have a native look, but they are
much cheaper to develop than native apps and hybrid apps. Web apps are not
popular, because they require users to remember website links. In general, web apps
are not suitable for promoting VGI concepts because they don’t look user-friendly,
and they often has a blurred user experience. Hybrid apps are part native apps, and
part web apps. Hybrid app development requires web experience, such as Html5
and JavaScript. Although hybrid apps are not developed using native programming
languages, they can have a native look and can access the functionalities of the
smart phone. The native appearance of a hybrid app relies on some open sourced or
third party user interface framework, such as jQuery Mobile and Sencha Touch.
Thus, its performance may not be as smooth as a native app, but the cost and
maintenance of the hybrid app is usually cheaper and easier than the native app
workflow. In addition, choosing a hybrid app doesn’t require planners and devel-
opers to have native development experience; it only requires web development
experience, making it easier for a hybrid app to be cross-platform. Most social
media apps are hybrid apps, because they are easily distributed on different mobile
OS platforms. In general, choosing the right type of app for building a VGI system
is very important, because it has a direct impact on the volunteers. It is hard to say
which type of app workflow is better than others. All have their own advantages and
disadvantages.
80 Z. Tang et al.
9 Conclusions
Acknowledgements This paper has been funded wholly or partially by the United States Envi-
ronmental Protection Agency (EPA) under an assistance agreement (CD-977422010;
UW-97735101). The contents do not necessarily reflect the views and policies of the EPA, nor
does the mention of trade names or commercial products constitute endorsement or recommen-
dation for use.
References
Abascal J, Barbosa SDJ, Nicolle C, Zaphiris P (2015) Rethinking universal accessibility: a broader
approach considering the digital gap. Univ Access Inf Soc. doi:10.1007/s10209-015-0416-1
Agostinho D, Paço A (2012) Analysis of the motivations, generativity and demographics of the
food bank volunteer. Int J Nonprofit Volunt Sect Mark 17:249–261
Bailey C, Convery I, Mort M, Baxter J (2006) Different public health geographies of the 2001 foot
and mouth disease epidemic: “Citizen” versus “professional” epidemiology. Health Place
12:157–166
Brabham DC (2009) Crowdsourcing the public participation process for planning projects. Plan
Theory 8(3):242–262
Cifelli R, Doesken N, Kennedy P, Carey LD, Rutledge SA, Gimmestad C, Depue T (2005) The
community collaborative rain, hail, and snow network. Am Meteorol Soc 86(8):1069–1077
Connors JP, Lei S, Kelly M (2012) Citizen science in the age of neogeography: utilizing
volunteered geographic information for environmental monitoring. Ann Assoc Am Geogr
102(X):1–23
Conrad CC, Hilchey KG (2011) A review of citizen science and community-based environmental
monitoring: issues and opportunities. Environ Monit Assess 176(1-4):273–291
Devictor V, Whittaker RJ, Beltrame C (2010) Beyond scarcity: citizen science programmes as
useful tools for conservation biogeography. Divers Distrib 16:354–362
Developing an Interactive Mobile Volunteered Geographic Information Platform. . . 81
J. Yin • Y. Gao
CyberGIS Center for Advanced Digital and Spatial Studies, University of Illinois at Urbana-
Champaign, Urbana, IL 61801, USA
CyberInfrastructure and Geospatial Information Laboratory, University of Illinois at Urbana-
Champaign, Urbana, IL 61801, USA
Department of Geography and Geographic Information Science, University of Illinois at
Urbana-Champaign, Urbana, IL 61801, USA
e-mail: [email protected]
S. Wang (*)
CyberGIS Center for Advanced Digital and Spatial Studies, University of Illinois at Urbana-
Champaign, Urbana, IL 61801, USA
CyberInfrastructure and Geospatial Information Laboratory, University of Illinois at Urbana-
Champaign, Urbana, IL 61801, USA
Department of Geography and Geographic Information Science, University of Illinois at
Urbana-Champaign, Urbana, IL 61801, USA
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL
61801, USA
Department of Urban and Regional Planning, University of Illinois at Urbana-Champaign,
Urbana, IL 61801, USA
National Center for Supercomputing Applications, University of Illinois at Urbana-
Champaign, Urbana, IL 61801, USA
e-mail: [email protected]
1 Introduction
(Maisonneuve et al. 2009, 2010). This effort also showed some promising results
regarding the effectiveness of participatory noise mapping. Compared to the tradi-
tional noise monitoring approach that relies on centralized sensor networks, the
mobile approach is less costly; and with collective efforts, this approach using
humans as sensors can potentially reach a significantly larger coverage of the city.
With integrated environmental sensors,1 the new generation mobile devices can
instrument comprehensive environmental properties, such as ambient temperature,
air pressure, humidity, and sound pressure level (i.e., noise level). However, when
the involvement of a large number of participants engaging in crowdsourcing
activities becomes a realization, a large volume of, near real-time updated, unstruc-
tured datasets are produced. Conventional end-to-end computational infrastructures
will have difficulties in coping with managing, processing, and analyzing such
datasets (Bryant 2009), requiring support from more advanced cyberinfrastructure
regarding data storage and computational capabilities.
This paper describes a CyberGIS-enabled urban sensing framework to facilitate
the participation of volunteered citizens in monitoring urban environmental pollu-
tion using mobile devices. CyberGIS represents a new-generation GIS (Geographic
Information System) based on the synthesis of advanced cyberinfrastructure, GIS
and spatial analysis (Wang 2010). It provides abundant cyberinfrastructure
resources and toolkits to facilitate the development of applications that require
access to, for example, high performance and distributed computing resources and
massive data storage. This framework enables scalable data management, analysis,
and visualization intended for massive spatial data collected by mobile devices. To
demonstrate its functionality, we focus on the case of noise mapping. In general,
this framework integrates a MongoDB2 cluster for data storage, a MapReduce
approach (Dean and Ghemawat 2008) to extracting and aggregating noise records
collected and uploaded by mobile devices, and a parallel kernel smoothing algo-
rithm using graphics processing unit (GPU) for efficiently creating noise pollution
maps from massive collection of records. This framework also implements a mobile
application for capturing geo-tagged and time-stamped noise level measurements
as users move around in urban settings.
The remainder of this paper is organized as follows: Section “Participatory
Urban Sensing and CyberGISParticipatory Urban Sensing and CyberGIS”
describes the related work in the context of volunteered participation of citizens
in sensing urban environment. We focus on the research challenges in terms of data
management, processing, analysis, and visualization. In particular, CyberGIS is
argued to be suitable for addressing these challenges. Section “System Design and
Implementation” illustrates the details of the design and implementation of the
CyberGIS-enabled urban sensing framework. Section “User Case Scenario” details
a user case scenario for noise mapping using mobile devices. Section “Conclusions
and Future Work” concludes the paper and discusses future work.
1
http://developer.android.com/guide/topics/sensors/index.html
2
http://www.mongodb.org/
86 J. Yin et al.
3
http://play.google.com/store/apps/details?id¼com.julian.apps.SPLMeter&hl¼en
4
http://www.patrick-wied.at/static/heatmapjs/
5
http://geoserver.org/
CyberGIS-Enabled Urban Sensing from Volunteered Citizen Participation Using. . . 87
GIS and spatial analysis (Wang 2010). As illustrated in Fig. 1 for the overview of
the CyberGIS architecture, CyberGIS provides a range of capabilities for tackling
the data and computation-intensive challenges, where the embedded middleware
can link different components to form a holistic platform tailored to specific
requirements.
In particular, our framework utilizes several components within this architec-
ture. In the “distributed data management” component, we deploy a MongoDB
cluster over multiple computing nodes for monitoring data intake and storage,
which is scalable to the growth of collected data volume. Compared to a relational
database, the NoSQL database supports more flexible data models with easy scale-
out ability and high performance advantages (Han et al. 2011; Wang et al. 2013b).
In the “high performance support” layer, we rely on the MapReduce functionality
of the MongoDB cluster for data processing, such as individual user trajectory
extraction, which is used to visualize the pollution exposure to a particular partic-
ipant; and aggregation of data provided by all participants to a 1-h (this value is
defined for the ease of implementation and can be changed according to user
specifications) time window. This is then used to dynamically produce noise
maps for the monitored environment. And finally, in the “data and visualization”
layer, we apply a parallel kernel smoothing algorithm for rapid noise map gener-
ation using GPUs. Specific design and implementation details will be discussed in
the following section.
88 J. Yin et al.
The workflow first filters out invalid data records (e.g., records without valid
coordinates) and then parses each record as a JSON object before saved to the
MongoDB cluster. The MongoDB cluster is chained in a master-slave style in order
to achieve scalability as datasets are accumulated into significant size, which is one
of the significant advantages over the existing relational databases. Another advan-
tage brought by the MongoDB cluster is the embedded mechanism for performing
MapReduce tasks. Since there is no predefined data schema and the input data are
simply raw documents with the only structure of <key, value> pairs, the
MapReduce function can efficiently sort the “unstructured” records based on the
specified keys, e.g. timestamp, unique user id or even geographical coordinates
(or a combination of these). More importantly, the data are stored in a distributed
fashion, meaning multiple instances of computing nodes can perform such tasks
simultaneously, which is otherwise nearly impossible for conventional database
6
http://json.org/
7
http://en.wikipedia.org/wiki/GeoTIFF
8
http://www.esri.com/software/arcgis/arcgisonline/maps/maps-and-map-layers
CyberGIS-Enabled Urban Sensing from Volunteered Citizen Participation Using. . . 89
RESTful
GPS Web Service Interfaces
Web Server
Internet
Mobile devices
Senor
measurements
MongoDB cluster
Data processing with
MapReduce
(data aggregation, trajectory
extraction, time window)
Pollution map
visualization
users. The aggregation is implemented also using the MapReduce method, where
the device ID is treated as the map key and the reduction process is based on the
timestamps that fall in a specified 1-h time window.
The pollution map is dynamically generated by using a kernel smoothing
method. Kernel smoothing is used to estimate a continuous surface of environmen-
tal measures (e.g. noise level) from point observations. The estimated measurement
at each location (target location) is calculated as a weighted average of the
observations within a search window (or bandwidth). The weight of each observa-
tion is decided by applying a kernel function to the distance between the target
location and that observation. The kernel function is typically a distance decay
function with a maximum value when the distance is zero and with a zero value
when the distance exceeds the bandwidth. The formula of kernel smoothing is
shown below, where K( ) is the kernel function, h is the bandwidth, (Xi, Yi) is the
location of observation i, and Zi is the environmental measures of observation i.
Xn xXi
yY i
K h , h Zi
Xi¼1
n xXi yY i
i¼1
K h , h
A noise mapping user case is investigated by collecting data of sound pressure using
a mobile application. The application utilizes the microphone of a mobile device to
measure sound pressure with the noise level calculated in decibels (dB) using the
following equation (Bies and Hansen 2009; Maisonneuve et al. 2009):
9
http://www.nvidia.com/object/cudahome new.html
CyberGIS-Enabled Urban Sensing from Volunteered Citizen Participation Using. . . 91
! !
prms 2 prms
Lp ¼ 10log10 ¼ 20log10 dB
pref 2 pref
where pref is the reference sound pressure level with a value of 0.0002 dynes/cm2
and prms is the measured sound pressure level. According to the World Health
Organization Night Noise Guidelines (NNGL) for Europe,10 the annual average
noise level of 40 dB is considered as equivalent to the lowest observed adverse
effect level (LOAEL) for night noise, whereas a noise level above 55 dB can
become a major public health concern and over 70 dB can cause severe health
problems. This calculated value is also calibrated by users according to physical
environment conditions and the type of mobile device.
The mobile application assigns a pair of geographic coordinates (in the format of
latitude and longitude) to each measured value together with a timestamp. The
update time interval for each recording is set to every 5 s. The recorded measure-
ments are saved directly on the mobile device and we let users decide when to
upload their data to the server, whether immediately after taking the measurements
or at a later time. An example of the data format of the measurements is shown in
Fig. 3. Note that the measurements of other sensors on a mobile device can be
included. Given the diversity of sensors on different devices, we use a flexible data
management approach based on MongoDB.
In this user case scenario, we choose the campus of University of Illinois at
Urbana—Champaign and its surroundings as the study area and asked the partic-
ipants to go around the campus to collect the noise level measurements. The user
interface of the mobile application is shown in Fig. 4, where users have the options
to record, upload and interact with noise maps. The mobile application is
implemented as a background service on the device and therefore participants are
free to engage in other activities.
From a generated noise map, we can identify those spots at which the noise level
exceeds such ranges. In Fig. 5, we can examine the visualization of the noise
exposure to an individual participant along their trajectory. At the current stage,
we have not quantitatively estimated accumulated noise exposure, which will be
taken into account in our future work. Figure 6 shows the noise map of a specified
hour using a 50-m kernel bandwidth, which is generated from the measurements
uploaded by all of the participants during this period. From the visualized results,
we can identify the spots where the noise pollution occurs (shown in red) within the
specified hour. A new feature to be evaluated for providing in-depth information
about what causes such noise pollution is to allow users to append descriptive text
when they carry out monitoring using their mobile devices (Maisonneuve
et al. 2009). Figure 7 is the noise map of the same hour but using 100-m kernel
bandwidth, which demonstrates the effects of choosing different sound decay
distance since the value can be changed in framework.
10
http://www.euro.who.int/data/assets/pdf file/0017/43316/E92845.pdf
92 J. Yin et al.
Fig. 6 The generated noise map using a 100-m kernel bandwidth during a specified hour
94 J. Yin et al.
Fig. 7 The generated noise map using a 100-m kernel bandwidth during a specific hour
Acknowledgements This material is based in part upon work supported by the U.S. National
Science Foundation under grant numbers: 0846655, 1047916, and 1354329. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the National Science Foundation. The authors are
grateful for insightful comments on the earlier drafts received from Yan Liu, Anand Padmanabhan.
References
Bies DA, Hansen CH (2009) Engineering noise control: theory and practice. CRC press, Boca
Raton, FL
Bryant RE (2009) Data-intensive scalable computing harnessing the power of cloud computing
(Tech. Rep.). CMU technical report. Retrieved from http://www.cs.cmu.edu/bryant/pubdir/
disc-overview09.pdf
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun
ACM 51(1):107–113
Directive E (2002) Directive 2002/49/ec of the European parliament and the council of 25 June
2002 relating to the assessment and management of environmental noise. Off J Eur Commu-
nities 189(12):12–26
Ghosh S, Raju P.P, Saibaba J, Varadan G (2012) Cybergis and crowdsourcing–a new approach in
e-governance. In Geospatial Communication Network. Retrieved from http://www.
geospatialworld.net/article/cybergis-and-crowdsourcing-a-new-approach-in-e-governance/
Goodchild MF (2007) Citizens as sensors: the world of volunteered geography. GeoJournal 69
(4):211–221
96 J. Yin et al.
Seth E. Spielman
Abstract The promise of “big data” for those who study cities is that it offers new
ways of understanding urban environments and processes. Big data exists within
broader national data economies, these data economies have changed in ways that
are both poorly understood by the average data consumer and of significant
consequence for the application of data to urban problems. For example, high
resolution demographic and economic data from the United States Census Bureau
since 2010 has declined by some key measures of data quality. For some policy-
relevant variables, like the number of children under 5 in poverty, the estimates are
almost unusable. Of the 56,204 census tracts for which a childhood poverty
estimate was available 40,941 had a margin of error greater than the estimate in
the 2007–2011 American Community Survey (ACS) (72.8 % of tracts). For exam-
ple, the ACS indicates that Census Tract 196 in Brooklyn, NY has 169 children
under 5 in poverty 174 children, suggesting somewhere between 0 and 343 chil-
dren in the area live in poverty. While big data is exciting and novel, basic questions
about American Cities are all but unanswerable in the current data economy. Here
we highlight the potential for data fusion strategies, leveraging novel forms of big
data and traditional federal surveys, to develop useable data that allows effective
understanding of intra urban demographic and economic patterns. This paper out-
lines the methods used to construct neighborhood-level census data and suggests
key points of technical intervention where “big” data might be used to improve the
quality of neighborhood-level statistics.
1 Introduction
The promise of “big data1” for those who study cities is that it offers new ways of
understanding urban environments and their affect on human behavior. Big data lets
one see urban dynamics at much higher spatial and temporal resolutions than more
traditional sources of data, such as survey data collected by national statistical
agencies. Some see the rise of big data as a revolutionary of mode of understanding
cities, this “revolution” holds particular promise for academics because, as argued
by Kitchin (2014), revolutions in science are often preceded by revolutions in
measurement. That is, big data could give rise to something even bigger, a new
science of cities. Others, such as Greenfield (2014) argue that real urban problems
cannot be solved by data and are deeply skeptical of the potential for information
technologies to have meaningful impacts on urban life. Here, we aim to contextu-
alize the enthusiasm about urban big data within broader national data economies,
particularly focusing on the US case. This paper argues that changes to national data
infrastructures, particularly in the US, have led to decline of important sources of
neighborhood-level demographic and economic data and that these changes com-
plicate planning and policymaking, even in a big data economy. We argue that in
spite of the shortcomings of big data, such as uncertainty over who or what is being
measured (or not measured), it is possible to leverage these new forms of data to
improve traditional survey based data from national statistical agencies.
The utility of data is contingent upon the context within which the data exist. For
example, one might want to know the median income in an area and the crime rate
in isolation these data are far less useful then they are in combination. Knowing that
place is high crime might help guide policy. However, policy might be much more
effectively targeted in one knew context within which crime was occurring. In a
general sense we might refer to the context within which data exist as the “data
economy.” For those who work on urban problems, the data economy has changed
in ways that are both poorly understood by the average data consumer and of
consequence to the application of big data to urban problems. Traditional sources
of information about cities in the US have recently changed in profound ways. We
argue that these changes create potential, and problems, for the application of data
to urban questions.
In particular, the data collected by the US Census Bureau has recently undergone
a series of dramatic changes, some of these changes are a result of the gradual
accrual of broader social changes and some have been abrupt, the result of changes
to federal policy. The National Research Council (2013) document a gradual long
term national trend of increases in the number of people who refuse to respond to
public (and private) surveys. Geographic and demographic patterns in survey
1
Defining big data is difficult, most existing definitions, include some multiple of V’s (see Laney
2001). All are satisfactory for our purposes here. We use the term to distinguish between census/
survey data which we see as “designed” measurement instruments and big data which we see as
“accidental” measurement instruments.
The Potential for Big Data to Improve Neighborhood-Level Census Data 101
2
We use the terms “fine” and “high” resolution to refer to census tract or smaller geographies,
these data are commonly conceived of as “neighborhood-scale” data. We conceive of resolution in
the spatial sense, higher/finer resolution means a smaller census tabulation unit. However, the
geographic scale high resolution of census units is a function of population density.
102 S.E. Spielman
The decline in the quality of neighborhood scale data in the United States began
in 2010, the year the American Community Survey (ACS) replaced the long form of
the United States decennial census as the principal source of high-resolution
geographic information about the U.S. population. The ACS fundamentally
changed the way data about American communities are collected and produced.
The long form of the decennial census was a large-sample, low frequency national
survey; the ACS is a high-frequency survey, constantly measuring the American
population using small monthly samples. One of the primary challenges for users of
the ACS is that the margins of error are on average 75 % larger than those of the
corresponding 2000 long-form estimate (Alexander 2002; Starsinic 2005). This loss
in precision was justified by the increase in timeliness of ACS estimates, which are
released annually (compared to the once a decade long form). This tradeoff
prompted Macdonald (2006) to call the ACS a “warm” (current) but “fuzzy”
(imprecise) source of data. While there are clear advantages to working with
“fresh” data, the ACS margins of error are so large that for many variables at the
census tract and block group scales the estimates fail to meet even the loosest
standards of data quality.
Many of the problems of the American Community Survey are rooted in data
limitations. That is at critical stages in the creation of neighborhood-level estimates
the census bureau lacks sufficient information and has to make assumptions and/or
use data from a coarser level of aggregation (municipality or county). We argue that
one of the major potential impacts of big data for the study of cities is the reduction
of variance in more traditional forms demographic and economic information. To
support this claim, we describe the construction of the ACS in some detail, with the
hope that these details illuminate the potential for big data to improve federal and/or
state statistical programs.
Like the decennial long form before it, the ACS is a sample survey. Unlike
complete enumerations, sample surveys do not perfectly measure the characteristics
of the population—two samples from the same population will yield different
estimates. In the ACS, the margin of error for a given variable expresses a range
The Potential for Big Data to Improve Neighborhood-Level Census Data 103
of values around the estimate within which the true value is expected to lie. The
margin of error reflects the variability that could be expected if the survey were
repeated with a different random sample of the same population. The statistic used
to describe the magnitude of this variability is referred to as standard error (SE).
Calculating standard errors for a complex survey like the ACS is not a trivial task,
the USCB uses a procedure called Successive Differences Replication to produce
variance estimates (Fay and Train 1995). The margins of error reported by the
USCB with the ACS estimates are simply 1.645 times the standard errors.
One easy way to understand the ACS Margin of Error is to consider the simple
case, in which errors are simply a function of the random nature of the sampling
procedure. Such sampling error has two main causes, the first is the sample size—
the larger the sample the smaller the standard error, intuitively more data about a
population leads to less uncertainty about its true characteristics. The second main
cause of sampling error is heterogeneity in the population being measured (Rao
2003). Consider two jars of U.S. coins, one contains U.S. pennies and the other
contains a variety of coins from all over the world. If one randomly selected five
coins from each jar, and used the average of these five to estimate the average value
of the coins in each jar, then there would be more uncertainty about the average
value in the jar that contained a diverse mixture of coins. If one took repeated
random samples of five coins from each jar the result would always be the same for
the jar of pennies but it would vary substantially in the diverse jar, this variation
would create uncertainty about the true average value.3 In addition, a larger handful
of coins would reduce uncertainty about the value of coins in the jar. In the extreme
case of a 100 % sample the uncertainty around the average value would be zero.
What is important to realize is that in sample surveys the absolute number of
samples is much more important than the relative proportion of people sampled, a
5 % sample of an area with a large population will provide a much better estimate
than a 5 % sample of a small population. While the ACS is much more complicated
than pulling coins from a jar, this analogy helps to understand the standard error of
ACS estimates. Census Tracts (and other geographies) are like jars of coins. If a
tract is like the jar of pennies, then the estimates will be more precise, whereas if a
tract is like the jar of diverse coins or has a small population, then the estimate will
be less precise.
While the simple example is illustrative of important concepts it overlooks the
central challenge in conducting surveys; many people included in a sample will
choose not to respond to the survey. While a group’s odds of being included in the
3
The Census Bureau generally is not actually estimating the “average” value, they are estimating
the “total” value of coins in the jar. Repeatedly grabbing five coins and computing the average will
over many samples get you a very precise estimate of the average value, but it will give you no
information on the total value. To get the total value, you need a good estimate of the average AND
a good estimate of the total number of coins in the jar. The loss of cotemporaneous population
controls caused by decoupling the ACS from the Decennial enumeration means that the census
does not have information about the number of coins in the jar. This is discussed in more details
later.
104 S.E. Spielman
ACS sample are proportional to its population size, different groups of people have
different probabilities of responding. Only 65 % of the people contacted by the ACS
actually complete the survey (in 2011, 2.13 million responses were collected from
3.27 million samples). Some groups are more likely to respond than others, this
means that a response collected from a person in a hard to count group is worth
more than a response from an easy to count group. Weighting each response
controls for these differential response rates. In the ACS each completed survey
is assigned a single weight through a complex procedure involving dozens of steps.
The important point, as far as this paper is concerned, is that these weights are
estimated and uncertainty about the appropriate weight to give each response is an
important source of uncertainty in the published data.
3 Sampling
Before 1940, the census was a complete enumeration; each and every housing
unit (HU) received the same questionnaire. By 1940 the census forms had become
a long, complicated set of demographic and economic questions. In response, the
questionnaire was split in 1940 into a short set of questions asked of 100 % of the
population and an additional “long form” administered to a subset of the popu-
lation. Originally, this long form was administered to a 5 % random sample, but
in later years it was sent to one HU in six (Anderson et al. 2011). Before 1940
any error in the data could be attributed either to missing or double counting a
HU, to incorrect transcription of a respondent’s answer, or to intentional/
unintentional errors by the respondent. After 1940, however, the adoption of
statistical sampling introduced new sources of uncertainty for those questions
on the long form.
Up until 2010 the sample based (long form) and the complete enumeration (short
form) of the census were administered at the same time. In 2010 the ACS replaced
the sample based long form. The American Community Survey constantly mea-
sures the population; it does not co-occur with a complete census. The lack of
concurrent complete count population data from the short form is a key source of
uncertainty in the ACS. Prior to the rise of the ACS, short form population counts
could serve as controls for long-form based estimates. The decoupling of the
sample from the complete enumeration accounts for 15–25 % of the difference in
margin of error between the ACS and the decennial long form (Navarro 2012).
Population controls are essential to the ACS sample weighting process, now
population controls are only available for relatively large geographic areas such
as municipalities and counties. This is a key data gap which as discussed later might
be addressed with big data.
The Potential for Big Data to Improve Neighborhood-Level Census Data 105
Prior to the advent of sampling, the complete count census data could, in principle,
be tabulated using any sort of geographic zone. Tract based census data has become
a cornerstone of social science and policy making. the decennial census by the late
twentieth century. However, users of the once a decade census were increasingly
concerned about the timeliness of the data (Alexander 2002). A solution to this
problem was developed by Leslie Kish, a statistician who developed the theory and
methods for “rolling” surveys (Kish 1990).
Kish’s basic idea was that a population could be divided into a series of
non-overlapping annual or monthly groups called subframes. Each subframe
would then be enumerated or sampled on a rolling basis. If each subframe were
carefully constructed so as to be representative of the larger population, then the
annual estimates would also be representative, and eventually, the entire population
would be sampled. The strength of this rolling framework is its efficient use of
surveys. The decennial census long form had to sample at a rate appropriate to make
reasonable estimates for small geographic areas such as census tracts, which
contain on average 4000 people. Therefore, citywide data released for a munici-
pality of, say, one million people would be based on considerably more samples
than necessary. Spreading the samples over time lets larger areas receive reasonable
estimates annually, while smaller areas wait for more surveys to be collected. The
rolling sample therefore increases the frequency of data on larger areas. The
primary cost comes in the temporal blurring of data for smaller areas. The advent
of sampling made census data for small geographic areas less precise. Since there
are a finite number of samples in any geographic area, as tabulation zones become
smaller sample sizes decline, making estimates more uncertain. The rise uncer-
tainty is greater for small populations; for instance the effects of reducing a sample
size from 200 to 100 is much greater than the effect of reducing a sample size from
20,000 to 10,000. The USCB counteracts this decline in sample size by pooling
surveys in a given area over multiple years, thus diluting the temporal resolution of
the estimates.
Rolling sampling is straightforward in the abstract. For example, suppose that
there are K ¼ 5 annual subframes, that the population in a tract is known
(N ¼ 1000), that the sampling rate is r ¼ 1/6, and that the response rate is 100 %;
then one would sample n ¼ N/(K*1/r) people per year. Over a 5-year period 1/6 of
the population would be sampled and each returned survey would represent w ¼
(N/n)/K people, where w is the weight used to scale survey responses up to a
population estimate. In this simple case, the weight assigned to each survey would
be the same. For any individual attribute y, the tract level estimate would be
yt ¼ Σwiyi (equation 1), a weighted summation of all i surveys collected in tract t.
If the weights are further adjusted by ancillary population controls X, then the
variance of the estimate is Σwi2VAR[yi|X] (equation 2; Fuller 2011, assuming
independence.). If the rolling sample consisting of long-form-type questions were
administered simultaneously with a short form census, then all the parameters in
our simple example (N,K, X) would be known.
106 S.E. Spielman
However, in the ACS good population controls are not available for small areas
(N and X are unknown) because, unlike the long form, the survey is not contempo-
raneous with the complete enumeration decennial census. Thus weights (w) for
each response must be estimated and this is an important source of uncertainty in
the ACS.
5 Weighting
In the ACS each completed survey is assigned a weight (w) that quantifies the
number of persons in the total population that are represented by a sampled
household/individual. For example, a survey completed by an Asian male earning
$45,000 per year and assigned a weight of 50 would in the final tract-level
estimates represent 50 Asian men and $2.25 million in aggregate income. The
lack of demographically detailed population controls, and variations in response
rate all necessitate a complex method to estimate w. The construction of ACS
weights is described in the ACS technical manual (which runs hundreds of pages,
U.S. Census Bureau 2009a). Individually these steps make sense but they are so
numerous and technically complex that in the aggregate they make the ACS
estimation process nearly impenetrable for even the most sophisticated data users.
The cost of extensive tweaking of weights is more than just lack of transparency
and complexity. Reducing bias by adjusting weights carries a cost. Any procedure
that increases the variability in the survey weights also increases the uncertainty
in tract-level estimates (Kish 2002). Embedded in this process is a trade-off
between estimate accuracy (bias) and precision (variance/margin of error), refin-
ing the survey weights reduces bias in the ACS but it also leads to variance in the
sample weights.
Without traditional survey data from national statistical agencies, like the USCB, it
is difficult to contextualize big data, its hard to know who is (and who is not)
represented in big data. It is difficult to know if there are demographic, geographic,
and or economic biases in the coverage of big data without traditional census data
as a baseline. Ironically, as this baseline data declines in quality, many of the
populations most in need of urban services are least well served by the traditional
census data and quite possibly the big data as well—consider the example of young
children in poverty discussed in the introduction.
The Potential for Big Data to Improve Neighborhood-Level Census Data 107
In the preceding sections we identified several key data gaps and methodological
decisions that might be addressed with big data:
1. Sampling is constrained by a lack of detailed high geographic and demographic
resolution population data.
2. Small area geographies are not “designed” and this leads to degradation in the
quality of estimates and the utility of the published data.
3. Weights are complex and difficult to accurately estimate without additional data.
In this section we outline how big data might be used to address these issues.
This section is by no means exhaustive, the aim more to draw attention to the
potential for new forms of data to mitigate emerging problems with neighborhood
statistics. It is also important to note that, for reasons discussed in the conclusion,
this section is largely speculative, that is, very few of the ideas we propose have
seen implementation.
So far this paper has emphasized the mechanics of the construction of the ACS—
sampling, the provision of small area estimates, the provision of annual estimates,
and the estimate of survey weights. The prior discussion had a fair amount of
technical detail because such detail is necessary in order to understand how novel
forms of “big” data might be integrated into the production process. Directly
integrating big data into the production of estimates is not the only way to use
new forms of data concurrently with traditional national statistics, but in this paper
the emphasis is on such an approach.
It should be apparent that the data gaps and methodological choices we have
identified thus far are intertwined. For example, the use of sampling necessitates the
estimation of survey weights which are complicated to estimate when very little is
known about the target population in the areas under investigation. Spatial and
temporal resolution are related because the reliability of the estimate depends on the
number of surveys, which accrue over time, and the size (population) and compo-
sition of the area under investigation.
The lack of detailed small area population controls is makes it very difficult to
estimate the weight for each survey. Since the US Census Bureau does not know
how many low income Caucasian males live in each census tract it is difficult to
know if the number of surveys returned by low income Caucasian males higher or
lower than expected—this affects the weight assigned to a response. For example,
imagine a hypothetical census tract with 2000 housing units and a population of
4000 people. 10 % of the population is low-income white males and this tract was
sampled at a 5 % rate, one would expect 10 % of the completed surveys to be filled
in by low-income white males. However, if this group is less likely than others to
respond perhaps the only 2 % of the completed surveys would be completed by
low-white males. If the number of low-income white males was known in advance
one could “up-weight” these responses to make sure that in the final data low
income-white males represented 10 % of the population. However, the census
bureau has no idea how many low-income white males are in each census tract.
This is where big data might help.
108 S.E. Spielman
If, for example, the number of low-income white males could be estimated by
using credit reporting data, social media profiles, or administrative records from
other government agencies, then a lot of the guesswork in deciding how to weight
survey responses could be eliminated. It’s important to realize that these forms of
“big” data might not be of the highest quality. However, they could be used to
establish meaningful benchmarks for sub-populations making simple comparisons
of “big” and traditional data possible. While it would be difficult to say which data
was “correct” it is reasonable to suggest that large discrepancies would warrant
closer inspection and would highlight key differences in the coverage of the various
data sets. These coverage differences are not especially well understood at the time
of writing.
A more sophisticated strategy would be to employ what is called are called
“model assisted estimation” strategies (see Särndal 1992). Model assisted estima-
tion is a set of strategies for using ancillary data and regression models to estimate
survey weights. Currently, the ACS uses a model assisted strategy called “Gener-
alized Regression Estimator” (GREG). In the ACS GREG takes advantage of
person-level administrative data on age, race, and gender of residents from auxil-
iary sources such as the Social Security Administration, the Internal Revenue
Service, and previous decennial census tabulations. The procedure builds two
parallel datasets for each census tract: one using the administrative data on all
people in the tract, and the second using administrative data for only the surveyed
housing units. The second dataset can be viewed, and tested, as an estimate of the
demographic attributes of the first—e.g., proportions of males aged 30–44,
non-Hispanic blacks, etc. A weighted least squares regression is then run on the
second dataset, in which the dependent variable is weighted HU counts and the
independent variables are the various weighted attribute counts.
The strength of model assisted estimation procedure depends entirely on the
quality of the regression. A well-fit regression should reduce overall uncertainty in
the final ACS estimates by reducing the variance of the weights, while a poorly fit
regression can actually increase the margin of error. The data used in models
assisted estimation in the ACS is terrible for its intended purpose, that is age, sex,
and race are only loosely correlated with many of the economic and demographic
characteristics of most interest to urban planners and policy makers. In spite of
these weaknesses Age, Sex, and Race data are used because they are available to the
USCB from other Federal agencies, more sensitive data, like income, is not
incorporated into estimates.
However, data on homeownership, home values, spending patterns, employ-
ment, education and many other attributes may be obtainable through big data sets
and this could be used to improve the quality of estimates through model assisted
estimation. For example, housing data from cadastral records and home sales could
be (spatially) incorporated into the ACS weighting strategy. The exact home value
of each house is unknown, so they are unusable as hard benchmarks. But, it is
possible to approximate the value of each house based upon location, characteris-
tics, and nearby sales. Even if it was not possible to directly match survey respon-
dent to records in other datasets, it might be possible to geospatially impute such
The Potential for Big Data to Improve Neighborhood-Level Census Data 109
characteristics. For example, recent nearby home sales might be used to estimate
the value of a respondents’ home. This approximation is used to great effect by the
mortgage industry and by local governments for property tax assessments. Since
these models are approximations, the data may enter the weighting phase as “soft”
benchmarks (i.e. implemented a mixed effects models). It is not appropriate for the
weights to exactly duplicate the estimated home value, but it is appropriate for the
weights to approximate the estimated home value. For example, Porter et al. (2013)
use the prevalence of Spanish language Google queries to improve census estimates
of the Hispanic population. Carefully chosen controls have the potential to dramat-
ically reduce the bias and margin of error in ACS estimate for certain variables. The
estimates most likely to be impacted are socioeconomic variables, which are poorly
correlated with the currently available demographic benchmarks, and thus have a
disproportionately large margin of error.
A second mechanism for using big data to improve estimates is through zone
design. Census geographies are designed to be stable over time, that is, local
committees at some point designed them in the past (often 3o years ago) and they
have only evolved through splits and merges with other census tracts. Splits and
merges can only occur when the tract population crosses some critical threshold.
The size and shape of census fundamentally affects the quality of estimates. Larger
population census tracts, because they generally have more surveys supporting
estimates have higher quality data. However, as geographies grow in size there is
potential to loose information on intra urban variation. However, information loss
does not necessarily occur as a result of changes in zone size. Consider two adjacent
census tracts that are very similar to each other in terms of ethnic composition,
housing stock, and economic characteristics. The cost of combining these two
census tracts into a single area is very small. That is, on a thematic map these
two adjacent areas would likely appear as a single unit (because they would be the
same legend color because they would likely have the same value). Combining
similar places together boosts the number of completed surveys and thus reduces
the margin of error. The challenge is how does one tell if adjacent places are similar
(or not) when the margins of error on key variables are very large? Again, big data,
if it provides a reasonable approximation of the characteristics of the places at high
spatial resolutions it maybe possible to combine lower level census geographies
into units large enough to provide high quality estimates. For example, Spielman
and Folch (2015) develop an algorithm to combine existing lower-level census
geographies, like tracts and block groups, into larger geographies while producing
new estimates for census variables such that the new estimates leverage the larger
population size and have smaller margins of error. For example, they demonstrate
that even for variables like childhood poverty, it is possible to produce usable
estimates for the city of Chicago by intelligently combining census geographies
into new “regions”. This strategy results in some loss of geographic detail, but the
loss is minimized by ensuring that only similar and proximal geographies are
merged together
110 S.E. Spielman
7 Conclusion
Little (2012) argues that a fundamental philosophical shift is necessary within both
federal statistical agencies and among data users, “we should see the traditional
survey as one of an array of data sources, including administrative records, and
other information gleaned from cyberspace. Tying this information together to
yield cost-effective and reliable estimates. . .” However, Little also notes that for
the Census “combining information from a variety of data sources is attractive in
principle, but difficult in practice” (Little 2012, p. 309). By understanding the
causes of uncertainty in the ACS the implications of Little’s statement become
clear, there is enormous potential to mash-up multiple forms of information to
provide a more detailed picture of US cities.
However, there are major barriers to incorporating non-traditional forms of data
into official neighborhood statistics. The reasons for this range from organizational
to technical. Institutionally, there is a resistance to barriers to the adoption of
non-standard forms of data in public statistics. This resistance stems from the fact
such data sources are outside of the control of the agencies producing the estimates
are relying on such data, that may be subject to changes in quality and availability,
poses a problem for the tight production schedules faced by national statistical
agencies. Technically, it is often unclear how to best leverage such information,
while we have outlined some possibilities they are difficult to test given the
sensitive and protected nature of census/survey data itself. Very few people have
access to this protected data, it is protected by statute, and thus must be handled in
very cumbersome secure computing environments. This makes it difficult to
“prove” or “test” concepts. In the US and UK there are some efforts underway to
publish synthetic data to allow research on/with highly detailed micro data without
releasing the data itself. The barriers to innovative data fusion are unlikely to be
resolved and until clear and compelling examples are developed that push national
statistical agencies away from their current practices.
To summarize, the growing enthusiasm over big data makes it easy to disre-
gard the decline of traditional forms of public statistics. As these data decline in
quality it becomes difficult to plan, provide services, or understand changes in
cities. The enthusiasm over big data should be tempered by a holistic view of the
current data economy. While it is true that many new data systems have come
online in the last 10 years, it is also true that many critical public data sources are
withering. Is big data a substitute for the carefully constructed, nationally repre-
sentative, high resolution census data that many practicing planners and
policymakers rely upon? I think not, and while federal budgets are unlikely to
change enough to yield a change to the quality of federal statistical programs, the
use of new forms of data to improve old forms of data is a promising avenue for
investigation.
The Potential for Big Data to Improve Neighborhood-Level Census Data 111
References
Alexander CH (2002) Still rolling: Leslie Kish’s “rolling samples” and the American Community
Survey. Surv Methodol 28(1):35–42
Anderson MJ, Citro C, Salvo JJ (2011) Encyclopedia of the US Census: from the Constitution to
the American Community Survey. CQ Press, Washington, DC
Fay RE, Train GF (1995) Aspects of survey and model-based postcensal estimation of income and
poverty characteristics for states and counties. In Proceedings of the Government Statistics
Section, American Statistical Association, pp 154–159
Fuller WA (2011) Sampling statistics. Wiley, Hoboken, NJ
Against the smart city (The city is here for you to use) by Adam Greenfield Kindle Edition, 152
pages, 2013
National Research Council (2013) Nonresponse in social science surveys: a research agenda. In:
Tourangeau R, Plewes TJ (eds) Panel on a research agenda for the future of social science data
collection, Committee on National Statistics. Division of Behavioral and Social Sciences and
Education. The National Academies Press, Washington, DC
Kish L (1990) Rolling samples and censuses. Surv Methodol 16(1):63–79
Kish L (2002) Combining multipopulation statistics. J Stat Plan Inference 102(1):109–118
Kitchin (2014) Big Data & Society 1(1)2053951714528481; DOI: 10.1177/2053951714528481
Little RJ (2012) Calibrated Bayes: an alternative inferential paradigm for official statistics. J Off
Stat 28(3):309–372
MacDonald H (2006) The American community survey: warmer (more current), but fuzzier (less
precise) than the decennial census. J Am Plan Assoc 72(4):491–503
Navarro F (2012) An introduction to ACS statistical methods and lessons learned. Measuring
people in place conference, Boulder, CO. http://www.colorado.edu/ibs/cupc/workshops/mea
suring_people_in_place/themes/theme1/asiala.pdf. Accessed 30 Dec 2012
Porter AT, Holan SH, Wikle CK, Cressie N (2014) Spatial Fay-Herriot models for small area
estimation with functional covariates. Spat Stat 10:27–42
Rao JNK (2003) Small area estimation, vol 327. Wiley-Interscience, New York
Särndal C-E (1992) Model assisted survey sampling. Springer Science & Business Media,
New York
Spielman SE, Folch DC (2015) Reducing uncertainty in the American Community Survey through
data-driven regionalization. PLoS One 10(2):e0115626
Starsinic M (2005) American Community Survey: improving reliability for small area estimates.
In Proceedings of the 2005 Joint Statistical Meetings on CD-ROM, pp 3592–3599
Starsinic M, Tersine A (2007) Analysis of variance estimates from American Community Survey
multiyear estimates. In: Proceedings of the section on survey research methods. American
Statistical Association, Alexandria, VA, pp 3011–3017
U.S. Census Bureau (2009a) Design and methodology. American Community Survey.
U.S. Government Printing Office, Washington, DC
Big Data and Survey Research: Supplement
or Substitute?
Abstract The increasing availability of organic Big Data has prompted questions
regarding its usefulness as an auxiliary data source that can enhance the value of
design-based survey data, or possibly serve as a replacement for it. Big Data’s
potential value as a substitute for survey data is largely driven by recognition of the
potential cost savings associated with a transition from reliance on expensive and
often slow-to-complete survey data collection to reliance on far less-costly and
readily available Big Data sources. There may be, of course, serious methodolog-
ical costs of doing so. We review and compare the advantages and disadvantages of
survey-based vs. Big Data-based methodologies, concluding that each data source
has unique qualities and that future efforts to find ways of integrating data obtained
from varying sources, including Big Data and survey research, are most likely to be
fruitful.
1 Introduction
As response rates and survey participation continue to decline, and as costs of data
collection continue to grow, researchers are increasingly looking for alternatives to
traditional survey research methods for the collection of social science information.
One approach has involved modifying scientific survey research methods through
the abandonment of probability sampling techniques in favor of less expensive
non-probability sampling methodologies (c.f. Cohn 2014). This strategy has
become popular enough that the American Association for Public Opinion
Research (AAPOR) recently felt it necessary to appoint a Task Force to investigate
the issue and release a formal report (Baker et al. 2013). Others have explored the
usefulness of supplementing, or replacing completely, surveys with information
captured efficiently and inexpensively via “Big Data” electronic information sys-
tems. In this paper, we explore the advantages and disadvantages of using survey
data versus Big Data for purposes of social monitoring and address the degree to
which Big Data can become a supplement to survey research or a complete
alternative or replacement for it.
Survey research originally evolved out of social and political needs for better
understandings of human populations and social conditions (Converse 1987). Its
genesis predates considerably the pre-electronic era to a time when there were few
alternative sources of systematically collected information. Over the past 80 years,
survey research has grown and diversified, and complex modern societies have
come to increasingly rely on survey statistics for a variety of public and private
purposes, including public administration and urban planning, consumer and mar-
ket research, and academic investigations, to name a few. In contrast, Big Data
became possible only recently with the advent of reliable, high speed and relatively
inexpensive electronic systems capable of prospectively capturing vast amounts of
seemingly mundane process information. In a very short period of time, Big Data
has demonstrated its potential value as an alternative method of social analysis
(Goel et al. 2010; Mayer-Sch€onberger and Cukier 2013).
Before proceeding further, however, it is important to define what we mean
exactly by survey research and “Big Data.” Vogt (1999: 286) defines a survey as “a
research design in which a sample of subjects is drawn from a population and
studied (often interviewed) to make inferences about the population.” Groves
(2011) classifies surveys as forms of inquiry that are “design-based,” as the specific
methodology implemented for any given study is tailored (or designed) specifically
to address research questions or problems of interest. In contrast, Webopedia (2014)
defines Big Data as “a buzzword. . .used to describe a massive volume of both
structured and unstructured data that is so large that it’s difficult to process using
traditional database and software techniques.” Thakuriah et al. (2016), more care-
fully define Big Data as “structured and unstructured data generated naturally as a
part of transactional, operational, planning and social activities, or the linkage of
such data to purposefully designed data.” In addition to these attributes, Couper
(2013) observes that Big Data is produced at a rapid pace. In contrast to design-
based data, Groves classifies Big Data as being organic in nature. Although similar
to survey data in the systematic manner in which it is collected, organic data is not
typically designed to address specific research questions. Rather, such data, referred
to by Harford (2014) as “digital exhaust,” is a by-product of automated processes
that can be quantified and reused for other purposes. There are, of course, excep-
tions, such as the National Weather Service’s measurements, which are design-
based and otherwise fit the definition of Big Data.
Although they do not fit today’s electronic-based definitions of Big Data, there
are several examples of survey-based data sets that are uncharacteristically “big” by
Big Data and Survey Research: Supplement or Substitute? 115
While censuses and the Literary Digest examples share with today’s Big Data large
observation-to-variable ratios, they do not have Big Data’s electronic-based longi-
tudinal velocity, or rate of data accumulation. Rather, even Big Surveys are only
snapshots that represent at best a brief moment in time. Perhaps even more
importantly, the structures of these design-based data sources are carefully
constructed, unlike many sources of Big Data, which are known for their “messy”
nature (Couper 2013). Hence, there are several important differences between
design-based survey data, and the organic data sources that represent Big Data.
These include differences in volume, data structures, the velocity and chronicity
with which data are accumulated, and the intended purposes for which the data are
collected.
2.1 Volume
Big Data is big by definition. As Webopedia (2014) suggests, Big Data represents
“a massive volume of both structured and unstructured data that is so large that it’s
difficult to process using traditional database and software techniques.” Most of the
information generated in the history of our planet has probably been produced in the
past several years by automated Big Data collection systems. Google’s search
database alone collects literally billions of records on a daily basis and will
presumably continue to do so into the foreseeable future, accumulating an almost
impossibly large amount of organic information. Prewitt (2013: 229) refers to this
as a “digital data tsunami.” Survey data, by contrast, is many orders of magnitude
more modest in volume, and as mentioned earlier, is becoming more expensive and
difficult to collect.
116 T.P. Johnson and T.W. Smith
2.3 Velocity
Data velocity is the speed with which data is accumulated. Big Data’s velocity, of
course, means that it can be acquired very quickly. Not so with surveys, which
require greater planning and effort, depending on mode. Well-done telephone
surveys can take weeks to complete, and well-done face-to-face and mail surveys
can require months of effort. Even online surveys require at least several days of
effort to complete all “field” work. Where government and business decisions must
be made quickly, Big Data may increasingly become the most viable option for
instant analysis. Indeed, many complex organizations now employ real-time “dash-
boards” that display up-to-the-minute sets of indicators of organizational function-
ing and activity to be used for this purpose, and one of the stated advantages of
Google’s Flu Index (to be discussed below) and similar efforts has been the almost
real-time speed with which the underlying data become available, vastly
outperforming surveys, as well as most other forms of data collection. Big Data is
collected so quickly, without much in the way of human intervention or mainte-
nance, that its velocity is sometimes compared to that of water emitting from a fire
hose. Survey research will continue to have difficulty competing in this arena.
Data chronicity refers to time dimensions. The chronicity of Big Data is much more
continuous (or longitudinal) than that of most common cross-sectional surveys.
With few exceptions, survey data are almost invariably collected over relatively
short time intervals, typically over a matter of days, weeks or months. Some data
collection systems for Big Data, in contrast, are now systematically collecting
information on an ongoing, more or less, permanent basis. There is an often
incorrect assumption that the methods, coverage and content of Big Data remains
static or unchanging over time. In fact, Big Data systems are often quite changeable
and hence there is a danger that time series measurements may not always be
comparable.
Big Data and Survey Research: Supplement or Substitute? 117
Design-based survey data are collected to address specific research questions. There
are few examples of Big Data being intentionally constructed for research purposes,
mostly by governmental agencies interested in taking, for example, continuous
weather or other environmental or economic measurements. Most Big Data initia-
tives, rather, seem driven by commercial interests. Typically, researchers have a
good deal of control over the survey data they collect, whereas most analysts of Big
Data are dependent on the cooperative spirit and benevolence of large corporate
enterprises who collect and control the data that the researchers seek to analyze.
The main advantages of Big Data over survey data collection systems are costs,
timeliness and data completeness.
3.2 Timeliness
As discussed earlier, the velocity of Big Data greatly exceeds that of traditional
survey research. As such, it theoretically provides greater opportunities for the real-
time monitoring of social, economic and environmental processes. It has been
noted, however, that the processing of Big Data can in some cases be a lengthy
and time-consuming process (Japec et al. 2015). In addition, being granted real-
time access by the original collectors of this information is not always allowed.
118 T.P. Johnson and T.W. Smith
Missing data at both the item and unit levels is a difficult problem in survey research
and the errors associated with it preoccupy many researchers. Big Data sets do not
typically share this problem. Because most Big Data sets are based on varied data
collection systems that do not rely directly on the participation of volunteers, and
subjects are typically not even aware that they are contributing information to Big
Data systems (on this point, see the section on Ethical Oversight below),
non-observations due to failure to contact individuals, or to their unwillingness or
inability to answer certain questions, or to participate at all, is not a problem. But
Big Data is also not perfect, as we would expect for example that monitors and other
recording devices will occasionally malfunction, rendering data streams incom-
plete. As with surveys, the information missing from Big Data sets may also be
biased in multiple ways.
Advantages of survey research data over Big Data include its emphasis on theory,
the ease of analysis, error assessment, population coverage, ethical oversight and
transparency.
Some have argued that the we are facing “the end of theory,” as the advent of Big
Data will make “the scientific method obsolete” (Anderson 2008). Although some
of the survey research reported in the popular news media is descriptive only, much
of the research conducted using survey methods is theory-driven. Survey data are
routinely employed to test increasingly sophisticated and elaborate theories of the
workings of our social world. Rather than allowing theory to direct their analyses,
Big Data users tend to be repeating some earlier criticisms of empirical survey
research by inductively searching for patterns in the data, behaviors that left earlier
generations of survey researchers vulnerable to accusations of using early high-
speed computers for “fishing expeditions.” Fung (2014) criticizes Big Data as being
observational (without design) and lacking in the controls that design-based data
typically collect and employ to rule-out competing hypotheses.
Big Data and Survey Research: Supplement or Substitute? 119
The sheer size of many Big Data sets and their often unstructured nature make them
much more difficult to analyze, compared to typical survey data files. There are
numerous packaged data management and statistical analysis systems readily
available to accommodate virtually any survey data set. Big Data, in contrast,
typically requires large, difficult-to-access computer systems to process, and there
is a shortage of experts with the knowledge and experience to manage and analyze
Big Data (Ovide 2013). The time necessary to organize and clean Big Data sets may
offset, to some extent, the speed advantage with which Big Data is accumulated.
The error sources associated with survey data are reasonably well understood and
have been the subject of robust, ongoing research initiatives for many decades
(Groves et al. 2009; Schuman and Presser 1981; Sudman and Bradburn 1974). We
know that the Literary Digest poll was discredited by several error sources, includ-
ing coverage and nonresponse errors that have been well documented (Lusinchi
2012; Squire 1988). Errors associated with Big Data, however, are currently not
well understood and efforts to systematically investigate them are only now begin-
ning. Prewitt (2013: 230) observes that “there is no generally accepted understand-
ing of what constitutes errors when it is machines collecting data from other
machines.” Measurement error is an important example. Survey measures are
typically the subject of considerable research and refinement, with sophisticated
methodologies readily available for the design, testing, and assessment of measure-
ment instruments (Madans et al. 2011; Presser et al. 2004). Big Data shares many of
the challenges of secondary analyses of survey data in which specific indicators of
the construct(s) of interest may not always be available, challenging the analyst’s
creativity and cleverness to sometimes “weave a silk purse from a sow’s ear.”
Indeed, those analyzing Big Data must work with what is available to them and
there is seldom an opportunity to allow theory to drive the design of Big Data
collection systems. There is also concern that those who generate Big Data are
sometimes unwilling to share details of how their data are collected, to provide
definitions of the terms and measures being used, and to allow replication of
measurements and/or analyses based on their measurements.
One interesting example is the Google Flu Index. In 2009, a team from Google
Inc. and the Centers for Disease Control and Prevention (CDC) published a paper in
Nature that described the development of a methodology for examining billions of
Google search queries in order to monitor influenza in the general population
(Ginsberg et al. 2009).1 They described a non-theoretical procedure that involved
1
In 2008, a team of academic investigators and Yahoo! Employees published a similar paper
(Polgreen et al. 2008).) That team, however, had not continued to report on this topic.
120 T.P. Johnson and T.W. Smith
identifying those Google search queries that were most strongly correlated with
influenza data from the CDC; a large number of models were fit during the
development of the flu index. They reported the ability to accurately estimate
weekly influenza within each region of the U.S. and to do so with only a very
short time lag. Shortly thereafter, the flu index underestimated a non-seasonal
outbreak, and researchers speculated that changes in the public’s online search
behaviors, possibly due to seasonality, might be responsible (Cook et al. 2011).
Despite an ongoing effort to revise, update and improve the predictive power of
Google Flu Trends, it also greatly overestimated influenza at the height of the flu
season in 2011–2012 (Lazer et al. 2014a) and especially in 2012–2013 (Butler
2013). Lazer et al. (2014a) also demonstrated that Google Flu Trends had essen-
tially overestimated flu prevalence during 100 of 108 weeks (starting with August
2011). A preliminary analysis of the 2013–2014 season suggests some improve-
ment, although it is still overestimating flu prevalence (Lazer et al. 2014b).
Couper (2013) has made the interesting point that many users of social media,
such as Facebook, are to some extent motivated by impression management, and we
can thus not be certain of the extent to which information derived from these
sources accurately represents the individuals who post information there. Social
desirability bias would thus appear to be a threat to the quality of Big Data as well as
survey data. The fact that a significant proportion of all Facebook accounts, for
example, are believed to represent fictitious individuals is another cause for con-
cern. One estimate from 2012 suggests the number of fake Facebook accounts may
be as many as 83 million (Kelly 2012). Hence, concerns with data falsification also
extend to Big Data.
The Literary Digest Poll was big, but many believe it did not provide adequate
coverage of the population to which it was attempting to make inferences. Rather, it
likely over-represented upper income households with political orientations decid-
edly unrepresentative of the Depression Era U.S. citizenry. Clearly, volume could
not compensate for or fix coverage error. Big Data faces similar problems. For Big
Data that captures online activities, it is important to be reminded that not everyone
is linked to the internet, not everyone on the web uses Google search engines,
Twitter and Facebook, and everyone who does certainly does not do so in a similar
manner. Among those who do interact with the web, the manners in which they do
are very diverse. The elderly, who are less likely to engage the internet, are
particularly vulnerable to influenza, yet none of the Google Flu Index papers
referenced here address this issue. A related concern is the problem of selection
bias. As Couper (2013) has observed, Big Data tends to focus on society’s “haves”
and less so on the “have-nots.” In addition, in Big Data there can be a problem with
potential violations of the “one-person-one-vote” rule. As Smith (2013) has
commented, a large preponderance of some social media activities, such as Twitter
Big Data and Survey Research: Supplement or Substitute? 121
and Facebook, are the products of the activities of relatively small concentrations of
individuals, further calling in to question the adequacy of their coverage. Indeed,
many Big Data systems have what Tufekci (2014) refers to as a denominator
problem “created by vague, unclear or unrepresentative sampling.” Others have
expressed concerns regarding the danger that Big Data “can be easily gamed”
(Marcus and Davis 2014). Campbell (1979) wrote more than 40 years ago about
the corruptibility of social data as it becomes more relevant to resource allocation
decisions. Marcus and Davis (2014) discuss several Big Data examples of this.
Design-based, “small data” surveys, in comparison, go to great lengths to insure
that their samples adequately cover the population of interest.
datasociety.net/initiatives/council-for-big-data-ethics-and-society/). There is no
consensus, however, regarding the ethical issues surrounding cases such as the
Facebook experiments (Puschmann and Bozdag 2014).
4.6 Transparency
5 Supplement or Substitute?
Lazer and colleagues (2014a: 1203) have coined the term “Big Data Hubris” to
refer to “the often implicit assumption that big data are a substitute for, rather than a
supplement to, traditional data collection and analysis.” Others share this sentiment.
The British sociologists Savage and Burrows (2007: 890) have considered the
historicity of survey research and suggest that its “glory years” were between
1950 and 1990. Taking the long view, one has to wonder as to whether or not
surveys might merely represent one of the first generations of social research
methods, destined to be replaced by more efficient methodologies in an increasingly
digital world? Just as the horse-drawn carriage was replaced by more advanced
Big Data and Survey Research: Supplement or Substitute? 123
References
Albergott R, Dwoskin E (2014) Facebook study sparks soul-searching and ethical questions. Wall
Street J. [Online] 30th June. http://online.wsj.com/articles/facebook-study-sparks-ethical-ques
tions-1404172292. Accessed 3 Aug 2014
Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete. Wired
Magazine, 23rd June. http://archive.wired.com/science/discoveries/magazine/16-07/pb_the
ory. Accessed 29 Jul 2014
124 T.P. Johnson and T.W. Smith
Baker R, Brick JM, Bates NA, Battaglia M, Couper MP, Dever JA, Gile KJ, Tourangeau R (2013)
Report of the AAPOR task force on non-probability sampling. http://www.aapor.org/AM/
Template.cfm?Section¼Reports1&Template¼/CM/ContentDisplay.cfm&ContentID¼5963.
Accessed 1 Aug 2014
Butler D (2013) When Google got flu wrong. Nature 494:155–156
Campbell DT (1979) Assessing the impact of planned social change. Eval Program Plann 2:67–90
Cohn N (2014) Explaining online panels and the 2014 midterms. New York Times. [Online]
27 July. http://www.nytimes.com/2014/07/28/upshot/explaining-online-panels-and-the-2014-
midterms.html?_r¼0. Accessed 1 Aug 2014
Converse JM (1987) Survey research in the United States: roots and emergence 1890-1960.
University of California Press, Berkeley
Cook S, Conrad C, Fowlkes AL, Mohebbi MH (2011) Assessing Google flu trends performance in
the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS One 6:e23610
Couper MP (2013) Is the sky falling? New technology, changing media, and the future of surveys.
Surv Res Methods 7:145–156
Fung K (2014) Google flu trends’ failure shows good data > big data. Harvard Business Review/
HBR Blog Network. [Online] 25 March. http://blogs.hbr.org/2014/03/google-flu-trends-fail
ure-shows-good-data-big-data/. Accessed 17 Jun 2014
Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinsji MS, Brilliant L (2009) Detecting
influenza epidemics using search engine query data. Nature 457:1012–1014
Goel S, Hofman JM, Lahaie S, Pennock DM, Watts DJ (2010) Predicting consumer behavior with
web search. PNAS 107:17486–17490, http://www.pnas.org/content/107/41/17486. Accessed
15 Aug 2015
Groves RM (2011) Three eras of survey research. Public Opin Quart 75:861–871
Groves RM, Fowler FJ, Couper MP, Lepkowski JM, Singer E, Tourangeau R (2009) Survey
methodology, 2nd edn. Wiley, New York
Harford T (2014) Big data: are we making a big mistake? Financial Times. [Online] 26 March.
http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.
html#axzz39NlqxnU8. Accessed 17 Jun 2014
Humphreys L (1970) Tearoom trade: impersonal sex in public places. Duckworth, London
Japec L, Kreuter F, Berg M, Biemer P, Decker P, Lampe C, Lane J, O’neil C, Usher A (2015)
AAPOR report on big data. http://www.aapor.org/AAPORKentico/AAPOR_Main/media/
Task-Force-Reports/BigDataTaskForceReport_FINAL_2_12_15.pdf. Accessed: 27 Jul 2015
Kelly H (2012) 83 million Facebook accounts are fakes and dupes. CNN Tech. [Online] 2 August.
http://www.cnn.com/2012/08/02/tech/social-media/facebook-fake-accounts/. Accessed 3 Aug
2014
Kramer ADI, Guillory JE, Hancock JT (2014) Experimental evidence of massive-scale emotional
contagion through social networks. PNAS 111:8788–8790, http://www.pnas.org/content/111/
24/8788.full. Accessed 2 Aug 2014
Kreuter F (2013) Improving surveys with paradata: analytic uses of process information. Wiley,
New York
Lazer D, Kennedy R, King G, Vespignani A (2014a) The parable of Google flu: traps in big data
analysis. Science 343:1203–1205
Lazer D, Kennedy R, King G, Vespignani A (2014b) Google flu trends still appears sick: an
evaluation of the 2013-2014 flu season. [Online] http://ssrn.com/abstract¼2408560. Accessed
26 Jul 2014
Lusinchi D (2012) “President” Landon and the 1936 Literary Digest Poll: were automobile and
telephone owners to blame? Soc Sci Hist 36:23–54
Madans J, Miller K, Maitland A, Willis G (2011) Question evaluation methods: contributing to the
science of data quality. Wiley, New York
Marcus G, Davis E (2014) Eight (no, nine!) problems with big data. New York Times. [Online]
7 April. http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.
html?_r¼0. Accessed 7 Aug 2014
Big Data and Survey Research: Supplement or Substitute? 125
Mayer-Sch€ onberger V, Cukier K (2013) Big data: a revolution that will transform how we live,
work, and think. Houghton Mifflin Harcourt, New York
Milgram S (1974) Obedience to authority: an experimental view. Harper, New York
Ovide S (2013) Big data, big blunders. Wall Street Journal. [Online] 10 March. http://online.wsj.
com/news/articles/SB10001424127887324196204578298381588348290. Accessed: 30 Jul
2014
Polgreen PM, Chen Y, Pennock DM, Nelson FD (2008) Using internet searches for influenza
surveillance. Clin Infect Dis 47:1443–1448
Presser S, Rothgeb JM, Couper MP, Lessler JT, Martin E, Martin J, Singer E (2004) Methods for
testing and evaluating survey questionnaires. Wiley, New York
Prewitt K (2013) The 2012 Morris Hansen lecture: Thank you Morris, et al., for Westat, et al. J Off
Stat 29:223–231
Puschmann C, Bozdag E (2014) Staking out the unclear ethical terrain of online social experi-
ments. Internet Policy Review 3(4). http://policyreview.info/articles/analysis/staking-out-
unclear-ethical-terrain-online-social-experiments. Accessed: 5 Aug 2016
Rayport JF (2011) What big data needs: a code of ethnical practices. MIT Technology Review,
May 26. http://www.technologyreview.com/news/424104/what-big-data-needs-a-code-of-ethi
cal-practices/. Accessed 2 Aug 2014
Savage M, Burrows R (2007) The coming crisis of empirical sociology. Sociology 41:885–899
Schuman H, Presser S (1981) Questions and Answers in Attitude Surveys. Wiley, New York.
Smith TW (2011) The report of the international workshop on using multi-level data from sample
frames, auxiliary databases, paradata and related sources to detect and adjust for nonresponse
bias in surveys. Int J Public Opin Res 23:389–402
Smith TW (2013) Survey-research paradigms old and new. Int J Public Opin Res 25:218–229
Smith TW, Kim J (2014) The multi-level, multi-source (ML-MS) approach to improving survey
research. GSS Methodological Report 121. NORC at the University of Chicago, Chicago
Squire P (1988) Why the 1936 Literary Digest poll failed. Public Opin Quart 52:125–133
Sudman S, Bradburn NM (1974) Response effects in surveys: a review and synthesis. Aldine
Press, Chicago
Thakuriah P, Tilahun N, Zellner M (2016) Big data and urban informatics: innovations and
challenges to urban planning and knowledge discovery. In: Thakuriah P, Tilahun N, Zellner
M (eds) Seeing cities through big data: research methods and applications in urban informatics.
Springer, New York
Tufekci Z (2014) Big questions for social media big data: representativeness, validity and other
methodological pitfalls. In ICWSM’14: Proceedings of the 8th international AAAI conference
on weblogs and social media, forthcoming. http://arxiv.org/ftp/arxiv/papers/1403/1403.7400.
pdf. Accessed 26 Jul 2014
Vogt WP (1999) Dictionary of statistics & methodology, 2nd edn. Sage, Thousand Oaks, CA
Webopedia (2014) Big data. http://www.webopedia.com/TERM/B/big_data.html. Accessed
29 Jul 2014
Big Spatio-Temporal Network Data Analytics
for Smart Cities: Research Needs
1 Introduction
water, sewage-disposal networks) or, use space for increasing their reach
(e.g. transmission towers and repeaters for electricity and tele-communication
networks).
Increasingly, a large number of urban sensors in major cities (Boyle et al. 2013)
have started to produce a variety of datasets representing both historic and evolving
characteristics of these smart-city networks. Each of these datasets record a certain
property or a phenomena on the smart-city network spread over space and time. A
collection of such datasets is referred to as Big Spatio-Temporal Network (BSTN)
data. On transportation network alone we have emerging datasets such as tempo-
rally detailed (TD) roadmaps (Shekhar et al. 2012) that provide travel-time for
every road-segment at 5 min granularity, traffic signal timings and coordination
(Liu and Hu 2013), GPS tracks (Shekhar et al. 2012; Yuan et al. 2010), vehicle
measurements (Ali et al. 2015) and traffic information from vehicle-to-vehicle
communications (Lee et al. 2009; Wolfson and Xu 2014). Other sample BSTN
datasets include, meter readings from water (Yang et al. 2011) and energy distri-
bution (electricity and gas) (Boyle et al. 2013) networks, traffic information from
communication networks (internet and telecom). To maintain coherence of the
paper, we will focus only on BSTN data generated from transportation networks.
The goal of this paper is to highlight some of the open computer science research
questions for a potential realization of analytics on BSTN data.
Realization of BSTN analytics is important due to its value addition potential in
several smart-city application use-cases e.g., urban navigation, transportation infra-
structure management and planning. For instance, BTSN can make significant
extensions to the traditional routing query “what is the shortest route between
University of Minnesota and the airport?” through novel preference metrics such
as “which route has the least amount of waiting at traffic signals?”, “Which route
typically produces the least amount of greenhouse gas emissions?”, “Which route
do commuters typically prefer?” Also, BTSN data can allow for exploratory
analysis through queries like “Which parts of the transportation networks typically
tend to have more greenhouse gas emissions than others?” etc.
This paper categorizes BSTN into the following two types: (a) vehicle measure-
ment big data and, (b) Travel-time big data. We define vehicle measurement big
data (VMBD) only to contain data coming from sensors inside a vehicle. VMBD
allows us to study how vehicles typically perform in a real world transportation
network. We define travel-time big data as to contain information on typical travel-
times and traffic signal delays observed by the traffic. Note that this segregation was
made for communication purpose only. Both VMBD and travel-time big data have
spatio-temporal network semantics and can be referred to as BSTN itself. Also, we
take a broad definition of the word “data analytics” which includes both knowledge
discovery and query processing.
This chapter is organized as follows: Section “Vehicle Measurement Big Data
(VMBD)” describes VMBD in detail and presents some open research questions for
realizing statistically significant pattern mining on it. In section “Travel-Time Big
Data” we present the travel-time big data and discuss some open research questions
which could pave the way towards scalable query processing this kind of big data.
We conclude in section “Conclusion”.
Big Spatio-Temporal Network Data Analytics for Smart Cities: Research Needs 129
Rich instrumentation (e.g., GPS receivers and engine sensors) in modern fleet
vehicles allow us to periodically measure vehicle sub-system properties (Kargupta
et al. 2006, 2010; Ali et al. 2015) of the vehicle. These datasets, which we refer to as
vehicle measurement big data (VMBD), contains a collection of trips on a trans-
portation network. Here, each trip is a time-series of attributes such as vehicle
location, fuel consumption, vehicle speed, odometer values, engine speed in revo-
lutions per minute (RPM), engine load, and emissions of greenhouse gases (e.g.,
CO2 and NOx). Figure 1 illustrates a sample VMBD from a trip using a plot
(Fig. 1a) and tabular representations (Fig. 1b). The geographic route associated
with the trip is shown as a map in lower part of Fig. 1a, where color indicates the
value of a vehicle measurement, e.g., NOx emission.
Computationally, VMBD may be modeled as a spatio-temporal network (STN)
(George and Shekhar 2008; George et al. 2007), which is a generalization of
location-aware adjacency graph representation for roadmaps (Fig. 2), where road
intersections are modeled as vertices and the road segments connecting adjacent
intersections are represented as edges. For example, the intersection of SE 5th Ave
and SE University Ave is modeled as node N1 and the segment between SE
University Ave and SE 4th Street is represented by the edge N1-N4 with the
direction information on the segment. Note that location-aware adjacency graphs
are richer than graphs in graph theory (Trudeau 2013; Agnarsson and Greenlaw
2006; Ahuja et al. 1993), since its vertices and edges are associated with their
geographic locations (e.g., Map-matched edge-id and Location colums in Fig. 1b)
to facilitate modeling spatial relationships (e.g., left-turn, right-turn) that are not
explicitly modeled via edges. Location-aware adjacency graph representations are
generalized to spatio-temporal networks (STN) representations (Fig. 2b, c), where
nodes or edges or their properties vary over time. Figure 2b shows the snapshot
model of STNs across four different time points (t ¼ 1,2,3 and 4) to represent the
time-dependence of edge weights. Figure 2c shows a time-expanded graph (TEG)
Fig. 1 Vehicle measurement data and its tabular representation. (a) Vehicle measurement data
with route map. (b) Tabular representation
130 V.M.V. Gunturi and S. Shekhar
a b c
B Time t=1 t=2 t=3 t=4 t=5 t=6
N7 2 B 1 2 1
5th St A A1 A2 A3 A4 A5 A5
A D A D
N8
N4 1 C 3 1 C 1
N9 Time t=1 Time t=2 B B1 B2 B3 B4 B5 B5
N5
N1 1 B 1 2 B 1
4th St
N6 C C1 C2 C3 C4 C5 C5
A D A D
Univ Ave SE 6th N2
Ave SE 2 C 1 1 C 3
N3 Time t=3 Time t=4 D D1 D2 D3 D4 D5 D5
Fig. 2 Spatio-temporal networks (STN): snapshop and time expanded. (a) Example roadmap and
its adjacency representation. (b) Snapshot model. (c) Time-expanded graph
(K€ohler et al. 2002) representation, which stitches all snapshots together via edges
spanning multiple snapshots. For example, the edge (A1–B3) in Fig. 2c represents a
trip starting at node A at time 1 and reaching node B at time 3. Even though not
explicit in Fig. 2c, properties of edge (A1–B3) may record fuel-consumption, NOx
emissions and other vehicle measurements made during the trip.
a
Right turn
A C
Left turn
B D
A A1 A2 A3 A4 A5 A6
[2] [2] [1]
B B1 B2 B3 B4 B5 B6
[1] [4]
[1]
[4] [1]
C1 C2 C3 C4 C5 C6
C
[1] [1]
D D1 D2 D3 D4 D5 D6
Fig. 4 (a) Two routes between A and D for the bus. (b) STN based representation of the two bus
routes between A and D
132 V.M.V. Gunturi and S. Shekhar
were traveled in some journey. Each journey is shown using a different color. The
number mentioned in “[]” on the edges denote the number of discrepancies in the
target variable observed on that edge. The journeys colored red and purple (both
over the route with left turn during “rush hours”) can be seen to have the highest
number (six each, compared to others with only two) of discrepancies, making them
a potential output for the hot route discovery problem.
Mining statistically significant hot routes is challenging due to the large data
volume and the STN data semantics, which violates critical assumptions underlying
traditional techniques for identifying statistically significant hotspots (Kulldorff
1997; Neill and Moore 2004; Anselin and Getis 2010) that data samples are
embedded in isotropic Euclidean spaces and spatial footprints of hotspots are
simple geometric shapes (e.g., circles or rectangle with sides parallel to coordinate
axes). This is illustrated in Fig. 5 using a transportation safety dataset about
pedestrian fatalities. SatScan (Kulldorff 1997, 1999), a popular hotspot detection
technique, finds a few circular hotspots (Fig. 5a) with statisfical significance of
around 0.1, even though many highway stretches (shown in color in Fig. 5b) have
higher statististical significance (0.035–0.052) and can potentially be detected from
non-geometric methods such as (Oliver et al. 2014; Ernst et al. 2011).
For our proposed hot route discovery problem, one needs to generalize spatial
statistics beyond iso-tropic Eucliean space to model the STN semantics such as
linear routes. Computationally, the size of the set of candidate patterns (all possible
directed paths in the given spatio-temporal network) is potentially exponential. In
addition, statistical interest measures (e.g., likelihood ration, p-value) may lack
monotonicity. In other words, a longer directed path may have a higher likelihood
ratio. We now describe some potential research questions for the problem of hot
route discovery.
Step-1: Design interest measure for statistically significant hot routes: The goal
here would be to develop interest measures for identifying statistically significant
hot routes. This would require balancing conflicting requirements of statistical
interpretation and support for computationally efficient algorithms. A statistically
interpretable measure is one whose values would typically conform to a known
distribution. On the other hand, computational properties which are known to
facilitate efficient algorithms include properties such as monotonicity. For this
task, one could start by exploring a ratio based interest measure containing the
densities of non-compliance inside the path to that of outside the path. We could
investigate the merit of this measure through the following research questions:
What are the characteristics of this interest measure? Does it have monotonicity? If
not then can we design an upper bound which has monotonicity? Do the values of
this measure (or its upper bound) follow a known distribution? Does it belong to
one of the general classes of distributions? If not then then how would an appro-
priate null hypothesis be designed for monte-carlo simulations? What distribution
Big Spatio-Temporal Network Data Analytics for Smart Cities: Research Needs 133
does the underlying VMBD normally tends to follow (needed for testing null
hypothesis)? Can we derive an interest measure based on the underlying distribu-
tion of the data (similar to log likelihood ratio (Kulldorff et al. 1998; Kulldorff
1997))?
Step-2: Design a Computational Structure of exploring the space of STN shortest
paths: As a first step towards addressing the challenge of the large number of
candidate paths, one would have to investigate scalable techniques for computing
all-pair STN shortest paths. This would require addressing the challenge of
non-stationary ranking among alternative paths between two points in an STN.
This non-stationary ranking precludes the possibility of a dynamic programming
134 V.M.V. Gunturi and S. Shekhar
(DP) based approach. A potential approach for this challenge could involve a
divide-and-conquer strategy called critical-time-point (CTP) based approaches
(Gunturi et al. 2011, 2015). A CTP based approach addresses the challenge of
non-stationary ranking by efficiently dividing the given time interval (over which
we observe non-stationary ranking among alternative paths) into a set of disjointed
time intervals over which ranking is stationary. One can now use a DP based
technique over these sub-intervals. Using the concept of CTPs, one could adapt
algorithms (e.g., Floyd Warshall’s, Johnson’s) for the all-pair shortest path problem.
Step-3: Algorithm for exploring the space of all simple paths in a spatio-
temporal graph: The third step for hot route discovery would be to investigate the
computational challenges of enumerating all the possible paths in spatio-temporal
network (STN). Clearly, in any real STN, there would be an exponential number of
candidate paths. A naı̈ve approach could involve an incremental spatio-temporal
join based strategy, which starts off with singleton edges is the STN and iteratively
joins them (based on spatio-temporal neighborhood constraints) to create longer
paths. This naı̈ve algorithm raises several unanswered questions: Would paths
enumerated this way still be simple, i.e., with no repeated nodes? How can we
avoid creating loops?
Figure 6 shows a sample transportation network in the Minneapolis area (to the
left). On the right is its simplified version, where arrows represent road segments
and labels (in circles) represent an intersection between two road segments. Loca-
tions of traffic signals are also annotated in the figure.
We consider an instance of travel-time big data on our sample networked system
shown in Fig. 6 by considering following three datasets: (a) temporally detailed
(TD) roadmaps (Shekhar et al. 2012), (b) traffic signal data (Liu and Hu 2013) and
(c) map matched GPS traces (Yuan et al. 2010; Shekhar et al. 2012). Each of these
measure either historical or evolving aspects of certain travel related phenomena on
our transportation network. TD roadmaps store historical travel-time on road
segments for several departure-times (at 5 min intervals) in a typical week. The
essence of TD roadmaps and traffic signal data is illustrated in Fig. 7. For simplic-
ity, TD roadmaps are illustrated by highlighting the morning (7:00 am to 11:00 am)
travel time only on segments A-F, F-D and S-A (7 min, 11 min and 9 min
respectively). The travel-times of other road segments in the figure (shown next
to arrows representing roads) are assumed to remain constant. The figure also shows
the traffic signal delays during the 7:00 am to 11:00 am period. Additionally, the
traffic signals SG1, SG2, SG3 are coordinated such that in a journey towards D
(from S within certain speed-limits), one would typically wait only at SG1. Simi-
larly, a traveler starting on segment B-C (after SG1) would have to wait only at
SG2.
Big Spatio-Temporal Network Data Analytics for Smart Cities: Research Needs 135
S UMN
A I-35W I-35W For Traffic
Signal ID coming from
I-35W Hwy 62
B A F
RM1
SG1 RM2 SG 1 S-B
Hiawatha Ave
UMN
S Airport D SG 2 B-C
C
SG2 Hiawatha Ave SG 3 C-E
I-35W
B C E S-A
RM1
E A-F
SG3
RM2
Fig. 6 Sample transportation network for illustrating travel-time big data (best in color)
8 Legend:
A F 7 <Non-rush hour Signal ID Duration of Red light
5 RM1
RM2 Travel time>
RM1 6:00--11:00am [2-1/2 min]
S D
RM2 [1min]
E Road
3 B C 5
8 5 Segment Rush hours Travel Time
Signal ID Duration of Red light
A-F 7:00--11:00am [11min] SG 1 6:00--11:00am [90 sec]
SG1 SG2 SG3
F-D 7:00--11:00am [9 min] SG 2 6:00--11:00am [90 sec]
Coordinated SG 3 6:00--11:00am [90 sec]
S-A 7:00--11:00am [7 min]
Traffic Signals
Map matched and pre-processed (Zheng and Zhou 2011) GPS tracks, another
component of our travel-time big data, consists of a sequence of road-segments
traversed in the journey along with its schedule denoting the exact time when the
traversal of a particular segment began (via map-matching the GPS points). GPS
traces can potentially capture the evolving aspect of our system. For instance, if
segment E-D in Fig. 7 is congested due to an event (a non-equilibrium phenomena),
travel-time denoted by TD roadmaps may no longer be accurate. In such a case, one
may prefer to follow another route (say C-F-D) which other commuters may be
taking to reach D.
As a first step towards building a navigation system which can harness travel-time
big data for recommending routes, we need a unified querying framework across the
previous described TD roadmaps, traffic signal data and GPS traces. This querying
framework would integrate all the datasets into one model such that routing
algorithms can access and compare information from multiple datasets. This prob-
lem is formally defined below:
136 V.M.V. Gunturi and S. Shekhar
Given TD roadmaps, traffic signal data and annotated GPS traces, and a set of
use-case queries, the goal is to build a unified logical data-model across these
datasets which can express travel related concepts explicitly while supporting
efficient algorithms for given use-cases. The objective here would be to balance
the trade-off between expressiveness and computational efficiency.
Challenges: Designing a logical data-model for travel-time big data is
non-trivial. For instance, the model should be able to conveniently express all the
properties of the n-ary relations in the data. For example, consider a typical journey
along S-B-C-E-D through a series of coordinated traffic signals SG1, SG2 and SG3
in Fig. 7. Here, the red-light durations and phase gaps between the traffic signals
SG1, SG2 and SG3 are set in a way that a traveler starting at S and going towards D
(within certain speed-limits) would typically wait only at SG1, before being
smoothly transferred through intersections C and E with no waiting at SG2 or
SG3. In other words, in this journey, initial waiting at SG1 would render SG2 and
SG3 wait-free. If the effects of the immediate spatial neighborhood are referred to
as local-interactions, e.g. waiting at SG1 delaying entry into segment B-C, then this
would be referred to as a non-local interaction as SG1 is not in the immediate
neighborhood of SG2.
Typical signal delays (and travel-time) measured under signal coordination
cannot be decomposed to get correct, reliable information. For instance, if we
decompose the above mentioned journey through S-B-C-E-D into experiences at
individual road-segments and traffic signals, we would note that we typically never
wait at SG2 or SG3. However, this is not true as any typical journey starting at
intersection B (after SG1) would have to wait at SG2. This kind of behavior where
properties measured over larger instances lose their semantic meaning when
decomposed into smaller instances is called holism. And we refer to such properties
as holistic properties. Travel-time observed in GPS traces also show such holism as
time spent on a road segment depends on the initial speed attained before entering
the segment (Gunturi and Shekhar 2014).
Limitations of Current Logical Data-models: Current approaches for modeling
travel-time data such as time aggregated graphs (George and Shekhar 2006, 2007),
time expanded graphs (K€ohler et al. 2002; Kaufman and Smith 1993) and related
problems in modeling and querying graphs in databases (Hoel et al. 2005; Güting
1994), are not suitable for modeling the mentioned travel-time big data. They are
most convenient when the property being represented can be completely
decomposed into properties of binary relations. In other words, they are not suitable
for representing the previously described holistic properties. Current related work
would represent our previous signal coordination scenario using the following two
data structures (see Fig. 8): one containing travel-time on individual segments
(binary relations) S-B, B-C, C-E, and E-D; the second containing the delays and
the traffic controlled by signals SG1, SG2, and SG3 (also binary). However, this is
not convenient as non-local interactions affecting travel-times on some journeys
(e.g. S-B-C-E-D) are not expressed explicitly. Note that this representation would
have been good enough if SG1, SG2 and SG3 had not been coordinated.
Big Spatio-Temporal Network Data Analytics for Smart Cities: Research Needs 137
Ideally, the representation model should express the non-local interactions more
explicitly. Figure 9 illustrates our proposal (Gunturi and Shekhar 2014) for the
previous signal coordination scenario. Here, we propose to represent the journey as
a series of overlapping sub-journeys, each accounting for a non-local interaction.
The first entry in the figure corresponds to travel-time experienced on the
sub-journey containing road segment S-B (3 min) and delay at SG1 (max delay
90 s). This would be between 3 min and 4 min 30 s (no non-local interactions in this
case). Next we would store travel-time experienced on the sub-journey containing
road segment S-B (3 min), delay at SG1 (max delay 90 s), segment B-C (8 min) and
delay at SG2 (max 90 s) as between 11 min and 12 min 30 s. Note that we did not
consider the delay caused by SG2 due to non-local interaction from SG1. This
process continues until all the possible non-local interactions are covered.
A key task while developing a logical model for travel-time big data would be to
balance the tradeoff between computational scalability and accuracy of represen-
tation of important travel related concepts such as non-local interactions. Further,
the proposed model should provide a seamless integration of TD roadmaps, traffic
signal data and experiences of commuters through GPS traces. This would allow the
routing framework to explore candidate travel itineraries across multiple sources to
get richer results. For example, routes from TD roadmaps, which recommend based
on historic congestion patterns, can be compared (during the candidate exploration
phase itself rather than offline) against commuter preferences for factors like fuel
efficiency and “convenience”.
138 V.M.V. Gunturi and S. Shekhar
4 Conclusion
BSTN data analytics has value addition potential for societal applications such as
urban navigational systems and transportation management and planning. It does
however present challenges for the following reasons: First, the linear nature of the
patterns in BTSN data (e.g., vehicle measurement big data) raises significant
semantic challenges to the current state-of-the-art in the area of mining statistically
significant spatial patterns which mostly focus on geometric shapes, e.g., circles
and rectangles. And second, holistic properties increasingly being captured in the
BSTN data (travel-time big data) raise representational challenges to the current
state-of-the-art data-models for spatio-temporal networks. In future, we plan to
explore the research questions emerging from this paper towards formalizing the
notion of BSTN data analytics.
Acknowledgment This work was supported by: NSF IIS-1320580 and 0940818; USDOD
HM1582-08-1-0017 and HM0210-13-1-0005; IDF from UMN. We would also like to thank
Prof William Northrop and Andrew Kotz of University of Minnesota for providing visualizations
of the vehicle measurement big data and an initial insight into interpreting it. The content does not
necessarily reflect the position or the policy of the government and no official endorsement should
be inferred.
References
Agnarsson G, Greenlaw R (2006) Graph theory: modeling, applications, and algorithms. Prentice
Hall, Upper Saddle River, NJ. ISBN 0131423843
Ahuja RK, Magnanti TL, Orlin JB (1993) Network flows: theory, algorithms, and applications.
Prentice Hall, Upper Saddle River, NJ
Ali RY et al (2015) Discovering non-compliant window co-occurrence patterns: a summary of
results. In: International symposium on spatial and temporal databases. Springer, New York
Anselin L, Getis A (2010) Spatial statistical analysis and geographic information systems. In:
Perspectives on spatial data analysis. Springer, New York, pp 35–47
Big Spatio-Temporal Network Data Analytics for Smart Cities: Research Needs 139
Boyle DE, Yates DC, Yeatman EM (2013) Urban sensor data streams: London 2013. IEEE
Internet Comput 17(6):12–20
Ernst M, Lang M, Davis S (2011) Dangerous by design: solving the epidemic of preventable
pedestrian deaths. Transportation for America: Surface transportation policy partnership. The
National Academies of Sciences, Engineering, and Medicine, Washington, DC
George B, Shekhar S (2006) Time-aggregated graphs for modeling spatio-temporal networks. In:
Advances in conceptual modeling-theory and practice. Springer, New York, pp 85–99
George B, Shekhar S (2007) Time-aggregated graphs for modelling spatio-temporal networks. J
Data Semantics XI:191
George B, Shekhar S (2008) Time-aggregated graphs for modeling spatio-temporal networks. J
Data Semantics XI:191–212
George B, Kim S, Shekhar S (2007) Spatio-temporal network databases and routing algorithms: a
summary of results. In: Advances in spatial and temporal databases. Springer, New York, pp
460–477
Gunturi VMV, Shekhar S (2014) Lagrangian Xgraphs: a logical data-model for spatio-temporal
network data: a summary. In: Advances in conceptual modeling. Springer, New York, pp
201–211
Gunturi VMV et al (2011) A critical-time-point approach to all-start-time Lagrangian shortest
paths: a summary of results. In: Advances in spatial and temporal databases. Springer,
New York, pp 74–91
Gunturi VM, Shekhar S, Yang K (2015) A critical-time-point approach to all-departure-time
Lagrangian shortest paths. IEEE Trans Knowl Data Eng 27(10):2591–2603
Güting RH (1994) GraphDB: modeling and querying graphs in databases. In Proceedings of the
20th international conference on very large data bases, pp 297–308
Hoel EG, Heng W-L, Honeycutt D (2005) High performance multimodal networks. In: Advances
in spatial and temporal databases. Springer, New York, pp 308–327
Kargupta H et al (2006) On-board vehicle data stream monitoring using minefleet and fast resource
constrained monitoring of correlation matrices. N Gener Comput 25(1):5–32
Kargupta H, Gama J, Fan W (2010) The next generation of transportation systems, greenhouse
emissions, and data mining. In Proceedings of the 16th ACM SIGKDD international confer-
ence on knowledge discovery and data mining, pp 1209–1212
Kaufman DE, Smith RL (1993) Fastest paths in time-dependent networks for intelligent vehicle-
highway systems application. I V H S J 1(1):1–11
Kelmelis J, Loomer S (2003) The geographical dimensions of terrorism, Chapter 5.1, Table 5.1.1.
Routledge, New York
K€
ohler E, Langkau K, Skutella M (2002) Time-expanded graphs for flow-dependent transit times.
In: Proceedings of the 10th annual European symposium on algorithms. ESA’02. Springer,
London, pp 599–611
Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6):1481–1496
Kulldorff M (1999) Spatial scan statistics: models, calculations, and applications. In: Scan
statistics and applications. Springer, New York, pp 303–322
Kulldorff M et al (1998) Evaluating cluster alarms: a space-time scan statistic and brain cancer in
Los Alamos, New Mexico. Am J Public Health 88(9):1377–1380
Lee U et al (2009) Dissemination and harvesting of urban data using vehicular sensing platforms.
IEEE Trans Veh Technol 58(2):882–901
Liu H, Hu H (2013) SMART-signal phase II: arterial offset optimization using archived high-
resolution traffic signal data. Technical report # CTS 13-19. Center for Transportation Studies,
University of Minnesota
Neill DB, Moore AW (2004) Rapid detection of significant spatial clusters. In: KDD, pp 256–265
Oliver D et al (2014) Significant route discovery: a summary of results. In: Geographic informa-
tion science - 8th International conference, GIScience 2014 proceedings, vol 8728, LNCS.
Springer, New York, pp 284–300
140 V.M.V. Gunturi and S. Shekhar
Shekhar S et al (2012) Spatial big-data challenges intersecting mobility and cloud computing.
Proceedings of the 11th ACM international workshop on data engineering for wireless and
mobile access, pp 1–6. http://dl.acm.org/citation.cfm?id¼2258058. Accessed 5 Jun 2014
Trudeau RJ (2013) Introduction to graph theory. Courier Corporation, North Chelmsford. ISBN
0486678709
US Environmental Protection Agency (2015) Heavy-duty highway compression-ignition engines
and urban buses -- exhaust emission standards. US Environmental Protection Agency,
Washington, DC
Wolfson O, Xu B (2014) A new paradigm for querying blobs in vehicular networks. IEEE
MultiMedia 21(1):48–58
Yang K et al (2011) Smarter water management: a challenge for spatio-temporal network
databases. In: Advances in spatial and temporal databases, Lecture notes in computer science.
Springer, Berlin, pp 471–474
Yuan J et al (2010) T-drive: driving directions based on taxi trajectories. In Proceedings of the
SIGSPATIAL international conference on advances in geographic information systems.
GIS’10, pp 99–108
Zheng Y, Zhou X (2011) Computing with spatial trajectories. Springer, New York
A Review of Heteroscedasticity Treatment
with Gaussian Processes and Quantile
Regression Meta-models
F. Antunes (*)
Center of Informatics and Systems of the University of Coimbra, Coimbra, Portugal
e-mail: [email protected]
A. O’Sullivan
Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
F. Rodrigues • F. Pereira
Technical University of Denmark, Lyngby, Denmark
1 Introduction
y ¼ f ðxÞ þ ε;
We now review the traditional Multiple Linear Regression (MLR) model and its
problems with heteroscedasticity. This method has been widely studied and it
continues to be a statistical analysis standard in the majority of fields, mainly
because of its interpretation simplicity and computational speed. The MLR model
is defined as:
or in matrix notation,
y ¼ f ðX; wÞ þ ε
¼ XT w þ ε
distribution with fixed variance σ 2, that is, εeN ð0; σ 2 Þ: Under this assumption,
maximizing the (log) likelihood function w.r.t to the parameters w is equivalent to
minimizing the least-squares function (Bishop 2006), that is
X
N X
N 2
^ ¼ arg maxw ¼
w log N yi xiT w, σ 2 I ¼ arg minw yi wT xi : ð1Þ
i¼1 i¼1
The solution of (1) is called the Ordinary Least-Squared (OLS) estimator and it is
given by
1
^ ¼ XXT Xy:
w ð2Þ
This estimator has the following properties for the mean and variance:
h 1 i h 1 i
^ X ¼ XXT Xy ¼ XXT X XT w þ ε
w ð3Þ
h i
1
¼ XT X1 X XT w þ ε ¼ ½w ¼ w
T 1 1
w^ X ¼ XX XΦXT XXT ; ð4Þ
where Φ is a diagonal matrix with Φii ¼ ½εi ¼ σ 2 . Since the error term is
homoscedastic, i.e., ½εi ¼ σ 2 for all i, then Φ ¼ σ 2 I, where I is the identity matrix.
So (4) can be simplified to
1
^ X ¼ σ 2 XXT :
w
Under the assumptions of the Gauss-Markov theorem, (2) is the best linear
unbiased estimator of the covariance matrix of w. From (3) and (4) we can see
that ŵ is an unbiased estimator and its variance is influenced by the error variance.
Notice that, under the presence of heteroscedastic residuals, ŵ will stay untouched,
in contrast to its variance, which will be a function of σ 2i . Hence, the associated tests
of significance and confidence intervals for the predictions will no longer be
effective. In Osborne and Waters (2002) the reader can find a brief but practical
guide to some of the most common linear model assumption violations, including
the usual assumption of homoscedastic residuals.
To overcome this problem, White (1980) suggested a heteroscedasticity-
consistent covariance matrix. In its simplest form, this amounts to setting Φ to be
a N N diagonal matrix whose elements are Φii ¼ ε2i ¼ ½εi . Here the
problem is that we do not know the form of ½εi , so we need to estimate
it. Under certain conditions, this can be achieved by constructing the consistent
estimator
A Review of Heteroscedasticity Treatment with Gaussian Processes and. . . 145
XN
^ ½ εi ¼ 1
ε2 Xi XiT :
N i¼1 i
^;
^ε 2i ¼ yi xiT w
1X N
^ε 2 xi xiT ð5Þ
N i¼1 i
Gaussian Processes (GP) propose a very different angle to regression than MLR,
allowing for modeling non-linear relationships and online learning. Their
non-parametric Bayesian nature gives them a sufficiently good flexibility to be
used in a vast variety of regression and classification problems. In the following
sections we will present the standard GP model and a heteroscedastic variant.
3.1 Standard GP
y ¼ f ðxÞ þ ε; ð7Þ
where ε N ð0; σ 2 Þ and f ðxÞ e GP mf ðxÞ, kf ðx, x0 Þ :
Assuming mf ðxÞ ¼ 0, the prior over the latent function values is then given by:
p f x1 , x2 , . . . , xn ¼ N 0; K f ;
1 1 N
logpðyX, θÞ ¼ yT K 1
y y logjK y j logð2πÞ,
2 2 2
where K y ¼ K f þ σ 2 I and Kf are, respectively, the covariance matrix for the noisy
targets y and noise-free latent variable f.
Having set the covariance function and its corresponding hyper-parameters, the
conditional distribution of a new test point x* is given by:
f * X, y, x* N ð½f * , ½f * Þ;
with
1
½f * :¼ f * X, y, x* ¼ kfT* K f þ σ 2 I y;
1
½f * :¼ kfT** kfT* K f þ σ 2 I kf * ;
3.2 Heteroscedastic GP
In this section we will closely follow the work of Gredilla and Titsias (2012). As
seen in the previous section, the standard GPs assume a constant variance, σ 2,
throughout the input, x. We will now relax this assumption.
To define a Heteroscedastic GP (HGP), besides placing a GP prior on f(x) as in
(7), we also place a GP prior on the error term, so that we have:
y ¼ f ðxÞ þ ε;
with f ðxÞ GP 0, kf ðx, x0 Þ and ε N ð0, r ðxÞÞ, where r(x) is an unknown
function. To ensure positivity and without losing generality, we can define
r ðxÞ ¼ egðxÞ , where
148 F. Antunes et al.
gðxÞ GP μ0 , kg ðx, x0 Þ :
After fixing both covariance functions, kf and kg, the HGP is fully specified and
depends only on its hyper-parameters. Unfortunately, exact inference in the HGP is
no longer tractable. To overcome this issue, the authors propose a variational
inference algorithm which establishes a Marginalized Variational (MV) bound,
for the likelihood function, given by:
1
Fðμ; ΣÞ ¼ logN y0, K f þ R tr ðΣÞ KL N gμ, Σ N gμ0 1, K g ;
4
½Σ ii
where R is a diagonal matrix such that Rii ¼ eμi 2 , Kg is the covariance matrix
μ and Σ are the parameters from the
x ’),
resulting from the evaluation of kg(x,
variational distribution qðgÞ ¼ N g μ, Σ , and KL is the Kullback-Leibler diver-
gence. The hyper-parameters of this Variational Heteroscedastic GP (VHGP) are
then learned by setting the following stationary equations
∂Fðμ, ΣÞ ∂Fðμ, ΣÞ
¼ 0, ¼ 0:
∂μ ∂Σ
The solution for this system is a local or global maximum and it is given by:
1
μ ¼ K g Λ I þ μ0 1, Σ 1 ¼ K 1
g þ Λ;
2
for some positive semidefinite diagonal matrix. We can see that both μ and Σ 1
depend on Λ, letting us rewrite the MV as a function of Λ,
1 1
with a* ¼ kfT* K f þ R y, c2* ¼ kf ** kfT* K f þ R kf * , μ* ¼ kg*
T
Λ 12I 1 þ μ0
1
and σ 2* ¼ kg** kg*
T
K g þ Λ kg* . For a more detailed derivation, please refer to
Gredilla and Titsias (2012). For an introduction on the dense theory of the
A Review of Heteroscedasticity Treatment with Gaussian Processes and. . . 149
with α 2 0, 1 :
The biggest drawback of Gaussian Processes is that they suffer from computational
intractability issues for very large datasets. As seen
in Rasmussen and Williams
(2006), typically the complexity scales around O N 3 so, for N > 10000, both
storing and inverting the covariance matrix proves to be a prohibitive task on most
modern machines. More recently, and specifically concerning the treatment of Big
Data, Hensman et al. (2013) proposed a stochastic variational inference algorithm
that is capable of handling millions of data points. By variationally decomposing
set of relevant M inducing points,
the GP and making it only dependent on aspecific
the authors lowered the complexity to O M3 . Also concerning the application of
GPs to large datasets Snelson and Ghahramani (2007) suggested a new sparse
approximation resulting from the combination of both global and local approaches.
There is a vast literature regarding GP approximations. For an overview on some of
these approximation methods we suggest Quinonero-Candela et al. (2007).
In this work we address the challenge posed by Big Data from the variance
treatment point-of-view. The non-constant variance which is usually associated to
the presence of a great amount of unstructured noise is also an intrinsic character-
istic of Big Data that should be taken into account and treated accordingly with the
right heteroscedastic models
150 F. Antunes et al.
4 Quantile Regression
Since, in some cases, integrating a heteroscedastic component in the model may not
be practical due to overwhelming complexity or computational intractability, an
alternative is to do post-model analysis. In this case, we analyze the output
performance of a certain prediction process, whose internal details are not neces-
sarily known or understood, and model its observed error. This is the approach
proposed by Taylor and Bunn (1999) and Lee and Scholtes (2014), where the
authors focus on time series prediction. In Pereira et al. (2014), we extended this
approach with more complex Quantile Regression (QR) models applied to general
predictors. Our algorithm treats the original prediction model as a black box and
generates functions for the lower and upper quantiles, respectively providing the
lower and upper bounds of the prediction interval. Despite this flexibility, being
downstream to the model itself comes with a cost: this approach will hardly correct
earlier modeling mistakes and will only uncover such limitations (by presenting
very wide prediction bounds).
In contrast to least-squared based regression approaches, QR fits the regression
parameters to the quantiles, instead of fitting them to the conditional mean. By
doing so, this type of regression tends to be more robust against outliers and does
not depend on the common assumptions of normality or symmetry of the error. It is
however worth mentioning that this approach can suffer from inconsistency as the
resulting quantiles may cross. In practical terms this is not usually a problem,
although it is important to be aware of it. Following Koenker and Hallock (2001)
and Fargas et al. (2014), we can denote the conditional quantile function as:
Q y τ X ;
X
N
minw ρτ ðyi ξðX; wÞÞ;
i¼1
where k is the covariance function. Contrary to the standard QR, the GPQR is, of
course, a Bayesian approach to the quantile estimation.
A Review of Heteroscedasticity Treatment with Gaussian Processes and. . . 151
t- 1 t
Within the GPQR approach, the posterior distribution of the quantile function
Qy, is:
1
p Qy D, θ ¼ p DQy , θ p Qy θ ;
z
arg maxθ p Qy D, θ ;
which is proved to be equivalent to minimizing the tilted loss function. Hence, the
likelihood becomes:
" #
X
N
μi
ðτ1μi <0 Þ
τ ð1 τ Þ N σ
p DQy ¼ e i¼1 ;
σ
5 Experiments
In this section we will compare the performances of the GP, VHGP and GPQR
approaches to the problem of the heteroscedasticity, using two one-dimensional
datasets. From now on we will refer to the GPQR only by QR. More than assessing
the pointwise quality of the predictions using the standard performance measures,
we are mostly interested in evaluating how accurately the constructed CIs and QR
prediction intervals handled the volatility in the data.
For the single-point predictions we will use the following standard and well
accepted measures: RMSE (Root Mean Squared Error), MAE (Mean Absolute
Error), RAE (Root Absolute Error), RRSE (Root Relative Squared Error) and
COR (Pearson’s linear correlation coefficient)
For the interval quality evaluation we will consider some of the performance
measures suggested in Pereira et al. (2014), although with slight modifications in
order to extend their applicability to both kind of intervals, CI and PI. Let li and ui
be the series of the lower and upper bounds of a confidence/prediction interval for
^yi, respectively. Consider also the series of the real target values yi. Then we can de
ne the following measures:
A Review of Heteroscedasticity Treatment with Gaussian Processes and. . . 153
1X N
1y 2½l ;u ;
N i¼1 i i i
1X N
ð ui l i Þ
;
N i¼1 yi byi
eðRMILðICPμÞÞ ;
5.2 Datasets
We used the motorcycle dataset from Silverman (1985) composed of 133 data
points and a synthetically generated toyset consisting of 1000 data points exhibiting
a linear relationship with local instant volatility. For each dataset we ran three
models: GP, VHGP and QR. The QR model used as input the predictions from the
two others. For both GP and VHGP we generated predictions in a tenfold cross-
validation fashion, using the code freely available from Rasmussen and Williams
(2006) and Gredilla and Titsias (2012). For the QR, we used the code available from
Boukouvalas et al. (2012). We fixed α ¼ 0:05, τ ¼ 0:025 and τ ¼ 0:975 so that
we have equivalent interval coverages.
5.3 Results
Table 1 summarizes the overall obtained results. The variance evolution of the GP
and VHGP are plotted in Figs. 2 and 3. Figures 4, 5, 6, and 7 compare the three
different approaches within each dataset.
154
2500
GP Variance
VHGP Variance
2000
1500
Variance
1000
500
0
0 10 20 30 40 50 60
x
x 105
6
GP Variance
VHGP Variance
4
Variance
0
0 100 200 300 400 500 600 700 800 900 1000
x
150
100
50
0
y
-50
-100
Data
QR bounds
-150
GP CI
GP prediction
-200
0 10 20 30 40 50 60
x
Fig. 4 Comparison between GP and QR over GP predictions for the motorcycle dataset
150
100
50
0
y
-50
-100
Data
-150 QR bounds
VHGP CI
VHGP prediction
-200
0 10 20 30 40 50 60
x
Fig. 5 Comparison between VHGP and QR over VHGP predictions for the motorcycle dataset
In terms of interval coverage, we can see that all models performed well,
although the VHGP and QR had an ICP slightly closer to the ideal value, except
for the toyset. It is clear that the homoscedastic model (GP) was able to reach such
A Review of Heteroscedasticity Treatment with Gaussian Processes and. . . 157
4000
3500
3000
2500
2000
1500
y
1000
500
Data
0 QR bounds
GP CI
-500 GP prediction
-1000
0 100 200 300 400 500 600 700 800 900 1000
x
4000
3500
3000
2500
2000
1500
y
1000
500
Data
0 QR bounds
VHGP CI
-500 VHGP prediction
-1000
0 100 200 300 400 500 600 700 800 900 1000
x
Fig. 7 Comparison between VHGP and QR over VHGP predictions for the toyset
ICP values because of its large, but constant, variance (see Figs. 2 and 3). However,
ICP cannot be regarded independently from the others. High values of ICP do not
necessarily imply good intervals, they merely mean that the constant variance
assumed for the model is sufficiently large for the CIs to cover almost every data
point. So, the GP did achieve, in fact, reasonable good marks for ICP, but it did it so
at the expense of larger (and constant) interval ranges. As a result these values are
best considered in combination with RMIL, which relates the interval bounds and
the actual observed error. Unsurprisingly, the RMIL values for the GP are the larger
ones, which means that, on average, the considered intervals were larger than
158 F. Antunes et al.
required to cover the data or too narrow to handle its volatility. On the other hand,
the heteroscedasticity-consistent approaches, VHGP and QR, were able to dynam-
ically adapt their error variances to the data volatility, leading to tighter intervals
that more accurately reflect the uncertainty in the data, as we can see from the
values of RMIL and CLC2.
The results obtained for the toyset are consistent with and show a similar pattern
to those of the motorcycle dataset. The values of RMIL and CLC2 obtained for the
VHGP show a great improvement over the standard homoscedastic GP as observed
in the toyset and as would be expected given the heteroscedasticity present in the
data. Again, as observed in the toyset, the addition of a quantile regression meta-
model approach to the standard homoscedastic GP results in much better perfor-
mance in terms of RMIL and CLC2. The RMIL in particular is now comparable to
the value obtained using VHGP. There is some reduction in ICP which is undesir-
able but we still obtain a high value, albeit slightly less than desired. The QR-VHGP
combination again exhibits the best overall performance in comparison to the three
previous methods.
mentioned in this paper, within the transportation system context and from the user’s
perspective, it is often of extreme importance to have precise estimates which can
incorporate the instant volatility that can occur at any moment. Hence, as a future line
of work, we will apply these procedures to real data such as public transport travel
time data.
References
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Boukouvalas A, Barillec R, Cornford D (2012) Gaussian process quantile regression using
expectation propagation. In: Proceedings of the 29th international conference on machine
learning (ICML-12), pp 1695–1702
Breusch TS, Pagan AR (1979) A simple test for heteroscedasticity and random coefficient
variation. Econometrica 47(5):1287–1294
Chen C, Hu J, Meng T, Zhang Y (2011) Short-time traffic flow prediction with ARIMA-GARCH
model. In: Intelligent vehicles symposium (IV), IEEE, pp 607–612
Chipman JS (2011) International encyclopedia of statistical science. Springer, Berlin, pp 577–582
Cook RD, Weisberg S (1983) Diagnostics for heteroscedasticity in regression. Biometrika 70
(1):1–10
Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica 50(4):987–1007
Fargas JA, Ben-Akiva ME, Pereira FC (2014) Prediction interval modeling using gaussian process
quantile regression. Master’s Thesis, MIT, pp 1–65
Fox CW, Roberts SJ (2012) A tutorial on variational Bayesian inference. Artif Intell Rev 38
(2):85–95
Goldberg P, Williams C, Bishop C (1998) Regression with input-dependent noise: a Gaussian
process treatment. Adv Neural Inf Process Syst 10:493–499
Goldfeld SM, Quandt RE (1965) Some tests for homoscedasticity. J Am Stat Assoc 60:539–547
Gredilla LG, Titsias MK (2012) Variational heteroscedastic Gaussian process regression. In: 28th
international conference on machine learning
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Proceedings of the
29th conference annual conference on uncertainty in artificial intelligence (UAI-13), pp
282–290
Kersting K, Plagemann C, Pfaff P, Burgard W (2007) Most likely heteroscedas-tic Gaussian
process regression. In: Proceedings of the International Machine Learning Society, pp 393–400
Khosravi A, Mazloumi E, Nahavandi S, Creighton D, Van Lint JWC (2011) Prediction intervals to
account for uncertainties in travel time prediction. IEEE Trans Intell Transp Syst 12
(2):537–547
Koenker R, Hallock KF (2001) Quantile regression. J Econ Perspect 15(4):143–156
Lee YS, Scholtes S (2014) Empirical prediction intervals revisited. Int J Forecast 30(2):217–234
Leslie DS, Kohn R, Nott DJ (2007) A general approach to heteroscedastic linear regression. Stat
Comput 17(2):131–146
Long JS, Ervin LH (1998) Correcting for heteroscedasticity with heteroscedasticity-consistent
standard errors in the linear regression model: small sample considerations, Working Paper,
Department of Statistics, Indiana University
MacKinnon JG (2012) Thirty years of heteroskedasticity-robust inference, Working Papers,
Queen’s University, Department of Economics
MacKinnon JG, White H (1983) Some heteroskedasticity consistent covariance matrix estimators
with improved finite sample properties, Working Papers, Queen’s University, Department of
Economics
160 F. Antunes et al.
Osborne J, Waters E (2002) Four assumptions of multiple regression that researchers should
always test. Pract Assess Res Eval 8(2):1–9
Pereira FC, Antoniou C, Fargas C, Ben-Akiva M (2014) A meta-model for estimating error bounds
in real-traffic prediction systems. IEEE Trans Intell Trans Syst 15:1–13
Quinonero-Candela J, Rasmussen CE, Williams CKI (2007) Approximation methods for Gaussian
process regression, Large-scale kernel machines, pp 203–223
Rasmussen CE, Williams C (2006) Gaussian processes for machine learning. MIT Press, Cam-
bridge, MA
Robinson PM (1987) Asymptotically efficient estimation in the presence of heteroskedasticity of
unknown form. Econometrica 55(4):875–891
Silverman BW (1985) Some aspect of the spline smoothing approach to non-parametric regression
curve fitting. J R Stat Soc 47(1):1–52
Snelson E, Ghahramani Z (2007) Local and global sparse Gaussian process approximations. In:
International conference on artificial intelligence and statistics, pp 524–531
Taylor JW, Bunn DW (1999) A quantile regression approach to generating prediction intervals.
Manag Sci 45(2):225–237
Tsekeris T, Stathopoulos A (2006) Real-time traffic volatility forecasting in urban arterial net-
works. Transp Res Rec 1964:146–156
Tzikas DG, Likas AC, Galatsanos NP (2008) The variational approximation for Bayesian infer-
ence. IEEE Signal Process Mag 25(6):131–146
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica 48(4):817–838
Zeileis A, Wien W (2004) Econometric computing with HC and HAC covariance matrix estima-
tors. J Stat Softw 11(10):1–17
Zhou B, He D, Sun Z (2006) Traffic predictability based on ARIMA/GARCH model. In: 2nd
conference on next generation internet design and engineering, pp 207–214
Part III
Changing Organizational and Educational
Perspectives with Urban Big Data
Urban Informatics: Critical Data
and Technology Considerations
Abstract Cities around the world are investing significant resources toward
making themselves smarter. In most cases, investments focus on leveraging data
through emerging technologies that enable more real-time, automated, predictive,
and intelligent decision-making by agents (humans) and objects (devices) within
the city. Increasing the connectivity between the various systems and sub-systems
of the city through integrative data and information management is also a critical
undertaking towards making cities more intelligent. In this chapter, we frame cities
as platforms. Specifically, we focus on how data and technology management is
critical to the functioning of a city as an agile, adaptable, and scalable platform. The
objective of this chapter is to raise your awareness of critical data and technology
considerations that still need to be addressed if we are to realize the full potential of
urban informatics.
Keywords Smart cities • Data • Technology • Urban informatics • Open data • Big
data • Mobile data • Intelligent cities • Platform
1 Introduction
Urban planners around the world are investing significant resources towards making
their city’s infrastructure ‘smarter’ (or ‘more intelligent’) through leveraging data. By
2020, $400 billion a year will be invested in building smart cities (Marr 2015); and it is
estimated that by 2016, smart cities will become a $39.5 billion market (Stewart 2014)
and a $1 trillion market by 2020 (Perlroth 2015). Leveraging data and information—
the informatics element—is critical towards creating smarter cities. Urban informatics
is “the study, design, and practice of urban experiences across different urban contexts
that are created by new opportunities for real-time, ubiquitous technology and the
augmentation that mediates the physical and digital layers of people, networks, and
urban infrastructures” (Foth et al. 2011, p. 4).
A focus on urban informatics requires us to consider: how can the planning,
design, governance, and management of cities be advanced through the innovative
use of information technologies (ITs). Towards this end, it is important to focus on
both the information and the technical elements of cities. Cities generate and
disseminate large quantities of data and information. While most of the data and
information flows through technical systems, the social (human) component is also
critical. Hollands (2008) argues that smart cities are ‘wired cities’ that include both
human and technical infrastructures. In this regard, employing ITs alone is not
sufficient to transform cities; rather, how humans (social) leverage technologies for
managing and governing in the urban sphere makes cities smarter (Kitchin 2014).
The concept of smart cities promotes the idea of creating a city where diverse
stakeholders leverage interconnected technologies to create innovate solutions and
address complex urban problems. Today, anyone can participate in the design and
building of solutions for cities. For example, Deck5, a software company, created
the 312 app that helps people new to Chicago navigate the city. This app uses GPS
and open source data from Chicago’s city portal to allow users to obtain informa-
tion about city neighborhoods; additionally, and more importantly, the app was
created to promote culture and engagement by letting people know interesting
things that are happening in other neighborhoods (Pollock 2013).
Technology enthusiasm has opened up new avenues for citizens and organiza-
tions to actively participate in the transformation of their urban spaces. However, as
we know, technology is no panacea. Several challenges remain in our quest towards
having smarter cities. First, from a fabric and structural perspective, cities need to
be open and capable to integrating multiple solutions. This is no easy feat to
accomplish as issues of standardization, alignment of interests, and even compat-
ibility of solutions play major roles here. Second, cities are comprised of multiple
systems (and sub-systems) that are often not optimally integrated. Given that there
are significant legacy issues that need to be managed here (i.e. historic investments
made into the design and operation of these systems) and the pace of technological
innovations, cities are often faced with a conundrum on whether to completely
abandon existing solutions and designing with a clean slate or trying to retrofit and
tweak old systems. Third, new technology-enabled economic models are being
introduced into cities that fundamentally impact how assets and infrastructure
within cities are utilized. Consider the impact of new technologies such as sharing
economy mobile apps like Airbnb and Uber that modify longstanding, traditional
government roles such as tax collecting, regulating, and protecting citizens
(Desouza et al. 2015). Cities have to continuously adapt their governance and
business models. However, most cities have age-old bureaucracies that need to be
navigated which stifle rapid innovation. Fourth, we must always remember that
significant access issues and variances in how technology is used by citizens need to
be accounted for when designing solutions for urban management.
In this chapter, we bring to the forefront several issues that lie at the intersection
of data and technology as they pertain to urban informatics. To do so, we frame
Urban Informatics: Critical Data and Technology Considerations 165
1
For more information, see—Desouza (2012a, b, 2014a, b, c); Desouza and Bhagwatwar
(2012a, b, 2014); Desouza and Flanery (2013); Desouza and Schilling (2012); Desouza and
Simons (2014); Desouza and Smith (2014a); Desouza et al. (2014); Desouza et al. (2015)
166 R. Krishnamurthy et al.
2 Emerging Technologies
Emerging technologies will increasingly connect cities and citizens through more
access and more data. For instance, Facebook recently announced that they are
piloting a drone that will provide Internet access around the world, even in the most
remote pockets of the world (Goel and Hardy 2015). Through greater access, more
data will be generated and collected on citizens’ actions, by both private and public
organizations. The joining of access and data will enable cities to become ‘smarter’
and even more interconnected. In addition, the rise of automation, greater real-time
data collection, and predictive analytic capabilities, are significantly modifying
who are making decisions and how decisions are being made.
Let us consider one emerging technology, automated vehicles (AVs). It has been
predicted by car and industry leaders that AVs will enter the market in the next
10 years (Mack 2014; Cheng 2014). These vehicles are totally driven by monitors,
sensors, and an intelligent system that makes decisions without human input on the
speed to drive, turns to make, and when to stop; all actions that, in a vehicle, can be
a matter of life and death. For instance, in an AV that Google has been testing, a
scenario showed the power of removing human decision-making from driving. At a
light, three cars—two human driven cars and one AV—were preparing to proceed
through a green light when a cyclist coming from another direction ignored and
tried to speed through their red light (Gardner 2015). The AV’s sensors picked up
the cyclists speed and did not go through the green light while the two human
drivers proceeded through the intersection forcing the cyclist to swerve. AVs are
designed to override human decisions, and more importantly, human mistakes. AVs
as a technology will not only impact how we traverse our physical spaces within
cities, but will also impact the social and economic dimensions as well. With the
adoption of AVs, the number of cars on the road is likely to be reduced significantly
because, due to the sharing economy, fewer people will own cars and opt for more
economical options such as ridesharing. Morgan Stanley estimated that AVs will
help the U.S. experience $1.3 trillion in annual savings by decreasing accident and
fuel consumption cost while increasing productivity (Morgan Stanley 2013).
PricewaterhouseCoopers predicts that when AVs are commonplace in the U.S.,
car ownership will reduce from 245 million to 2.4 million (PWC 2013).
Further, vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) technolo-
gies offer the capacity for vehicles to communicate with other vehicles and infra-
structure to increase safety, improve mobility, and lessen environmental
degradation (U.S. Department of Transportation n.d.). For instance, V2I capabili-
ties could alert drivers to crashes or closed roads while redirecting them to other
paths, reducing traffic congestion, lowering carbon emissions and increasing fuel-
efficiency. It could in turn, also lead to a reduction in public services such as police
and fire and rescue being required to manage accidents and traffic. Since excess fuel
costs and road congestion cost upwards of $100 billion in the U.S. annually, this
type of efficiency creates important benefits for cities (Dobbs et al. 2013). In
Glasgow, Scotland intelligent streetlights, networks of sensors and CCTV cameras
Urban Informatics: Critical Data and Technology Considerations 167
have been installed around the city to increase wireless monitoring and connectivity
(Macdonell 2015). Data from road sensors feed into analytical engines that adjust
traffic lights to reduce bottlenecks. Further, detailed maps of the city with outlined
areas for walking tours and cycling are made available. Glasgow’s smart city
technology is even used by citizens for social services. As a city whose citizens
have low life expectancy, a 2011 study found that Glasgow citizens are 30 % more
likely to die young than their counterparts in Manchester and Liverpool (Glasgow
Centre for Population Health 2011). 60 % of the deaths of those that will die young
in Glasgow are due to excesses in alcohol, violence, and suicide. An example of
Smart city connectivity helping with this is by allowing recovering alcoholics
access to city smart maps, navigating them away from harmful places like bars
and clubs.
Emerging technologies are fundamentally altering the economic models within
cities (Desouza et al. 2015). For instance, peer-2-peer apps and on-demand services
are designed to exchange goods directly without an intermediary. The on-demand
economy allows consumers to place demands for goods and/or services and a
freelancer immediately fulfills that demand. For instance, TaskRabbit is an
on-demand app that allows consumers to input small tasks (i.e. house cleaning,
handyman, delivery, parties and events) they need completed, the price they’re
willing to pay for the service, and freelancers bid to complete the job. Peer-2-peer
and on-demand technologies have already spurred a new direction in employment.
A recent report revealed that 53 million Americans are working as freelancers
(Horowitz and Rosati 2014). Many of these freelancers are people that use online
platforms such as Uber, TaskRabbit, and Amazon Mechanical Turk to find new
contracts and tasks. However, despite their benefits, these jobs lack employee
benefits such as insurance and retirement plans, which leaves urban planner’s to
figure out how to operate amidst these new realities (Howard 2015).
Concerns are plentiful as cities are becoming more connected and guided by
emerging technologies. Issues of dependency, security, safety, and privacy are
non-trivial. In 2015, it was discovered that several of Chrysler’s models of car
were vulnerable to hacking through the Internet. That is, hackers could cut brakes,
shut down a car’s engine, and even drive a car off the road remotely (Pagliery
2015). Hackers had this access to the vehicles through a wireless service called
Uconnect that connects the cars with the Sprint network. This happened because
Chrysler unwittingly left a communication channel open that granted outside access
to car controls. As a result, Chrysler has recalled 1.4 million cars and trucks to
update software protection to prevent hacking.
Additionally, emerging technologies are released and consumed so quickly by
citizens that governance and regulatory procedures have not kept pace with them.
One such technology is the use of drones or unmanned aerial vehicles (UAVs).
Traditionally, drones have been used for military and special operations for such
purposes as reconnaissance, armed attacks, and research. Now, drones are available
to businesses, media, and citizens for varied purposes. Alistar 2014, $14 million
worth of drones were traded on E-bay alone (Desouza et al. 2015). Drones are now
being used in urban spaces and causing problems for citizens and cities. For
168 R. Krishnamurthy et al.
instance, the U.S. Federal Aviation Administration revealed that there were 25 near
collisions between drones and airlines in 2014. However, the agency has fined only
five people due to illegal use and undefined regulations (Whitlock 2015). There was
another incident where a Seattle woman claimed that a drone was looking into her
window but law enforcement were not legally able to do anything as she was unable
to prove it (Bever 2014). Because there are inadequate rules and regulations
developed for drones, city administrators and urban planners’ hands are tied when
attempting to protect citizens’ privacy (Queally 2014); an issue sure to continue as
more technology becomes more accessible to the masses.
3 Data
Data are invaluable aspects of smart city functioning. Data can originate from
varied sources in both predictable and emergent manners with both intended and
unintended consequences. Three data movements have become essential for smart
cities: open, big, and mobile data.
The focus of open data efforts is to promote transparency and increase civic
participation (Noveck 2012). Open data movements are becoming more prevalent
all across the globe (Davies 2013; Manyika et al. 2013). Public agencies are
increasingly moving towards ‘open by default’ mandates for their public data.
During the 2013 G8 summit, the US, the UK, France, Canada, Germany, Russia,
Italy, and Japan signed the Open Data Charter to increase data sharing efforts,
improve the quality of data, and consolidate efforts to build a data repository (Sinai
and Martin 2013). Agencies are making data available to the public about all facets
of a city from transit to crime and in addition, are liberating data that were
traditionally locked up within administrative systems. Open data efforts have
spurred the development of (a) mobile applications to navigate urban spaces and
public institutions (Desouza and Bhagwatwar 2012a), (b) incentivized innovation
competitions/contests (Mergel and Desouza 2013; Desouza 2012a), and
(c) information system innovations to realize administrative and process
efficiencies.
However, open data has its drawbacks. Even though open data policies have
largely been developed to enhance transparency, open data can misguide the public,
either purposely or inadvertently. Open data that misguides can be attributed to
incomplete data or data that is poorly curated for the reader. Desouza and Smith
(2014c) discuss this as the ‘transparency tragedy’ where data is so low quality that
being able to draw any conclusions from it extremely difficult. Further, the release
and use of open data can create or sustain inequalities. For instance, predictive
policing is the use of data and analytics to find patterns in criminal activity to
proactively fight crime. While decent in its intention, the ethicality of data used to
profile individuals (especially without their knowledge) that have previously com-
mitted crimes is questionable at the least and ethically wrong at worst (Desouza and
Smith 2014a).
Urban Informatics: Critical Data and Technology Considerations 169
The primary focus of big data efforts is to build a capacity for predictive
analytics that promotes real-time sensing of environments to increase situational
awareness, automated and intelligent decision-making, and securing administrative
and process efficiencies through information management. Public agencies around
the world are investing in big data analytics (Manyika et al. 2011). In 2012, the
U.S. announced plans to invest $200 million in developing big data analytics in the
public sector (U.S. Office of Science and Technology Policy 2012). The
U.K. Government has entered into a partnership with IBM to invest £113 million
(approximately $175.8 million) in to the Hartree Centre in big data analytics over
next 5 years (Consultancy.uk 2015). Further, widespread use of social networking
sites such as Facebook, Twitter, and YouTube are also producing unstructured data
streams. Individuals and organizations are increasingly using these sites to connect,
exchange, share, and learn about information. Cities are increasingly using social
media sites to disseminate information and engage people (Oliveira and Welch
2013). Data produced during this process provides rich situational and behavioral
information in real-time (Harrison et al. 2010).
Big data efforts have led to innovations in (a) urban modeling of systems,
(b) algorithmic regulations, and (c) management of public agencies. Big data
analytics helped the New York City Department of Environmental Protection
crack down on a long-term problem of restaurants illegally dumping cooking oils
and grease into neighborhood sewers. To find the culprits, data was utilized from a
city agency that previously certified local restaurants’ employ of services that
hauled away their grease. Through a few calculations, the research team was able
to compare restaurants that did not employ a grease hauler with geo-spatial sewer
data to come up with a list of possible culprits. The analysis resulted in a 95 %
success rate of collaring dumpers (Feuer 2013).
The focus of mobile data efforts is to improve connectivity and reach remote
populations. Public agencies are leveraging mobile technologies to provide
e-services, engage citizens, and track urban movement (OECD Report on
M-Government, 2011). In the first quarter of 2015, worldwide mobile subscriptions
reached 7.2 billion; Africa and Asia account for three quarters of added mobile
subscriptions (Pornwasin 2015). Mobile phones are increasingly becoming an
integral part of individuals’ lives, especially in developing countries. By tapping
into mobile networks, cities can reach people at a lower cost and even in areas that
lack infrastructures (Rotberg and Aker 2013). Mobile data efforts provide oppor-
tunities to (a) improve connectivity between agents and organizations, (b) empower
greater public participation, (c) effective navigation of the urban space both from a
physical and resources viewpoint, and (d) for greater personalization of services
based on time, location, and behavioral contexts (Desouza and Smith 2014b;
Desouza 2014c). The State of California is using mobile data to communicate
with citizens to be safe drivers and avoid driving while impaired. Statewide, they
released the Be My Designated Driver (BeMyDD) App that gives discounts and
exclusives offers to designated drivers as they wait to drive inebriated persons home
(Tatro 2014). The app developer’s plan to expand their services by offering cheap
driver deals not just for late-night partying but also for other needs such as airport
transportation and medical visits.
170 R. Krishnamurthy et al.
Fig. 1 Conceptualizing
city as a platform
trashcans and recycling bins into Wi-Fi hotspots. The waste management company
upgraded two of its recycling stations in Manhattan into Wi-Fi hotspots. City
recycling bins and trashcans were also converted into free public Wi-Fi hotspots
with speed ranging from 50 to 75 MB per second. The company is planning to
convert recycling stations into Internet hotpots in all five New York boroughs,
especially in low-income neighborhoods. Further, it is anticipated that by using
these recycling bins as advertising displays, the bins will generate revenue to cover
the Internet access. Moreover, Bigbelly Company’s recycling bins are already
equipped with sensors to communicate with the city’s department of waste man-
agement and solar planners to decompose certain materials (McDonald 2015).
Thus, a city as an architectural platform provides avenues to integrate and build
on existing technical components.
Further, a city as a platform has the capacity to exploit positive interaction
between social and technical systems to create network effects (Evans
et al. 2006). Cities can open up its data reservoirs to encourage people to create
apps that help solve urban challenges. By providing APIs, cities can generate
economic and social value. For instance, BrightScope, a financial information
company, helps individuals and corporations make better decisions about wealth
management and retirement plans. Using 401(k) plans and data made available by
the Department of Labor and other government agencies, BrightScope advises the
public on retirement plans. Using this business model, BrightScope has raised
venture capital and expanded its workforce (Noveck 2011). Thus, cities can provide
avenues for people to leverage technologies for creating economic value. Further,
cities can nurture and develop linkages between all three spheres for generating new
products and accommodating diverse user needs.
There is no doubt that emerging technologies and the deluge of data have the
potential to transform cities into smart, livable, sustainable, and resilient cities.
Policymakers, big technology firms, and NGOs are excited about the potential of
emerging technologies in addressing complex urban challenges. While significant
resources are invested in deploying technologies for building and transforming
cities, it is important to consider how interaction between social and technical
systems will present emergent challenges and issues for cities. Further, cities
have legacies of governance, management, and culture that create path dependen-
cies that dictate how emerging technologies will be embedded into the urban
sphere. To truly realize the objective of developing smart cities, we need to
critically evaluate, understand, and address challenges associated with data and
technology. These challenges can be broadly classified into technical, social, and
socio-technical. Table 1 highlights these key challenges faced by cities as they
transform into becoming smart.
Urban Informatics: Critical Data and Technology Considerations 173
Table 1 (continued)
Issue Data and technology challenges
Do not forget small innovation Developing solutions that match the needs of
and developing countries the local marketplace is critical. Cities in
developing countries are experiencing chal-
lenges when it comes to adopting solutions for
smart cities in developed economies. While
frugal innovation offers solutions based on
environmental constraints, cities are yet to
understand the value of frugal innovation.
Gaining social support for frugal innovation
projects is critical.
that if this type of attack happened across the country, it could potentially take down
the electric grid and lead to a blackout across the country (Smith 2014). Moreover,
pervasive and interconnected technologies make cities more vulnerable to cyber
attacks. In 2014, Cesar Cerrudo, Chief Security Officer at IOActive Labs, showed
that 200,000 traffic sensors installed in major cities in the U.S., Australia, and
France where unprotected and vulnerable to cyber attacks. He demonstrated that
these sensors could be intercepted from 1500 ft away. A hacker could potentially
use drones to attack these systems because they were unencrypted. When he
conducted a test in the city of San Francisco in 2015, he found that the city has
yet to encrypt their traffic signals (Perlroth 2015).
In addition to cities acquiring new technologies, citizens are also acquiring tech-
nologies of their own that are creating new challenges for cities. The unprecedented
growth of information and computation technologies has reduced the cost and
barriers to accessing emerging technologies.
Through the provision of data on activities and behaviors, we might increase the
level of intelligent decision-making by individuals and organizations in a city. For
example, the electrical company OPOWER sends energy consumption bills to
households that include information about household’s energy consumption in
comparison to other similar households. The company found that a simple inter-
vention such as providing information about neighbors’ energy consumption, on
average resulted in 2 % reduction in energy consumption. If scaled across the U.S.,
this program could reduce energy consumption and provide net benefits of $2.2
billion per year (Allcott and Mullainathan 2010). Yet, we have to watch against the
unintended consequences of publicizing and sharing data. For example, in the U.S.,
homeowners are increasingly seeking the advice of private firms that offer one-on-
one advice when it comes buying homes. Individuals have reported that these firms
provide accurate information about localities based on their customer preferences.
While this is good news for people searching for homes, this raises challenges for
cities. Residents have a fundamental right to information; however, some informa-
tion may result in creating situations where residents are likely to choose “people
like us” (Pervost 2014). This was illustrated in London when a map was created that
overlaid the last name of residents on top of the city’s geography. The map revealed
a clear picture of the racial landscape of London (Alistar 2014).
While planners can’t manage all of the uses of the data it releases, they must
think deeply about how to mitigate the challenges that might weaken the social
system through data. Open data was created for transparency purposes but that
transparency can also create brand new challenges for cities. For instance, the
public transportation system BART in the Bay Area has a mobile security app
that allows riders to send text message and photo alerts to BART police about
crimes and non-crimes happening on their ride. It was found that riders were
disproportionately sending messages to report African Americans and the home-
less. In further analysis of the complaints, out of 763 alerts, 198 included some
mention of the alleged offender’s African American race while only 37 alerts
mentioned the race of white alleged offenders (BondGraham 2015). The criminal-
ization of the homeless and African Americans could result in more arrests based on
these text alerts.
Urban Informatics: Critical Data and Technology Considerations 177
Laboratory for Sustainability in Urban Design (CaLL) piloted a project in the city
of New York to access how people connect with their urban environment. The
project explored how arts can facilitate conversations between citizens and scien-
tists about sustainable choices. The CaLL project team installed ‘tactic art’ in
Montefiore Park, New York for people to think about connection between city
and its critical infrastructures such as streetlights, hydrants, and manhole covers.
People were encouraged to ponder about connections between critical infrastruc-
tures and cities. The project aimed to access how people who shape and use urban
environment think about their role in building sustainable cities (Fraser and Miss
2012).
In many cases, cities lack the capacity to effectively plan, manage, and implement
large scale IT projects. Public agencies have poor track records when it comes to
planning and implementing large scale IT projects. Many projects have overrun
costs and time, and worse, some are abandoned after spending significant public
resources. Consider these examples: the Boston Big Dig is one of the most
expensive highway projects, costing $14 billion and taking 32 years to complete.
Initially the project was estimated to cost $2.4 billion. In addition to cost and time
overrun, the project experienced several construction flaws (Hofherr 2015). The
Seattle Monorail Project (SMP) was shut down after 3 years of planning and
research in 2005 (Yuttapongsontorn et al. 2008). In 2005, Victoria State in
Australia began a project to develop smartcard-ticketing systems. The government
awarded $500 million to the Keane Australia Micropayment Consortium (Kamco)
and expected to launch this new system on March 1, 2007. The project experienced
delays and ran $500 million over-budget (Charette 2010). We need to increase our
track record with urban projects if we are to stand any chance of realizing the true
potential of technological and social innovations at scale.
and security issues, at the same time people are willing to share personal informa-
tion for receiving personalized services such as through apps and Internet usage.
The critical question for urban planners is to manage the trade-off between
offering personalized services and privacy concerns. Are people willing to share
information in return for personalized services? How much personal information
can be collected? For instance, smart infrastructures such as smart metering,
electronic tolling, and smart parking offer several benefits to citizens. Citizens
can use these infrastructures to manage travel, monitor consumption, save time,
etc. At the same time, this information can be used to predict patterns about user
behavior. Electronic toll collection programs such as EZPass allows user to pur-
chase toll passes online to avoid hassles and improves travel experiences. However,
the information collected about user travel can reveal where people live, how often
they travel and so on (Humphries 2013). While concerns about security and privacy
are critical, collecting more information and data about people and their activities is
also critical for developing personalized solutions.
We are witnessing a shift in data ownership where private companies are collecting,
managing, and analyzing urban data. Inrix, a global transportation provider has
launched a program to track movement of connected cars. The company uses GPS
data from 250 million cars and devices to collect data about people’s movement
around a city. Inrix data can provide insight into how many vehicles pass through a
location and at what time. Cities can use this data to (a) understand and predict
population movement across urban space and (b) plan and prioritize transit and their
city’s transportation infrastructure (Traffic Technology Today.com 2015). Waze is
a mobile application that allows users to create and use live maps and real-time
traffic updates for navigation. Waze collects a wide range of data about its users
including date, origin and destination, route, and speed. Additionally, users can also
report about accidents, road conditions, and speed traps. The City of Rio de Janeiro,
Brazil and the City of Jakarta are combining their own traffic data with Waze data to
gain better situational awareness of their roads and citizen safety. Cities use these
incident reports and real-time updates to repair roads, divert traffic flows, and
prepare emergency responses (Dembo 2014).
For smart cities, access to reliable and accurate data about people’s movement
offers an unprecedented advantage to design urban spaces and address complex
urban challenges such as transportation, energy, and water consumption. At the
same time, these partnerships raise critical questions about data ownership and
access (Smith and Desouza 2015). Traditionally, public agencies were collecting
information and data about citizens for urban planning and governing purposes.
However, we are now witnessing a trend where private companies such as mobile
operators, social networking sites, and smartphone apps are collecting more data
and information about every day activities of people (Leber 2013). For instance,
180 R. Krishnamurthy et al.
whereas cities could tap into taxi cab operations to understand travel patterns in
their city through reports taxi drivers must file, they miss a whole segment of the
population because many citizens use Uber or Lyft to get around town; private
companies that are not required to share travel information with government. Cities
need to collaborate with private companies to obtain data about urban use. These
partnerships in some cases have led to improvement in service delivery, while in
others increased concerns about data insecurity. The success of these efforts will
depend upon developing rules, norms, practices, and new partnerships.
Urban planners are increasingly developing partnerships with businesses to
provide urban services that meet the demands of the citizens. Clearly, cities
developing data partnerships and designing data sharing contracts is new territory
for planners. These partnerships present several opportunities and challenges to
urban planners. We do not know how cities can develop lasting-partnerships and
avoid losses (Smith and Desouza 2015). Urban planners will need to share their
experiences about developing and managing partnerships with private companies.
While cities are making significant efforts towards developing partnerships with
private companies, issues of inequality and access becomes central concerns for
cities investing in technology driven solutions to urban challenges. Access to
technologies often dictates whose voices are heard. Many of the smart cities
initiatives are built on the assumption that citizens have access to smart phone or
Internet. Yet, one has to mindful when infusing technology without care to issues
such as access, knowledge to use, adoption rates across segments of the population,
etc. Incidents such as Boston’s Street Bump apps clearly indicate that access to
technologies determine whose complaints are heard. When the city of Boston
introduced the Street Bump app that automatically detects potholes and sends
reports to city administrators, they found that the program directed crews to mostly
wealthy neighborhoods because those residents were more likely to have access to
smartphones (Rampton 2014).
future research. Exploring these questions will provide further insights into how
cities can leverage emerging technologies to be transformed into smart cities.
Urban planners need to pay attention and experiment with different approaches
rather than focusing on technologies as solutions for addressing urban challenges.
For instance, the concept of developing smart grids has gained traction around the
world for reducing energy consumption. However, recent research has revealed that
installation of cogeneration technologies, solar heater pumps, and building insula-
tion practices can potentially reduce New York’s carbon emissions by 50 %.
Developing old-school solutions may provide similar outcomes rather than
investing in expensive smart technologies (Hammer 2010).
Further, evidence from behavioral economics experiments such as the
OPOWER example suggests that simple behavioral tweaks can produce significant
return on investments similar to R&D subsidies (Allcott and Mullainathan 2010).
As the evidence suggests, blindly investing in technologies will not result in
transforming cities that are livable, sustainable, and resilient. Further research is
needed to explore and examine how these behavioral insights can be scaled up to
national levels to promote alternative measures to make cities livable and
sustainable.
Given the magnitude of urban challenges, cities need to develop partnerships
with businesses, universities, and NGOs for effectively addressing pressing chal-
lenges of urban areas. Private companies are increasingly developing novel solu-
tions. For example, environmental health startup Aclima partnered with Google
Earth Outreach and the Environmental Protection Agency (EPA) to equip Google
street cars with sensors for collecting air quality data. In 2014, they piloted the
program in Denver, where the three cars collected over 150 million air quality data
in a month. The EPA’s research directed by Dan Costa who noted that this imitative
provides a perfect opportunity to update and move forward the science of monitor-
ing air quality (CNNMoney 2015). Further research is needed to understand how
partnerships with private companies can help cities upgrade their traditional data
collection strategies. Additionally, cities also need to experiment with analytics for
understanding and exploring how this improved information will aid in urban
planning and developing sustainable cities.
Urban planners also need to think critically about the challenges associated with
developing technology-based solutions to urban challenges. They should also
carefully consider the issues of inequality and access that can arise when deploying
urban solutions. How can cities promote inclusive urban growth? One viable
solution could be investments in frugal innovation. Frugal innovations pay partic-
ular attention to environmental constraints and are best suited for addressing
challenges faced by low-income families. As cities increasingly experience
resource constraints (e.g. financial), they need to develop their capacity for finding
inexpensive solutions for sustainable outcomes. They also need to be mindful of
investing massive resources in technologies that could potentially disrupt cities’
well-being (Roche et al. 2012).
Urban Informatics: Critical Data and Technology Considerations 183
7 Conclusion
In this chapter, we outlined the city as a platform consisting of social and technical
components and the interactions between them. Data and information acts as glue
that enables interaction, communication, and exchange between these components.
We have enumerated several critical considerations that merit further discussion,
debate, and scientific investigations to advance the field of urban informatics.
Urban planners must develop sophisticated technologies and enhance human
capacity for proactively addressing challenges associated with the application of
urban informatics.
References
Alistar (2014) Big data is our generation’s civil rights issue, and we don’t know it. [WWW
Document]. http://solveforinteresting.com/big-data-is-our-generations-civil-rights-issue-and-
we-dont-know-it/. Accessed 27 Jul 2015
Allcott H, Mullainathan S (2010) Behavioral science and energy policy. Science 327:1204–1205
Badger E (2015) Uber’s war with New York is so serious it’s giving out free hummus. [WWW
Document]. The Washington Post. http://www.washingtonpost.com/news/wonkblog/wp/2015/
07/21/ubers-war-with-new-york-is-so-serious-its-giving-out-free-hummus/. Accessed 27 Jul
2015
Baldwin CY, Woodard CJ (2009) The architecture of platforms: a unified view. Harvard Business
School Finance Working Paper
Batty M (2007) Cities and complexity: understanding cities with cellular automata, agent-based
models, and fractals. The MIT press, Cambridge
Bever L (2014) Seattle woman spots drone outside her 26th-floor apartment window, feels
“violated.” The Washington Post. [WWW Document]. http://www.washingtonpost.com/
news/morning-mix/wp/2014/06/25/seattle-woman-spots-drone-outside-her-26th-floor-apart
ment-window-feels-violated/. Accessed 27 Jul 2015
BondGraham D (2015) BART riders racially profile via smartphone app [WWW Document]. East
Bay Express. URL http://www.eastbayexpress.com/oakland/bart-riders-racially-profile-via-
smartphone-app/Content?oid¼4443628. Accessed 5 Aug 2015
Boyd D, Crawford K (2011) Six provocations for big data, a decade in internet time: symposium
on the dynamics of the internet and society. [WWW Document]. http://ssrn.com/
abstract¼1926431. Accessed 27 Jul 2015
Charette R (2010) Australia’s AU$1.3 Billion Myki ticketing system introduction marred by
multiple missteps [WWW Document]. http://spectrum.ieee.org/riskfactor/computing/it/
australias-au13-billion-myki-ticketing-system-introduction-marred-by-multiple-missteps.
Accessed 27 Jul 2015
Cheng R (2014) General Motors President sees self-driving cars by 2020 [WWW Document].
CNET. http://www.cnet.com/news/general-motors-president-sees-self-driving-cars-by-2020/.
Accessed 5 Aug 2015
CNNMoney (2015) Google Street View cars will soon measure pollution [WWW Document].
CNNMoney. http://money.cnn.com/2015/07/30/technology/google-aclima-air-pollution/index.
html. Accessed 5 Aug 2015
Consultancy.uk 2015 UK GOV partners with IBM to boost Big Data research [WWW Document].
Consultancy.uk. http://www.consultancy.uk/news/2128/uk-gov-partners-with-ibm-to-boost-
big-data-research. Accessed 5 Aug 2015
184 R. Krishnamurthy et al.
Davies T (2013) Open Data Barometer 2013 global report. Open Data Barometer
De Blasio B (2015) A fair ride for New Yorkers: how the city should respond to the rapid rise of
Uber [WWW Document]. NY Daily News. http://www.nydailynews.com/opinion/bill-de-
blasio-fair-ride-new-yorkers-article-1.2296041. Accessed 25 Jul 2015
Dembo M (2014) The power of public-private partnerships: mobile phone apps and municipalities
[WWW Document]. Planetizen: The Urban Planning, Design, and Development Network.
http://www.planetizen.com/node/70934. Accessed 25 Jul 2015
Desouza KC (2012a) Leveraging the wisdom of crowds through participatory platforms: designing
and planning smart cities. Planetizen: planning, design & development
Desouza KC (2012b) Designing and planning for smart(er) cities. Pract Plann 10:12
Desouza KC (2014a) Our fragile emerging megacities: a focus on resilience [WWW Document].
Planetizen: the urban planning, design, and development network. http://www.planetizen.com/
node/67338. Accessed 27 Jul 2015
Desouza KC (2014b) Realizing the promise of big data | IBM Center for the Business of
Government. IBM Center for the Business of Government, Washington, DC
Desouza KC (2014c) Intelligent cities. In: Atlas of cities. Princeton University Press, Princeton, NJ
Desouza KC, Bhagwatwar A (2012a) Citizen apps to solve complex urban problems. J Urban
Technol 19:107–136
Desouza KC, Bhagwatwar A (2012b) Leveraging technologies in public agencies: the case of the
US Census Bureau and the 2010 Census. Public Adm Rev 72:605–614
Desouza KC, Bhagwatwar A (2014) Technology-enabled participatory platforms for civic engage-
ment: the case of US cities. J Urban Technol 21:25–50
Desouza KC, Flanery TH (2013) Designing, planning, and managing resilient cities: a conceptual
framework. Cities 35:89–99
Desouza KC, Schilling J (2012) Local sustainability planning: harnessing the power of informa-
tion technologies. PM Magazine 94
Desouza KC, Simons P (2014) Society for Information Management. Society for Information
Management - Advanced Practices Council, Mount Laurel, NJ
Desouza KC, Smith K (2014a) Big data for social innovation (SSIR). Stanford Soc Sci Rev
12:38–43
Desouza KC, Smith K (2014b) Finding a fair and equitable use of citizen data: the case of
predictive policing [WWW Document]. The Brookings Institution. http://www.brookings.
edu/blogs/techtank/posts/2014/10/15-police-citizens-data. Accessed 5 Aug 2015
Desouza KC, Smith K (2014c) The transparency tragedy of open data [WWW Document]. The
Brookings Institution. http://www.brookings.edu/blogs/techtank/posts/2014/11/5-transpar
ency-tragedy. Accessed 5 Aug 2015
Desouza KC, Swindell D, Koppell J, Smith K (2014) Funding smart technologies: tools for
analyzing strategic options. Smart Cities Council, Washington, DC
Desouza KC, Swindell D, Smith KL, Sutherl A, Fedorschak K, Coronel C (2015) Local govern-
ment 2035: strategic trends and implications of new technologies (No. 27). Issues in Technol-
ogy Innovation. Brookings, Washington, DC
Dobbs R, Phol H, Lin D-Y, Mischke J, Garemo N, Hexter J, Matzinger S, Palter R, Nanavatty R
(2013) Infrastructure productivity: how to save $1 trillion a year. McKinsey Global Institute,
London
Evans DS, Hagiu A, Schmalensee R (2006) Invisible engines: how software platforms drive
innovation and transform industries. MIT Press, Cambridge, MA
Feuer A (2013) Mayor Bloomberg’s Geek Squad. The New York Times. [WWW Document].
http://www.nytimes.com/2013/03/24/nyregion/mayor-bloombergs-geek-squad.html?
pagewanted¼all. Accessed 27 Jul 2015
Florida R (2004) The rise of the creative class and how it’s transforming work, leisure, community
and everyday life (Paperback Ed.). Basic Books, New York
Foth M, Choi JH, Satchell C (2011) Urban informatics. In: Proceedings of the ACM 2011
conference on computer supported cooperative work. ACM, pp 1–8
Urban Informatics: Critical Data and Technology Considerations 185
Fraser J, MIss M (2012) City as living laboratory for sustainability in urban design. New
Knowledge Organization, New York
Gardner G (2015) Google tests self-driving cars in tricky situations [WWW Document].
Detroit Free Press. http://www.freep.com/story/money/2015/07/22/google-car-self-driving/
30514747/. Accessed 27 Jul 15
Glasgow Centre for Population Health (2011) Scottish “excess” mortality: comparing Glasgow
with Liverpool and Manchester [WWW Document]. http://www.gcph.co.uk/work_themes/
theme_1_understanding_glasgows_health/excess_mortality_comparing_glasgow. Accessed
27 Jul 2015
Goel V, Hardy Q (2015) A Facebook project to beam data from drones is a step closer to flight. The
New York Times. [WWW Document]. http://www.nytimes.com/2015/07/31/technology/
facebook-drone-project-is-a-step-closer-to-flight.html?_r¼0. Accessed 5 Aug 2015
Hammer S (2010) The smart grid may not be the smartest way to make cities sustainable [WWW
Document]. Harvard Business Review. https://hbr.org/2010/09/smart-energy-for-smart-cities.
Accessed 5 Aug 2015
Hardekopf B (2014) This week in credit card news: massive data breach at chase, the value of
stolen medical data [WWW Document]. Forbes. URL http://www.forbes.com/sites/
moneybuilder/2014/10/03/this-week-in-credit-card-news-massive-data-breach-at-chase-the-
value-of-stolen-medical-data/. Accessed 5 Aug 2015
Harrison C, Eckman B, Hamilton R, Hartswick P, Kalagnanam J, Paraszczak J, Williams P (2010)
Foundations for smarter cities. IBM J Res Dev 54:1–16
Heinimann HR (2015) Strengthening a city’s “backbone” [WWW Document]. The Straits Times.
http://www.straitstimes.com/opinion/strengthening-a-citys-backbone. Accessed 27 Jul 2015
Hofherr J (2015) Can we talk rationally about the big dig yet? [WWW Document]. Boston.com.
http://www.boston.com/cars/news-and-reviews/2015/01/05/can-talk-rationally-about-the-big-
dig-yet/0BPodDnlbNtsTEPFFc4i1O/story.html. Accessed 27 Aug 2015
Hollands RG (2008) Will the real smart city please stand up? Intelligent, progressive or entrepre-
neurial? City 12:303–320
Horowitz S, Rosati F (2014) 53 million Americans are freelancing, new survey finds [WWW
Document]. Freelancers Union. https://www.freelancersunion.org/blog/dispatches/2014/09/
04/53million/. Accessed 27 Jul 2015
Howard A (2015) How digital platforms like LinkedIn, Uber and TaskRabbit are changing the
on-demand economy [WWW Document]. The Huffington Post. http://www.huffingtonpost.
com/entry/online-talent-platforms_55a03545e4b0b8145f72ccf6. Accessed 25 Jul 2015
Humphries C (2013) The too-smart city [WWW Document]. The Boston Globe. https://www.
bostonglobe.com/ideas/2013/05/18/the-too-smart-city/q87J17qCLwrN90amZ5CoLI/story.
html. Accessed 25 Jul 2015
Jaconi M (2014) The “On-Demand Economy” is revolutionizing consumer behavior — here’s how
[WWW Document]. Business Insider. http://www.businessinsider.com/the-on-demand-econ
omy-2014-7. Accessed 25 Jul 2015
Keil P (2013) Data-driven science is a failure of imagination. [WWW Document]. http://www.
petrkeil.com/?p¼302. Accessed 25 Jul 2015
Kitchin R (2014) The real-time city? Big data and smart urbanism. GeoJournal 79:1–14
Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of Google flu: traps in big data
analysis. Science 343:1203–1205. doi:10.1126/science.1248506
Leber J (2013) How Verizon and other wireless carriers are mining customer data [WWW
Document]. MIT Technology Review. http://www.technologyreview.com/news/513016/how-
wireless-carriers-are-monetizing-your-movements/. Accessed 25 Jul 2015
Loukides M (2010) What is data science? [WWW Document]. O’Reilly Media. https://beta.
oreilly.com/ideas/what-is-data-science. Accessed 5 Aug 2015
Macdonell H (2015) Glasgow: the making of a smart city [WWW Document]. The Guardian.
http://www.theguardian.com/public-leaders-network/2015/apr/21/glasgow-the-making-of-a-
smart-city. Accessed 27 Jul 2015
186 R. Krishnamurthy et al.
Mack E (2014) Elon Musk: don’t fall asleep at the wheel for another 5 years [WWW Document].
CNET. http://www.cnet.com/news/elon-musk-sees-autonomous-cars-ready-sooner-than-previ
ously-thought/. Accessed 5 Aug 2015
Madden M, Rainie L (2015) Americans’ attitudes about privacy, security and surveillance. Pew
Research Center: Internet, Science & Tech, Washington, DC
Manyika J, Michael C, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2011) Big data: the
next frontier for innovation, competition, and productivity. McKinsey Global Institute,
London, [WWW Document] http://www.mckinsey.com/insights/mgi/research/technology_
and_innovation/big_data_the_next_frontier_for_innovation. Accessed 5 Aug 2015
Manyika J, Chui M, Farrell D, Kuiken SV, Groves P, Doshi EA (2013) Open data: unlocking
innovation and performance with liquid information. McKinsey Global Institute, London,
[WWW Document]. http://www.mckinsey.com/insights/business_technology/open_data_
unlocking_innovation_and_performance_with_liquid_information. Accessed 5 Aug 2015
Manyika J, Lund S, Robinson K, Valentino J, Dobbs R (2015) Connecting talent with opportunity
in the digital age. McKinsey & Company, London
Marr B (2015) How big data and the internet of things create smarter cities [WWW Document].
Forbes. http://www.forbes.com/sites/bernardmarr/2015/05/19/how-big-data-and-the-internet-
of-things-create-smarter-cities/. Accessed 23 Jul 2015
McDonald G (2015) NYC is turning trash cans into Wi-Fi hotspots [WWW Document].
DNews. http://news.discovery.com/tech/gear-and-gadgets/nyc-is-turning-trash-cans-into-wi-
fi-hotspots-150717.htm. Accessed 5 Aug 2015
Mergel I, Desouza KC (2013) Implementing open innovation in the public sector: the case of
Challenge. gov. Public Adm Rev 73:882–890
Morgan Stanley (2013) Autonomous cars: self-driving the new auto industry paradigm. Morgan
Stanley blue paper
Nam T, Pardo TA (2011) Conceptualizing smart city with dimensions of technology, people, and
institutions. In: Proceedings of the 12th annual international digital government research
conference: digital government innovation in challenging times. ACM, pp 282–291
Noveck B (2011) Why cutting E-Gov funding threatens American jobs [WWW Document].
Huffington Post. http://www.huffingtonpost.com/beth-simone-noveck/why-cutting-egov-
funding-_b_840430.html. Accessed 5 Aug 2015
Noveck B (2012) Open data - the democratic imperative [WWW Document]. Crooked Timber.
http://crookedtimber.org/2012/07/05/open-data-the-democratic-imperative/. Accessed 28 Jul
2015
OECD, International Telecommunication Union (2011) OECD Report on M-Government. 2011.
M-Government: mobile technologies for responsive governments and connected societies.
OECD Publishing, Paris, [WWW Document]. http://dx.doi.org/10.1787/9789264118706-en.
Accessed 27 Jul 2015
Oliveira GHM, Welch EW (2013) Social media use in local government: linkage of technology,
task, and organizational context. Gov Inf Q 30:397–405
Pagliery J (2015) Chryslers can be hacked over the Internet [WWW Document]. CNN. http://
money.cnn.com/2015/07/21/technology/chrysler-hack/index.html?iid¼ob_homepage_desk
recommended_pool&iid¼obnetwork. Accessed 27 Jul 2015
Perlroth N (2015) Smart city technology may be vulnerable to hackers [WWW Document]. Bits
Blog. http://bits.blogs.nytimes.com/2015/04/21/smart-city-technology-may-be-vulnerable-to-
hackers/. Accessed 25 Jul 2015
Pervost L (2014) The data-driven home search. [WWW Document]. The New York Times. http://
www.nytimes.com/2014/07/20/realestate/using-data-to-find-a-new-york-suburb-that-fits.html.
Accessed 25 Jul 2015
Pollock M (2013) Five Chicago Apps that make city life a little less annoying [WWW Document].
Chicago magazine. http://www.chicagomag.com/city-life/October-2013/Chicago-App-Roundup/.
Accessed 5 Aug 2015
Urban Informatics: Critical Data and Technology Considerations 187
Pornwasin A (2015) The world is becoming more mobile and networked [WWW Document]. The
Nation. http://www.nationmultimedia.com/politics/The-world-is-becoming-more-mobile-and-
networked-30265200.html. Accessed 27 Jul 2015
PWC (2013) Autofacts. [WWW Document]. http://www.detroitchamber.com/wp-content/
uploads/2012/09/AutofactsAnalystNoteUSFeb2013FINAL.pdf. Accessed 5 Aug 2015
Queally J (2014) Seattle woman says drone wasn’t spying on her after all. Los Angeles Times.
WWW Document]. http://www.latimes.com/nation/nationnow/la-na-nn-seattle-drone-update-
20140625-story.html. Accessed 27 Jul 2015
Rampton R (2014) White House looks at how “Big Data” can discriminate. [WWW Document]
http://uk.reuters.com/article/2014/04/27/uk-usa-obama-privacy-idUKBREA3Q00S20140427.
Accessed 5 Aug 2015
Rotberg RI, Aker JC (2013) Mobile phones: uplifting weak and failed states. Wash Q 36:111–125
Roche S, Nabian N, Kloeckl K, Ratti C (2012) Are “smart cities” smart enough. Presented at the
Global Geospatial Conference 2012, Spatially Enabling Government, Industry and Citizens,
Québec City, Canada, pp. 215–235
Sarver R (2013) What is a Platform? Ryan Sarver
Sinai N, Martin M (2013) Open data going global [WWW Document]. The White House. http://
www.whitehouse.gov/blog/2013/06/19/open-data-going-global Accessed 28 Jul 2015
Smith R (2014) Assault on California Power Station raises alarm on potential for terrorism. [WWW
Document]. Wall Street J. http://www.wsj.com/articles/SB10001424052702304851104579
359141941621778. Accessed 23 Jul 2015
Smith K, Desouza KC (2015) How data privatization will change planning practice [WWW
Document]. Planetizen: the urban planning, design, and development network. http://www.
planetizen.com/node/79680/how-data-privatization-will-change-planning-practice. Accessed
25 Jul 2015
Stewart E (2014) A truly smart city is more than sensors big an all-seeing internet [WWW
Document]. The Guardian. http://www.theguardian.com/sustainable-business/2014/nov/21/
smart-city-sensors-big-data-internet. Accessed 23 Jul 2015
Tatro S (2014) New App rewards designated sober drivers [WWW Document]. NBC 7 San Diego.
http://www.nbcsandiego.com/news/local/New-App-Rewards-Designated-Drivers-286583491.
html. Accessed 5 Aug 2015
Tiwana A, Konsynski B, Bush AA (2010) Research commentary-platform evolution: coevolution
of platform architecture, governance, and environmental dynamics. Inf Syst Res 21:675–687
Townsend AM (2013) SMART CITIES: big data, civic hackers, and the quest for a new utopia
[WWW Document]. Stanford Social Science Review. http://www.ssireview.org/articles/entry/
smart_cities_big_data_civic_hackers_and_the_quest_for_a_new_utopia. Accessed 27 Jul
2015
Traffic Technology Today.com (2015) New “Big Data” analytics platform can aid urban planning
[WWW Document]. Traffic Technology Today.com. http://www.traffictechnologytoday.com/
news.php?NewsID¼68829. Accessed 25 Jul 2015
U.S. Department of Transportation (n.d.) The vehicle-to-vehicle and vehicle-to-infrastructure
technology Test Bed – Test Bed 2.0: Available for Device and Application Development
U.S. Office of Science and Technology Policy (2012) Obama administration unveils “Big Data”
initiative: announces $200 million in new R&D investments [WWW Document]. https://www.
whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release.pdf. Accessed 27 Jul
2015
Whitlock C (2015) How crashing drones are exposing secrets about U.S. war operations. The -
Washington Post. [WWW Document]. https://www.washingtonpost.com/world/national-secu
rity/how-crashing-drones-are-exposing-secrets-about-us-war-operations/2015/03/24/e89
ed940-d197-11e4-8fce-3941fc548f1c_story.html. Accessed 27 Jul 2015
188 R. Krishnamurthy et al.
Williams A, Robles E, Dourish P (2009) Urbane-ing the city: examining and refining the
assumptions behind urban informatics. In: Handbook of research on urban informatics: the
practice and promise of the real-time city. IGI, Hershey, PA, pp 1–20
Yigitcanlar T, Velibeyoglu K (2008) Knowledge-based urban development: the local economic
development path of Brisbane, Australia. Local Econ 23:195–207
Yuttapongsontorn N, Desouza KC, Braganza A (2008) Complexities of large-scale technology
project failure: a forensic analysis of the Seattle popular monorail authority. Public Perform
Manage Rev 31:443–478
Digital Infomediaries and Civic Hacking
in Emerging Urban Data Initiatives
Abstract This paper assesses non-traditional urban digital infomediaries who are
pushing the agenda of urban Big Data and Open Data. Our analysis identified a mix
of private, public, non-profit and informal infomediaries, ranging from very large
organizations to independent developers. Using a mixed-methods approach, we
identified four major groups of organizations within this dynamic and diverse
sector: general-purpose ICT providers, urban information service providers, open
and civic data infomediaries, and independent and open source developers. A total
of nine types of organizations are identified within these four groups.
We align these nine organizational types along five dimensions that account for
their mission and major interests, products and services, as well activities they
undertake: techno-managerial, scientific, business and commercial, urban engage-
ment, and openness and transparency. We discuss urban ICT entrepreneurs, and the
role of informal networks involving independent developers, data scientists and
civic hackers in a domain that historically involved professionals in the urban
planning and public management domains.
Additionally, we examine convergence in the sector by analyzing overlaps in
their activities, as determined by a text mining exercise of organizational webpages.
We also consider increasing similarities in products and services offered by the
infomediaries, while highlighting ideological tensions that might arise given the
overall complexity of the sector, and differences in the backgrounds and end-goals
of the participants involved. There is much room for creation of knowledge and
value networks in the urban data sector and for improved cross-fertilization among
bodies of knowledge.
P. Thakuriah (*)
Urban Studies and Urban Big Data Centre, University of Glasgow, Glasgow, UK
e-mail: [email protected]
L. Dirks
Urban Transportation Center, University of Illinois at Chicago, Chicago, IL, USA
e-mail: [email protected]
Y.M. Keita
Department of Urban Planning and Policy, University of Illinois at Chicago, Chicago, IL, USA
e-mail: [email protected]
Keywords Digital infomediaries • Civic hacking • Urban Big Data • Open data •
Text mining
1 Introduction
There has been a surge of interest recently in urban data, both “Big Data” in the
sense of large volumes of data from highly diverse sources, as well as “Open Data”,
or data that are being released by government agencies as a part of Open Govern-
ment initiatives. These sources of data, together with analysis methods that aim to
extract knowledge from the data, have attracted significant interest in policy and
business communities. While public and private organizations involved in planning
and service delivery in the urban sectors have historically been users of urban data,
recent developments outlined below have opened up opportunities for policy and
planning reform in public agencies and for business innovation by private entities.
Such opportunities have also attracted innovative new organizations and facilitated
new modes of ICT entrepreneurship. The objective of this paper is to examine the
diverse organizations and networks around such emerging sources of urban “Big
Data” and “Open Data”.
While there are many explanations of Big Data, it is the term being applied to
very large volumes of data which are difficult to handle using traditional data
management and analysis methods (Thakuriah and Geers 2013; Batty 2013), and
which can be differentiated from other data in terms of its “volume, velocity and
variety” (Beyer and Laney 2012). It has also stimulated an emphasis on data-driven
decision making based on analytics and data science (Provost and Fawcett 2013)
that has the potential to add to the value brought about by traditional urban and
regional modeling approaches.
Urban Big Data can be generated from several sources such as sensors in the
transportation, utility, health, energy, water, waste and environmental management
infrastructure, and the Machine-to-Machine (M2M) communications thereby gen-
erated. The increasing use of social media, using Web 2.0 technologies, personal
mobile devices and other ways to connect and share information has added to the
vast amounts of socially-generated user-generated content on cities. Open Data
initiatives adopted by city governments are leading to an increasing availability of
administrative and other governmental “open data” from urban management and
monitoring processes in a wide variety of urban sectors. These initiatives have the
potential to lead to innovations and value-generation (Thorhildur et al. 2013). Pri-
vately-held business transactions and opinion-monitoring systems (for example,
real estate, food, or durable goods transactions data, data on household energy or
water consumption, or customer reviews and opinions) can yield significant insights
on urban patterns and dynamics.
The increasing availability of such data has generated new modes of enquiry on
cities, and has raised awareness regarding a data-driven approach for planning and
decision-making in the public, private and non-profit sectors. Urban Big Data has
Digital Infomediaries and Civic Hacking in Emerging Urban Data Initiatives 191
Our objective is to examine organizations that are involved in urban data and the
myriad activities that relate to the urban data infrastructure. A mixed-methods
approach is utilized, consisting of qualitatively examining the literature and using
personal experience and by communicating with experts, as well as quantitatively
analyzing material from websites of selected organizations using text mining. The
qualitative assessment helped to identify the major groups of stakeholders in the
urban data landscape, types of products or services generated, skillsets of pro-
fessionals involved and evolving professional networks.
In order to understand the work of the urban digital infomediaries in greater
detail, a database of webpages of 139 public, private and non-profit ICT-focused
organizations and informal ICT entities was constructed. The webpages were
collected using snowballing techniques, starting with a list of organizations
involved in city-related activities known to the authors. Additional organizations
were identified through Internet search using keywords such as “open data”, “smart
cities”, “big data”, “civic”, “open source”, “advocacy” and so on, as well as with
keywords relating to modes such as “participatory sensing”, “crowdsourcing”,
“civic engagement”, “public engagement” and related terms. Pages within websites
retrieved for this purpose include, among others: (1) about us, mission statement,
products or services, or similar page(s) of organizations which describes the
organization; and (2) terms of service, privacy policy or related pages which
describe the organization’s terms or policies regarding information use and data
sharing.
The first step was to manually label and categorize the type of organizations,
sector (public, private, non-profit), major functional interest or domain area and
types of services offered, as well as policies and markets. It also allowed us to make
an assessment of the skills sets of the urban-centric workforce involved in the
organizations, although this aspect was informed by additional reviews and judg-
ment. This led to the identification of four major groups of infomediaries, which
were then organized into nine subgroups based on the stated missions and interests
of the organizations.
One use of the database was to understand, using text mining, the emphasis of
the organizations with regard to their activities and processes that may not be
apparent from their stated mission and objectives. For example, an ICT-focused
organization may indicate that it is in the business of “smart cities”; it is possible
that it is involved in the smart cities agenda by helping to empower residents to
connect to city governments or it may be focused to a greater extent on building the
Digital Infomediaries and Civic Hacking in Emerging Urban Data Initiatives 193
In our approach, there are three aspects to understanding the emerging urban data
sector: identifying the types of entities that are active in this space; understanding
the scope of what the organizations do; and, assessing the extent to which these
boundaries are blurring over time towards the goal of inferring trends towards
convergence. We identify four major groups of urban digital infomediaries
consisting of nine specific organizational types, based on their stated mission and
objectives, and the products and services delivered. Table 1 shows these four
groups, with a description of the specific types of organizations within each
group, and the sectors (public, private, non-profit, informal) to which they belong.
The table also displays the percentage of the total sample which a specific type of
organization comprises of.
194 P. Thakuriah et al.
Table 1 (continued)
% of total
Dominant sample
Organization type Description sector (N ¼ 139)
Independent and open source applications, software and content developer
infomediaries
IAD Independent App Individuals primarily focused Informal 3
Developers on developing software and
apps to link citizens to
information
OSD Open Source Organizations and entities cre- Private, 4
Developers ating open source software, informal
social coding accounts, devel-
oper networks and other open
source ways to allow access to
Big Data and Open Data
their overall product–service mix (for example, the percentage of total business
geared to urban ICT products and services).
(1) Smart City Companies (or Units): These entities are focused on improved
performance and efficiency of city ICT systems. While there are many defini-
tions of smart cities, the overall vision is one of having ICT-focused solutions
be an integral part of urban development, driving economic competitiveness,
environmental sustainability, and general livability. As noted by Thakuriah and
Geers (2013), the term “smart city” is championed by commercial entities and
the expectation is that the networking and integration of multiple urban sectors
will enable cross-agency efficiencies for a range of services (such as traffic
management, utilities, law enforcement, garbage disposal, emergency services,
aged care, etc). In some cases, the entire SCC business is focused on city-
centric applications of intelligent infrastructure, data management and analyt-
ics, and in other cases, the smart city business is handled by specific units within
comprehensive ICT businesses which are involved in many other ICT sectors
such as health, finance, energy and so on. This group also includes consulting
firms which have historically offered services in Intelligent Transportation
Digital Infomediaries and Civic Hacking in Emerging Urban Data Initiatives 197
Systems, smart energy and water management and other sectors. As shown in
Table 2, SCCs are high on data management and communications tools and in
developing platforms and tools for further processing of data. They are likely to
be involved in analytics and knowledge discovery processes for the purpose of
business solutions, but are likely to be involved in end-user services to a lesser
extent.
(2) Multiple-Service ICT Companies: These are business organizations providing
foundational, general-purpose hardware, software, and communications ser-
vices targeted to location-based information, information-sharing, search
engines, map databases, web services, sensing technologies, collaborative
tools, social media, Web 2.0 and related products. They provide general ICT
services in the telecommunications and information technology sector, without
being focused on urban applications such as smart cities, or are focused on them
only in incidental ways, without making the urban focus a core aspect of
business; yet, the information infrastructure they provide are vital to urban
data initiatives. For example, they are owners of cell phone data which has a
wide variety of urban mobility analysis applications. As in the case of SCC,
MSICTC can range from small private firms to large multinationals.
Urban Information Service Infomediaries: The second group consists of organiza-
tions which deliver end-user ICT services to urban residents to explore cities and
communities, and to connect citizens to social, entertainment, economic and com-
mercial opportunities. The services provided are connected to specific business
models such as mobile commerce and location-based advertising or banner ads.
While the majority of the organizations we examined are established private
businesses, some of these services are also being offered by informal, independent
developers. Two specific digital infomediaries can be differentiated by the types of
information they provide, the degree to which location and real-time information
streams are explicit and front-and-center in their products, and the extent to which
social networking processes are utilized.
(3) City Information Services (CIS): CIS organizations provide directory services,
question and answer databases or recommender systems for residents to engage
in social, commercial, entertainment and other aspects of cities. As shown in
Table 2, CIS are generally focused on end-user services. These organizations
often utilize crowdsourced information by means of user reviews and ratings of
businesses, entertainment services, restaurants and other retail and commercial
entities, thereby creating user communities who may form a social network.
CIS infomediaries tend to be private companies or informal organizations and
independent developers. They are distinguished from Location-Based Services
by not requiring explicit positioning and navigation capability, which call for
additional (positioning and sensor) technologies.
(4) Location-Based Services (LBS): The LBS industry is either private and to a
limited degree, informal, and has been studied extensively. LBS are informa-
tion services that capitalize on the knowledge of, and are relevant to, the mobile
user’s current or projected location. Examples of LBS include resource
198 P. Thakuriah et al.
One of the questions raised earlier is the potential blurring of activities among
different types of infomediaries with different mission/objectives. The text mining
exercise described in Sect. 2 was used to extract information on the types of
activities undertaken by the 139 organizations.
This led to the identification of seven Activity Clusters given in Table 3 which
describe processes and approaches undertaken to operationalize organizational
objectives: (1) data, computation and tool-building; (2) accessible, advisory,
citizen-oriented; (3) economically efficient and resilient urban communities;
80
70
60
50
40
30
20
10
0
Data, Computational ICT-focused urban Smart, sustainable & ICT-focused Accessible, advisory, Collaborative & Accountability,
and Tool Savviness systems management efficient community citizen-oriented economically efficient advocacy & data
neighborhoods information & urban communities activism
location services
Activity Cluster
Smart City Companies Multiple-service ICT Companies
City Information Services Location-Based Services
Open Data Organizations Civic Hacking Organizations
Community-Based Information Service Organizations Independent App Developers
Open Source Developers
(4) ICT-focused urban systems management; (5) smart and sustainable communi-
ties; (6) community information and location services; and (7) accountability,
advocacy and data activism.
The spread of the nine different organizational types across these seven Activity
Clusters is given in Fig. 1. Given the overall focus on technology, it is not surprising
that all nine organizational types have at least some organizations that are dominant
in Activity Cluster 1, “Data, Computational and Tool Savviness”. However, for
close to 35 % of ODO organizations, this is the dominant activity, as determined by
hypernyms, followed by LBS and MSICTs, indicating that such organizational
types with their interest in data sharing and publication are pushing the envelope
in technology development around urban data.
LBS and SCCs dominate in the “ICT-focused urban systems management
cluster”. In the case of this activity cluster, as in the case of first cluster, it is
somewhat surprising that Community-Based Information Service Organizations
dominate quite strongly in terms of overall focus, indicating that very different
types of organizations are carrying out functionally similar activities. The domi-
nance of SCC, ODO and CBISO organizations within the “Accessible, advisory and
citizen-oriented” activity cluster similarly suggests increased convergence in the
use of data and technology to serve cities.
Similar trends regarding the dominance of very different types of organizations
within other activity clusters seem to indicate increasing convergence of focus.
Studies of convergence go back to the analysis of the machine tool industry by
Rosenberg (1963) and recently, the ICT sector has received considerable interest—
convergence not only in technologies from the supply side used but also in products
from the demand side (Stieglitz 2003). It is likely that as urban digital infomediaries
in historically different industries increasingly use urban data to produce similar
products and services, convergence may be stimulated, leading to potentially useful
networks, alliance strategies and partnerships.
202 P. Thakuriah et al.
are then shared with other service users. Urban Open and Civic Data Infomediaries
are likely to employ personnel formally trained in disciplines in the urban and
regional planning domains, whether as methodologists or generalists interested in
civic and community issues. Within this group, informal civic hackers with no
formal ties to any of the organizations discussed here may be actively involved.
This group is also likely to utilize the work of ordinary citizens in the monitoring
and reporting of events within their communities.
Independent App Developers or Open Source Developers are very likely to be
developers or data scientists. It is also possible that within the IAD group, citizens
with no formal training in informatics or the urban disciplines, but who have self-
taught the use of social software and open source tools, are playing a part. Data-
centric managers, as described above, are likely to be employed in all infomediary
groups considered, since broad-based ICT technical, managerial and entrepreneur-
ial skills are needed in virtually all urban data sectors.
6 Role of Networks
As noted previously, many of the activities driving urban data are not occurring
within traditional organizational boundaries, but through formal and informal
networks involving the actors discussed above. We discuss this aspect here very
briefly, with the note that this topic is a significant research area in its own right.
Informal networks of informed citizens, civic technologists and civic hackers
have become an important aspect of urban data and it is possible that they are
attracting individuals from a wide variety of ICT organizations, although to the best
of our knowledge, we have not seen the results of any study on this particular topic.
In the area of civic hacking, ongoing networks are emerging due to the use of social
networking (both online and face-to-face meetings via Meetup groups) to exchange
knowledge among civic hackers who are more tech-savvy developers and the less
technically savvy, and to discuss developments in software, data and policy and
civic issues, is becoming increasingly important. “Hackathons” that are being
sponsored by government agencies and non-profits, as well as design and
crowdsourced competitions for citizen apps, are giving increasing identity, visibil-
ity and legitimacy to these activities. A slightly different type of network are those
spearheaded by primarily established ICT companies, and focused to a greater
degree on urban data and urban management applications, in contrast to civic and
government transparency issues.
As indicated earlier, members of the organizations we discussed tend to populate
ongoing networks relating to data, communication and other standards, and other
technical aspects related to urban data. In contrast to ongoing networks, there are
also project-based networks in the urban data sector. These are formed primarily
around technology-focused city projects (smart city projects, field operational tests
of intelligent transportation, smart energy systems and so on) primarily by members
of government agencies who are sponsoring the project, businesses that are
204 P. Thakuriah et al.
providing the service, and affiliate members consisting of other planning and
administrative agencies, non-profits, higher education and research institutions.
These types of networks are often formed as result of project requirements for
partnerships or public participation, and their work typically end when the project is
complete, although they may continue to come together well after that to follow up
on evaluation results and to develop strategies regarding lessons learned.
to operational and policy use. This dimension also views data on city dynamics as
offering interesting opportunities and test-bed to address communications, infor-
mation processing, data management, computational, and analytics challenges
posed by urban Big Data. Another aspect of the scientific interest is to make
advances in open source and social software, technologies for privacy preservation
and information security, and other challenges associated with urban data.
A third dimension is business and commercial where previous modes of
e-commerce are being augmented with location-based social networks for mobile
commerce, user-generated content for reviews and recommender systems,
crowdsourcing of input for idea generation and other business product develop-
ment, and other commercial purposes in cities, ultimately leading to participant-
generated information on the social, recreational and entertainment aspects of
cities.
A fourth dimension of interest is urban engagement and community well-being,
with a focus on civic participation and citizen involvement. One aspect of this
strand is community-based information and monitoring, which may involve tech-
nologies similar to techno-managerial strand.
A fifth dimension is openness and transparency of government information
towards more interactive, participatory urban governance and bearing close simi-
larities to other “open” movements including open access, open source, open
knowledge and others. One aspect of this strand of interest is likely to overlap
with the second aspect of the scientific dimension, i.e., on computational and data
management aspects.
8 Conclusions
The urban data sector is highly dynamic and involves a significant informal
entrepreneurial sector which is entering the domain given the opportunities and the
overall challenges involved. However, for these opportunities to be realized, a
broad-based strategy is needed that reflects research and policy deliberations
regarding the social, policy, behavioral and organizational implications of the
data management and dissemination processes. This entrepreneurial sector is itself
highly diverse and includes developers passionate about computational and tech-
nological challenges, and data scientists who are interested in analyzing complex
data, as well as in open source software.
Urban data are also increasingly being seen by professionals in the urban
planning and public management domains as being important towards urban
management. They also include the work of civic hackers and civic technologists
who value openness and transparency and open source technologies, and are
interested in data and analytics to address civic and urban challenges. While
some entrepreneurs work in the private firms, civic and public agencies, or in
non-profits, others freelance in ICT development work.
Using all the different measures, the dimensions of interest addressed by urban
data infomediaries are fivefold: techno-managerial, scientific, business and com-
mercial, urban engagement, and openness and transparency. Using these dimen-
sions, it may be possible to predict aspects of city operations and management
where innovations may result from the work of urban data infomediaries.
Informal networks are playing an important role in creating and sharing knowl-
edge regarding technical skills and urban and governance issues. Social networks
that have formed among developers, data scientists and others involved in civic
hacking are important in this regard, but require greater involvement of profes-
sionals from the urban research community. The participation of the latter group is
particularly important, since what is sometimes presented as novel digital modes of
urban planning, are effectively practices and strategies that are well-established in
the urban domain. In much the same way, much can be learned about emerging
technology and analytics solutions by the urban community. This indicates that
there should overall be greater cross-fertilization of knowledge among the various
professional domains involved.
One issue we were interested in is the idea of convergence, which is a general
trend in the ICT sector, as noted by many authors. We found evidence of technical
convergence because many different types of organizations use similar technolo-
gies and offer similar urban data products and services. We also found through our
clustering analysis that several different types of organizations across the four
infomediary groups are focused on similar ICT-focused activities relating to
urban management, accountability and data activism.
However, less evident is ideological convergence. Ideological tensions occur
with differences in viewpoints regarding the way things should be, or should be
done. There are several examples in the urban data landscape. A far from complete
list is on ideological tensions regarding what open government should mean:
transparent government, innovative or collaborative government. These questions
Digital Infomediaries and Civic Hacking in Emerging Urban Data Initiatives 207
are important to consider because they have implications for policy and stakeholder
generation, as well as for practice and bodies of knowledge in these areas.
The study has several limitations. The emerging nature of the sector necessitated
an exploratory study. Aside from the informal nature of the sample of organizations
examined, another limitation is that it consists of only those entities that have a
formal presence (websites) in the Internet. Informal digital infomediaries who do
not have websites but are active through blogs, social networking sites such as
Facebook, or have a social web presence via social coding services such as Github,
SourceForge and so on, or who have no presence in the Internet at all, are not
included in this study. This potentially excludes a significant share of informal
digital infomediaries including civic hackers, many of whom are one-person enti-
ties contributing without an established organizational presence. However, we were
able to identify some organizations which are undertaking civic hacking activities
and these are included in the sample. At the time of writing this paper, we are
administering a survey instrument to gather data on such the independent data
activists and civic hackers. Another limitation of the sample is that it excludes
higher education and research institutions, some of which are focusing heavily on
Big Data and Open Data research.
References
Batty M (2013) Big data: big issues. Geographical magazine of the royal geographical society.
p 75
Beyer MA, Laney D (2012) The importance of ‘Big Data’: a definition. Gartner
Ferraro R, Aktihanoglu M (2011) Location-aware applications. Manning, Greenwich
Gil-Garca JR, Pardo TA (2005) E-government success factors: mapping practical tools to theo-
retical foundations. Gov Inf Q 22:187–216
Lam W (2005) Barriers to e-government integration. J Enterp Inf Manag 18:511–530
Patil DJ, Hammerbacker J (n.d.) Building data science teams. http://radar.oreilly.com/2011/09/
building-data-science-teams.html#what-makes-data-scientist. Accessed 1 Aug 2014
Provost F, Fawcett T (2013) Data science for business: what you need to know about data mining
and data. O’Reilly Media, Sebastopol
Rosenberg N (1963) Technological change in the machine tool industry, 1840–1910. J Econ Hist
23(4):414–443
Stieglitz N (2003) Digital dynamics and types of industry convergence: the evolution of the
handheld computers market. In: Christensen JF, Maskell P (eds) The industrial dynamics of
the new digital economy. Edward Elgar, Cheltenham, pp 179–208
Thakuriah P, Geers DG (2013) Transportation and information: trends in technology and policy.
Springer, New York. ISBN 9781461471288
~
Thorhildur J, Avital M, BjA¸rn-Andersen N (2013) The generative mechanisms of open Govern-
ment data. In: ECIS 2013 Proceedings. Paper 179
Xing X, Ye L-K (2011) Measuring convergence of China’s ICT industry: an input–output analysis.
Telecommun Policy 35(4):301–313
Zheng Y (2011) Location-based social networks: users. In: Zheng Y, Zhou X (eds) Computing
with spatial trajectories. Springer, New York, pp 243–276
How Should Urban Planners Be Trained
to Handle Big Data?
Abstract Historically urban planners have been educated and trained to work in a
data poor environment. Urban planning students take courses in statistics, survey
research and projection and estimation that are designed to fill in the gaps in this
environment. For decades they have learned how to use census data, which is
comprehensive on several basic variables, but is only conducted once per decade
so is almost always out of date. More detailed population characteristics are based
on a sample and are only available in aggregated form for larger geographic areas.
But new data sources, including distributed sensors, infrastructure monitoring,
remote sensing, social media and cell phone tracking records, can provide much
more detailed, individual, real time data at disaggregated levels that can be used at a
variety of scales. We have entered a data rich environment, where we can have data
on systems and behaviors for more frequent time increments and with a greater
number of observations on a greater number of factors (The Age of Big Data, The
New York Times, 2012; Now you see it: simple visualization techniques for
quantitative analysis, Berkeley, 2009). Planners are still being trained in methods
that are suitable for a data poor environment (J Plan Educ Res 6:10–21, 1986;
Analytics over large-scale multidimensional data: the big data revolution!,
101–104, 2011; J Plan Educ Res 15:17–33, 1995). In this paper we suggest that
visualization, simulation, data mining and machine learning are the appropriate
tools to use in this new environment and we discuss how planning education can
adapt to this new data rich landscape. We will discuss how these methods can be
integrated into the planning curriculum as well as planning practice.
Planning methods have been the source of much discussion over the past few
decades. Practitioners and researchers have examined what methods planning
schools teach and how these methods are used in practice. The suite of traditional
methods courses taught in planning programs—inferential statistics, economic
cost-benefit analysis, sampling, and research design for policy evaluation—remains
largely stagnant, despite the rapidly changing reality in which planners are expected
to work. Although the focus of this paper is on the impact of big data for planning
methods, other variables have also contributed to the need for additional methods to
tackle planning problems. The rise of ubiquitous computing and a hyper-connected
communication network as well as new private investment in data collection have
created an environment in which greater amounts of data exist than ever before. The
ability of the planner to analyze and use this data is no longer limited by computing
power or the cost of data collection, but by the knowledge that planners possess to
employ data analytics and visualization techniques.
Educating planners with skills that are useful for practice has been a key tenant
of many planning programs over the years. Several studies have been conducted to
understand how well planning programs are succeeding at this goal or not. Surpris-
ingly, the most recent comprehensive investigation of planning education and skills
demanded by practitioners was conducted in 1986. In this survey, four important
conclusions were identified as relevant to how planners were being educated and
the professional skills they would be required to use (Contant and Forkenbrock
1986). They found that the methods taught in planning programs remained highly
relevant to the methods needed for practicing planners, and the authors concluded
based on their survey results that planning educators were adequately preparing
their students to solve planning problems in practice. They cited communication
skills (writing and speaking) and analysis and research design as critical compo-
nents of planning education and practice, but noted that educators needed to remain
vigilant on seeking relevance (Contant and Forkenbrock 1986). The article also
identified several changes that were occurring throughout the 1980s that affected
the planning profession—the rise of micro-computing and the expansion of
methods being offered by planning schools. Contant and Forkenbrock (1986)
wrote “. . .there is little to suggest that planning schools are overemphasizing
analytic methods, nor do they appear to be failing to any real extent in meeting
the demands of practitioners interviewed. While more techniques are required than
these practitioners feel that all planners should understand, it certainly is arguable
that this situation is not at all bad.” That survey of methods is now nearly 30 years
old, and new realities exist that require educators to revise and expand the scope of
methods taught in planning schools (Sawicki and Craig 1996; Goodspeed 2012).
Despite wide acknowledgement of the changing data landscape, planning cur-
ricula still resemble their traditional form. Kaufman and Simons completed a
follow-up to this investigation which surveyed planning programs specifically on
methods and research design. The more limited focus on this 1995 study “revealed a
rather surprising lack of responsiveness among planning programs over time to
practitioner demand for [quantitative research methods]” and that “planning pro-
grams do not seem to teach what practitioners practice, and not even what
How Should Urban Planners Be Trained to Handle Big Data? 211
practitioners should practice” (Kaufman and Simons 1995). In a 2002 study focused
on the use of technology within planning programs, Urey claims that the haphazard
approach with which planning programs have introduced the use of technology to
serve larger goals (research, analysis, modeling) might be problematic as increased
microcomputing power becomes more widespread. While manual techniques serve
learning objectives within planning methods courses, the use of technology is now
required (Urey 2002). This leaves planning educators today with two questions
relevant to big data and methods: what new methods must we now include in our
curriculum, and what technology must students understand to employ these
methods in an ethical, accurate, and precise way? Given these questions, we
reviewed current methods requirements at planning schools to assess whether or
not planning programs have begun to respond to these questions and adapt to the
changing data landscape.
In a non-scientific review of methods taught at the top ten planning schools
(as listed by Planetizen in 2014 [http://www.planetizen.com/education/planning]),
we discovered that almost all programs require that planners be trained in statistics,
economic cost-benefit analysis, and research design. Of the programs reviewed,
including MIT, Cornell, Rutgers, UC Berkley, University of Illinois Urbana Cham-
paign, UNC Chapel Hill, University of Southern California, Georgia Institute of
Technology, UCLA, and University of Pennsylvania, none required students to
seek additional data analysis courses outside of the planning department. Although
the review of these programs was not scientific and limited to information published
online for prospective students, it does suggest that planning education has yet to
see value in teaching planners methods widely adopted in the fields of computer
science and engineering. We argue, as Contant and Forkenbrok did 30 years ago,
that maintaining the relevance of planning education to planning practice is impor-
tant. Contant and Forkenbrok reminded educators to be vigilant in their understand-
ing of skills that are in demand for practitioners—yet we have failed to do this in
regards to our methods curricula.
The one big exception to the static nature of planning methods offerings is
geographic information systems (GIS). Almost all of the top programs include a
required course on GIS or include a significant section on GIS as a portion of a
required methods course. This technology, once the province of a subset of com-
puting nerds, has spilled out of the methods sequence and permeated the curricu-
lum. It is now common to see planning students using GIS as a part of land use,
housing, transportation and economic development courses. The adoption and use
of GIS has been the most sweeping change in planning methods curriculum over the
past 30 years. For a discussion of this history and how this technology is evolving,
see Drummond and French (2008).
Big data, although currently a popular topic, is not new—and the concept of big
data dates back to 2001, when industry analyst Doug Laney articulated the defini-
tion of big data as any data set that was characterized by the three Vs: Volume,
Velocity and Variety (Laney 2001). Big data sets are characterized by containing a
large number of observations, streaming and fast speed and requiring real time
analytics. Big data sets are also usually mixed format combining both structured
212 S.P. French et al.
and unstructured data, joined by a common field such as time or location. In sum,
any data sets that are too large and complex to process using conventional data
processing applications can be defined as big data.
Several pioneers in the industry have already started to process and analyze big
data (Lohr 2012, Cuzzocrea et al. 2011). For instance, UPS now tracks 16.3 million
packages per day for 8.8 million customers, with an average of 39.5 million
tracking requests from customers per day. The company stores more than
16 petabytes of data. Through analyzing those datasets, UPS is able to identify
real time on-road traffic conditions, daily package distribution patterns and together
with the latest real time GIS mapping technology, the company is able to optimize
the daily routes for freight. With all the information from big data, UPS has already
achieved savings in 2011 of more than 8.4 million gallons of fuel by cutting
85 million miles off of daily routes (Davenport and Dyché 2013). IBM teamed up
with researchers from the health care field to use big data to predict outbreaks of
dengue fever and malaria (Schneider 2013). It seems that big data, together with
advanced analysis and visualization tools, can help people from a wide variety of
industries explore large, complex data sets and reveal patterns that were once very
difficult to discover. Given the increasing use of big data across fields that share
interests with the field of city planning, planners should more deliberately explore
and develop methods for using big data to develop insights about cities, transpor-
tation patterns and the basic patterns of urban metabolism.
Data analytics, as a powerful tool to investigate big data, is becoming an
interdisciplinary field. There are new programs at universities across the United
States that aim to teach students how to grapple with big data and analyze it using
various analytic tools. For this paper, we collected and reviewed some common
tools and skills that are taught in data analytics courses. We gathered course
information from John Hopkins, Massachusetts Institute of Technology, University
of Washington, and Georgia Institute of Technology. We noted that machine
learning/data mining and data visualization are the tools that are frequently taught
in these programs to prepare students to handle big data and some of them are
actually quite new to urban planners.
Machine learning is a core subarea of artificial intelligence. Machine learning
uses computer algorithms to create explanatory models. There are different types of
learning approaches, including supervised learning, unsupervised learning, and
reinforcement learning. Although some of the terminologies may be completely
new to planners, the actual methods turn out to be quite familiar. For example, the
regression model is one of the methods that is frequently used in supervised
learning process. Planners who work with remote sensing images often apply
supervised classification methods to reclassify the images into land cover images
based on various color bands in the image. However, planners may not be familiar
with other machine learning methodologies or algorithms, such as unsupervised
learning and reinforcement learning. Unsupervised learning tries to identify regu-
larities (or clusters or groupings) in the input datasets without correct output values
provided by the supervisors. Reinforcement learning is primarily used in applica-
tions where the output of the system is a sequences of actions (e.g. playing chess).
How Should Urban Planners Be Trained to Handle Big Data? 213
In this case, what’s important is not a single action, but a sequence of actions that
will achieve the ultimate goal. When machine learning methods are applied to large
databases, such as big data, it is often called data mining. Data mining tries to
identify and construct a simple model with high predictive accuracy, based on the
large volume of data. The model is then applied to predict future values. This is the
kind of projection that planners have been doing for years with less sophisticated
methods.
Most of the programs we reviewed also include data visualization components to
help identify patterns in the data and communicate the results of data analysis.
Some data visualization techniques, such as multivariate data representations, table
and graph designs are quite conventional. However, those techniques may also be
applied in innovative ways to help convey information behind data in a clearer
manner. One example is the information graphics or infographics, which improve
human cognition by utilizing graphics to improve the visual system’s ability to
extract patterns and trends (Smiciklas 2012; Few 2009). The latest trend in data
visualization is to take the advantage of webs to present data in an interactive way.
To effectively present big data interactively, the designer needs to be equipped with
knowledge regarding how human beings interact with computers, and how different
interaction types (i.e. filtering, zooming, linking, and brushing) will affect human
being’s cognition ability. In the example below, viewers can interact with data
generated from Foursquare check-ins across Manhattan (Williams 2015). These
interactive visualizations can be used on both big, and small data, but allowing
interaction allows for more data to be presented to viewers (Fig. 1).
In addition to the core courses, these new interdisciplinary programs require the
students to master at least one programming or query language. SQL is a popular
requisite and, in a survey on tools for data scientists, over 71 % of respondents used
SQL (King and Magoulas 2013). Some programs also require students to under-
stand and use open statistics software, such as R and R studio.
While these methods for analyzing data may seem somewhat out of place within
a planning methods framework, they actively seek to create ways in which
researchers can describe, explore, and explain data. These categories of data
analysis are described in depth in Earl Babbie’s Survey Research Methods (1990).
This text serves as one of many fundamental introductions to methods for planners,
and by grouping the new suite of tools available to planners and data scientists
within these categories, planners can see how these tools might be useful to them.
For example, data visualization is one of the key ways in which data scientists are
exploring big data sets (Few 2009). Data visualization acknowledges that our
typical methods of data exploration (descriptive statistics, graphing, and the like)
are ill-equipped to handle larger data sets, and even less equipped to communicate
information derived from those data sets to the public and to decision makers. By
introducing planners to the growing field of data visualization, we can expand their
ability to not only to use larger data set’s but to communicate the information
garnered from those data sets. As the basis for research, exploration of data sets will
allow planners to ask additional questions. These additional questions will require
explanatory analysis, and within this group of methods, tools such as machine
learning and data mining can help planners generate predictive models from larger
data sets.
Many of the data sets that planners will deal with in the future will be big data.
Credit card data or web browsing histories may help planners to predict the focus of
emerging public concerns. As a matter of fact in MIT’s big data courses, there is a
case study regarding how to utilize the Google search records to estimate the trends
within the real estate industry (MIT 2014). Social media, such as Twitter and
Facebook, have already become powerful information sources regarding almost
every aspect of social life. Analysis of twitter feeds can help to identify the extent
and intensity of hazard events. There are already studies on how to utilize infor-
mation extracted from Facebook’s friend list to forecast the use of airplanes. GPS or
real time transportation information can help planners to calibrate and develop
more accurate activity based travel demand models to forecast future travel pat-
terns. Moreover, the real time information about energy flows such as water, sewer,
and electricity flows may equip planners with critical information to design more
energy efficient and sustainable cities to make built environment more resilient to
natural hazards and climate change. Planning is characterized by its special affinity
for place-based issues, and this focus on place will be one of the critical ways in
which typical data sets can become “big data.” Location is the ultimate relational
field, and our ability to link data sets through location will create big data sets that
are especially useful to planners. If location is the ultimate relational connector,
then planning data sets will only continue to increase in size, speed, and complexity
in the future. The importance of teaching planners how to effectively and accurately
How Should Urban Planners Be Trained to Handle Big Data? 215
examine and explore this data cannot be understated, yet, our work to prepare this
paper leads us to believe that planning programs have not yet taken the steps
required to introduce these methods to planning students.
Big data analysis tools, such as machine learning and data visualization, can help
planners to make better use of the big data sets. The Memphis Police Department
has used machine learning and data mining approaches to predict potential crime
based on past crime events. As a result, the serious crime rate was reduced by
approximately 30 %. The city of Portland, Oregon optimized their traffic signals
based on big traffic data, and was able to reduce more than 157,000 metric tons of
CO2 emissions in 6 years (Hinssen 2012). In sum, the machine learning techniques
can help planners to analyze the future development of urban areas in a more
accurate way to solve current problems and eliminate or at least ease some the
impacts of new development. The explanatory power of machine learning will be
critical for planners seeking to use big data to solve long-term challenges in cities
and communities.
Data visualization has always been considered useful in the planning process,
primarily as a communication method. However, it is now a critical tool for
exploring large, complex data sets. Data visualization can help planners better
understand how people live, work and behave within urban context. When paired
with more explanatory tools such as machine learning, data visualization becomes a
critical tool in the planning process. Visualization can also continue to be used as a
way for planners to convey their planning concepts to corresponding stakeholders
during the public participation process. In this way, visualization is used as an
interpretation toolkit to help people digest the complex analysis results from big
data. Planners continue to be more comfortable using traditional graphs, tables, and
animation images to visualize their results. However, some planners are now using
more advanced web based tools to display the information in interactive ways to
encourage public participation. This trend has been on the rise for some time, and
the demand for practitioners with visualization skills continues to increase (Few
2009; Sawicki and Craig 1996; Goodspeed 2012).
We argue in this paper that planners would benefit greatly from the introduction
of more advanced methods of descriptive, exploratory, and explanatory data anal-
ysis in order to more effectively use an ever increasing amount of available data.
When considering adding new methods to the planning curriculum, there is always
the question of what will be displaced from the existing curriculum. We would urge
planning educators to review their current methods carefully to see if the current
offering are suitable as we move from a data poor environment to one of data
abundance, At the very least, planning programs should strive to make all students
aware of big data and give them some introduction to the means and methods of
analyzing this data. This basic overview may be sufficient for the generalist planner,
with more in depth training in big data available those who want it. This is similar to
the model that was initially followed with respect to GIS—all planning students
were given some basic GIS skills and vocabulary so they could communicate with
spatial analysis specialists. All planning students should get some exposure to big
216 S.P. French et al.
data and its analytical techniques, but some should be able to develop more depth
and the ability to collaborate with data scientists.
Two key issues for additional research emerged as we prepared this paper. The
field of planning is inherently place-based, and it, therefore, has the potential to take
many types of data and transform it into big data by linking mixed format infor-
mation into databases based on location This suggests that planning can draw upon
all types of data that is location based, including cell phone locations, license plate
readers, infrastructure sensors, drone videos, and building performance data. The
challenge will be how to build a theoretical framework that will allow planners to
use this wealth of information. Second, the field of planning is predominantly
concerned with the long-term. To date most big data applications have been used
to provide insights into short term challenges. As planners, we need to be asking a
larger question that relates to not just what methods can be used to analyze this data,
but how this data can be employed in our search for long-term solutions. How can
minute-by-minute Twitter text analysis related to planning issues allow us to
reframe planning issues for years to come? How does real time transportation
data help us understand how to shape transportation systems for the next genera-
tion? We did not set out to answer these questions in this paper, but we do believe
that posing them will help frame the discussion of planning methods for the next
generation of planning students and practitioners.
Big data represents an exciting new asset for planners who have always strug-
gled to explore and explain patterns and trends based on limited observations of
discrete data. We should make the best use of this data by giving planners the tools
with which to analyze it, understand it and communicate it. Like others who have
written on the topic of big data in cities, we do caution that data should not be used
for data’s sake. Planners are tasked with a more complex task that our data science
colleagues: we must find ways in which to use the data to make existing commu-
nities better and to provide better solutions than were previously available (Sawicki
and Craig 1996; Mattern 2013). In order to help planners achieve these goals, we
must revamp the methods offerings in our planning programs to take full advantage
of the new world of large, fast moving, ubiquitous data.
References
Few S (2009) Now you see it: simple visualization techniques for quantitative analysis. Analytics
Press, Berkeley
Goodspeed R (2012) The democratization of big data. http://www.planetizen.com/node/
54832 2014
Hinssen P (2012) Open data, power, smart cities. How big data turns every city into a data capital.
Across Technology
Kaufman S, Simons R (1995) Quantitative and research methods in planning: are schools teaching
what practitioners practice? J Plan Educ Res 15:17–33
Laney D (2001) 3D data management: controlling data volume, velocity and variety. META
Group Research Note, 6
Lohr S (2012) The age of big data. The New York Times, February 11, 2012
King R, Magoulas R (2013) Data Science Salary Survey: Tools, Trends What Pays (and What
Doesn’t) for Data Professionals. O’Reilly Media, Inc
Mattern S (2013) Methodolatry and the art of measure [Online]. Design observer. http://places.
designobserver.com/feature/methodolatry-in-urban-data-science/38174/ 2014
Sawicki DS, Craig WJ (1996) The democratization of data: bridging the gap for community
groups. J Am Plan Assoc 62:512–523
Schneider S (2013) 3 examples of big data making a big impact on healthcare this week.
http://blog.gopivotal.com/pivotal/p-o-v/3-examples-of-big-data-making-a-big-impact-on-
healthcare-this-week. Accessed 2 Oct 2013/2014
Smiciklas M (2012) The power of infographics. Using pictures to communicate and connect with
your audience. Que, Indiana
Urey G (2002) A critical look at the use of computing technologies in planning education: the case
of the spreadsheet in introductory methods. J Plan Educ Res 21:406–418
Williams S (2015) Here now: social media and the psychological city. http://www.spatial
$32#informationdesignlab.org/projects/here-now-social-media-and-psychological-city.
Accessed 14 July 2015
Energy Planning in a Big Data Era: A Theme
Study of the Residential Sector
Hossein Estiri
Abstract With a focus on planning for urban energy demand, this chapter
re-conceptualizes the general planning process in the big data era based on the
improvements that non-linear modeling approaches provide over mainstream tra-
ditional linear approaches. First, it demonstrates challenges of conventional linear
methodologies in modeling complexities of residential energy demand. Suggesting
a non-linear modeling schema to analyzing household energy demand, the paper
develops its discussion around repercussions of the use of non-linear modeling in
energy policy and planning. Planners and policy-makers are not often equipped
with the tools needed to translate complex scientific outcomes into policies. To fill
this gap, this chapter proposes modifications to the traditional planning process that
will enable planning to benefit from the abundance of data and advances in
analytical methodologies in the big data era. The conclusion section introduces
short-term implications of the proposed process for energy planning (and planning,
in general) in the big data era around three topics of: tool development, data
infrastructures, and planning education.
1 Introduction
Fig. 1 U.S. energy demand and CO2 emissions by sector, 2013, 2020, 2030, and 2040. Data
source: U.S. Energy Information Administration (2013b)
Fig. 2 World residential sector delivered energy demand, 2010–2040. Data Source: U.S. Energy
Information Administration (2013a)
energy use in the residential sector, failing to account for its complexities. Second,
lack of publicly available energy demand data for research has intensified the
methodological issues in studying residential energy demand.
A major problem in residential energy demand research is that “the data do not
stand up to close scrutiny” (Kristr€om 2006, p. 96). Methodological approaches lag
behind theoretical advances, partly because data used for quantitative analysis often
do not include the necessary socio-demographic, cultural, and economic informa-
tion (Crosbie 2006). In addition, the absence of publicly available high-resolution
energy demand data has hindered development of effective energy research and
policy (Min et al. 2010; Kavgic et al. 2010; Pérez-Lombard et al. 2008; Lutzenhiser
et al. 2010; Hirst 1980).
Even though relevant data are being regularly collected by different organiza-
tions, such data sources do not often become publicly known (Hirst 1980).
Energy Planning in a Big Data Era: A Theme Study of the Residential Sector 223
Conventional wisdom and modeling practices of energy demand are often based on
“averages” derived from aggregated data (e.g. average energy demand of an
appliance, a housing type, a car, etc.), which do not explicitly reflect human choice
of housing and other energy consumptive goods (Lutzenhiser and Lutzenhiser
2006).
4 Non-linear Modeling
Fig. 4 A non-linear
conceptual model of the
impact of the household and
the housing unit on energy
demand. Source: Estiri
(2014a)
This difference in the two approaches can be game changing, as the non-linear
approach can reveal an often hidden facet of effects on the outcome, the “indirect”
effects. Research has shown that, for example, linear approaches significantly
underestimate the role of household characteristics on energy demand in residential
buildings, as compared with the role of housing characteristics (Estiri 2014b; Estiri
2014a). This underestimation has formed the conventional understanding on resi-
dential energy and guided current policies that are “too” focused on improving
buildings’ energy efficiency.
Figure 4 illustrates a non-linear conceptualization of the energy demand at the
residential sector. According to the figure, households have a direct effect on energy
use through their appliance use behaviors. Housing characteristics, such as size,
quality, and density also influence energy use directly. Household characteristics,
however, influence the characteristics of the housing unit significantly—which is
labeled as housing choice. In addition to their direct effect, through the housing
choice, households have an indirect effect on energy demand, which has been
dismissed with the use of linear methodologies, and so, overlooked in conventional
thinking and current policies.
Energy use in the residential sector is a function of local climate, the housing unit,
energy markets, and household characteristics and behaviors. A conventional linear
approach to household energy use correlates all of the predictors to the dependent
variable (Fig. 5). Figure 6, instead, illustrates a non-linear model that incorporates
multiple interactions between individual determinants of energy demand at the
residential sector. Results of the non-linear model will be of more use for energy
policy.
Energy Planning in a Big Data Era: A Theme Study of the Residential Sector 225
Fig. 5 Graphical model based on the linear approach. All predictors correlate with the dependent
variable, while mediations and interactions among variables are neglected
Fig. 6 Proposed graphical model based on a non-linear approach. Predictors impact both: the
outcome variable and other variables
There are five exogenous variables in this model: age, gender, race/ethnicity,
local climate, and energy price. All housing-related characteristics can be predicted
226 H. Estiri
“For the theory-practice iteration to work, the scientist must be, as it were, mentally
ambidextrous; fascinated equally on the one hand by possible meanings, theories, and
tentative models to be induced from data and the practical reality of the real world, and on
the other with the factual implications deducible from tentative theories, models and
hypotheses.” (Box 1976, p. 792)
Fig. 7 The proposed modified planning process for the big data era
statistical denotation). There are several modern analytical approaches that can
analyze more complexities, and can provide simulations. I suggest that the tradi-
tional analysis in planning process (step 2) should be enhanced/replaced by incor-
porating advanced modeling algorithms that are trainable and connected to live
data. This process involves scientific discoveries.
Yet, planners and policy-makers should not be expected to be able to utilize
complex modeling results directly into planning and policy-making. The findings of
such analyses and simulations need to be made explicit via a policy interface. Using
the policy interface, planners and policy-makers would be able: (1) to explicitly
monitor the effects of various variables on energy demand and results of a simu-
lated intervention, and (2) to modify the analytical algorithms, if needed, to
improve the outcomes. The interface should provide explicit goals for planners
and policy-makers, making it easier to reach conclusions and assumptions.
From the explicit goals, designing smart policies is only a function of the
planners’/policy-makers’ innovativeness in finding the best ways (i.e., smartest
policies) for their respective localities to achieve their goals. Smart policies are
context-dependent and need to be designed in close cooperation with local stake-
holders, as all “good” policies are supposed to. For example, if reducing the impact
of income on housing size by X% is the goal, then changes in property taxes might
be the best option in one region, while in another region changes in design codes
could be the solution. Once smart policies are implemented, the results will be
captured in the data infrastructure and used for further re-iterations of the planning
process.
8 Conclusion
This chapter built upon a new approach to energy policy research: accounting for
more complexities of the energy demand process can improve conventional under-
standing and produce results that are useful for policy. I suggested that in order for
planners and policy makers to benefit from the incorporation of complex modeling
practices and the abundance of data, modifications are essential in the traditional
planning process. More elaborations around the proposed modified planning pro-
cess will require further work and collaborations within the urban planning and big
data communities. Regarding the modified planning process, in the short-run, three
areas of further research can be highlighted.
First is developing prototype policy interfaces. The non-linear modeling that I
proposed in this work can be operationalized and estimated using a variety of
software packages. More important, however, is the integration of the proposed
non-linear model into the corresponding policy interface. More work needs to be
done in this area using different methodologies, as well as developing more
complex algorithms to understand more of the complexities in energy use in the
residential sector—and perhaps, in other sectors.
Energy Planning in a Big Data Era: A Theme Study of the Residential Sector 229
References
Kavgic M et al (2010) A review of bottom-up building stock models for energy consumption in the
residential sector. Build Environ 45(7):1683–1697. http://linkinghub.elsevier.com/retrieve/pii/
S0360132310000338. Accessed 7 Nov 2013
Kelly S (2011) Do homes that are more energy efficient consume less energy?: a structural
equation model of the English residential sector. Energy 36(9):5610–5620. http://linkinghub.
elsevier.com/retrieve/pii/S0360544211004579. Accessed 18 Nov 2013
Kristr€om B (2006) Residential energy demand. In: Household behaviour and the environment;
reviewing the evidence. Organisation for Economic Co-Operation and Development, Paris, pp
95–115. http://www.oecd.org/environment/consumption-innovation/42183878.pdf
Lutzenhiser L (1992) A cultural model of household energy consumption. Energy 17(1):47–60.
http://linkinghub.elsevier.com/retrieve/pii/036054429290032U
Lutzenhiser L (1993) Social and behavioral aspects of energy use. Annu Rev Energy Environ
18:247–289
Lutzenhiser L (1994) Sociology, energy and interdisciplinary environmental science. Am Sociol
25(1):58–79
Lutzenhiser L (1997) Social structure, culture, and technology: modeling the driving forces of
household energy consumption. In: Stern PC et al (eds) Environmentally significant consump-
tion: research directions. pp 77–91
Lutzenhiser L, Lutzenhiser S (2006) Looking at lifestyle: the impacts of American ways of life on
energy/resource demands and pollution patterns. In: ACEEE summer study on energy effi-
ciency in buildings. pp 163–176
Lutzenhiser L et al (2010) Sticky points in modeling household energy consumption. In: ACEEE
summer study on energy efficiency in buildings, American Council for an Energy Efficient
Economy, Washington, DC, pp 167–182
MacKay RS (2008) Nonlinearity in complexity science. Nonlinearity 21(12):T273–T281. http://
stacks.iop.org/0951-7715/21/i¼12/a¼T03?key¼crossref.126ca54ea24c1878bf924facc7197
105. Accessed 31 Dec 2013
Min J, Hausfather Z, Lin QF (2010) A high-resolution statistical model of residential energy end
use characteristics for the United States. J Ind Ecol 14(5):791–807. http://doi.wiley.com/10.
1111/j.1530-9290.2010.00279.x. Accessed 7 Nov 2013
Moezzi M, Lutzenhiser L (2010) What’s missing in theories of the residential energy user. In:
ACEEE summer study on energy efficiency in buildings. pp 207–221
Norman J, MacLean HL, Kennedy CA (2006) Comparing high and low residential density: life-
cycle analysis of energy use and greenhouse gas emissions. J Urban Plann Dev 132:10–21
O’Neill BC, Chen BS (2002) Demographic determinants of household energy use in the United
States. Popul Dev Rev 28:53–88. http://www.jstor.org/stable/3115268
Pérez-Lombard L, Ortiz J, Pout C (2008) A review on buildings energy consumption
information. Energy Build 40(3):394–398. http://linkinghub.elsevier.com/retrieve/pii/S0378
778807001016. Accessed 6 Nov 2013
Phillips JD (2003) Sources of nonlinearity and complexity in geomorphic systems. Prog Phys
Geogr 27:1–23
Roaf S, Crichton D, Nicol F (2005) Adapting buildings and cities for climate change: a 21st
century survival guide. Architectural Press, Burlington. http://llrc.mcast.edu.mt/digitalversion/
Table_of_Contents_134820.pdf
Swan LG, Ugursal VI (2009) Modeling of end-use energy consumption in the residential sector: a
review of modeling techniques. Renew Sustain Energy Rev 13(8):1819–1835. http://
linkinghub.elsevier.com/retrieve/pii/S1364032108001949. Accessed 11 Nov 2013
Yu Z et al (2011) A systematic procedure to study the influence of occupant behavior on building
energy consumption. Energy Build 43(6):1409–1417. http://linkinghub.elsevier.com/retrieve/
pii/S0378778811000466. Accessed 8 Nov 2013
Part IV
Urban Data Management
Using an Online Spatial Analytics
Workbench for Understanding Housing
Affordability in Sydney
Abstract In 2007 the world’s population became more urban than rural, and,
according to the United Nations, this trend is to continue for the foreseeable future.
With the increasing trend of people moving to urban localities—predominantly
cities—additional pressures on services, infrastructure and housing is affecting the
overall quality of life of city dwellers. City planners, policy makers and researchers
more generally need access to tools and diverse and distributed data sets to help
tackle these challenges.
In this paper we focus on the online analytical AURIN (Australian Urban
Research Infrastructure Network) workbench, which provides a data driven
approach for informing such issues. The workbench provides machine to machine
(programmatic) online access to large scale distributed and heterogeneous data
resources from the definitive data providers across Australia. This includes a rich
repository of data which can be used to understand housing affordability in
Australia. For example there is more than 20 years of longitudinal housing data
nationwide, with information on each housing sales transaction at the property
level. For the first time researchers can now systematically access this ‘big’ housing
data resource to run spatial-statistical analysis to understand the driving forces
behind a myriad of issues facing cities, including housing affordability which is a
significant issue across many of Australia’s cities.
1 Introduction
to rise to 59.9 % and 67.2 % in 2030 and 2050 respectively (Heilig 2012). The total
population is projected to be 8.3 billion people and 9.3 billion people in 2030 and
2050 respectively (Heilig 2012). This growth and the increasing trend of people
moving to urban localities, predominantly cities, brings about additional pressures
on services, infrastructure, housing, transport and the overall quality of life of city
dwellers. Issues such as housing affordability are becoming increasingly pressing
for city planners and policy makers to address.
Over the last 50 years, an increasing array of digital data has been produced
relating to urban settlements resulting in the rise of the ‘information city’ as referred
to by (Castells 1989). Some 80 % of this data can be given a location attribute and
be used to create spatial databases which can then be analyzed and visualized to
help understand urban growth and development and to plan for sustainable urban
futures. However, as (Townsend 2013) points out, we must be conscious of the
limitations of our ability to predict the future and how we use such information to
engage our communities and support bottom-up participatory planning of our cities.
In this paper we introduce an online spatial analysis workbench where data
representing Australian cities can be accessed, analyzed and visualized. Since 2011,
the Australian Urban Research Infrastructure Network (AURIN) has made avail-
able an online workbench, which provides access to over 1800 spatial datasets from
over 30 data providers from across Australia. As of August 2015 there are over
eight billion data elements that cover all major cities of Australia, crossing health,
housing, transport, demographics and other essential characteristics of cities. This
includes historical data, current data and future data, which can be used, for
example, to assess the expected population growth for major cities. In this chapter
we will focus on the issue of housing affordability and how the data and analytical
tools available via the AURIN online workbench (hereby referred to as ‘the
Workbench’) can be used to understand this in the situational context of Sydney
where unprecedented levels of housing unaffordability is currently being experi-
enced (Committee for Sydney 2015).
2 Housing Affordability
Housing affordability is a persistent policy issue and has been the focus of interest
to Australian housing researchers for many years (Yates et al. 2013). The recent
escalation in property prices in Australia’s major cities is an all too familiar feature
of both popular media reports (e.g. Ting 2015) and conversations around the garden
barbecue. Yet it remains a policy Cinderella, with much wringing of hands accom-
panied by little coherent policy intervention. The complexity of the drivers of
housing markets and the range of approaches to explaining these processes militate
against a coherent understanding of these drivers or proscribing policy solutions to
improve affordability outcomes (Economic Review Committee 2015).
Defining what is or is not an affordable home, or how this might be best
measured, is also problematic and subject to longstanding debates, both in
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 235
Australia and elsewhere (Hulchanski 1995; Stone 2006; Gabriel et al. 2005). The
two basic components of any housing affordability assessment, namely housing
costs and household income and wealth, defy easy measurement, especially over
time. While housing affordability measures vary in detail and construction, there
are two basic approaches: the housing cost to household income ratio approach in
which a measure of housing costs (rent, mortgage payment, etc.) is compared to a
measure of household income, and the income after housing cost or ‘residual
income’ approach, which seeks to define the household income remaining after
housing costs are deducted and then comparing this to a low income benchmark of
some kind. The prevalent use of ratio measures (housing cost as a percentage of
household income) stems from the more ready availability of such data and the ease
of calculation and interpretation. However, the residual income method might be
more appropriate for lower income households for whom the absolute gap between
housing costs and household income matters the most.
The role of data in systematically analysing the issue of housing affordability is
critical. Understanding how and why housing markets function and whether or not
housing is affordable to various sections of the population is a fundamental
prerequisite for coherent policy development. The theme of ‘Urban Housing’ exists
as one of seven activated ‘lenses’ which were created by AURIN to define and then
collate a range of urban data sets and develop appropriate analytical tools (Pettit
et al. 2013a). AURIN’s Urban Housing Lens focused on delivering the kinds of data
to support both systematic monitoring of housing affordability. It also aimed to
provide the base data for developing much better understanding of how the housing
market works more generally and where pressures of housing costs, and therefore
housing affordability, occur and for whom.
Ironically, it is not that we do not have data to help us. In fact, in Australia,
details of every residential property sale and rental agreement are gathered by State
and Territory governments as on-going statutory administrative requirements.
These are all recorded at address level and therefore can be geo-coded to allow a
precise spatial matching to the land use property cadastre. The potential for detailed
spatial analysis of these data is therefore significant. In the case of property sales,
these data are sold to commercial companies for on-sale, once suitably processed, to
the real estate industry, media, insurance industry, banking sector and others to
assist in their business activities. There are several major national private firms that
gather and disseminate sales data from around Australia on a for-profit basis.
However, access to property sales data by the research and policy community has
been much less ubiquitous and often incurs significant expense or time consuming
negotiation, often stymied by data protection concerns. This is a paradox—while
the local estate agent can tell you the sales history of your house in some detail, a
university researcher has significant difficulty in obtaining the same data for
research purposes, unless he or she has the cash to buy it. For the first time,
AURIN has developed a machine to machine data feed directly into the
Australian Property Monitors housing sales databases. This is a significant step
forward in supporting urban housing research across Australia in accessing this
‘big’ housing data asset.
236 C. Pettit et al.
For rental data, the problem is more intractable as it is held by the various
jurisdictional Rental Bond Boards who have no direct interest in the market
information the records contain and, to date, have shown little interest in dissem-
inating it as useable data assets. Also, unlike the house price data, no one as yet has
attempted to assemble a nationally consistent dataset on rents from the various
jurisdictional departments that gather this information. Other key housing datasets,
from Federal rent assistance and first home owner payments to mortgage and
foreclosure data, are all collected by various government and private agencies at
address level. Assembling these data together in a nationally consistent manner,
geocoded to match with the property cadastre, would provide researchers and
policy makers with a vastly greater capacity to study and better understand how
housing markets function. This has also yet to be achieved. However, in this chapter
we illustrate how such comparable data can be brought together, analysed and
visualized using the AURIN workbench in the context of one jurisdiction, Sydney,
for understanding various dimensions of housing affordability. It is to be hoped that
a broader range of administratively and commercially held housing market data will
become available for research in the not too distant future.
Fig. 1 Systems view of housing market in Australia with the central issue of affordability (Pettit
et al. 2013a)
machine (programmatic) access to these data sets (and the data remains as close to
the data custodians as technically possible). This data has been curated and the
associated metadata is used to register the data feeds into the Portal. There is also a
CKAN metadata search and discovery tool as part of the wider workbench. The
workbench, comprising the Portal, is conceptually illustrated in Fig. 2. It has been
implemented using an open source federated technical architecture (Sinnott
et al. 2015). The federated data structure enables datasets from across different
cities, government agencies and the private sector to feed into the workbench. A
key component to the success of AURIN has been its engagement with data pro-
viders and associated stakeholders from government, industry and academia. Many
of these data sets have either been siloed behind organizational firewalls and are not
easily discoverable or accessible. As of August 2015 AURIN has federated over
1800 critical datasets which can be used to support evidence-based research and
data-driven policy and decision-making around the challenges facing our cities.
Some of this data can be considered ‘big’ data, such as the Australian Property
Monitors (APM) data which is discussed later in this paper.
As part of the broader workbench there are also a number of spatial statistical
routines and modelling tools to support urban informatics including, for example, a
Walkability Toolkit and a What if? Scenario Planning Tool. The Walkability toolkit
includes an agent based approach for calculating pedsheds (Badland et al. 2013).
Pedsheds are commonly referred to as the area within walking distance from a
particular destination such as a town centre or a train station. The What if? Scenario
Planning Tool is integrating the well-known (Klosterman 1999) What if? GIS based
planning support system (PSS) into the AURIN workbench (Pettit et al. 2013b,
2015b). Other tools within the workbench include an employment clustering tool,
and a suite of spatial and statistical routines, charting and mapping visualization
238 C. Pettit et al.
Fig. 2 System architecture for AURIN workbench Pettit et al. (2015a) adapted from Sinnott
et al. (2015)
capabilities. All of these tools require access to a rich tapestry of datasets which can
be shopped for via the Portal. In this paper we focus on the application of a number
of spatial statistical routines which are available in the AURIN Portal.
The AURIN project has also established a number of Data Hubs and feeds across
the country to programmatically access data required to support urban researchers
(Delaney and Pettit 2014). These data hubs are essentially a series of distributed
computer servers which reside across jurisdictions where jurisdictional data resides
and where an AURIN client has been developed to be able to provide data through
to the Portal. These Data Hubs close the loop between data owners and data users by
ensuring each hub is established aligning to a set of core principals, defined as:
(1) facilitating collaboration and interaction between end users and data custodians;
(2) being held as close to the source as possible; (3) set up to serve a broad end user
community, not a single project; and (4) sufficient information (including metadata)
being provided for users to understand the data.
The housing affordability analysis undertaken in this research draws upon data at
both aggregate census geographies from across a number of data hubs throughout
Australia. However, housing data is primarily drawn from the Sydney Housing data
hub where the NSW Rental Bond Board data resides and through the property data
made available through the Australian Property Monitors (APM) which is a Sydney
based company. At present, the NSW Rental Bond Board data available through the
Portal is limited to a select number of years’ worth of aggregated rental bond data,
but provides median rent data for each year at the ABS’s Statistical Areas 2 (SA2)
level by broad dwelling type and bedroom number. Licensing agreements preclude
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 239
the release of these data at the unit record level. The SA2 census tract comprises an
aggregate spatial unit of approximately 15,000 properties. The APM dataset is
spatial-temporal and goes back over 20 years, with monthly updates and includes
descriptive information such as dwelling type (house, unit, land), number of
bathrooms and bedrooms and dwelling features (including more than 20 associated
dwelling characteristics such as laundry room availability, garage, BBQ facilities,
harbour or beach view, etc.). This data is longitudinal and exists across 14 million
land parcels across all of Australia comprising approximately 20 columns of
attributes. There is in the vicinity of 5.6 billion records in total. The data has been
aggregated to 12 levels of geography to align with standard geographies supported
by the Australian Bureau of Statistics and Australia Post. Each level of aggregation
has been created as its own unique data product to support multiple level analysis
and ensure the data is in a compatible geography required to run the AURIN portal
spatial-statistical tools. The APM property data has been judicially collated,
cleaned internally by APM and made available spatially-temporally through a
hosted Postgres/PostGIS database for distribution via GeoServer. Researchers
whose credentials have been authenticated through the Australian Access Federa-
tion (Sinnott et al. 2015) can then access the APM data services via the AURIN
portal.
The data also includes comprehensive information on sales, auction and rent
economic cycles, including total transaction prices with corresponding statistical
analysis, for example, median price, detailed price breakdown at each fifth percen-
tile, standard deviation price, geometric mean price, as well as first and final
advertisement prices and dates, settlement dates, and auction clearance rate
among others. In this instance, the point to stress is that the APM data, in this
form, is a highly detailed disaggregated resource drawing on a wealth of statistical
assessments from a pool of much larger data. It is this form of ‘big data’ which is
used for analysing housing affordability and trends in Australia using the suite of
spatial-statistical tools available via the Portal and will be discussed further in the
Sydney housing affordability uses case.
This section utilises the range of data currently available within the Portal to show
how a much more spatially disaggregated analysis of local housing affordability
and related housing market statistics can be derived, using Sydney as a case study.
As of 2014, the Australian Bureau of Statistics (ABS) reports that Sydney’s
population stands at some 4.76 million, growing by 15 % since 2001. The NSW
Government Department of Planning and Environment project the city’s population
to grow by an additional one million people over the next 10 years. Over the same
period, Sydney’s median house price increased by 82 %, from A$338,000 in 2001
to A$730,000 in 2014 (APM). Median incomes grew from $988 (2001) to $1444
(2011), a 46 % increase.
240 C. Pettit et al.
Use is made of AURIN’s access to detailed sales and rental information and also,
where feasible, the Portal’s current analytical techniques. The case study also
identifies a few of the current technical limitations and reflects on challenges
relating to data access. Underlying the case study, however, is a simple narrative
concerning conceptual debates about housing affordability, noted above, and how
these can be tested and assessed as more detailed data becomes available.
One of the central difficulties in constructing any meaningful local index of
housing affordability has been a paucity of accessible information. Most urban
researchers rely, for better or worse, on forms of Census materials for local level
analysis. In the Australian context, the quinquennial Census undertaken by the
Australian Bureau of Statistics (ABS) collects information on household, family
and individual incomes and provides considerable geographic resolution. Unfortu-
nately, housing cost information that is collected by the Census focuses on weekly
rent and weekly mortgage payments. Whilst the former provides a relatively sound
proxy for calculating rent affordability, the latter cannot be readily used to consider
affordability in the owner occupied sector. Mortgages, of course, decrease over the
life of repayments with many locations containing households who have been in
residence for considerable periods of time. This serves to produce a substantial
mismatch between stated mortgage payments and the actual current market values
of properties.
Firstly, the following capitalises on the local level household incomes data from
the Census and local house sales data (provided through the Portal by APM) in
order to begin to bridge this gap. Secondly, focusing on the rental market, an
example of a workflow to identify local changes in rental affordability is set out.
Finally, a more detailed assessment of affordability is set out. In this final example,
some of the current limitations of the Portal, both in terms of data availability and
analytical capacity are discussed.
One of the most widely cited affordability indexes is the Demographia affordability
index (Performance Urban Planning 2015). Under this methodology Sydney has
consistently been identified as one of the top ten most unaffordable cities in the
world. As of 2015, Sydney sits third in the rankings behind Hong Kong and
Vancouver.
Underneath these macro city-wide scale assessments, the index itself is very
simple in construction; comparing city wide median incomes to city wide median
sales prices to define a ratio. Typically data is sourced from national statistical
agencies in aggregate form. Although lacking any theoretical justification, ratios of
3 or below are classified as ‘Affordable’ and 5.1 or over as ‘Severely Unaffordable’
(see Table 1). Sydney’s current median multiple is 9.8.
Despite its limitations (Phibbs and Gurran 2008), for broad comparisons of
different cities the Demographia approach is both robust and replicable. However,
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 241
headline multiples belie underlying spatial complexity. The Portal provides access
to prepared APM sales data for the period 1993 through 2015, broken down by
percentile ranges. Such provision enables a more spatially disaggregated analysis to
be derived through the application of a standard median multiple approach, but in
this case, at the local (SA2) level.
Figure 3 presents the median multiple classifications derived across Sydney
using local median incomes and reported median sales values for 2011 at the SA2
level. From Fig. 1 it is apparent that not even Sydney’s suburban fringe locations
are affordable under the Demographia classification: in these more peripheral
locations, multiples between 4.3 and 5.9 are common. In the inner city, and
particularly the more affluent eastern coastal suburbs, the multiple between median
prices and median incomes is well over double (11þ) the Severely Unaffordable
classification applied by Demographia.
The case study now shifts focus to rental affordability. Australia has a relatively
large and well developed private rental sector in comparison to many other western
economies. In 2011 around 32 % of all residential dwellings were rented, a signif-
icant increase from the 25 % recorded in 2006 (source: ABS).
Rents are perhaps one of the most dynamic measures of housing affordability.
Whilst house prices tend to garner the most attention in debates concerning this area
of concern, property sales themselves represent around half of the overall level of
housing transactions occurring. This is especially true in Sydney, which is the focus
of this case study. In 2011 there were a little over 120,000 sales of residential
properties. In comparison, 130,000 new rental contracts were signed. For context,
the average rental contract in Sydney lasts for a little over 2 years.
One of the most widely utilised assessments of rental affordability is a strict
assessment of median rents to median incomes, with a benchmark of 30 % of
household income used as the cut-off threshold for affordability. Such calculations
can be carried out utilising a number of resources, most typically, Census data.
Ultimately, however, Census data tend not to provide useful contextual information
about the scale of housing market activity, a key qualifying component for more
242
Fig. 7 Scatter plot of percentage change in rents and rent to income increases
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 247
Utilising the Portal’s spatial selection tools (currently limited to bounding box
selection—Fig. 9) it is possible to drill down inside locations of interest and extract
local level information. Figure 10 illustrates the application of the bounding box
selection technique to define a case study area within the significant clusters of
rental affordability deterioration identified using the Getis-Ord assessment. Using
these selection criteria it is then possible to run multiple queries on the individual
property level records of rental lettings provided by APM. Figure 10 sets out the
distribution of individual lettings for houses made in 2006 within the case study
area.
Being able to further quantify and detail not just a basic metric of affordability
but how this metric relates to local level availability offers greater sophistication to
debates that often overlook the complexity of the situation. The ability to generate
realistic schedules of how many, or how few, properties were actually affordable
(even at an abstract definition of household income) is only made feasible through
the provision of individual information. For the case study location, the story is
stark. In 2006 a little over 32 % of all rental properties advertised where affordable
at the local median income level. In 2011 this had declined to 2.5 %. Whilst in 2006
the profile of affordable rental properties was split equally between one and two bed
properties, in 2011 the affordable component only contained one bed properties.
Up to this stage, the case study has utilised some basic concepts of housing
affordability that could be readily applied in the current version of the Portal. Being
able to replicate these metrics at the local scale has demonstrated that Sydney
appears to face severe housing unaffordabilty problems. However, as stated at the
outset, the strict median multiple definitions of housing affordability were derived
in earlier times, periods where there was a paucity of data available (and especially
availability of data at a local level). Such calculations are useful at the macro
(or city level), but begin to raise questions concerning their relative utility at the
local scale.
In the face of these questions, more nuanced calculations have been developed
over time. One of these is an assessment of household incomes translated into
current borrowing capacity, allowing comparisons to be made between open market
sales prices and the amount a typical household could reasonably be expected to
afford to pay. Such a metric enables ratios between affordable levels of mortgage
repayments (nominally, 30 % of family income) and observed sales prices to be
made. For this, a well-established price threshold using the ‘30:40 rule’ is used.
Here, comparison is made on the basis of the price affordable at 30 % of income to
households earning at the bottom 40th income percentile (Gabriel et al. 2005). This
value can then be compared to different sales price positions.
In the following final example, the 30:40 definition of affordability is compared
to median recorded sales price from each SA2. This has been made possible through
the pre-prepared provision of the APM data set broken down by detailed prices for
each percentile (5th through 95th). Whilst there are over 500 ABS data products
directly accessible via the AURIN portal that urban researchers can access and
interrogate, for more customized data products other software is currently preferred
for extraction and manipulation. Subsequently, data on incomes has been extracted
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 249
from the ABS Table Builder product and mortgage calculations have been
conducted in Excel prior to uploading to the Portal. One of the strengths of the
Portal is users can upload their own data into their cloud based project space. This
customized data can then be joined to one of the 1800 AURIN data products to
perform further analysis and visualize the results.
As Fig. 11 sets out, the distribution of the ratio between the 30:40 calculated
affordability thresholds and local median sales prices presents a different picture to
that discussed previously. Values below 1 identify locations where the affordable
threshold is considerably below the achieved median prices, values above 1 indicate
locations where incomes can comfortably afford monthly repayments. It is evident
that more peripheral suburban locations are considerably more affordable for
households at the 40th income percentile than inner city locations.
Whilst this more nuanced take on housing affordability has produced a geogra-
phy broadly comparable to that seen in Fig. 3, it should be noted that the resulting
analysis provides a more realistic picture. For example, whilst sales across much of
the inner city remains severely unaffordable at the 40th income percentile, key
pockets exist where ratios are only marginally above 30:40 definition.
This final stage of the case study has demonstrated how differing aspects of
housing affordability can be highlighted through the application of more complex
data sets. Critically, due to the nature of the ABS Census derived income variables,
only a partial picture could be developed. Income variables are a product of all stated
household incomes and therefore is influenced by homeowners, renters, young
families and retirees alike: it therefore does not specifically reflect the incomes of
those trying to enter home ownership or trade up or down within the housing market.
In part, this limitation is due to the manner in which Census data has been tradition-
ally provided, namely through pre-existing flat file tabulations. A more meaningful
income profile (or indeed groups of profiles) cannot be readily developed from such
resources. Income data for these kinds of disaggregated cohorts derived from
individual Census records at the local scale would be more rewarding. However,
currently the costs of obtaining such data preclude this more detailed analysis.
This latter point suggests that attempting to harmonise all possible data resources
across all possible domains whilst retaining the flexibility to interrogate data fully is
a challenging goal. Different data resources have different requirements and to
maximise their utility requires a significant level of technical expertise and domain
knowledge, not to mention cost-effective access to data. Such approaches suggest
hybrid working practices where the format of outputs are harmonised; either within
comparable formats or spatial frameworks (and increasingly both). Where the
Portal fits within such considerations is in its ability to provide machine to machine
access to significant data resources across Australia and the ability to also allow
users to import resources derived from other sources in a standardised manner. The
Portal does provide access to urban ‘big’ data but is only exposed to the end user
after it has been curated and cleaned. Providing access to such valuable data and a
suite of analytical tools via a one-stop-shop interface, whether via a portal or
dashboard, is an important step in its utility in big data. Similar observations have
been recently raised by various commentators (Wilson 2015, Batty 2015, Rae and
Singleton 2015).
252
Fig. 11 Ratio between 30:40 mortgage calculations and median house sales prices
C. Pettit et al.
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 253
The final case study provided points towards a tantalizing possibility for further
development. Applying mortgage calculations to individual income profiles quan-
tifies the level of affordable housing loans that different cohorts can achieve.
Assessing these against detailed house price transaction data would allow a quan-
tification of affordability; how many properties (if any) could have been purchased?
This possibility could be used to extend affordability debates through adding the
context of availability. Instead of classifying groups of areas (or whole cities) as
unaffordable, more attuned analysis could be undertaken in order to capture short-
falls in affordable properties, or indeed the financial gaps between prices and
different cohorts.
In the context of the Portal, specifically, the basics for such analysis could be
provided through the development of a simple tool into which affordability assump-
tions could be entered (base rates, lifetime of loan, etc.), much in the same manner
as are available on many mortgage lender websites currently. This housing afford-
ability tool could be developed to calculate the value of home loans achievable
under various assumptions at the local level for different household cohorts. These
can be compared to the APM data in order to begin to track not just affordability but
levels of availability with the housing market overtime. The development of such a
tool could be implemented via the Portal’s object-oriented modelling framework
(Sinnott et al. 2015).
5 Conclusions
Affordable housing is part of a large systems views of affordable living (Kvan and
Karakiewicz 2012). The elements for the latter are necessarily more diverse that
those that contribute to an analysis of affordable housing alone and would include
factors such as accessibility to transportation, employment, education and recrea-
tional facilities. Many of these data themes exist within the AURIN data sets. The
potential of the workbench is to deliver access to a rich variety of data sets such that
the broader questions of affordable living can be examined with a robust model of
affordable housing underpinning the analysis.
In this paper we have discussed the workbench in the context of accessing big
data from across the AURIN federated data architecture, specifically from the
Australian Property Monitors Geoserver web service. We have focused on the
issue of housing affordability in the context of Sydney. Specifically, we have
shown how the data and the analytical tools within the Portal (the flagship product
within the AURIN workbench) can be used to understand the spatial-temporal
patterns of housing affordability across a city. However, whilst the Portal is a
powerful vehicle for data access, with over 1800 datasets, and analytics with over
100 spatial-statistical tools for analysis, there are limitations in attempting to
provide a technical solution that endeavours to integrate geoportal functionality
254 C. Pettit et al.
References
Badland H, White M, Macaulay G, Eagleson S, Mavoa S, Pettit C, Giles-Corti B (2013) Using simple
agent-based modeling to inform and enhance neighborhood walkability. Int J Health Geogr 12:58
Batty M (2015) A perspective on city dashboards. Reg Stud Reg Sci 2:29–32
Castells M (1989) The information city. Blackwell, Oxford
Committee for Sydney (2015) A city for all: five game-changers for affordable housing in Sydney.
Sydney issues Paper No. 8
Delaney P, Pettit CJ (2014) Urban data hubs supporting smart cities. In: Proceedings of the
Research@Locate’14. 7–9 Apr 2014
Economic Review Committee (2015) Out of reach? The Australian housing affordability chal-
lenge. Australian Senate
Gabriel M, Jacobs K, Arthurson K, Burke T, Yates J (2005) Conceptualising and measuring the
housing affordability problem. Australian Housing and Urban Research Institute, Sydney
Getis A, Ord JK (1992) The analysis of spatial association by use of distance statistics. Geogr Anal
24:189–206
Heilig GK (2012) World urbanization prospects: the 2011 revision. United Nations, Department of
Economic and Social Affairs (DESA), Population Division, Population Estimates and Pro-
jections Section, New York
Hulchanski JD (1995) The concept of housing affordability: six contemporary uses of the housing
expenditure‐to‐income ratio. Hous Stud 10:471–491
Klosterman RE (1999) The what if? Collaborative planning support system. Environ Plan B: Plan
Des 26:393–408
Kvan T, Karakiewicz J (2012) Affordable living. In: Pearson C (ed) 2020 vision for a sustainable
society. Melbourne Sustainable Society Institute, Melbourne, Victoria
Ord JK, Getis A (1995) Local spatial autocorrelation statistics: distributional issues and an
application. Geogr Anal 27:286–306
Performance Urban Planning (2015) 11th Annual demographia international housing affordability
survey: 2015. Christchurch
Pettit CJ, Stimson R, Tomko M, Sinnott R (2013a) Building an e-infrastructure to support urban
and built environment research in Australia: a Lens-centric view. In: Surveying & spatial
sciences conference, 2013
Using an Online Spatial Analytics Workbench for Understanding Housing. . . 255
Pettit CJ, Klosterman RE, Nino-Ruiz M, Widjaja I, Russo P, Tomko M, Sinnott R, Stimson R
(2013b) The online what if? Planning support system. In: Planning support systems for
sustainable urban development. Springer
Pettit CJ, Klosterman RE, Delaney P, Whitehead AL, Kujala H, Bromage A, Nino-Ruiz M (2015a)
The online what if? Planning support system: a land suitability application in Western
Australia. Appl Spatial Anal Policy 8:93–112
Pettit CJ, Barton J, Goldie X, Sinnott R, Stimson R, Kvan T (2015b) The Australian urban
intelligence network supporting smart cities. In: Geertman S, Ferreira JJ, Goodspeed R,
Stillwell J (eds) Planning support systems and smart cities. Springer International Publishing
Phibbs P, Gurran N (2008) Demographia housing affordability surveys: an assessment of the
methodology/Peter Phibbs and Nicole Gurran. Shelter NSW, Sydney
Rae A, Singleton A (2015) Putting big data in its place: a Regional Studies and Regional Science
perspective. Reg Stud Reg Sci 2:1–5
Sinnott RO, Bayliss C, Bromage A, Galang G, Grazioli G, Greenwood P, Macaulay A,
Morandini L, Nogoorani G, Nino‐Ruiz M, Tomko M, Pettit CJ, Sarwar M, Stimson R,
Voorsluys W, Widjaja I (2015) The Australia urban research gateway. Concur Comput Pract
Exp 27:358–375
Stimson R, Tomko M, Sinnott R (2011) The Australian Urban Research Infrastructure Network
(AURIN) initiative: a platform offering data and tools for urban and built environment
researchers across Australia. In: State of Australian Cities (SOAC) conference 2011
Stone ME (2006) What is housing affordability? The case for the residual income approach. Hous
Policy Debate 17:151–184
Ting I (2015) The Sydney Suburbs where minimum wage workers can afford to rent. Sydney
Morning Herald, June 2015
Townsend AM (2013) Smart cities: big data, civic hackers, and the quest for a new utopia. WW
Norton & Company, New York
Wilson MW (2015) Flashing lights in the quantified self-city-nation. Reg Stud Reg Sci 2:39–42
Yates J, Milligan V, Berry M, Burke T, Gabriel M, Phibbs P, Pinnegar S, Randolph B (2013)
Housing affordability: a 21st century problem, national research venture 3: housing afford-
ability for lower income Australians Final Report No. 105. Australian Housing and Urban
Research Institute, Melbourne
A Big Data Mashing Tool for Measuring
Transit System Performance
Abstract This research aims to develop software tools to support the fusion and
analysis of large, passively collected data sources for the purpose of measuring and
monitoring transit system performance. This study uses San Francisco as a case
study, taking advantage of the automated vehicle location (AVL) and automated
passenger count (APC) data available on the city transit system. Because the
AVL-APC data are only available on a sample of buses, a method is developed to
expand the data to be representative of the transit system as a whole. In the
expansion process, the General Transit Feed Specification (GTFS) data are used
as a measure of the full set of scheduled transit service.
The data mashing tool reports and tracks transit system performance in these key
dimensions:
• Service Provided: vehicle trips, service miles;
• Ridership: boardings, passenger miles; passenger hours, wheelchairs served,
bicycles served;
• Level-of-service: speed, dwell time, headway, fare, waiting time;
• Reliability: on-time performance, average delay; and
• Crowding: volume-capacity ratio, vehicles over 85 % of capacity, passenger
hours over 85 % of capacity.
An important characteristic of this study is that it provides a tool for analyzing
the trends over significant time periods—from 2009 through the present. The tool
allows data for any two time periods to be queried and compared at the analyst’s
request, and puts the focus specifically on the changes that occur in the system, and
not just observing current conditions.
1 Introduction
between APC data which was used in its archived form and AVL data which was
often designed for real-time analysis and not archived or analyzed retrospectively
(Furth et al. 2006). More complete data systems have since been developed that
encapsulate the data processing and reporting (Liao 2011; Liao and Liu 2010),
apply data mining methods in an effort to improve operational performance
(Cevallos and Wang 2008), and examine bus bunching (Byon et al. 2011; Feng
and Figliozzi 2011). Initial attempts have been made to visualize the data at a
network level (Berkow et al. 2009; Mesbah et al. 2012).
Two important characteristics distinguish this study from previous work.
First, it operates on a sample of AVL-APC data, and a methodology is
established to expand the data to the schedule as a whole and weight the data to
represent total ridership. This is in contrast to the examples given above which
generally assume full data coverage. Establishing expansion and weighting
methods is important because it allows Big Data analysis to be applied in a wider
range of locations with lower expenditure on data collection equipment.
Second, this study develops a tool to analyze the trends over a significant time
periods—from 2009 through the present—as opposed to many applications which
focus on using the data to understand a snapshot of current operations (Liao and Liu
2010; Feng and Figliozzi 2011; Wang et al. 2013; Chen and Chen 2009). The tool
allows data for any two time periods to be queried and compared at the analyst’s request,
and puts the focus specifically on the changes that occur in the system, and not just
observing current conditions. For example, changes that occur in a specific portion of
the city may be traceable to housing developments or roadway projects at that location,
trends that may go unnoticed given only aggregate measures or cross-sectional totals.
The remainder of this chapter is structured as follows: Section 2 describes the
data sources used in this study. Section 3 covers the methodology for data
processing, including the approach used to expand and weight the data to be
representative of the system as a whole. Section 4 presents example outputs to
demonstrate the types of performance reports that the data mashing tool can
produce. Section 5 is conclusions and expected future work.
2 Data Sources
This research uses two primary data sources provided by the San Francisco Municipal
Transportation Agency (SFMTA): automated vehicle location/automated passenger
count (AVL-APC) data, and archived General Transit Feed Specification (GTFS)
data. A third data set, from the Clipper transit smartcard system, has recently been
released and work is currently underway to validate and incorporate these data.
The AVL-APC data is formatted with one record each time a transit vehicle
makes a stop. At each stop, the following information is recorded:
• Vehicle location;
• Arrival time;
260 G.D. Erhardt et al.
• Departure time;
• Time with door open;
• Time required to pullout after the door closes;
• Maximum speed since last stop;
• Distance from last stop;
• Passengers boarding;
• Passengers alighting;
• Rear door boardings;
• Wheelchair movements; and
• Bicycle rack usage.
In addition, identifiers are included to track the route, direction, trip, stop,
sequence of stops, and vehicle number. The vehicle locations reflect some noise,
both due to GPS measurement error and due to variation in the exact location at
which the vehicle stops. However, because the stop is identified, those locations can
be mapped to the physical stop location, providing consistency across trips. The
count data become less reliable as the vehicle becomes more crowded, but the data
are biased in a systematic way, and SFMTA makes an adjustment in the data set to
compensate for this bias. The data are not currently available on rail or cable car,
only on the buses. Equipment is installed on about 25 % of the bus fleet, and those
buses are allocated randomly to routes and drivers each day at the depot. These data
are available from 2008 to the present.
Because the AVL-APC data are available for only a sample of bus trips, the
GTFS data are used to measure the scheduled universe of bus trips. GTFS is a data
specification that allows transit agencies to publish their schedule information in a
standard format. It was initially used to feed the Google Maps transit routing, and is
now used by a wide range of applications. The data are in a hierarchical format and
provide the scheduled time at which each vehicle is to make each stop. The full
specification is available from (“General Transit Feed Specification Reference -
Transit — Google Developers,” n.d.). The data used in this study were obtained
from the GTFS archive (“GTFS Data Exchange - San Francisco Municipal Trans-
portation Agency,” n.d.), from 2009 to present.
In addition, data from the Bay Area’s transit smartcard system, Clipper Card, has
recently been made available. These data provide on fare transactions made with
the cards. Clipper Card was introduced in 2010, and currently has a penetration rate
of approximately 50 % of riders on SFMTA buses. The data provide value over the
above sources because they allow transfers to be identified. The data are subject to
California’s laws governing personally identifiable information (Harris 2014),
making data privacy and protection issues of particular importance. Therefore,
they have been released with a multi-step anonymization and data obfuscation
process (Ory 2015).
A Big Data Mashing Tool for Measuring Transit System Performance 261
3 Methodology
route IDs consistent with the equivalency used for the AVL-APC data. It calculates
the scheduled headway of each trip, the scheduled runtime from the previous stop,
and the distance traveled from the last stop, and along the route shape as a whole.
After the initial cleaning and conversion, the data are joined to create an
expanded data store. The goal of this expansion is to identify exactly what is
missing from the sampled data, so they can be factored up to be representative of
the universe as a whole. The relationship between the data sets is that transit
smartcard data provides a sample of about 50 % of riders, the AVL-APC data
provides 100 % of riders on a sample of about 25 % of vehicle trips, and the
GTFS data identifies 100 % of vehicle trips. Therefore, the expansion chain allows
the more information-rich data sets to be combined with the more complete, but less
rich data, much like a household travel survey would be expanded to match Census
control totals. In this case, the expansion is a left join of the AVL-APC data records
onto the GTFS records. Rail does not have AVL-APC equipment installed, so is
excluded from the GTFS records as well. Note that this process is not able to
account for scheduled trips that are not run, due to driver or equipment availability
or other operational issues. The resulting datastore has the full enumeration of
service, but ridership and actual time information attached to only a portion of
records. Without this step, it would not be possible to differentiate between trips
that are missing because of a service change or those that are missing because they
were simply not sampled. In a setting where we are explicitly interested in exam-
ining service changes, this distinction is important.
The output of this expansion process is a datastore whose structure is shown in
Table 1. The table also shows the source and data type of each field. These
disaggregate records are referred to as trip-stop records because there is one record
for each time a bus trip makes a stop (even if that stop is bypassed due to no
passengers boarding or alighting). More specifically, records are defined by a
unique combination of values in those fields identified with as an index in the
source column. A related set of trip records is generated that aggregates across the
SEQ field such that there is a single record for each time a bus makes a trip.
While only about one in five will be observed, the datastore at this root level is
suitable for making comparisons of individual trips or trip-stops. However, sum-
ming values across trips to generate time-of-day, daily, or system totals would
result in an under-estimate of the total ridership because of the missing values.
Therefore, a set of weights is developed to factor up the records to estimate the
totals at these more aggregate levels for each day.
Because an entire trip is observed together, weights are calculated for trips and
then broadcast to all stops in that trip. At the root level, the TRIP_WEIGHT is set
equal to 1 if the trip is observed and 0 otherwise. While not entirely necessary,
including this TRIP_WEIGHT allows the remaining calculations to be performed
consistently at all levels of aggregation.
The weights are calculated by grouping the trips to the level of aggregation of
interest, and within the group, applying the formula:
A Big Data Mashing Tool for Measuring Transit System Performance 263
Table 1 (continued)
Category Field Description Type Source
Times ARRIVAL_TIME_S Scheduled arrival time Datetime GTFS
ARRIVAL_TIME Actual arrival time Datetime AVL/APC
ARRIVAL_TIME_DEV Deviation from arrival Float Calculated
schedule (min)
DEPARTURE_TIME_S Scheduled departure Datetime GTFS
time
DEPARTURE_TIME Actual departure time Datetime AVL/APC
DEPARTURE_TIME_DEV Deviation from departure Float Calculated
schedule (min)
DWELL_S Scheduled dwell time Float GTFS
(min)
DWELL Actual dwell time (min) Float AVL/APC
RUNTIME_S Scheduled running time Float GTFS
(min), excludes dwell
time
RUNTIME Actual running time Float AVL/APC
(min), excludes dwell
time
TOTTIME_S Scheduled total time Float GTFS
(min), runtime þ dwell
time
TOTTIME Actual total time (min), Float AVL/APC
runtime þ dwell time
SERVMILES_S Scheduled service miles Float GTFS
SERVMILES Service miles from Float AVL/APC
AVL/APC data
RUNSPEED_S Scheduled running speed Float Calculated
(mph), excludes dwell
time
RUNSPEED Actual running speed Float Calculated
(mph), excludes dwell
time
ONTIME5 Vehicle within 1 to Float Calculated
þ5 min of schedule
(1 ¼ yes, 0 ¼ no)
(continued)
A Big Data Mashing Tool for Measuring Transit System Performance 265
Table 1 (continued)
Category Field Description Type Source
Ridership ON Boardings Float AVL/APC
OFF Alightins Float AVL/APC
LOAD_ARR Passenger load upon Float AVL/APC
arrival
LOAD_DEP Passenger load upon Float AVL/APC
departure
PASSMILES Passenger miles Float Calculated
PASSHOURS Passenger hours, includ- Float Calculated
ing both runtime and
dwell time
WAITHOURS Passenger waiting hours, Float Calculated
with wait as 1/2 headway
PASSDELAY_DEP Delay to passengers Float Calculated
boarding at this stop
PASSDELAY_ARR Delay to passengers Float Calculated
alighting at this stop
RDBRDNGS Rear door boardings Float AVL/APC
CAPACITY Vehicle capacity Float AVL/APC
DOORCYCLES Number of times door Float AVL/APC
opens and closes at this
stop
WHEELCHAIR Number of wheelchairs Float AVL/APC
boarding at this stop
BIKERACK Bikerack used at this stop Float AVL/APC
Crowding VC Volume-capacity ratio Float Calculated
CROWDED Volume > 0.85 * Float Calculated
capacity
CROWDHOURS Passenger hours when Float Calculated
volume > 0.85 * capacity
Additional ROUTE_ID Route ID in GTFS Integer GTFS
ID fields ROUTE_AVL Route ID in AVL/APC Integer AVL/APC
TRIP_ID Trip ID in GTFS Integer GTFS
STOP_ID Stop ID in GTFS Integer GTFS
STOP_AVL Stop ID in AVL/APC Float AVL/APC
BLOCK_ID Block ID in GTFS Integer GTFS
SHAPE_ID Shape ID in GTFS Integer GTFS
SHAPE_DIST Distance along shape (m) Float GTFS
VEHNO Vehicle number Float AVL/APC
SCHED_DATES Dates when this schedule String GTFS
is in operation
(continued)
266 G.D. Erhardt et al.
Table 1 (continued)
Category Field Description Type Source
Weights TRIP_WEIGHT Weight applied when Float Calculated
summarizing data at trip
level
TOD_WEIGHT Weight applied when Float Calculated
calculating time-of-day
totals
DAY_WEIGHT Weight applied when Float Calculated
calculating daily totals
SYSTEM_WEIGHT Weight applied when Float Calculated
calculating system totals
N
Wt ¼ X wt
t
wt
where:
Wt is the weight for trip t,
N is the number of trips in the group, and
wt is the base weight for trip t.
These weights are built hierarchically, such that the higher-level weights incor-
porate the lower-level weights.
The first calculation is for the time-of-day weight, TOD_WEIGHT, in which the
trips are grouped by DATE, TOD, AGENCY_ID, ROUTE_SHORT_NAME and
DIR, and the TRIP_WEIGHT serves as the base weight. Because the
TRIP_WEIGHT is a mask for the observed trips, the formula simply gives the
ratio of the total trips to observed trips in the group. So if a particular route makes
ten trips in the inbound direction during the AM peak and two of those trips are
observed, the resulting weight of five is used to scale up the observations to
represent the total ridership in the AM peak.
One level up is the DAY_WEIGHT, used for expanding to daily totals by route.
The trips are grouped by DATE, AGENCY_ID, ROUTE_SHORT_NAME and
DIR, and the base weight is the TOD_WEIGHT. If a route has some observations
in each time period for which there is service, the DAY_WEIGHT will be equal to
the TOD_WEIGHT because the base weight already factors up to the total trips
during the day. The difference occurs when there are some time periods for which
there is service, but zero trips are observed. In this situation, the data from the
remaining time periods are scaled up to account for this missing period. This weight
should not be applied when summarizing data at the time period level because
doing so would over-state the ridership in non-missing time periods, but it provides
a better estimate of the daily total by route.
A Big Data Mashing Tool for Measuring Transit System Performance 267
Moving another level up, it is also possible that some routes will not be observed
at all during a day, so adding up the total system ridership using the
DAY_WEIGHT would miss the ridership on those unobserved routes. To account
for this problem, the SYSTEM_WEIGHT is calculated using the DAY_WEIGHT
as a base weight and grouping by DATE, TOD and AGENCY_ID. The time-of-day
grouping ensures the result is representative of the total number of trips in each time
period. As long as some trips are observed in each time period, which is true for all
days that have been inspected, another higher-level weight is not needed.
After calculating these weights, they are assigned to the disaggregate records,
and the data are aggregated with the weights applied to calculate route-stop, route,
stop and system totals by time-of-day and for the daily total. This is done separately
for each day, providing an estimate of the state of the system on each day for which
data are available. The weighted data are then aggregated by month used to
calculate conditions for an average weekday, an average Saturday and an average
Sunday/holiday in each month. These monthly average datastores are the primary
source of information for the system performance reports, discussed in the next
section, although the daily data remain available for more detailed analysis.
The estimates resulting from this process will be more reliable if there is
reasonably good coverage of observations across routes. To examine the route
coverage, Table 2 shows the percent of trips observed on each route for each
weekday in July 2010. Twenty-two percent of trips are observed, although this
varies somewhat by route. The weighting process should do a good job of account-
ing for these varying penetration rates. More limiting are the cases where zero trips
are observed on a route, which are highlighted with red cells. In these cases, the
weighting process scales up the ridership on other routes to account for the missing
values on that route. The missing values tend to occur on the routes that make fewer
trips. Overall, 93 % of routes are observed at least once during the month, with
those routes covering 96 % of trips.
One of the challenges in this effort is that the sampling of trips is not entirely
random. There are operational constraints, such as certain types of buses (motor bus
versus trolley bus, and articulated versus standard length) being needed on certain
routes, and the fact that once a bus is assigned it tends to drive the same route back
and forth. The result is that the data will not be as reliable as could be achieved with
a well-designed sampling plan, but with a good overall coverage can be expected to
provide good estimates of the state of the system.
To evaluate the magnitude of the error that can be expected from the sampling
and weighting process, the number of service miles is used as an indicator. Service
miles serves as a useful indicator because it is calculated from the GTFS data, so the
enumerated value for the system as a whole is known. For comparison, the service
miles are also calculated from the subset of observed records, with the weights
applied to scale up those observed records to the system total. These calculations
reveal that for months from 2009 through 2013, the average magnitude of the
weighting error at the system level is 1.0 %, and the maximum magnitude is 3.3 %.
The software was developed in an open-source framework in the Python envi-
ronment. It is available under the GNU General Public License Version 3 for
Table 2 Percent of trips observed on each route on weekdays in July 2010
268
88 B.A.R.T. SHUTTLE 16 0% 50% 50% 50% 50% 0% 50% 50% 50% 50% 0% 0% 50% 50% 19% 0% 50% 100% 0% 0% 0% 32%
90 OWL 13 0% 0% 0% 0% 0% 54% 0% 0% 46% 0% 54% 46% 100% 0% 0% 0% 46% 0% 0% 46% 0% 19%
91 OWL 15 40% 0% 0% 47% 20% 0% 0% 0% 27% 33% 40% 0% 20% 0% 13% 13% 13% 0% 47% 40% 67% 20%
95 INGLESIDE APTOS 2 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
108 TREASURE ISLAND 150 21% 28% 25% 33% 55% 16% 31% 63% 23% 31% 21% 28% 24% 29% 35% 0% 20% 7% 0% 7% 7% 24%
KM BUS 262 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
K-OWL 3 33% 33% 0% 0% 0% 0% 33% 0% 0% 33% 0% 0% 33% 33% 0% 0% 33% 0% 0% 0% 0% 11%
L-OWL 9 56% 56% 0% 0% 0% 0% 44% 0% 0% 44% 0% 0% 44% 44% 0% 0% 44% 0% 0% 44% 0% 18%
N-OWL 12 25% 0% 25% 0% 0% 0% 0% 25% 33% 33% 0% 58% 58% 0% 33% 33% 0% 17% 0% 0% 33% 18%
Total 8,092 19% 22% 23% 21% 22% 21% 21% 23% 21% 24% 21% 23% 22% 25% 23% 23% 21% 23% 21% 23% 21% 22%
269
270 G.D. Erhardt et al.
4 Sample Results
This section presents sample results from the data mashing tool. The purpose of this
section is to illustrate the types of performance measure the tool is capable of
reporting, and how those measures might be useful in planning. In all cases, the
performance reports seek to report information that is both relevant to the planning
process and readily explainable to policy makers. It further seeks to put the focus of
the analysis on the changes that occur over time, rather than a single snapshot of the
system.
Table 3 shows a sample of the monthly transit performance report. It consoli-
dates the core performance measures onto a single page, and compares them to
performance from another period, often the month before. The measures are
grouped in the following categories:
• Input Specification: Attributes selected by the user to define the scope of the
report. The geographic extent can be the bus system as a whole, a route or an
individual stop, with some minor differences for the route or stop reports. The
day-of-week is weekday, Saturday or Sunday/holiday. Time-of-day can be
specified for the daily total, or for individual time periods allowing for evalua-
tion of peak conditions. The report generation date and a comments section are
provided. The notes in this case indicate that system-wide service cuts occurred
A Big Data Mashing Tool for Measuring Transit System Performance 271
performance reports can easily be generated for each time period, allowing for
monitoring of crowding during the peak periods.
• Observations: The report includes the percent of trips observed, the total number
of days and the number of days with observations. At a system level, there will
generally be observations on each day, but specific routes or stops may not be
observed on some days. The measurement error calculates the percent difference
between the total boardings and alightings, providing an indication of the level
of error that can be expected from the APC technology. The weighting error
calculates the percent difference between the scheduled service miles and the
weighted and expanded service miles, giving an indication of the error that can
be expected as a result of the sampling and weighting process.
This performance report provides an overview allowing planners to quickly scan
a range of indicators for changes that might be occurring.
While the numeric performance measures provide valuable information, their
aggregate nature can wash out change that may be occurring in one portion of the
city. Therefore, an interactive mapping tool was developed to plot key metrics in
their geographic context. Figure 2 shows a screenshot from this tool. The left map
shows a before period, the middle map an after period, and the right map shows
either the absolute or relative change between the two periods. In this case, the
comparison is between July 2009 and July 2010, before and after the service cuts in
spring 2010. The user can select which time-of-day, which performance measure
and which direction to plot. In this instance, the user has chosen to map the degree
of crowdedness in the outbound direction during the 4–7 pm time period. The warm
colors on the left two maps indicate more crowding, as measured by the average
volume-capacity ratio during the period. The results are logical, with reasonably
full buses moving west from the central business district towards residential areas
of the city, as well as north-south on Van Ness Avenue. The map on the right shows
274 G.D. Erhardt et al.
the relative change in the metric between the two periods, with the warm colors
indicating an increase in crowdedness and the cool colors indicating a decrease. In
this instance, the change in concentrated on about three specific routes.
To accommodate further analysis of the changes that occur to specific routes, the
software generates route profiles as shown in Fig. 3. In this example, average
weekday ridership on the 1-California route is plotted in the inbound direction
during the AM peak. The x-axis is the sequence of stops along the route. The line
charts show the number of passengers on the bus between each stop. The bar charts
show the number of passengers boarding and alighting at each stop, with positive
bars indicating boardings and negative bars indicating alightings. In all cases, the
blue colors indicate the July 2009 period, and the red colors indicate the July 2010
period. The pattern of ridership remains similar between the two periods, with
riders accumulating through the residential portions of the route, and passengers
getting off the bus when it reaches the central business district, starting at the Clay
Street and Stockton Street stop. The PM peak ridership profile would show the
reverse. The route was shortened by the July 2010 period, with service no longer
provided to the last three stops. Therefore, in the July 2010 period there are no
alightings at these stops, and an increase in alightings at the new end-of-line stop.
The overall volume on this route during the AM peak is lower after these changes.
These boarding profiles are useful when evaluating service changes made to
specific routes, or the ridership resulting from newly opened land developments.
Finally, line plots are output, as in Fig. 4, to show the trends over a longer period of
time, rather than just for two periods. This particular example shows the on-time
A Big Data Mashing Tool for Measuring Transit System Performance 275
performance, defined as the share of buses arriving no more than 1 min early or 5 min
late. This is plotted for the daily totals, the AM peak and the PM peak. The results
show the on-time performance is generally 60–70 %, with higher values in the AM
peak and lower values in the PM peak. Any of the performance measures can be
easily plotted in this way, and doing so is an important step to understanding whether
the changes observed are real, or simply within the natural variation of the data.
The software can automatically generate each of the performance reports
described above, allowing for core analysis of the most important measures. In
addition, the full weighted and imputed datastore is available for advanced users
who seek to conduct further in-depth analysis or custom queries.
The product of this research is a Big Data mashing tool that can be used to measure
transit system performance over time. The software is implemented for San
Francisco, but can be adapted for use in other regions with similar data.
The paper addressed some of the methodological and mechanical challenges
faced in managing these large data sets and translating them into meaningful
planning information. One such challenge was the sampled nature of the data,
where not all vehicles have AVL-APC equipment installed. To make these data
more representative of the system as a whole, the vehicle trips in the AVL-APC
data are expanded to match the universe of vehicle trips identified by the GTFS data
and weights are developed to scale up to compensate for data that remain
unobserved. The expansion process applies strategies from traditional surveys
276 G.D. Erhardt et al.
where a small but rich data set is expanded to match a less rich but more complete
data set. Such strategies are key to spreading the use of Big Data for urban analysis
beyond the first tier of cities that have near-complete data sets to those that are
constrained by partial or incomplete data.
The software is available under an open-source license from (Erhardt 2014). For
working with these large data sets, it was an important decision to work with
libraries that allow fast querying of on-disk data, but also the ability to easily
modify the data structure.
The data mashing tool reports and tracks transit system performance in the core
dimensions of: service provided, ridership, level-of-service, reliability and
crowding. The performance measures are reported for the system, by route and
by stop, can also be mapped using an interactive tool. The focus of the tool is on
providing the ability to monitor the trends and changes over time, as opposed to
simply analyzing current operations. By making performance reports readily avail-
able at varying levels of resolution, and the data mashing tool encourages planners
to engage in data-driven analysis on an ongoing basis.
Several extensions of this research are currently underway. First, the transit
smartcard data has been obtained, and work is underway to evaluate the data set and
incorporate it into the current tool. Doing so will provide additional value by
allowing transfers and linked trips to be monitored. Second, parallel tools have
been developed to monitor highway speeds and plans are in place to incorporate
highway performance measures into a combined tool, allowing both to be tracked in
concert.
Ultimately, the data mashing tool will be applied to measure the change in
performance before and after changes to the transportation system. The study
period covers a time with important changes to the transit system, such as the
percent service cut in discussed in the test results (Gordon et al. 2010), and several
pilot studies aimed at improving the speed and reliability of transit service in
specific corridors (City and County of San Francisco Planning Department 2013).
Evaluating these changes will provide planners and researchers with greater insight
into the effects of transportation planning decisions.
Acknowledgement The authors would like to thank the San Francisco County Transportation
Authority (SFCTA) for funding this research, the San Francisco Municipal Transportation Agency
(SFMTA) for providing data, and both for providing valuable input and advice.
References
Kittelson & Associates, Urbitran, Inc., LKC Consulting Services, Inc., MORPACE International,
Inc., Queensland University of Technology, Yuko Nakanishi (2003) A guidebook for devel-
oping a transit performance-measurement system (Transit Cooperative Research Program
No. TCRP 88). Transportation Research Board of the National Academies, Washington, DC
Kittelson & Associates, Parsons Brinckerhoff, KFH Group, Inc., Texas A&M Transportation
Institute, Arup (2013) Transit capacity and quality of service manual (Transit Cooperative
Research Program No. TCRP 165). Transportation Research Board, Washington, DC
A Big Data Mashing Tool for Measuring Transit System Performance 277
Benson JR, Perrin R, Pickrell SM (2013) Measuring transportation system and mode performance.
In: Performance measurement of transportation systems: summary of the fourth international
conference, 18–20 May 2011, Irvine, California. Transportation Research Board, Irvine, CA
Berkow M, El-Geneidy A, Bertini R, Crout D (2009) Beyond generating transit performance
measures: visualization and statistical analysis with historic data. Transp Res Rec J Transp Res
Board 2111:158–168. doi:10.3141/2111-18
Byon Y-J, Cortés CE, Martinez C, Javier F, Munizaga M, Zuniga M (2011) Transit performance
monitoring and analysis with massive GPS bus probes of transantiago in Santiago, Chile:
emphasis on development of indices for bunching and schedule adherence. Presented at the
transportation research board 90th annual meeting
Cevallos F, Wang X (2008) Adams: data archiving and mining system for transit service improve-
ments. Transp Res Rec J Transp Res Board 2063:43–51. doi:10.3141/2063-06
Chen W-Y, Chen Z-Y (2009) A simulation model for transit service unreliability prevention based
on AVL-APC data. Presented at the international conference on measuring technology and
mechatronics automation, 2009, ICMTMA’09, pp 184–188. doi:10.1109/ICMTMA.2009.77
City and County of San Francisco Planning Department (2013) Transit effectiveness project draft
environmental impact report (No. Case No. 2011.0558E, State Clearinghouse
No. 2011112030)
Erhardt GD (2014) sfdata_wrangler [WWW Document]. GitHub. https://github.com/UCL/sfdata_
wrangler. Accessed 15 July 2014
Feng W, Figliozzi M (2011) Using archived AVL/APC bus data to identify spatial-temporal causes
of bus bunching. Presented at the 90th annual meeting of the transportation research board,
Washington, DC
Furth PG (2000) TCRP synthesis 34: data analysis for bus planning and monitoring, Transit
Cooperative Research Program. National Academy Press, Washington, DC
Furth PG, Hemily B, Muller THJ, Strathman JG (2006) TCRP Report 113: using archived
AVL-APC data to improve transit performance and management (No. 113), Transit Cooper-
ative Research Program. Transportation Research Board of the National Academies,
Washington, DC
General Transit Feed Specification Reference - Transit — Google Developers [WWW Document]
(n.d.) https://developers.google.com/transit/gtfs/reference. Accessed 15 July 2014
Gordon R, Cabanatuan M, Chronicle Staff Writers (2010) Muni looks at some of deepest service
cuts ever. San Francisco Chronicle, San Francisco
Grant M, D’Ignazio J, Bond A, McKeeman A (2013) Performance-based planning and program-
ming guidebook (No. FHWA-HEP-13-041). United States Department of Transportation
Federal Highway Administration
GTFS Data Exchange - San Francisco Municipal Transportation Agency [WWW Document]
(n.d.) http://www.gtfs-data-exchange.com/agency/san-francisco-municipal-transportation-
agency/. Accessed 15 July 2014
Harris KD (2014) Making your privacy practices public: recommendations on developing a
meaningful privacy policy. Attorney General, California Department of Justice
Liao C-F (2011) Data driven support tool for transit data analysis, scheduling and planning.
Intelligent Transportation Systems Institute, Center for Transportation Studies, University of
Minnesota
Liao C-F, Liu H (2010) Development of data-processing framework for transit performance
analysis. Transp Res Rec J Transp Res Board 2143:34–43. doi:10.3141/2143-05
Lomax T, Blankenhorn RS, Watanabe R (2013) Clash of priorities. In: Performance measurement
of transportation systems: summary of the fourth international conference, 18–20 May 2011,
Irvine, California. Transportation Research Board, Irvine, CA
Mesbah M, Currie G, Lennon C, Northcott T (2012) Spatial and temporal visualization of transit
operations performance data at a network level. J Transp Geogr 25:15–26. doi:10.1016/j.
jtrangeo.2012.07.005
Moving Ahead for Progress in the 21st Century Act (2012)
278 G.D. Erhardt et al.
Ory D (2015) Lawyers, big data, (more lawyers), and a potential validation source: obtaining smart
card and toll tag transaction data. Presented at the 94th transportation research board annual
meeting, Washington, DC
Pack M (2013) Asking the right questions: timely advice for emerging tools, better data, and
approaches for systems performance measures. In: Performance measurement of transportation
systems: summary of the fourth international conference, 18–20 May 2011, Irvine, California.
Transportation Research Board, Irvine, CA
Price TJ, Miller D, Fulginiti C, Terabe S (2013) Performance-based decision making: the buck
starts here. In: Performance measurement of transportation systems: summary of the fourth
international conference, 18–20 May 2011, Irvine, California. Transportation Research Board,
Irvine, CA
Turnbull KF (ed) (2013) Performance measurement of transportation systems: summary of the
fourth international conference, 18–20 May 2011, Irvine, California. Transportation Research
Board, Washington, DC
Wang J, Li Y, Liu J, He K, Wang P (2013) Vulnerability analysis and passenger source prediction
in urban rail transit networks. PLoS One 8:e80178. doi:10.1371/journal.pone.0080178
Winick RM, Bachman W, Sekimoto Y, Hu PS (2013) Transforming experiences: from data to
measures, measures to information, and information to decisions with data fusion and visual-
ization. In: Performance measurement of transportation systems: summary of the fourth
international conference, 18–20 May 2011, Irvine, California. Transportation Research
Board, Irvine, CA
Zmud J, Brush AJ, Choudhury MD (2013) Digital breadcrumbs: mobility data capture with social
media. In: Performance measurement of transportation systems: summary of the fourth inter-
national conference, 18–20 May 2011, Irvine, California. Transportation Research Board,
Irvine, CA
Developing a Comprehensive U.S. Transit
Accessibility Database
Abstract This paper discusses the development of a national public transit job
accessibility evaluation framework, focusing on lessons learned, data source eval-
uation and selection, calculation methodology, and examples of accessibility eval-
uation results. The accessibility evaluation framework described here builds on
methods developed in earlier projects, extended for use on a national scale and at
the Census block level. Application on a national scale involves assembling and
processing a comprehensive national database of public transit network topology
and travel times. This database incorporates the computational advancement of
calculating accessibility continuously for every minute within a departure time
window of interest. This increases computational complexity, but provides a very
robust representation of the interaction between transit service frequency and
accessibility at multiple departure times.
1 Introduction
et al. (2012), which collected zone-to-zone travel time information from 38 metro-
politan planning organizations to implement a cross-metropolitan evaluation of
accessibility by car.
The goal of this project is to combine the lessons learned from these earlier
works with recent advances in transit schedule data format and availability to
produce a new, comprehensive dataset of accessibility to jobs by transit.
3 Data Sources
GTFS dataset obtained from this source was originally published by the actual
transit operator, or that it has not been modified in some way. For this project,
schedules downloaded from this web site are used only when they cannot be
obtained directly from a transit operator.
4 Software
All of the major components of this evaluation system are open source. While this
was not a specific goal or requirement, experience from earlier projects suggested
some important benefits of using open source tools. First, open source software
often provided greater flexibility in input and output data formats. This is an
important consideration when a project involves multiple stages of data transfor-
mation and processing, each performed with a separate tool. Second, open source
software can be rapidly customized to fit the project needs. In this project, local
customizations to OpenTripPlanner provided more efficient parallelization and
allowed for better data interoperability. Finally, open source approaches reduce
barriers to replication and validation. Because the output of this project is itself a
dataset designed for use in research and practice, it is important that all parts of the
methodology—including those implemented using existing software—are thor-
oughly transparent and understandable.
This project makes use of the following major software packages:
• OpenTripPlanner (OTP), an open-source platform for multi-modal journey
planning and travel time calculation.
• PostgreSQL, an open-source SQL database engine.
• PostGIS, a PostgreSQL extension that allows efficient storage and querying of
spatial data.
Additionally, numerous smaller scripts and tools for data collection and
processing were developed specifically for this project.
Figure 1 illustrates the basic project architecture and workflow, which is described
in the following sections.
4.1.1 Inputs
The project inputs are stored primarily in a single SQL database. PostgreSQL is
used along with the PostGIS extension; this combination allows spatial and
non-spatial data in a single database, automated spatial queries (e.g. to select all
origins within a given analysis zone), and spatial indexing methods that accelerate
Developing a Comprehensive U.S. Transit Accessibility Database 283
4.1.2 Calculation
Fig. 2 A metropolitan area divided into analysis zones. Each zone contains a maximum of 5000
Census block centroids
The core unit of work—calculating travel times from a single origin at a single
departure time—is provided by existing OpenTripPlanner capabilities. The param-
eters and assumptions involved in these calculations are described in following
sections. OTP is natively multithreaded and can efficiently parallelize its work
across multiple processors. To achieve efficient parallelization without requiring
dedicated supercomputing techniques, the total computation workload is divided
into “analysis bundles” which include all information necessary to compute a
defined chunk of the final data. Each analysis bundle includes origin locations
and IDs; destination locations, IDs, and opportunity (job) counts; and a unified
pedestrian-transit network created by OTP.
The scope of origins included in each bundle is arbitrary; a useful value of 5000
origins per bundle was found through trial and error. Figure 2 illustrates the division
of a single county into analysis zones, each containing no more that 5000 census
block centroids. Too-small bundles erode overall efficiency by increasing the
overhead costs of job tracking and data transfer, while too-big bundles suffer
reliability issues: errors do occur, and when they do it is preferable to lose a
small amount of completed work rather than a large amount.
Destinations, on the other hand, are selected geographically. Because travel
times are by definition not known until the calculations are complete, it is necessary
to include in each bundle all destinations that might be reached from any of the
included origins within some maximum time threshold. A buffer of 60 km from the
border of the origin zone is used, based on 1 h of travel at an estimated 60 km/h
upper limit of the average speed of transit trips. This 1-h limit only applies to the
Developing a Comprehensive U.S. Transit Accessibility Database 285
Fig. 3 A single origin zone (blue) and its corresponding 60-km destination zone buffer (red).
Travel times are calculated from each centroid in the origin zone to each centroid in the destination
zone
extent of the graph; using such a graph, accessibility metrics can be reported for any
time threshold of 1 h or less. Figure 3 illustrates the spatial selection of destinations
for a given set of origins.
OTP’s Analyst module provides a graph builder function that combines pedes-
trian and transit network data from the input database into a single graph, and
locally-developed software merges the graph into an analysis bundle with the
appropriate origins and destinations. The bundle is queued in a cloud storage
system making it available for computation.
Computations take place on a variable number of cloud computing nodes that are
temporarily leased while calculation is in progress. (Currently, computing nodes are
leased from Amazon Web Services (AWS).) Each node is prepared with OTP
Analyst software as well as custom software that retrieves available analysis
bundles, initiates accessibility calculations, and stores the results.
4.1.3 Outputs
The processing of each analysis bundle results in a single data file that records
accessibility values for each origin in the bundle. For each origin, this includes an
accessibility value for each departure time and for each travel threshold between
5 and 60 min, in 5-min increments. These values are stored individually and
disaggregated to facilitate a wide range of possible analyses. Each result file is
286 A. Owen and D.M. Levinson
tagged with the ID of the analysis zone and range of departure times for which it
contains results, and then stored in a compressed format in the cloud storage
system.
Because analysis typically takes place at the metropolitan level or smaller, it is
rarely necessary to have the entire national result dataset available at once. Instead,
custom scripts automate the download of relevant data from the cloud storage
system.
This analysis makes the assumption that all access portions of the trip—initial,
transfer(s), and destination—take place by walking at a speed of 1.38 m/s along
designated pedestrian facilities such as sidewalks, trails, etc. On-vehicle travel time
is derived directly from published transit timetables, under an assumption of perfect
schedule adherence. Transfers are not limited.
Just as there is no upper limit on the number of vehicle boardings, there is no
lower limit either. Transit and walking are considered to effectively be a single
mode. The practical implication of this is that the shortest path by “transit” is not
required to include a transit vehicle. This may seem odd at first, but it allows the
most consistent application and interpretation of the travel time calculation meth-
odology. For example, the shortest walking path from an origin to a transit station
often passes through destinations where job opportunities exist. In other cases, the
shortest walking path from an origin to a destination might pass through a transit
access point which provides no trips that would reduce the origin–destination travel
time. In these situations, enforcing a minimum number of transit boardings would
artificially inflate the shortest-path travel times. To avoid this unrealistic require-
ment, the transit travel times used in this analysis are allowed to include times
achieved only by walking.
Transit accessibility is computed for every minute of the day, as described in
Owen and Levinson (2015), which demonstrates that continuous accessibility
metrics can provide a better description of the variation in transit commute mode
share than do metrics evaluated at a single or optimal departure time.
5 Visualization
This project produces highly detailed accessibility datasets, and some level of
aggregation is typically needed to produce easily understandable summary maps.
Figures 4, 5, 6 and 7 provide examples of block-level accessibility results mapped
at a constant data scale across four major metropolitan areas: Washington, DC;
Fig. 4 Map of job accessibility by transit in the Washington, DC metropolitan area
Fig. 7 Map of job accessibility by transit in the Minneapolis–Saint Paul, MN metropolitan area
Developing a Comprehensive U.S. Transit Accessibility Database 289
Atlanta, GA; Seattle, WA, and Minneapolis–Saint Paul, MN. In these maps,
accessibility for each Census block has been averaged over the 7–9 AM period.
The resulting average accessibility value indicates the number of jobs that a
resident of each block could expect to be able to reach given a randomly-selected
departure time between 7 and 9 AM.
6 Conclusion
Acknowledgements The project described in this article was sponsored by the University of
Minnesota’s Center for Transportation Studies. Many of the employed tools and methodological
approaches were developed during earlier projects sponsored by the Minnesota Department of
Transportation.
290 A. Owen and D.M. Levinson
References
Catala M, Downing S, Hayward D (2011) Expanding the Google transit feed specification to
support operations and planning. Technical Report BDK85 997-15, Florida Department of
Transportation
Delling D, Pajor T, Werneck RF (2014) Round-based public transit routing. Transp Sci 49
(3):591–604
Google, Inc. (2013) General transit feed specification reference. [Online]. https://developers.
google.com/transit/gtfs/reference
Jariyasunant J, Mai E, Sengupta R (2011) Algorithm for finding optimal paths in a public transit
network with real-time data. Transp Res Rec J Transp Res Board 2256:34–42
Krizek K, El-Geneidy A, Iacono M, Horning J (2007) Refining methods for calculating non-auto
travel times. Technical Report 2007-24, Minnesota Department of Transportation
Krizek K, Iacono M, El-Geneidy A, Liao C-F, Johns R (2009) Application of accessibility
measures for non-auto travel modes. Technical Report 2009-24, Minnesota Department of
Transportation
Levine J, Grengs J, Shen Q, Shen Q (2012) Does accessibility require density or speed? A
comparison of fast versus close in getting where you want to go in US metropolitan regions.
J Am Plan Assoc 78(2):157–172
Owen A, Levinson D (2012) Annual accessibility measure for the Twin Cities metropolitan area.
Technical Report 2012-34, Minnesota Department of Transportation
Owen A, Levinson DM (2015) Modeling the commute mode share of transit using continuous
accessibility to jobs. Transp Res A Policy Pract 74:110–122
Puchalsky CM, Joshi D, Scherr W (2012) Development of a regional forecasting model based on
Google transit feed. In: 91st annual meeting of the transportation research board, Washington,
DC
Wong J (2013) Leveraging the general transit feed specification for efficient transit analysis.
Transp Res Rec J Transp Res Board 2338:11–19
Seeing Chinese Cities Through Big Data
and Statistics
1 Introduction
China has historically been an agricultural nation. Its urbanization rate was reported
to be about 11 % in 1949 and 18 % in 1978. Subject to differences in definition (Qiu
2012), the U.S. urbanization rate was estimated to be at 74 % in 1980 (U.S. Census
Bureau 1990). China began its economic reforms “Socialism with Chinese charac-
teristics” in 1978. It introduced market principles and opened the country to foreign
investment, followed by privatization of businesses and loosening of state control in
the 1980s.
In the last 36 years, China leapfrogged from the ninth to the second largest
economy in the world in gross domestic product (GDP), surpassing all other
countries except the U.S. (Wikipedia, “List of countries by GDP (nominal)”). The
poverty rate in China dropped from 85 % in 1981 to 13 % in 2008 (World Bank,
“Poverty headcount ratio at $1.25 a day”).
Through expanding population and land annexation, urbanization in China
increased dramatically in support of this economic growth. The objective of this
paper is to document the need for a proactive data-driven approach to meet
challenges posed by China’s urbanization. This strategy will require a number of
technological and data-oriented solutions, but also a change in culture towards
statistical thinking, quality management, and data integration. New investments in
smart cities have the potential to design systems so that the data can lead to much-
needed governmental innovations towards impact.
The paper is structured as follows: in Sect. 2, we provide background information
on urbanization trends in China and the challenges that have come with it. In Sect. 3,
we describe current policy goals and approaches to meet these challenges. The
emerging role of statistics and technology, including investments of smart cities
and much-needed transformations towards data and their use are discussed in
Sect. 4. Section 5 describes progress with a smart city developments with
Zhangjiagang in Jiangsu Province, as an example of the types of potential benefits
that can be accrued with technology for cities. Conclusions are given in Sect. 6.
2 Background
Table 1 reproduces the State Council of China (2014) report about the growth of
Chinese cities from 193 in 1978 to 658 in 2010. Wuhan became the seventh
megacity in 2011. No U.S. city qualified to be a megacity in 2012 (U.S. Census
Bureau, “City and Town Totals: Vintage 2012”). The urbanization rate in China
tripled from 18 % in 1978 to 54 % by the end of 2013 (National Bureau of Statistics
of China 2014).
Migration of rural workers to meet the urban labor needs accounted for most of
the growth of the Chinese urban population to 711 million. However, under the
unique Chinese household registration system known as Hukou (Wikipedia,
Seeing Chinese Cities Through Big Data and Statistics 293
Fig. 1 Urban residency rate and urban hokou rate (1978–2012). (Source: The State Council of
China (2014))
“Hukou System”), the registered rural residents living in the city are not entitled to
the government benefits of the city, such as health care, housing subsidy, education
for children, job training, and unemployment insurance. Conversion from the rural
to urban registration status has been practically impossible.
This disparity has become a major concern for social discontent in a nation of
almost 1.4 billion people. Although 52.6 % of the Chinese population lived in cities
in 2012, only 35.3 % were registered urban residents. The gap of 17.3 % is known as
the “floating population,” amounting to 234 million people and well exceeding the
entire U.S. labor force of 156 million people. Figure 1 shows that this gap has been
widening since 1978.
In addition to the social inequity caused by the Hukou system, there is an
increasing geographical divide. Figure 2 shows that the eastern region of China is
more densely populated than the rest of the nation (Beijing City Lab 2014).
Five out of the six megacities in 2010 are located on the east coast. These
megacity clusters occupy only 2.8 % of the nation’s land area, but contain 18 %
294 J.S. Wu and R. Zhang
Fig. 2 Population density of China in 2010 (Source: Beijing City Lab (2014))
of the population and 36 % of the GDP. While the east coast is increasingly
suffocated by people and demand for resources, the central and the western regions
lag behind in economic development and income. Reliance on urban land sale and
development to generate local revenue during the reform process has led to high
real estate and housing prices and conflicts with the preservation of historical and
cultural sites in all regions. Conversion of land from agricultural use is also raising
concerns about future food supply.
At the same time, “big city diseases” surfaced and became prevalent in China,
including environmental degradation, inadequate housing, traffic congestion, treat-
ment of sewage and garbage, food security, and rising demand for energy, water
and other resources. Many of these issues have been discussed domestically and
internationally (e.g., Henderson 2009; Zhang 2010a, b; United Nations Develop-
ment Program 2013).
So, where is China heading in terms of economic growth and urbanization? The
answer to this question is unambiguous. The urbanization goal of 51.5 % for the
12th Five-Year Plan (2011–2015) has already been exceeded (National People’s
Congress 2011). The 2014 Report on the Work of the Chinese Government
(Li 2014b), which is similar to the annual State of the Union in the U.S., states
that “economic growth remains the key to solving all (of China’s) problems” and
urbanization is “the sure route to modernization and an important basis for
Seeing Chinese Cities Through Big Data and Statistics 295
integrating the urban and rural structures.” On March 16, 2014, the State Council of
China (2014) released its first 6-year plan on urbanization for 2014–2020. The
comprehensive plan covers 8 parts, 31 chapters, and 27,000 words, providing
guiding principles, priorities for development, and numerical and qualitative
goals. Under this plan, China sets a goal of 60 % for its urbanization by 2020.
In the early days of reform, China took the trial-and-error approach of “feeling the
rocks to cross the river” when infrastructure and options were lacking. Over time,
the original simple economic goals were challenged by conflicting cultural and
social values. More scientific evaluations are needed to minimize costly mistakes
made by instinctive decisions.
After 30-plus years of reform, Chinese President Xi Jinping (2014) acknowl-
edged that “. . .the easier reforms that could make everyone happy—have already
been completed.” Chinese Premier Li Keqiang (2014b) pledged to carry out
“people-centered” urbanization and cited three priorities on three groups of 100 mil-
lion people each:
• Granting official urban Hukou status to 100 million rural people who have
already moved to cities;
• Rebuilding rundown, shanty city areas and villages inside cities where 100 mil-
lion people currently live;
• Guiding the urbanization of 100 million rural residents of the central and western
regions into cities in their regions.
Table 2 reproduces the 18 key numerical goals for 2020 along with the 2012
benchmarks under the national urbanization plan.
There are now two key goals with regard to the urban population: raising the
level of residents living in cities to 60 % and the level of registered urban residents
to 45 %, thereby reducing the floating population from the current 17.3 to 15 % in
6 years. The other key goals promote the assimilation of migrant rural workers into
city life, improving urban public service and quality of life, and protecting land use
and the environment.
There are less specific qualitative goals in the national urbanization plan. For
example, this chapter mandates “Three Districts and Four Lines” in each city. The
three districts are defined as areas forbidden from, restricted from, and suitable for
construction respectively. Four types of zones will be drawn by color lines: green
line for ecological land control; blue line for protection of water resources and
swamps; purple line for preservation of historical and cultural sites; and yellow line
for urban planning and development. Yet how these districts and zones will be
created and sustained has not been specified.
China is pressing forward with concurrent modernization in agriculture, indus-
trialization, information technology, and urbanization. Under the urbanization plan,
296 J.S. Wu and R. Zhang
the central government is responsible for strategic planning and guidance. Author-
ity is delegated to the provincial and municipal levels through political reform.
Local administrators are encouraged to innovate, build coalitions, undertake pilot
tests, formulate action plans, and implement orderly modern urbanization under
local conditions.
Conversion from rural to urban Hukou registration is now officially allowed and
encouraged, but the process will be defined by individual cities, under the general
rule that the conversion will be more restrictive as the population of the city
increases.
Seeing Chinese Cities Through Big Data and Statistics 297
The national urbanization plan provides an unprecedented opportunity for the role
of statistics and technology to support and monitor the implementation of policies
in China. Chapter 31 prescribes the role of defining metrics, standards, and methods
to establish a sound statistical system, monitoring the activities dynamically, and
performing longitudinal analysis and continuing assessment of the progress of
urbanization according to the development trends.
The specification of dynamic monitoring and longitudinal analysis reflects
advanced thinking, compared to the current static, cross-sectional reports. Yet
how the statistical monitoring system will be implemented also remains unclear
at this stage.
Many developed nations have been using a data-driven approach to manage
knowledge for their businesses (e.g., Sain and Wilde 2014) and it is assumed that
China will also take up this approach for governance in this paper. Figure 4 shows a
Data-Information-Knowledge-Wisdom (DIKW) hierarchy model for this process.
The foundation of scientific knowledge and wisdom is to observe facts and collect
data. However, data in their raw form have little or no meaning by themselves. Not
all data have adequate information value or are useful for effective decision
making. Statistics, both as a branch of science for knowledge discovery and a set
of measurements, provides context and value by converting useful data into rele-
vant information. Knowledge is gained and accumulated from information, and
used as the basis for making wise decisions. The decisions will not be correct all the
time, but the scientific process promotes efficiency and minimizes errors, especially
when conducted with integrity, objectivity, and continuous improvement (Fig. 3).
Although technology is not explicitly shown in the DIKW model, today the base
of the pyramid is greatly expanded by technology, and the transformation of data
into information has been accelerated. However, the process is also contaminated
by hype, useless data, and misinformation (Harford 2014; Wu 2014). Traditional
sources of data such as the Census have been used for governance of nations for
centuries. Random surveys were later introduced based on probability theory to
produce scientifically reliable information with proper design and a relatively small
amount of data.
Together censuses and random surveys form the statistical foundation based on
structured data (Webopedia, “Structured data”) in the twentieth century. Developed
nations have used them effectively for policy and decision making, with design and
purpose, over the past 100 years.
At the turn of this century, massive amounts of data began to appear in or were
converted from analog to digital form, allowing direct machine processing (Hilbert
and Lopez 2011), a lot of which are unstructured text, map, image, sound, and
multimedia data. Big Data was not a well-known term in China until Tu (2012)
published the first Chinese-language book on the topic. Although data mining is
commonly mentioned as a promising approach to extract information from such
data for commercial purposes, their reliability and value can be suspect, especially
for the purpose of governance (e.g., Marcus and Davis 2014; Lazer et al. 2014). Few
of the key numerical goals in the national urbanization plan can be measured
meaningfully or reliably by unstructured data alone.
Other approaches to Big Data arise due to the integration of structured data
derived from administrative records to create longitudinal data systems. This
approach was the first realized benefit of Big Data for government statistics. For
example, the Longitudinal Employer-Household Dynamics (LEHD) program of the
U.S. Census Bureau merges unemployment insurance data, social security records,
tax filings and other data sources with census and survey data to create a longitu-
dinal frame of jobs. It is designed to track every worker and every business in the
Seeing Chinese Cities Through Big Data and Statistics 299
nation dynamically through the relationship of a job connecting them, with data
updated every quarter, while protecting confidentiality. The data provide insights
about patterns and transitions over time, which are not available from the traditional
cross-sectional statistics. Similar efforts to build longitudinal data systems for
education (Data Quality Campaign n.d.) and health care (Wikipedia, “Health
Information Technology for Economic and Clinical Health Act”) are underway in
the U.S. The 2020 U.S. census will also be supplemented by the integration of
administrative records (Morello 2014).
More than a decade ago, the State Council of China (2002) issued guidance to
create four National Basic Data Systems as part of e-Government—longitudinal
frames of people, enterprises, and environment/geography respectively with the
fourth system as an integration of the first three to form a unified macroeconomic
data system. These nationwide data systems possess the desired characteristics of a
twenty-first century statistical system (Groves 2012; Wu 2012; Wu and Guo 2013).
They help to transition the Chinese government’s role from central control to
service for citizens and to establish a foundation for data sharing and one-stop
integrated service nationwide. Heavy investment followed to establish and imple-
ment definitions, identification codes, standards, and related infrastructure.
Identification codes are the keys to unlocking the enormous power in Big Data
(Wu and Ding 2013). A well-designed code matches and merges electronic records,
offers protection of identity, provides basic description and classification, performs
initial quality check, and facilitates the creation of dynamic frames. As early as
1984, China began to build an infrastructure with its citizen identification system
(Wikipedia, “Resident Card System”). A sample Chinese citizen card (Fig. 4)
displays the citizen identification code, name, gender, ethnicity, birthdate, address,
issuing agency, dates of issuance and expiration, and a photograph.
The 18-digit citizen identification code, introduced in 1999, includes a Hukou
address code, birthdate, gender, and a check digit. It is issued and administered by
the Ministry of Public Security. The citizen code is uniquely and permanently
assigned to the cardholder. The card is capable of storing biometric information.
It is increasingly required for multiple purposes, such as the purchase of a train
ticket for travel. In contrast, the U.S. does not have a comparable national citizen
card system. Recent renewed discussions about adding an image of the cardholder
to the Social Security card was met again with controversy (e.g., Bream 2014;
Eilperin and Tumulty 2014).
China has also established a system of National Organization Codes under the
National Administration for Code Allocation to Organizations. The nine-digit
organization code includes a check digit and is a unique identification and linking
300 J.S. Wu and R. Zhang
The term “Smart City” began to appear globally around 2008 as an extension to
previous development of e-Government and digital cities. Data collection,
processing, integration, analysis and application are at the core of constructing
smart cities. In practical terms, embracing the concept of smart city will allow
China to downscale the original national approach to the more manageable city
level, while protecting past investments and permitting aggregation to the provin-
cial or regional level.
Table 3 describes the direction to develop smart cities as outlined in the national
urbanization plan. At the end of 2013, the Chinese Ministry of Housing and Urban-
Rural Development has designated 193 locations to be smart city test sites (baidu.
com, “National Smart City Test Sites”). They are expected to undergo 3–5 years of
experimental development. The Chinese Ministry of Science and Technology has
also named 20 smart city test sites (Xinhuanet.com 2013). They are expected to
spend 3 years to develop templates of cloud computing, mobile networks, and
related technologies for broad implementation.
The issuance of a City Resident Card is a concrete first step for aspiring smart
cities to provide one-stop service and to consolidate data collection. The multi-
Seeing Chinese Cities Through Big Data and Statistics 301
functional card may be used for social security or medical insurance purposes, as
well as a debit card for banking and small purchases. Depending on the city, the
City Resident Card may also be used for transportation, public library, bicycle
rentals, and other governmental and commercial functions yet to be developed.
During the application process, the citizen identification code is collected along
with identification codes for social security and medical insurance, residence
address, demographic data, and family contact information, facilitating linkage to
other data systems and records. The current smart resident cards in use in China
(Fig. 5) vary from city to city, but they typically contain two chips and a magnetic
memory strip. One example of smart city applications utilizing this card system is a
one-stop service platform by Digital China (2013). This is an additional channel of
service for millions of card-carrying residents, who can use the secured smart card
to perform previously separate functions.
302 J.S. Wu and R. Zhang
Fig. 5 Sample of smart city resident card. (Source: Baike.baidu.com, “Resident Card”)
While the aforementioned activities are modest, they have the potential to lay the
foundation for urban informatics in China, particularly as they represent the very
early results of China’s total investment into smart city development, which is
estimated to exceed ¥2 trillion ($322 billion) by 2025 (Yuan 2014). Urban infor-
matics, meaning the scientific use of data and technology to study the status, needs,
challenges, and opportunities for cities, is presently not a well-known concept in
China. It uses both unstructured and structured data, collected with and without
design or purpose. The defining characteristics of urban informatics will be the
sophisticated application of massive longitudinal data, integration of multiple data
sources, and rapid and simple delivery of results, while strictly protecting confi-
dentiality and data security and assuring accuracy and reliability.
However, there are many challenges and needs in establishing urban informatics
as a mature field of study in China. These are discussed next.
There is no assurance that internal resistance to data sharing and standards can be
overcome in China despite mandates, political reforms, downscaling, and cloud
computing (e.g., UPnews.cn 2014). A major risk of a de-centralized approach is the
formation of incompatible “information silos” such that the systems cannot inter-
Seeing Chinese Cities Through Big Data and Statistics 303
operate within or between cities. This challenge is not unique to China. The
U.S. had more than 7000 data centers in the federal government alone in 2013;
about 6000 of them were considered “noncore.” Many of them do not communicate
with each other and are costly to maintain. Although a major consolidation initia-
tive was started by the White House in 2010, progress has been slow (CIO.Gov
2014; Konkel 2014).
However, open data-based governance and research are relatively new concepts
in China. Although their value is recognized and advocated in the central plans, thus
far there has been little support of open data policy and data sharing, and neither has
a full awareness of modern statistical or environmental issues demonstrated. Much
is also needed regarding statistical quality control (Shewhart 1924; Deming 1994)
and quality standards (e.g., International Organization for Standardization 9000).
While these statistical principles and thinking originated in the context of industrial
production, they are equally applicable to governance.
The National Bureau of Statistics of China relies heavily on data supplied by
provincial and local governments. Intervention and data falsification by local
authorities are occasionally reported in China (e.g., Wang 2013), including the
famed GDP. For example, the incomplete 2013 GDP of 28 out of 31 provinces and
cities already exceeded the preliminary 2013 total national GDP by ¥2 trillion or
3.6 % (e.g., Li 2014a). Due to these issues, the credibility and public confidence in
China’s statistics are not high. Tu (2014) observed that China has not yet developed
a culture of understanding and respect for data. Changing this culture is a challenge
without historical precedent in China.
The Chinese statistical infrastructure is relatively new and fragile. The first Chinese
decennial census on population began in 1990 while the U.S. started 200 years
earlier; the first Chinese consolidated economic census was conducted only
10 years ago in 2004. Random surveys seldom include detailed documentation on
methodology.
Requiring dynamic monitoring and longitudinal analysis by the Chinese gov-
ernment is refreshing in the national urbanization plan. Its implementation faces
many statistical and technological issues, including record linkage and integration,
treatment of missing or erroneous data, ensuring data quality and integrity, retrieval
and extraction of data, scope for inference, and rapid delivery of results. Some of
the terms in use, such as “talented persons,” “green buildings,” and “information
level,” do not have commonly accepted definition or standard meaning.
In the collection of data about a person, some of the characteristics such as
gender and ethnicity remain constant over time; some change infrequently or in a
predictable manner such as age, Hukou, and family status; some change more
frequently such as education level, income level, employment, and locations of
home and work; and others change rapidly such as nutritional intake, use of water
and electricity, or opinion about service rendered. Measurement of these
304 J.S. Wu and R. Zhang
Yuan (2014) quoted the research firm IDC that “roughly 70 % of government
investments went to hardware installation in China, way higher than the global
average of 16 %.” While China may be strong at hardware, service and software
tend to lag behind. Technology and statistics are in many cases disconnected.
For effective administration and rapid information delivery, the underlying data
need to be representative and quality-assured. This would facilitate easy extraction,
transformation and loading (ETL), as well as dynamic visualization and longitudi-
nal reporting of the status and assessment of progress on the urbanization plan, and
overall system performance and customer satisfaction.
Online services based on smart resident cards and one-stop centers have already
led to relief in labor-intensive administrative functions and reduction of long
queues, but the current static monitoring reports are not connected to data collected
from online services in concept or in operation. Although statistical yearbooks are
beginning to appear online, interactive queries and dynamic visualization similar to
the American Factfinder (U.S. Census Bureau n.d.a) are not yet available. Intelli-
gent mapping applications similar to OnTheMap (Wu and Graham 2009; U.S.
Census Bureau n.d.c) have also not been introduced to deliver custom maps and
statistical reports based on the most recent data in real time. Such fragmentation
may well hamper data-driven impact even though considerable investment may
have gone into systems and hardware.
Seeing Chinese Cities Through Big Data and Statistics 305
In this section, we discuss current smart city developing by considering the case of
Zhangjiagang, a port city of 1.5 million population located along the Yangtze River
in eastern China Jiangsu Province (Fig. 6). The urbanization rate for Zhangjiagang
was 63 % in 2010, which is higher than the current average in China, exerting high
pressure on its city administrators to manage its population, environment, and
economic development.
Zhangjiagang’S 12th 5-year plan (2011–2015) emphasizes the support of infor-
mation technology for e-Government by raising the level of government applica-
tions; accelerating the construction of the Basic Data Systems; focusing on public
and government data sharing and exchange systems; and further improving the
level of inter-departmental applications.
The Zhangjiagang public website was launched in October 2013 with the above
goals in mind. The front page of the website (Fig. 7) contains three channels—My
Service, My Voice, and My Space. My Service provides government and public
services; My Voice connects the government and the resident through an online
survey and microblogging; My Space contains the user’s “digital life footprint”
such as personal information and record of use.
The public website combines online and offline services through the use of the
smart resident card, desktop and mobile devices, and government and community
service centers, offering 621 types of services by 31 collaborating government and
community organizations.
The services vary by type of access device. Desktop computers offer the most
comprehensive services, including queries, more than 240 online applications, and
over 130 online transactions. Mobile device users may check on the progress of
their applications, using General Packet Radio Service (GPRS) positioning and
speech recognition technology to obtain 56 types of efficiency services such as
travel and transportation.
The one-stop service platform attempts to provide a unified, people-centric,
complete portal, eliminating the issues of territoriality of various government
agencies to build their own websites and service stations and consolidating separate
developments such as smart transportation (e.g., showing the location and avail-
ability of rental bicycles) and smart health care. The website also combines a
variety of existing and future smart city proposals and services. Developers will
be able to link to the platform to provide their services with lower operating costs.
The city government wants to have a platform to showcase information technology,
introduce business services, and assist economic development, especially in
e-Commerce.
The Zhangjiagang public website is designed to be an open platform for pro-
gressive development. All the applications will be dynamically loaded and flexible
to expand or contract. Existing services will be continuously improved, and new
functionalities added. It aims to improve public satisfaction of government service
and broaden agency participation to facilitate future data sharing and data mining.
Participation of the residents in the platform will determine whether its goals
will be achieved or not. In the 6-month period since its launch in October 2013,
there have been 15,518 total users through real-name system certification and
Seeing Chinese Cities Through Big Data and Statistics 307
online registration, 31,956 visitors, and 198,227 page views. The average visit time
was 11 min and 7 s. Among all the users, real-name registrants accounted for 67 %,
and mobile end users accounted for 44 %.
Online booking of sports venues, event tickets, and long-distance travel are the
most popular services to date. They show the value of convenience to the residents,
who had to make personal visits in the past.
Although there is no current example of data sharing between government
departments, the public website is beginning to integrate information for its resi-
dents. A user can view his/her records in a secured My Space.
It is already possible in the Zhangjiagang platform to create a consolidated bill of
natural gas, water, electricity, and other living expenses to provide a simple analysis
of household spending. Although this is elementary data analysis, it foretells the
delivery of more precise future services as online activities and records expand and
accumulate over time.
6 Summary
China is in the early stage of its 6-year national urbanization plan, extending its
economic development to also address rising social and environmental concerns.
There is a defined role for statistics and urban informatics to establish norms and
conduct dynamic monitoring and longitudinal analysis. Small steps have been taken
to begin data consolidation in some smart city test sites, and modest progress is
beginning to appear.
In the next 6 years, cultural changes towards an objective data-driven approach,
integration of statistical design and thinking into the data systems, and innovative
statistical theories and methods to fully deploy meaningful Big Data will be needed
to grow urban informatics in China and to achieve balanced success in its urban-
ization efforts. China will undoubtedly continue to advance towards building
smarter cities with Chinese characteristics, and we will be able to understand
more of Chinese cities through statistics and Big Data.
Acknowledgements This research was supported in part by Digital China Holdings Limited and
East China Normal University. Correspondence concerning this article should be addressed to
Jeremy S. Wu, 1200 Windrock Drive, McLean Virginia 22012. We wish to thank Dr. Carson
Eoyang and Dr. Xiao-Li Meng for their valuable comments, corrections, and edits.
References
Bream S (2014) Proposal to add photos to Social Security cards meets resistance. Fox News. http://
fxn.ws/1lnumHU
U.S. Census Bureau (1990) 1990 census of population and housing, Table 4. http://1.usa.gov/
1lnA24E
CIO.Gov (2014) Data center consolidation. http://1.usa.gov/1xKcUTN
Data Quality Campaign (n.d.) Why education data? http://bit.ly/1l1iQnb
Deming WE (1994) The new economics for industry, government, education. The Massachusetts
Institute of Technology, Cambridge
Digital China (2013) Release of first Chinese city public information service platform. http://bit.ly/
1q47SPx
Eilperin J, Tumulty K (2014) Democrats embrace adding photos to Social Security cards’.
Washington Post. http://wapo.st/1l8TOfM
Groves RM (2012) National Statistical Offices: Independent, identical, simultaneous actions
thousands of miles apart. U.S. Census Bureau. http://1.usa.gov/1xJbEQL
Harford T (2014) Big data: are we making a big mistake? FT Magazine. http://on.ft.com/1xJbnx0
Henderson JV (2009) Urbanization in China: policy issues and options. Brown University and
NBER. http://bit.ly/1laGvkP
Hilbert M, Lopez P (2011) The world’s technological capacity to store, communicate, and
compute information. Sci Mag 332(6025):60–65. doi:10.1126/science.1200970. http://bit.ly/
1oUvAKm
International Organization for Standardization (n.d.) About ISO. http://bit.ly/1jRdtSs
Jiang C (2014) How to wake the slumbering “Scientific Big Data.” Digital Paper of China. http://
bit.ly/1p54gcn
Konkel F (2014) Is data center consolidation losing steam? FCW Magazine. http://bit.ly/1q4TdDU
Lazer D, Kennedy R, King G, Vespignani (2014) The parable of Google flu: traps in big data
analysis. Sci Mag 343(6176):1203–1205. doi:10.1126/science.1248506. http://bit.ly/1s5RW1h
Li D (2014a). Sum of GDP from 28 provinces already exceed the national GDP by two trillion
yuan. The Beijing News. http://bit.ly/1lqfR7Y
Li K (2014b) 2014 Report on the Work of the Government, delivered at the Second Session of the
Twelfth National People’s Congress. http://bit.ly/1jkDKIr
Marcus G, Davis E (2014) Eight (No, Nine!) problems with big data. The opinion pages, The
New York Times. http://nyti.ms/1q4a9KK
Ministry of Science and Technology (2013) Science and technology results to support our nation’s
smart city development. http://bit.ly/1l9BROg
Morello C (2014) Census hopes to save $5 billion by moving 2020 surveys online. The
Washington Post. http://wapo.st/TCJvKP
National Bureau of Statistics of China (2014) Statistical Communiqué of the People’s Republic of
China on the 2013 national economic and social development. http://bit.ly/1oeqVXM
National People’s Congress (2011) The twelfth five-year plan. http://bit.ly/1lHHcm9
Nie H, Jiang T, Yang R (2012) State of China’s industrial enterprises database usage and potential
problems. World Econ 2012(5). http://bit.ly/1oXtLwp
Qi F (2014) How to wake up the slumbering “Scientific Big Data.” Guangming Daily. http://bit.ly/
1p54gcn
Qiu A (2012) How to understand the urbanisation rate in China? China Center for Urban
Development, National Development and Reform Commission, PRC. http://bit.ly/1lnwsaQ
Sain S, Wilde S (2014) Customer knowledge management. Springer Science þ Business Media.
http://bit.ly/1qsiyqf
Shen Y (2008) Analysis of the bottlenecks in constructing the four basic data system. Cooper Econ
Sci. http://bit.ly/1oUwyq4
The State Council of China (2002) Guidance on development of e-government (17), national
information automation leading group. http://bit.ly/1lnx2Wa
The State Council of China (2014) The national new-type urbanization plan (2014–2020). http://
bit.ly/1lo1K1g
Seeing Chinese Cities Through Big Data and Statistics 309
Abstract In California, one of the greatest concerns of global climate change is sea
level rise (SLR) associated with extreme storm events. Several studies were
conducted to statically map SLR and storm inundation, while its dynamic was
less studied. This study argues it is important to conduct dynamic simulation with
high resolution data, and employs a 3Di hydrodynamic model to simulate the
inundation of Sherman Island, California. The big data, high resolution digital
surface model (DSM) from Light Detection and Ranging (LiDAR), was used to
model the ground surface. The results include a series of simulated inundation,
which show that when the sea level rises more than 1 m, there are major impacts on
Sherman Island. In all, this study serves as a fine database for better planning,
management, and governance to understand future scenarios.
1 Introduction
In California’s coastal areas, one of the great concerns of global climate change is
sea level rise (SLR) associated with extreme high tides (Heberger et al. 2009). By
2100, mean sea level (MSL) will rise between 1.2 and 1.6 m (Bromirski et al. 2012),
and this will cause a series of impacts along coastal areas, such as inundation and
flooding of coastal land, salt water intrusion, increased erosion, and the decline of
coastal wetlands, etc. (Nicholls and Cazenave 2010; Titus et al. 1991). Among all
the impacts, flood risk is likely the most immediate concern for coastal regions.
This threat can be more severe in the Sacramento-San Joaquin Delta (the Delta),
as many of its islands are 3–8 m below sea level (Ingebritsen et al. 2000). These
islands are protected by more than 1700 km of levees (Mount and Twiss 2005), with
standard cross sections at a height of 0.3 m above the estimated 100-year flood
elevation (Ingebritsen et al. 2000). However, with a projected SLR between 1.2 and
1.6 m, these current levees can be easily overtopped, and the islands can be flooded.
Several efforts were made in the adjacent San Francisco Bay area (the Bay Area)
to measure and understand the impact of SLR and storm inundation (Biging
et al. 2012; Heberger et al. 2009; Knowles 2009, 2010). By using computer models,
these studies intersected a water surface with a ground surface to identify inundated
areas. The water surface could be interpolated from measured water level data at
existing gauges (Biging et al. 2012), while the ground surface was usually obtained
from LiDAR that provided fine resolution from 1 to 5 m. It should be noted that the
interpolated water surface is static since it only describes the water surface condition
at a particular water level, such as MSL or mean higher high water (MHHW) level.
However, real tides and storm events are dynamic processes. The Bay Area and the
Delta are characterized by semi-diurnal tides each day, meaning there are two uneven
heights of high tide and low tide, and should be modeled dynamically to simulate all
stages in the tidal cycle and the movements of tides during a storm event.
Therefore, a 3Di hydrodynamic model (Stelling 2012), was used in this study to
better simulate the dynamics of tidal interaction during an extreme storm event. In
addition, a 1 m resolution digital surface model (DSM) was generated from LiDAR
in order to accurately describe the ground surface and to indicate the water flow
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . . 315
pathway for 3Di simulations. A near 100-year storm with various scenarios of SLR
was simulated in the Delta’s Sherman Island, where significant critical infrastruc-
ture existed. Inundation extent, frequency, and average depth were mapped and
analyzed based on the model outputs. Finally, a spatial resolution sensitivity
analysis of DSM was conducted. Through this entire exercise, this study hopes to
build a fine database for better planning, management, and governance to under-
stand future scenarios.
2 Study Area
The study area, Sherman Island and its adjacent regions, is located at the confluence
of Sacramento River and San Joaquin River (Fig. 1). Sherman Island is one of the
major islands in the Delta, and is located at the transition from an estuarine system
3.1 Overview
To understand the impact of SLR inundation, a water surface and a ground surface
are required to identify the spatial extent of inundated areas. A hydrodynamic
model, 3Di (Stelling 2012), was employed in this study to simulate the water
surface from a 72-h, near 100-year storm associated with 0, 0.5, 1.0, 1.41 m SLR
scenarios. The ground surface, or a DSM, was generated from airborne LiDAR to
capture the terrain and important ground objects, e.g. levees. The model output is a
time-series of inundations, providing spatial extent, inundation depth, and water
level. The workflow is shown in Fig. 2. All elevation data use NAVD 88.
The first input of the 3Di model is water level data. To estimate the impact of a
worst case scenario, a near 100-year storm event was used as the baseline, and SLR
increments were added to the baseline. Being a dynamic model, 3Di requires time-
series water level data for the entire storm event as input. However, existing
100-year storm calculation methods and studies (Zervas 2013) only provide esti-
mates of water levels for a single stage such as MSL, MHHW, and mean lower low
water (MLLW) level. Therefore, a historic storm whose peak water level was close
to the 100-year storm was used as the water level input. Shown in Table 1, two
storms that occurred in 1983 exceed the estimated 100-year storm at San Francisco
NOAA tide station (NOAA ID: 9414290), and a third highest storm occurred on
Feb. 6, 1998 that had a peak water level close to the estimated 100-year storm
(Zervas 2013). All of these three extreme storms occurred during El Ni~no years.
Considering the storm’s peak water level and availability of data, the Feb.
6, 1998 storm was selected as the storm to be simulated. More specifically, this
study simulated this storm event over 72-h, from Feb 5 to Feb 7, 1998, to allow the
318 Y. Ju et al.
Table 1 Estimated and historic extreme storms at San Francisco and Port Chicago NOAA gauge
Station name Date Estimated 100-year storm (m) Peak water level (m)
San Francisco 01/27/1983 2.64 2.707
12/03/1983 2.674
02/06/1998 2.587
Port Chicago 02/06/1998 Not available 2.729
model simulation to capture the complete storm movements through the study area.
The water level data used for the 3Di simulation was retrieved from the nearby
NOAA Port Chicago gauge (NOAA ID: 9415144), which provided measured water
levels with 6-min intervals during the storm event.
As for the SLR scenarios, Cayan et al. (2009) and Cloern et al. (2011) studied the
projected water level at Golden Gate in the Bay Area, and found that the time that
sea levels exceed the 99.99th historical percentile of water elevation would increase
to 15,000 h per decade by year 2100. And the 99.99th historical percentile is
1.41 m above year 2000’s sea level. This study assumed that 1.41 m would be
the maximum SLR by year 2100. This study also analyzed scenarios of 0, 0.5, and
1.0 m SLR to show how the impact changed with rising sea level. SLR was added
on top of the baseline water level to simulate each scenario.
The second input for the 3Di model is a fine spatial resolution DSM. The DSM was
constructed based on LiDAR data, which used active remote sensing technology to
measure the distance to target by illuminating the target with light pulses from a
laser (Wehr and Lohr 1999). The density of the LiDAR data used in this study is
1 point per 0.7 m2, and there are approximately 140 million points covering the
study area. The DSM obtained from LiDAR in this study was originally 1 m
resolution, and was resampled to 4 m resolution by the maximum aggregation
method (ESRI 2015) to meet the computing limitations. In this method, a coarse
grid cell obtains the maximum value of the fine grid cells in the coarse grid cell’s
spatial extent. Even though the DSM was aggregated to 4 m, this spatial resolution
still accurately described the actual ground surface by precisely delineating objects
such as levees, ditches, buildings, and the pathways that water moved through.
As aforementioned, the 3Di model has a limitation in the total number of grid
cells that it can process, and it uses the quad-tree approach to reduce the total
number of grid cells for model computation. The quad-tree is a data structure that is
based on the regular decomposition of a square region into quadrants and
sub-quadrants (Mark et al. 1989). 3Di draws finer quadrants/grid cells when
elevation changes greatly within a short x, y distance, which then preserves detailed
information while reducing the amount of data. Considering Sherman Island is
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . . 319
Fig. 3 DSM and guad-tree grid, which show 3Di draws finer grid cells along the levees and
coarser grid cells for the other areas
relatively flat and the only abrupt change in topography is due to the levees, only
levee data were used in the model to create finer grid cells, and coarser grid cells
were generated for the rest of the study area that was more homogenous. The DSM
and the quad-tree grid for Sherman Island are shown in Fig. 3.
4 Results
Fig. 4 Examples of simulated inundation from the 72-h, 100-year storm associated with 0 m (a),
0.5 m (b), 1.0 m (c), and 1.41 m (d) SLR, showing inudation extent and depth in hour 1, 24, 48, 72,
respectively
The results show that during a 100-year storm, a total of 14.68 km2 of land is
inundated in the study area. With 0.5 m SLR, a total 20.67 km2 of land is inundated,
with 1.0 m SLR, a total of 57.87 km2 of land is inundated, and with 1.41 m SLR, a
total of 72.43 km2 of land is inundated (Table 2). The inundation extent for different
SLR scenarios is mapped in Fig. 5. The western end of Sherman Island is constantly
underwater in all the modeled SLR scenarios as it is not protected by levees. In the
0.5 m SLR scenario, only minor inundation occurred in the rest of the island. In the
1.0 m SLR scenario, over half of the remaining Sherman Island is inundated and in
the 1.41 m SLR scenario, the entire Sherman Island is inundated. This progress
shows that when the sea level rises above 1.0 m, it will cause major flood impacts on
Sherman Island.
Table 2 Statistical summary of inundation by a 100-year storm with different levels of SLR
Area by inundation frequency (km2) Area by average inundation depth (km2)
Inundated Low Medium High Low Medium High
SLR (m) area (km2) (0.00–0.21) (0.22–0.64) (0.65–1.00) (0.00–1.98 m) (1.99–4.01 m) (4.02–13.22 m)
0 14.68 4.40 4.99 5.30 14.64 0.04 0.00
0.5 20.67 3.26 7.33 10.08 20.51 0.15 0.00
1.0 57.87 6.82 23.93 27.11 49.55 8.31 0.01
1.41 72.43 2.37 15.52 54.54 23.91 32.67 15.86
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . .
321
322 Y. Ju et al.
Storm is a dynamic process, and impacted areas are not permanently under water
during the entire storm event. Thus, this study analyzed inundation frequency by
using (1) and (2):
1, inundated
I x, y, i ¼ ð1Þ
0, not inundated
Xn
I
i¼1 x, y, i
F x, y ¼ ð2Þ
n
where Ix,y,i is whether grid cell in column x, row y gets inundated at hour i, Fx,y is
inundation frequency for grid cell in column x, row y, and n is total number of
outputs, which equals 72 in this study since a 72-h event was simulated.
The inundation frequency calculated here is the proportion of hours each
piece of land (i.e. a 4 m 4 m grid cell) gets inundated in the entire 72-h storm
event. This study then classified the inundation frequency in the 1.41 m SLR
scenario using a natural breaks method, which minimizes the variance within
classes and maximizes the variance between classes (ESRI 2015). From this
classification, low frequency is 0.00–0.21, medium frequency is 0.22–0.64, and
high frequency is 0.65–1.00. The results from other scenarios were classified
using the 1.41 m SLR scenario’s classification in order to compare between the
scenarios. The inundation frequency is shown in Fig. 6 for the four scenarios, and
a statistical summary is shown in Table 2. From the results, it is observed that
when the sea level rises, low frequency areas decrease while high frequency
areas increase, therefore showing that more land will be permanently inundated
in the future with such rises.
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . . 323
Fig. 6 Inundation frequency during the 3-day 100-year storm associated with 0 m (a), 0.5 m (b),
1.0 m (c), and 1.41 m (d) SLR
This study also analyzed average inundation depth, as the inundation depth on top
of each piece of land varied in a storm event. Average inundation depth was
calculated by (3):
Xn
d
i¼1 x, y, i
D x, y ¼ ð3Þ
n
where Dx,y is average inundation depth (m) at grid cell in column x, row y, dx,y, i is
inundation depth at grid cell in column x, row y, at hour i, and n is total number of
outputs, which equals 72 in this study.
Similarly, this study classified average inundation depth in the 1.41 m SLR
scenario by the natural breaks method. From this classification, low depth is
0–1.98 m, medium depth is 1.99–4.01 m, and high depth is 4.02–13.22 m. The
324 Y. Ju et al.
Fig. 7 Average inundation depth during the 3-day 100-year storm associated with 0 m (a), 0.5 m
(b), 1.0 m (c), and 1.41 m (d) SLR
results from other scenarios were classified using the 1.41 m SLR scenario classi-
fication, in order to compare between the scenarios. The average inundation depth is
shown in Fig. 7, and a statistical summary is shown in Table 2. The results show that
in the 0.5 and 1.0 m SLR scenarios, the majority of inundated areas are under low
inundation depth, and when it comes to the 1.41 m SLR, more areas are under
medium and even high inundation depth.
While 4 m resolution was used in the simulation as it was the finest resolution
possible, it is important to test spatial resolution sensitivity of DSM and the effect
on extent, frequency, and average depth. Surfaces with other resolution (5, 6,
10, 20, and 30 m) were used for the sensitivity analysis, while other model
parameters remained the same. The same maximum aggregation method was
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . . 325
used here for generating those surfaces from the original, 1 m resolution surface,
and the results from the 4 m simulation were used as the baseline for comparison.
To quantify the sensitivity, percentage difference in area (4) was calculated for
extent, and Root Mean Square Difference (RMSD) and Coefficient of Variation
(CV) were calculated for depth and frequency (5) and (6). When calculating
RMSD, coarse resolution’s results were first resampled to 4 m resolution by the
nearest neighbor method (ESRI 2015), and RMSD was calculated using the 4 m
resolution grid cells.
Ai A4
Diff ¼ 100% ð4Þ
A4
where Ai is the inundated area with other resolution i (5, 6, 10, 20, and 30 m), A4 is
the inundated area with 4 m resolution.
sX
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
t¼1 i
ðy y 4 Þ2
RMSD ¼ ð5Þ
n
where yi is a grid cell’s value simulated with other resolution i, y4 is the grid cell’s
value simulated with 4 m resolution, n is the total number of grid cells compared.
RMSD
CV ¼ ð6Þ
y4
This study creates a SLR and storm inundation dataset for Sherman Island and its
adjacent areas. This is an important and initial step for policy makers, planners, and
the public to understand the magnitude and spatial distribution of SLR and storm
inundation. The big data, high resolution DSM, was used to accurately model the
ground surface. The results show that with more than 0.5 m SLR, the levees
protecting Sherman Island start to be overtopped. With 1.0 m SLR, nearly half of
Sherman Island is inundated, and with 1.41 m SLR, the entire island is inundated.
Based on this study, SLR impacts are significant, especially when SLR is greater
than 1.0 m. Local governments can use the inundation water level to improve the
levee system and construct new levees to protect areas with high inundation
frequency. This dataset can also be employed in a suitability analysis for Sherman
Island to identify areas with higher inundation risks, and to improve infrastructure
planning and/or adopt different planning strategies for the rising sea level. Finally,
the DSM’s spatial resolution sensitivity analysis shows the importance of fine
resolution data. Local governments should collect the best quality data possible to
inform more accurate decision making.
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . . 327
Our future work further studies the SLR impacts on infrastructures, such as
pipelines and roads. These infrastructures are designed to allow certain level of
inundation, but this tolerance is limited. To better understand SLR and storm
inundation impacts on these critical infrastructures, it is beneficial to know the
duration and the depth of water sitting on top of any infrastructure. Static models
have limitations as they only depict one stage of inundation, where the information
on duration and flood dynamic is lost. The dynamic model implemented here
provides the additional dimension of time. Subsequent studies can intersect the
inundation dataset with infrastructure datasets to calculate the duration of impact,
as well as the amount of water sitting on top of impacted infrastructure. To
conclude, future researchers should identify when the infrastructure gets impacted,
estimate the cost, and provide more detailed planning suggestions.
5.2 Limitations
This study has several limitations. First, the Sacramento-San Joaquin Delta region
has a complex hydrologic system which is influenced by both the ocean and the
rivers, making it difficult to conduct hydrologic modeling. Considering that Sher-
man Island is close to the mouth of the Delta, this study simplified the actual process
and assumed that the island is only affected by tidal surges from the ocean. With the
discharge from the Sacramento and San Joaquin River, the simulated process could
be different. Second, the model did not incorporate other factors, such as subsi-
dence, sediment deposition, wind, and rainfall. Being an “artificial” system, the
Sacramento-San Joaquin Delta region has limited sedimentation and significant
subsidence issues that would further exacerbate the impact of SLR inundation. As a
result, the 3Di model might underestimate the SLR impacts on Sherman Island.
Third, the water level data used in this study could be inaccurate, as there is no
gauge currently available in the immediate region of the study area. Finally, the 3Di
model has computing limitations that limit the number of grid cells processed to
approximately 125,000 for each simulation. Our study lowered the DSM resolution,
from the original 1 m resolution to 4 m, to accommodate the computing limitations.
As a result, some topographic information, such as smaller ditches and roads, might
not be reflected in the model.
5.3 Conclusions
No GIS model perfectly represents reality (Fazal 2008), and inundation models are
usually a simple but effective method that identifies inundated areas (Tian
et al. 2010). They provide the possibility to incorporate different datasets and
generate models for planners, policy makers and the public to clearly see potential
impacts. Compared to previous studies, our study provides a more detailed level of
328 Y. Ju et al.
References
Biging GS, Radke JD, Lee JH (2012) Impacts of predicted sea-level rise and extreme storm events
on the transportation in the San Francisco Bay Region. California Energy Commission,
Sacramento
Bromirski PD et al (2012) Coastal flooding-potential projections 2000–2100. California Energy
Commission, Sacramento
Cayan D et al (2009) Climate change scenarios and sea level rise estimates for the California 2008
Climate Change Scenarios Assessment. California Energy Commission, Sacramento
Cloern JE et al (2011) Projected evolution of California’s San Francisco bay-delta-river system in
a century of climate change. PLoS One 6(9):e24465
NOAA Tide and Currents (2010) Datums for 9415144, Port Chicago CA. http://tidesandcurrents.
noaa.gov/datums.html?units¼1&epoch¼0&id¼9415144&name¼PortþChicago&state¼CA.
Accessed 3 Aug 2015
Dahm R et al (2014) Next generation flood modelling using 3Di: a case study in Taiwan. In: DSD
international conference 2014, sustainable stormwater and wastewater management, Hong
Kong
ESRI (2015) Nearest neighbor resampling—GIS dictionary. http://support.esri.com/en/
knowledgebase/GISDictionary/term/nearest%20neighbor%20resampling. Accessed 4 Aug 2015
Fazal S (2008) GIS basics. New Age International, New Delhi
Hanson JC (2009) Reclamation District 341, Sherman Island Five Year Plan 2009. http://ccrm.
berkeley.edu/resin/pdfs_and_other_docs/background-lit/hanson_5yr-plan.pdf
Heberger M et al (2009) The impacts of sea-level rise on the California coast. California Energy
Commission, Sacramento
Ingebritsen SE et al (2000) Delta subsidence in California; the sinking heart of the state.
U.S. Geological Survey. http://pubs.usgs.gov/fs/2000/fs00500/. Accessed 3 Aug 2015
Knowles N (2009) Potential inundation due to rising sea levels in the San Francisco Bay Region.
California Climate Change Center. http://www.energy.ca.gov/2009publications/CEC-500-
2009-023/CEC-500-2009-023-D.PDF
Knowles N (2010) Potential inundation due to rising sea levels in the San Francisco Bay Region.
San Francisco Estuar Watershed Sci 8(1). http://escholarship.org/uc/item/8ck5h3qn. Accessed
18 Dec 2014
Mark DM, Lauzon JP, Cebrian JA (1989) A review of quadtree-based strategies for interfacing
coverage data with digital elevation models in grid form. Int J Geogr Inf Syst 3(1):3–14
Mount J, Twiss R (2005) Subsidence, sea level rise, and seismicity in the Sacramento–San Joaquin
Delta. San Francisco Estuar Watershed Sci 3(1). http://escholarship.org/uc/item/4k44725p.
Accessed 3 Sept 2014
Nicholls RJ, Cazenave A (2010) Sea-level rise and its impact on coastal zones. Science 328
(5985):1517–1520
Stelling GS (2012) Quadtree flood simulations with sub-grid digital elevation models. Proce ICE
Water Manag 165(10):567–580
Tian B et al (2010) Forecasting the effects of sea-level rise at Chongming Dongtan Nature Reserve
in the Yangtze Delta, Shanghai, China. Ecol Eng 36(10):1383–1388
Titus JG et al (1991) Greenhouse effect and sea level rise: the cost of holding back the sea. Coast
Manag 19(2):171–204
Planning for the Change: Mapping Sea Level Rise and Storm Inundation. . . 329
Van Leeuwen E, Schuurmans W (2012) 10 questions to professor Guus Stelling about 3Di water
managment. Hydrolink 3:80–82
Wehr A, Lohr U (1999) Airborne laser scanning—an introduction and overview. ISPRS J
Photogramm Remote Sens 54(2–3):68–82
Zervas C (2013) Extreme water levels of the United States 1893–2010. Center for Operational
Oceanographic Products and Services, National Ocean Service, National Oceanic and Atmo-
spheric Administration
The Impact of Land-Use Variables
on Free-Floating Carsharing Vehicle
Rental Choice and Parking Duration
M. Khan (*)
HDR, 504 Lavaca St #1175, Austin, TX 78701, USA
Department of Civil, Architectural and Environmental Engineering, The University of Texas at
Austin, 301 E. Dean Keeton St. Stop C1761, Austin, TX 78712-1172, USA
e-mail: [email protected]
R. Machemehl
Department of Civil, Architectural and Environmental Engineering, The University of Texas at
Austin, 301 E. Dean Keeton St. Stop C1761, Austin, TX 78712-1172, USA
e-mail: [email protected]
1 Introduction
The urban development in the United States depends highly on the automobile
based mobility system. This automobile dependency has led to many transportation
and environmental problems, including traffic congestion, greenhouse gas emis-
sions, air and noise pollution, and foreign oil dependency (Kortum and Machemehl
2012). In addition, personal vehicle ownership costs its owners more than $9000
per year (American Automobile Association 2013). Carsharing programs are a
novel substitute to personal vehicle ownership in urban areas. Carsharing provides
the mobility of using a private car without the burden of car ownership costs and
responsibilities (Shaheen et al. 1998). Such innovative programs facilitate reduc-
tions in household vehicle ownership by motivating road users’ behavior towards
personal transportation decisions as on-demand mobility as opposed to an owned
asset. The carsharing concept serves as a sustainable mobility solution because
recent research studies have shown its positive environmental impact at least in
three-ways (Firnkorn and Müller 2011): first, a reduction of total carbon dioxide
emissions (Firnkorn and Müller 2011; Haefeli et al. 2006); second, reduction in
household vehicle holdings (Shaheen and Cohen 2012); and third, reduction of
vehicle miles traveled (Shaheen et al. 2010).
Carsharing service providers allow members on-demand vehicle rental with a
basic “pay-as-you-drive”. Members of the carsharing program have access to a fleet
of vehicles in the service network and pay per use. Currently, traditional/station-
based carsharing systems and free-floating/one-way carsharing systems are the two
programs in practice. Generally, traditional carsharing programs (such as ZipCar)
offer short-term rental with an hourly pricing option and require users to return
vehicles to the original location of renting. On the other hand, free-floating
carsharing programs (such as car2go and DriveNow) allow users one-way car rental
where cars can be rented and dropped-off at any location within a specified service
area. The main advantage of free-floating carsharing programs over traditional
station-based carsharing programs is flexibility because it overcomes the limiting
requirement of dropping off at the same station where it was rented.
German carmaker Daimler’s car2go program was the first major initiative that
allowed users one-way rental within a city’s operating area. In 2008, car2go was
first launched in Ulm, Germany and later expanded its service to 28 cities in
8 different countries across Europe and North America. Recently the German
carmaker company BMW started offering one-way free-floating carsharing pro-
gram DriveNow in five cities in Germany and in one city in the U.S. In the U.S.,
car2go was first launched in Austin, Texas in November, 2009 as a pilot carsharing
project for city of Austin employees and later the service was opened to the general
public in May, 2010 (Kortum and Machemehl 2012). Presently, over 16,000 car2go
members use 300 identical car2go vehicles in Austin and all vehicles can be parked
for free at any City of Austin or State of Texas controlled meter, parking space,
or designated parking space for car2go vehicles within its operating area
(car2go 2014; Greater Austin 2013 Economic Development Guide 2013).
The Impact of Land-Use Variables on Free-Floating Carsharing Vehicle Rental. . . 333
Austin area is also used as a data source for the study. The organization of this paper
is explained below.
The following section presents a brief review of the literature on the topic of this
paper. The third section presents the modeling methodology, while the fourth
section presents a description of the data set used. The fifth section presents
model estimation results and the sixth section offers concluding thoughts.
electric charge level from either the Internet, a hotline, or the car2go smartphone
application. Users can rent (41 cents per minute, as of April 2014) an available
vehicle through the mobile app or from the car2go website in advance. The rent
includes costs of parking, fuel, insurance, maintenance, cleaning, GPS navigation,
24/7 customer support, and roadside assistance (car2go 2014).
affect the usage of carsharing vehicles. The number of studies investigating the
effect of land-use variables on carsharing vehicle usage is very limited because
most of the studies are based on station-based carsharing services. Moreover,
despite the recognition that parking cost and transit service can affect the usage
of carsharing services, most of the earlier studies on free-floating carsharing did not
consider their effects on carsharing vehicle usage. In this study, we identify the
impact of land-use variables on free-floating carsharing vehicle rental choice and
parking duration (vehicle unused duration). Two different methodological
approaches are used to investigate free-floating carsharing vehicle usage. The first
approach is a logistic regression model approach where the dependent variable is a
binary outcome that indicates whether or not renting of an available carsharing
vehicle occurred, and the second is a duration model approach where the dependent
variable is a continuous variable representing the unused duration of carsharing
vehicles’ fleet time.
3 Methodology
Two different methodologies are adopted to identify land-use factors affecting the
usage of car2go carsharing vehicles. The first approach is a logistic regression
model approach where the dependent variable is a binary outcome representing the
choice of rental of an available free-floating vehicle within a given time period.
This approach will be helpful to identify land-use factors that affect the usage of
available carsharing vehicles in a 3-h time window. The second approach is a
duration model where each available vehicle observed at a specific time period is
observed for 3 h or until it is rented again. The methodologies are presented in the
following subsections.
In the logistic regression model, the response variable (yj) is binary or dichotomous
in nature where yj takes the value of 1 for a given car2go available vehicle j if the
vehicle becomes unavailable (because of renting) during the time of observation
and takes the value of 0 if the vehicle remains available to rent at the end of
observation period. The equation for the standard logistic regression model is:
where, G(.) is the cumulative density function for the error term, which is assumed
to be logistically distributed as
exj β
G xj β ¼ :
1 þ exj β
To estimate the model by maximum likelihood, we use the likelihood function for
each j. The parameters to be estimated in the logistic regression model are the β
parameters.
The instantaneous renting rate per unit of time is λ(t), commonly referred to as the
instantaneous hazard rate in duration model literature. The mathematical definition
for the hazard in terms of probabilities is
P t T, t þ hT t
λðtÞ ¼ limþ :
h!0 h
The hazard function can be easily expressed in terms of the density and cdf very
simply.
338 M. Khan and R. Machemehl
f ðtÞ f ðtÞ dF
= =
dS
d ln SðtÞ
λðtÞ ¼ ¼ ¼ dt ¼ dt ¼ ;
1 FðtÞ SðtÞ SðtÞ SðtÞ dt
where S (t) is referred as a “survivor function” in the duration literature and in the
reliability literature it is referred to as a “reliability function”. In this study the
authors prefer the term “availability function”.
The shape of the hazard function (instantaneous renting rate) has important
implications for duration analysis. Two parametric shapes for instantaneous renting
rate λ(t) are considered here. In the first case, the instantaneous renting rate is
assumed to be constant, implying that there is no duration dependence or duration
dynamics, mathematically,
λðtÞ ¼ λ, 8 t 0:
The conditional probability of being rented does not depend on the elapsed time
since it has become available to rent. The constant-hazard assumption corresponds
to an exponential distribution for the duration distribution. The instantaneous
renting rate function with covariates is
λðt; xÞ ¼ expðxβÞ;
λðt; xÞ ¼ expðxβÞαtα1 ;
4 Data Description
Five different data sources are used in this study. The first dataset, as discussed
earlier is the real-time 24-h car2go vehicle location and condition data in 5-min
intervals obtained from car2go. The data provides robust information on
vehicle rentals, movement, and availability across the Austin, TX area. This
primary data enables one to identify the usage of free-floating carsharing vehicles.
The second dataset is the transportation analysis zone (TAZ) level land-use data
provided by CAMPO for year 2005. The CAMPO data provided a host of land-use
level socio-demographic information including population, number of households,
household size, 2005 median household income, autos owned, total employment
etc. The third dataset is land-use level demographic data based on the 2010
census obtained from the Capital Area Council of Governments (CAPCOG).
Total population, race/ethnicity, as well as other items, at block level are obtained
from the CAPCOG data source. The fourth dataset is a parking survey conducted in
2010 provided by CAMPO that encompasses the study area. The fifth dataset
describes transit stops of the study area as an open data source created by Capital
Metro, the public transit service provider (Capital Metropolitan Transportation
Authority 2012).
Car2go vehicle location data was collected for a typical weekday (Tuesday) on
February 25th, 2014 from 12:00 AM to 12:00 PM in 5-min intervals. The data
collection effort was automated by setting up a Windows Task Scheduler operation
that collected and stored data thereby reducing possibilities of human error. Car2go
API data provides information about available vehicles only. Therefore, if a vehicle
(each car2go vehicle in service has a unique vehicle identification number) is
observed at time t and after some time h the vehicle is no longer found in the
dataset, then it implies that the vehicle was rented within the t and t þ h time
intervals. In this study, each of car2go vehicles observed available at 9:00 AM is
tracked every 5 min until 12:00 PM to see if they were rented. For instance if one
car2go vehicle (say, vehicle 1) is available at 9:00 AM and 9:05 AM but it does not
appear in the dataset at 9:10 AM, then it is assumed that the vehicle became
unavailable at 9:10 AM because it was rented sometime between 9:05 AM and
9:10 AM. The availability status (rented vs. not rented) of each observed vehicle is
used as the dependent variable of the logistic regression model.
The dependent variable for the duration model analysis is continuous. If a
vehicle was found to become available at 7:00 AM and then was rented again at
340 M. Khan and R. Machemehl
9:10 AM, then the total unused duration is 130 min.1 The availability status and the
duration unused (available duration prior to next rental) of each vehicle was
recorded for further analysis.
Figure 1 shows the characterization of car2go vehicles unused duration. Vehicle
1 corresponds to the sample observation described earlier. The dataset also contains
observations represented by Vehicle 2, 3, and 4, for which the beginning or ending
times are synonymous with the sampling beginning or ending times. Such observa-
tions are referred to as right or left censored observations. The duration models used
in this study are capable of accounting for the right-censored observations. Vehicle-
type 3 illustrates the case for which the beginning time when it became available is
unknown and is referred to as a left censored observation. Vehicle-type 4 represents
those cases having both left and right censoring. By following the history of the
vehicles represented by Types 3 and 4 one can trace the start time of the unused
duration. However, this paper focuses only on the unused duration in a given
weekday and therefore limits the analysis to midnight to noon on a typical weekday.
The final sample was assembled in a number of steps. First, a total of 240 available
vehicles observed at 9:00 AM are mapped to one of the 172 TAZs within the car2go
1
Actually the unused duration for the vehicle is somewhat different because the vehicle observed
available at 7:00 AM can become available anytime between 6:55 AM and 7:00 AM.
The Impact of Land-Use Variables on Free-Floating Carsharing Vehicle Rental. . . 341
A total of 240 car2go vehicles were observed in the 172 TAZs encompassing the
car2go operating area covering 50 mile2 in Austin, TX at 9:00 AM on February
25th, 2014. At the time of observation the 240 available vehicles were observed in
100 TAZs. The CAMPO data divides all TAZs into five area types namely, CBD,
Urban Intense, Urban, Suburban, and Rural. The car2go operating area comprises
the first four area types. Total number of zones in each area type, distribution of the
vehicle locations in the four area types, average population and employment in the
corresponding area types are presented in Table 1. Although higher numbers of
available vehicles are observed in the area types with higher population (i.e. urban
intense and urban area types), the average number of available vehicles per zone is
highest in the CBD area.
Summary statistics of land-use variables for the car2go service area weighted by
the number of available cars are presented in Table 2. As one can see from Table 2,
the locations of car2go vehicles at 9:00 AM within the service area shows on
average more than 11 % of the households have no car. Again, locations of
available vehicles are mostly in neighborhoods where the mean percentage of
adult population is about 87 %. This is not surprising because the central part of
the city, including the downtown area and the neighborhoods (CBD and Urban
Intense area Types) around the University of Texas at Austin, have higher concen-
trations of car2go members (Kortum and Machemehl 2012). The average
Table 2 Summary statistics of the land use variables used in the models
Land use variables Mean Min. Max.
Percent of household having no car (year 2010) 11.01 0.87 46.75
Percent of over 18 population (year 2010) 86.91 65.65 100.00
Average household size (year 2005) 2.11 0.00 3.66
Number of transit stops (year 2010) 11.02 1.00 31.00
Employment density (year 2005: # of total employment/acre) 24.35 0.07 199.47
household size is 2.11, which is slightly lower than the 2005 average of 2.40 in
Austin (City of Austin 2009).
All 100 TAZs are divided into two categories based on the median income of the
TAZs. A total of 33 (14 %) vehicles are located in high income TAZs (Income >
$60,000). Based on parking policy, all 100 TAZs are divided into two groups and
only seven are found to charge parking fees. A total of 29 (12 %) of available
vehicles were observed in TAZs where parking is not free.
A total of 110 (46 %) of the observed vehicles were rented between 9:00 AM and
12:00 PM. The total unused duration of the rented vehicles ranges from 5 to
705 min with an average unused duration of 331 min. The unused duration of the
130 unrented vehicles during the period of observation (12:00 AM to 12:00 PM)
ranges from 180 to 720 min with an average of 480 min.
This section presents a discussion of the logistic regression model estimation results
for carsharing vehicle rental choice and the duration model estimation results for
carsharing vehicle unused duration (Tables 3 and 4, respectively).
Increasing household size in a TAZ increases the propensity of the vehicles located
in those TAZs to be rented. On the other hand, as the employment density increases
in a TAZ, the propensity to rent an available car2go vehicle from that TAZ
decreases. This result is not surprising because free-floating carsharing vehicles
are less likely to be used on a daily basis as a commuting solution.
As expected, there is a higher renting propensity of a car2go vehicle in a TAZ
with parking charges compared to a TAZ with free parking. This finding has
important policy implications in that parking policy directly affects the usage of
such services. To manage higher parking demand, the City of Austin enforces
metered parking. The result suggests that allowing free parking at the metered
parking spaces for carsharing vehicles can positively affect the usage of carsharing
vehicles. The result is also in line with the earlier research findings showing that
metropolitan areas with parking policies favoring carsharing have stronger
carsharing services (Kortum 2012). The result may also imply that availability
and easy access of carsharing vehicles help their potential usage. The result
associated with the number of parking spots variable shows that as the number of
transit stops increases, the propensity to rent an available car2go vehicle from that
TAZ increases. This validates the assumption of carsharing services in areas with
good transit services making intermodal trips easier.
The CAMPO area type definition is also used as an explanatory variable in the
carsharing vehicle rental choice model, however, the effect was found to be
non-significant. Perhaps the disaggregated level land-use variables included in the
model explain the data variability more precisely than the aggregated land-use
classification.
Table 4 shows the estimated covariate effects for the two parametric duration model
specifications. The estimated value of the alpha parameter is 1.33 > 1.0 for the
Weibull parametric distribution of hazard function (instantaneous renting rate) and
is statistically significant at the 0.05 level. This alpha parameter >1 indicates a
monotonically increasing hazard (the probability to be rented) over time and
therefore the results of the Weibull parametric distribution of hazard function
appropriately describe the data.
The effect of land-use level socio-demographic characteristics indicates that
locations with higher percentages of households having no car have a higher hazard
(i.e., a smaller unused duration) than locations with lower percentages of house-
holds having no car, probably because of increased mobility options availability
to these locations. A 1 % increase in 0-car households in a TAZ increases the
probability of the vehicle(s) located in that TAZ to be rented by 5.2 %. Vehicles
located in areas with higher percentages of adult population (age >18 years) have a
higher hazard (i.e., a smaller unused duration). A 1 % increase in the percentage of
The Impact of Land-Use Variables on Free-Floating Carsharing Vehicle Rental. . . 345
6 Conclusion
metropolitan planning organizations for travel demand models which can also be
accessible in most cases. Often the expense in data collection imposes difficulty in
research. The innovative dataset used in this research can shed light on alternative
available data sources in transportation planning research.
Admittedly, there are several limitations associated with this study that require
further research. First, the data used in this study does not provide information
about the trip purposes of the carsharing vehicle trips. Therefore, the study con-
siders 3 h time periods during the AM off-peak periods, primarily to focus on trips
made for discretionary activities. However, future research should investigate other
time periods to find the effect of time of day on carsharing vehicle usage and to
compare the research findings with the current study. Second, the study suggests
that vehicles located in high median income TAZs are more likely to be rented
compared to vehicles located in low-income TAZs. Now, this result may have two
different implications. The availability of the carsharing alternative mode may
replace an automobile trip, a walk/bike trip; or a transit trip. From a transportation
planning perspective, it would be desirable that carsharing vehicles replace the
personal auto trips. However, availability of the carsharing alternative mode may
replace transit trips or non-motorized trips and thereby could increase overall auto
trips. Future research is required to carefully investigate the effect of income on the
choice decisions considering carsharing as an alternative travel mode. Third, the
absence of trip purpose data also limits the analysis effort for considering different
trip distribution patterns at different times of day. Future research should develop
other data sources to investigate the variation in carsharing vehicle usage by trip
purpose and time of day. Finally, the real locations of the carsharing vehicles are
used to study their usage in terms of rental choice and parking duration. The
locations of the available car2go vehicles represent the actual location of the
observed vehicles. These locations may reflect (1) where the last users left them;
(2) where the car2go staff relocated them within the study area. Future studies
should explore the destination choice of carsharing vehicles during different times
of the day.
References
Bardhi F, Eckhardt GM, Arnould EJ (2012) Liquid relationship to possessions. J Consum Res 39
(3):510–529
Capital Metropolitan Transportation Authority (2012) https://data.texas.gov/capital-metro
Car2go Austin Parking FAQs (2014) https://www.car2go.com/common/data/locations/usa/austin/
Austin_Parking_FAQ.pdf
City of Austin (2009) Community inventory demographics: demographic & household trends. ftp://
ftp.ci.austin.tx.us/GIS-Data/planning/compplan/community_inventory_Demographcs_v1.pdf.
Accessed 23 Apr 2014
Crain & Associates (1984) STAR project: report of first 250 days. U.S. Department of Transpor-
tation, Urban Mass Transportation Administration, Washington, DC
Firnkorn J, Müller M (2011) What will be the environmental effects of new free-floating
carsharing systems? The case of car2go in Ulm. Ecol Econ 70(8):1519–1528
Fricker JD, Cochran JK (1982) A queuing demand model to optimize economic policy-making
for a shared fleet enterprise. In: Proceedings of the October ORSA/TIMS conference in
San Diego, CA
Greater Austin 2013 Economic Development Guide (2013) Car2go. http://www.businessintexas.
org/austin-economic-infrastructure/car2go/
Haefeli U, Matti D, Schreyer C, Maibach M (2006) Evaluation car-sharing. Federal Department of
the Environment, Transport, Energy and Communications, Bern, Switzerland
Katzev R (2003) Car sharing: a new approach to urban transportation problems. Anal Soc Issues
Public Policy 3(1):65–86
Kortum (2012) Free-floating carsharing systems: innovations in membership prediction, mode
share, and vehicle allocation optimization methodologies. Ph.D. dissertation, The University of
Texas at Austin
Kortum K, Machemehl RB (2012) Free-floating carsharing systems: innovations in membership
prediction, mode share, and vehicle allocation optimization methodologies. Report SWUTC/
12/476660-00079-1. Project 476660-00079
Millard-Ball A (2005) Car-sharing: where and how it succeeds, vol 108. Transportation Research
Board, Washington
Shaheen SA, Cohen AP (2012) Carsharing and personal vehicle services: worldwide market
developments and emerging trends. Int J Sustain Transp 7(1):5–34
Shaheen SA, Cohen AP (2013) Innovative mobility carsharing outlook. Transportation Sustain-
ability Research Center, University of California at Berkeley
Shaheen S, Sperling D, Wagner C (1998) Carsharing in Europe and North America: past, present,
and future. Transp Q 52(3):35–52
Shaheen SA, Schwartz A, Wipyewski K (2004) Policy considerations for carsharing and station
cars: monitoring growth, trends, and overall impacts. Transp Res Rec 1887:128–136
Shaheen SA, Cohen AP, Chung MS (2009) North American carsharing: 10-year retrospective.
Transp Res Rec 2110:35–44
Shaheen SA, Cohen AP, Martin E (2010) Carsharing parking policy. Transp Res Rec
2187:146–156
Dynamic Agent Based Simulation of an
Urban Disaster Using Synthetic Big Data
Abstract This paper illustrates how synthetic big data can be generated from
standard administrative small data. Small areal statistical units are decomposed into
households and individuals using a GIS buildings data layer. Households and indi-
viduals are then profiled with socio-economic attributes and combined with an agent
based simulation model in order to create dynamics. The resultant data is ‘big’ in
terms of volume, variety and versatility. It allows for different layers of spatial
information to be populated and embellished with synthetic attributes. The data
decomposition process involves moving from a database describing only hundreds
or thousands of spatial units to one containing records of millions of buildings and
individuals over time. The method is illustrated in the context of a hypothetical
earthquake in downtown Jerusalem. Agents interact with each other and their built
environment. Buildings are characterized in terms of land-use, floor-space and value.
Agents are characterized in terms of income and socio-demographic attributes and
are allocated to buildings. Simple behavioral rules and a dynamic house pricing
system inform residential location preferences and land use change, yielding a
detailed account of urban spatial and temporal dynamics. These techniques allow
for the bottom-up formulation of the behavior of an entire urban system. Outputs
relate to land use change, change in capital stock and socio-economic vulnerability.
1 Introduction
behavior, and residential location preferences, along with a dynamic house pricing
system that informs land-use dynamics, a detailed description of urban spatial and
temporal dynamics is presented.
The paper proceeds as follows. After reviewing AB modeling applications for
urban disasters, we outline the modeling framework and context of the study. We
then describe how the big data is generated and coupled with the AB model.
Simulation results are presented relating to change in land use, capital stock and
socio-economic structure of the study area. To embellish the visualization potential
of this data, we present some output in the form of dynamic web-based maps.
Finally, we speculate on further developments derived from this approach.
and variety.1 At the broader, system-wide level of the urban area, Zou et al. (2012)
argue that the bottom-up dynamics of AB simulation become homogenized when
looking at complex urban processes such as sprawl or densification. They propose a
different simulation strategy to that commonly used in AB modeling. This involves
‘short-burst experiments’ within a meta-simulation framework. It makes for more
efficient and accelerated AB simulation and allows for the easy transfer across
different spatial and temporal scales. Elsewhere, we have also illustrated that
complex macroscopic urban change such as land use rejuvenation and morphology
change in the aftermath of an earthquake, can be suitably analyzed in an AB
framework (Grinberger and Felsenstein 2014, 2017).
While the literature addressing urban outcomes of disasters and using agent-
based modeling is limited, there is a larger ancillary literature that indirectly
touches on this issue from a variety of traditions. Chen and Zhan (2008) use the
commercial Paramics simulation system to evaluate different evacuation tech-
niques under different road network and population density regimes. Another
approach is to couple GIS capabilities to an existing analytic tool such as remote
sensing and to identify disaster hot spots in this way (Rashed et al. 2007). Alterna-
tively, GIS can be integrated with network analysis and 3D visualization tools to
provide a realtime micro-scale simulation tools for emergency response at the the
individual building or city block level (Kwan and Lee 2005). At a more rudimentary
level, Chang (2003) has suggested the use of accessibility indicators as a tool for
assessing the post disaster effects of earthquakes on transportation systems.
A particular challenge to all forms of simulation modeling comes from the
inherent dynamics of AB simulation and the visualization of results that has to
capture both spatial and temporal dimensions. In the context of big data, this
challenge is amplified as real time processing needs to also deal with large quan-
tities of constantly changing data. These are state of the art challenges that require
the judicious use of computational techniques, relational databases and effective
visualization. The literature in this area is particularly thin. In a non-AB environ-
ment, Keon et al. (2014) provide a rare example of how such integration could be
achieved. They illustrate an automated geo-computation system in which tsunami
inundation and the resultant human movement in its aftermath are simulated. They
couple the simulation model with a web-based mapping capability thereby allowing
the users to specify input parameters of their choosing, run the simulation and
visualize the results using dynamic mapping via a standard web browser. A mix of
server-side and client-side programming is invoked that allows the user all the
standard functionality of web-based mapping.
Our approach takes this integration one stage further. In addition to using AB
simulation we do not just combine a simulation model with a web-based visuali-
zation capacity but also generate the synthetic big data that drives the model. Once
derived, this data needs to be spatially allocated. The literature provides various
1
While not calling this ‘big data’ as such, Torrens (2014) notes that the volume of locations/
vectors to resolve for each object moved in the simulation is of the order of 1012–1014.
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 353
methods such as population gridding (Linard et al. 2011), areal interpolation which
calls for ‘creating’ locations (Reibel and Bufalino 2005) and dasymetric represen-
tation which uses ancillary datasets such as roads or night-time lights imagery to
approximate population location (Eicher and Brewer 2001; Mennis 2003). An
alternative approach, adopted here, is to distribute the data discretely without
resorting to interpolation or dasymetric mapping. This calls for combining different
sources of survey, administrative and non-structured data to create population count
data. In this spirit, Zhu and Ferreira (2015) have recently illustrated how spatially
detailed synthetic data from different sources can be input into a microsimulation
framework. In our context, the result is detailed spatially-referenced local popula-
tion count data that uses an appropriate spatial anchor. We use detailed building-
level data in order to accurately allocate populations. This method has been adopted
elsewhere, for example by Harper and Mayhew (2012a, b) who geo-reference
administrative data using a local source of land and property information. Similarly,
Ogawa et al. (2015) use building floorspace data to spatially allocate population for
earthquake damage assessment.
As behooves big data, the modeling framework used here is data-driven. The
process is outlined in Fig. 1. Socio-economic data for coarse administrative units
(small data) is disaggregated into buildings and individuals on the basis of a GIS
building layer and then recombined into households. The resultant synthetic data
gives an accurate socio-economic profiling of the population in the study area.
Coupling this data with an AB simulation model adds both temporal and spatial
dynamics to the data. The result is a multi-dimensional big data set that affords
flexibility in transcending conventional administrative boundaries. Outputs relate to
socio-economic change, change in land use and capital stock in the aftermath of an
earthquake. To fully capture the richness of the data generated we use web-based
mapping to generate extra visual outputs.
The study area houses 22,243 inhabitants, covers 1.45 km2 and is characterized by
low-rise buildings punctuated by high rise structures. A heterogeneous mix of land
uses exists, represented by residential buildings (243 thousand squared meters,
717 structures), commercial buildings (505 Th sqm, 119 structures) and govern-
ment/public use buildings (420 Th sqm, 179 structures). The area encompasses two
major commercial spaces: the Machaneh Yehuda enclosed street market and the
CBD. Three major transportation arteries roads traverse the area and generate heavy
traffic volumes: Agripas and Jaffa Streets (light railway route) run north-west to the
south-east and King George Street runs north-south. The area exhibits a heteroge-
neous mix of residential, commercial, governmental and public land use and high
traffic volumes.
2
A statistical area (SA) is a uniform administrative spatial unit defined by the Israeli Central
Bureau of Statistics (CBS) corresponding to a census tract. It has a relatively homogenous
population of roughly 3000 persons. Municipalities of over 10,000 population are subdivided
into SA’s.
3
We also use coarser, regional data on non-residential plant and equipment stock to calculate
non-residential building value. The estimating procedure for this data is presented elsewhere
(Beenstock et al. 2011).
356 A.Y. Grinberger et al.
set of units. This transfer can be effected in a variety of ways such as using spatial
algorithms, GIS techniques, weighting systems etc. (Reibel and Bufalino 2005).
The GIS layer provides the distribution of all buildings nationally with their aerial
footprint, height and land use. We derive the floor-space of each building and
populate it with individuals to which we allocate the relevant socio-economic
attributes of the SA to which they belong, according to the original distribution of
these attributes in the SA. In this way, synthetic big data is created from spatial
aggregates. The mechanics of the derivation are described in Appendix 1. The
variables used to populate the buildings and drive the model are:
• Building level: land-use, floor-space, number of floors, building value,
households
• Household level: inhabitants, earnings, car ownership
• Individuals level: Household membership, disability, participation in the work
force, employment sector, age, workplace location.
The variables used in the model, their sources and level of disaggregation appear
in Table 1.
Buildings Level disaggregation: The basis of the disaggregation procedure is
calculating the floor-space of each building using height and land-use. We assume
an average floor height of 5 m for residential building and 7 m for non-residential
buildings (see Appendix 1). These figures are the product of comparing total
national built floor-space (for each land-use) with total national floor-space as
calculated from the building layer.
The entire data disaggregation procedure is automated using SQL and Python
code and the results at each stage are stored in a spatial database. The process
entails first allocating inhabitants into buildings and then assigning them socio-
economic attributes. Later, these inhabitants are grouped into households and a
further allocation of households attributes is performed. This necessarily involves
loss of data due to dividing whole numbers (integers) such as households and
inhabitants by fractions such as building floor-space for density calculations and
percentages for socio-economic attribute distributions. In order to avoid loss of data
and to meet the national control totals in each calculation, the SQL code is written
so that it compensates for data losses (or increases) in the transition from floating
points to integer values and ensures that the original control totals are always met.
This is done by adjusting the floating point figures rounding threshold for each
variable separately, to fit the losses or gain in the count of variables automatically.
Disaggregation at the level of the individual: The disaggregated building level data
serves as the basis for the further disaggregation at the level of the individual. The
building database includes a total of 1,075,904 buildings. A total of 7,354,200
inhabitants are allocated to 771,226 residential buildings. Disaggregation of the
data to the individual begins with assigning each individual in the database a id, so
that it is represented as a unique entity tied to a building in the database. Next, each
person is allocated a random point location (a lat, lon coordinate) within the
building with which it is associated. In each building, demographic attributes
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 357
(labor force participation, employment sector, disabilities and age group) are
allocated to each individual so that they comprise the entire distribution in the
building which in turn gets its distribution from the SA in which it is located. In the
same way, the distribution of work locations of inhabitants by employment sector
is derived from a GPS-based transport survey carried by the Jerusalem Transport
Master Plan Team (Oliveira et al. 2011). This is used to determine the distribution
of inhabitants working inside or outside the study area according to their sector of
employment and to assign the corresponding binary values to individuals.
Household level clustering and attribute allocation: Individuals are clustered into
households by size of the household in each building. This creates new unique
entities in the database representing households. Households are associated with
358 A.Y. Grinberger et al.
buildings, and inhabitants are assigned to them. The clustering introduces hetero-
geneity in terms of the age distribution in the household to closely represent a
“traditional household” containing both adults and children when these are present.
This is achieved by an algorithm iterating through inhabitants in each building
sorted by age, and assigning them to households. Depending on the age distribution
in the building, this algorithm clusters inhabitants in a building into closely age
represented but not identical, households. Each household is assigned the SA
average household earnings value. Other attributes such as car ownership are
assigned to households in the same way.
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 359
The high resolution data detailed above is combined with an agent-based model. As
agents represent the focal catalysts of change and aggregate patterns are
decomposed into actions of individual agents, this further unshackles the constraints
imposed by data collected on the basis of arbitrary administrative borders. In the
context of the current study this allows us to relate to the specific spatio-temporal
nature of the event and its long-term impacts. To do this we characterize the three
basic elements of an AB model: agents, their environment and rules governing
agent-to-agent and agent-to-environment interactions (Macal and North 2005).
The data provides the first two with individuals and households as agents and
buildings as the urban environment. The model itself reflects the dynamics of the
urban system via a collection of simplistic behavioral rules. These govern the
interactions within the system at each iteration (Fig. 4) which is set to represent
one day and also include an exogenous shock in the form of the earthquake. These
rules are described in Appendix 2.
A simulation entity is created to represent each individual, household and
building along with its spatial and socio-economic characteristics. Identifying the
unique workplace for each employed individual in the study area is done on the
basis of satisficing behavior. The first building which satisfies randomly generated
preferences in terms of distance from home location and floor-space size (assuming
larger functions attract more employees) and is of the land-use associated with the
individual’s employment sector is designated as the agent’s workplace.
Agent behavior (bottom-up procedures): This simulation characterizes the city as a
spatial entity whose organization emerges from the aggregate behavior of all its
citizens (individuals). Individuals and households are therefore identified as the
agents of the urban system. Their behavior is simplified into two decision sets: the
decisions of households about place of residence and the decisions of individuals
about choice and sequence of daily activities.
The decision to change place of residence is probabilistic and based on compar-
ing a randomly drawn value to exogenous probabilities of migrating out of the study
area (city-level probability), or relocating to another part of the study area
(SA specific probability).4 Choosing a new place of residence location follows
two decision rules: a willingness to commit up to one third of monthly household
earnings to housing and preferences regarding the socio-economic characteristics of
the residential environment. This follows a probabilistic and satisficing procedure
similar to that described for selection of work place. If a household fails to find an
alternative place of residence after considering 100 possible locations, it leaves the
study area. Individual agents that are members of the household relocate/migrate
with the household. In-migration is treated in the model by having the number of
potential migrating households dependent on the number of available housing units
and an exogenous in-migration/out-migration ratio. New households with up to two
members are comprised of adults only. Those with more members include at least
one non-adult agent and their socioeconomic status reflects the urban average for
key socio-economic attributes.
At each iteration individuals conduct a variety of activities within the study area.
These are important for land-use dynamics and the mobility paths between land uses
(see below). The number of activities undertaken ranges from 0 to 11 and varies by
socio-economic characteristics (age, car ownership of household, disability,
employment status, location of employment) and randomly generated preferences.
The location of each activity (with the exception of work activity for agents
employed within the study area) is determined by the attractiveness of different
locations. This in turn is dependent on floor-space size, environment, distance from
previous activity, and the mobility capability of the individual. This choice criteria is
again, probabilistic and satisficing. Movement between each pair of activities is not
necessarily shortest-path. A more simplistic aerial-distance-based algorithm is used
to reduce computing demands and again reflect the satisficing nature of agents.5
4
Calculated from 2012 immigration data in the Statistical Yearbook for Jerusalem 2014 (Jerusa-
lem Institute for Israel Studies).
5
The algorithm works as follows: at each step, junctions adjacent to the current junction are
scanned and the junction with the shortest aerial distance to the destination is flagged as the current
junction for the next step. If a loop is encountered, it is deleted from the path. The algorithm ends
when agents arrive at the junction closest to the destination or when all junctions accessible from
the origin are scanned.
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 361
Results are presented relating to three main themes: land use change, change in
value of capital stock and socio-economic change. These reflect three dimensions of
urban vulnerability to unanticipated shocks: functional, economic and social vul-
nerability respectively. The simulation is run 25 times each with a duration of 1000
iterations (i.e. days). The earthquake occurs on day 51 in each simulation and is
located randomly within the study area. The 50 day run-in time is heuristically
derived, comprising of a 30 day period for stochastic oscillation and a further
20 day period for ‘settling down’. Like any catastrophic change, the initial impact
of the earthquake is an immediate population loss and damage to buildings and
infrastructure. Yet as the results illustrate, these lead to a second, indirect round of
impacts induced by the way in which agents react to the new conditions created.
Sensitivity tests for key model parameters are presented in Appendix 3.
Traffic Patterns and Land Use Change: we present a series of maps that illustrate
change in land use (from residential to commercial and vice versa) and the
concomitant dispersal of traffic activity at discrete time points (Figs. 5 and 6). As
can be seen, in the period following the disaster the main commercial thoroughfares
running N-S and E-W across the area lose their prominence induced by changing
movement patterns. Within 50 days of the event, major traffic loads shift from the
north-west part of the study area into the south and to a lesser extent to the north-
east and central sections. Commercial activity responds to this change and a cluster
of medium-sized commercial uses appears in the south-west. However, by day
500 we observe a reversal of this pattern. Evidently, the recovery of the traffic
network, along with the anchoring effect of large commercial land uses, helps
Agripas St. to regain its position causing a new commercial cluster to develop in
its vicinity. The immediate, post disaster change in traffic pattern does however
leave a permanent footprint. Commercial land use becomes more prevalent in the
north-east and CBD centrality seems to be slightly reduced. The buildings
containing these activities have larger floor-space area than those located in the
new emerging commercial clusters. This implies a potentially large addition of
dwelling units to the residential stock in the case that these buildings transform into
residences. One year after the shock these patterns hardly change as the new
transportation pattern gets locked in, empty buildings become occupied and the
potential for land use change is reduced. From the 3D representation of frequency
of land use change (Fig. 6) we can identify the time juncture where buildings that
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 363
were previously empty become occupied. Between day 500 and day 1000 land use
tends to become rejuvenated in the sense that unoccupied buildings become
populated.
Frequency is represented by building height. Building color represents initial
use. Colored section represents the share of total simulations in which the building
was in use other than the original use, at that time point. Grey represents the share of
times the building was unoccupied. Road height represents absolute traffic volume,
while shading represents relative volume within the distribution of all roads.
Change in the Value of Capital Stock: standard urban economic theory suggests that
demand for capital stock (residential and non residential) is inversely related to
price while the supply of capital stock is positively related to price. In our AB world
with dynamic pricing for both residential and non-residential stock, demand
changes through population change and supply changes through either building
destruction or change in land use as a result of changing traffic loads and accessi-
bility to services. Aggregate simulation results for residential capital stock show
that the number of buildings drops to about 600 after about 100 days and in the long
run never recovers (Fig. 7). However average residential values tend to rise only
after 500 days to about 90 % of their pre-shock values. This lagged recovery may be
attributed to supply shortage and increased access to services as suggested by the
changing land use patterns noted above along with increasing demand from a
growing population. Yet, the fact that a reduced residential stock recovers to almost
the same value suggests that this is due to rising average floor-space volumes. This
would point to buildings that were initially large commercial spaces becoming
residential. Non residential stock behaves rather differently. Long-term increase in
stock is accompanied by lower average values. In contrast to residential stock, the
turning point in these trends is after roughly one year (Fig. 8). Elsewhere we have
identified this with the dispersal of commercial activities from large centers to
smaller neighborhood units (Grinberger and Felsenstein 2014). The current results
reflect a similar picture. The number of units grows but their average value
decreases as the buildings in the south-west and north-east have smaller floor space.
Population Dynamics: The initial impact causes a population loss of about 4000
residents (Fig. 9). After about one year population size recovers and continues to
grow to about 29,000 by the end of the simulation period. This increase is the result
of in-migration of an heterogeneous population with stochastic earnings. The
ability of this extra population to find residence in the area is due to the process
of land use change described above. The new, large residential buildings (previ-
ously commercial spaces) contain many individual dwelling units. While the
366 A.Y. Grinberger et al.
average price of buildings rises, the rising average floor-space of buildings pushes
down the average cost per dwelling unit within the building and consequently the
monthly cost of housing services. As a result, lower income households find
suitable lower cost residential location making for an increased population that is
poorer on average. Since the average income of new in-migrants is slightly higher
than the average income of the area, this suggests that the lower average income
must result from the out-migration of wealthier households that accompanies
in-migration.
A composite indicator of both social, functional and economic vulnerability can
be obtained by looking at the flow-through of households through buildings. The
simple ratio of in-coming to out-going households per building at each discrete time
step, gives an indication of the amount of through-traffic per building and an
indication of its population stability. Figure 10 gives a summary account of this
ratio. A high ‘pull’ factor is not necessarily a sign of stability or even attractiveness.
It may be an indicator of transience and instability. The overall picture is one of
unstable population dynamics. The simulations suggest that none of the buildings
that were initially residential are consistently attractive to population. Most of them
have difficulty maintaining their population size, post earthquake. For many build-
ings this is due to either physical damage or change in function for example, from
residential to commercial use. It seems that only new potential residential spaces
that start initially as commercial uses, consistently succeed in attracting population.
The direct and indirect effects of the shock generate much household turnover
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 367
(or ‘churning’) through buildings but without any indications of location prefer-
ences for specific buildings or parts of the study area. Total floor space that registers
net positive household movement (strong/weak pull) amounts to only 75 % of the
floor space that registers net negative turnover (strong/weak push). This under-
scores the high state of flux in the study area.
Social Vulnerability: a more in-depth examination of population movement is
presented in Fig. 11 where snapshots of the spatial distribution of social vulnera-
bility at different time steps are presented. We follow Lichter and Felsenstein
(2012) and use a composite index of social vulnerability.6 Green indicates less
vulnerable population and red more vulnerable. The average index value for each
building is calculated, disaggregated into households and used to generate contin-
uous value surfaces using Inverse Distance Weighting (IDW). The parameters used
for the interpolation are: pixels of 10 10 m, 100 m search radius and a second
order power function.
6
Social vulnerability by household (Vhh) is defined as: V hh ¼ 0:5*Z ihh 0:2*Z agehh 0:2*Z%dishh
þ0:1*Z car where: Z is the normalized value of a variable, i is household income for household hh,
age is the average age group of members of household hh, %dis is the percent of disabled members
of all members in household hh, car is car ownership for household hh.
368 A.Y. Grinberger et al.
Fig. 13 Time lapse visualization of the change in the number of passengers along roads on a
dynamic web-map (see http://ccg.huji.ac.il/AgentBasedUrbanDisaster/index.html)
results for different points in time and for different areas without prior knowledge
of handling spatial data or GIS. They can choose a variable of interest, visualize its
change over time and space and generate the relevant location specific information
they need. To this end, we create a dedicated database for the output results of time
series from the model simulation. This needs to be carefully constructed and
sometimes does not follow strict DB design but rather contains some flat tables of
lateral data in order to be displayed in pop-ups graphs and charts. The visualization
includes time lapse representation of human mobility (household level), changes in
passengers along roads, changes in buildings’ land use and value, household socio-
economic attributes change etc. in the study area (Figs. 12 and 13).
370 A.Y. Grinberger et al.
We use Google Maps API as the mapping platform and add middleware func-
tionalities. These are functions that are not provided by the mapping API but
interact with it to provide ancillary capabilities (Batty et al. 2010) such as time
laps animation, sliders, interactive graphs etc. These middleware functionalities are
User Interface (UI) features that allow for different ways of data querying and
interactive engagement, using a variety of JavaScript libraries and API’s. Utilizing
this mashup of visualization tools is merely the final stage in the development of the
web-map. It is preceded first by extensive data analysis and manipulation of vast
model output spatial data and second by creating a dedicated database in order to
allow easy, intuitive and sometimes lateral querying of the database.
7 Conclusions
This paper makes both a methodological and substantive contribution to the study
of urban dynamics using recent advances in urban informatics and modeling. In
terms of method we illustrate how an agent based model can be coupled with a data
disaggregation process in order to produce synthetic big data with accurate socio-
economic profiling. This fusion adds both temporal and spatial dynamics to the
data. The simulation model uniquely treats the built environment as a quasi-agent in
urban growth. Consequently, more attention is paid to the supply side dynamics of
urban change than generally practiced and the result is a modeling system with
dynamic pricing and an active supply side. We also illustrate how outputs can be
suitably communicated to practitioners and the informed public. Dynamic
web-based mapping is used to enhance civic engagement and public awareness as
to the possible implications of a large scale exogenous shock.
On the substantive side the results of the simulation highlight some interesting
urban processes at work. We speculate about three of these and their implications
for the ability of cities to rejuvenate in the aftermath of an unanticipated event. The
first relates to the permanent effects of temporary shocks. Our results have shown
that temporary shocks to movement and traffic patterns can generate longer term
lock-in effects. In our simulations these have a structural effect on reduction of
CBD commercial activity. The issue arising here is the ability to identify when this
fossilization takes place and when a temporary shock has passed the point of no
return.
The second process relates to the large level of household turnover and
‘churning’ through the built fabric of the city in the aftermath of an earthquake.
Obviously, a traumatic event serves to undermine population stability as housing
stock is destroyed and citizens have to find alternative residence. However, high
turnover levels of buildings point to a waste of resources, material, human and
emotional. In other markets such as the labor market, ‘churning’ might be consid-
ered a positive feature pointing to economic and occupational mobility (Schettkat
1996). However, in the context of a disaster, this would seem to be a process that
judicious public policy should attempt to minimize. The welfare costs of the effort
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 371
needed to search for new accommodation and the dislocation associated with
changing place of residence are likely fall hardest on weaker and more vulnerable
populations (Felsenstein and Lichter 2014).
Finally, our findings shed new light on the familiar concept of ‘urban vulnera-
bility’. Our simulated results show that less vulnerable socio-economic groups
‘weather the storm’ by dispersing and then re-clustering over time. This points to
their higher adaptive capacities. Stronger populations have the resources to accom-
modate the negative impacts of a disaster. Urban vulnerability is thus as much as an
economic welfare issue as it an engineering or morphological concept. From a
socioeconomic perspective, it is not the magnitude of the event that is important but
the ability to cope with its results. This makes urban vulnerability a relative term: a
shock of a given magnitude will affect diverse population groups differentially.
Vulnerable populations or communities can be disproportionately affected by
unanticipated disasters which are more likely to push them into crisis relative to
the general population. Much of this can only be detected at the micro level such as
the household. It is often smoke-screened in studies dealing with aggregate city-
wide impacts. The use of highly disaggregated and accurately profiled data would
seem to be critical in understanding urban vulnerability.
Acknowledgements This research is partially based on work done in the DESURBS (Designing
Safer Urban Spaces) research project funded by the European Commission FP7 Program under
Grant Agreement # 261652. The authors thank the JTMT for granting access to the HTS database.
HB
FR ¼
5
In the case of non-residential buildings, the number of floors (FN) is estimated as the
building height divided by average floor height of 5 m:
HB
FN ¼
7
Floor space for each building (SB) is then calculated by multiplying the number of
floors in each building by its polygon area representing roof space:
SB ¼ SR F
372 A.Y. Grinberger et al.
where:
SR ¼ Building polygon footprint
F ¼ Building number of floors
The GIS buildings layer and building type serve as the basis for the calculation
of residential building value, non-residential building and equipment value. To
create estimates of residential building value we use average house prices per m2
2008–2013 (in real 2009 prices). In cases where no transactions exist in a specific
SA over that period we use a regional estimate for residential property prices.
1. Value of residential buildings (PBR) is calculated as follows:
where:
PSR ¼ Average SA price per m2
SBR ¼ Residential building floor space.
2. Value of Non residential buildings is calculated as follows:
Non residential value per m2 by region (PRN):
V RN
PRN ¼
SRN
where:
SRN ¼ Total regional non residential floor space.
VRN ¼ Total regional non residential building stock
Non residential building value per m2 for each region is multiplied by the
floor space of each non-residential building to produce non-residential building
values (PBN):
V RE
PRE ¼
SRN
where:
VRE ¼ Total regional non residential equipment stock
The equipment stock per m2 for each region is multiplied by the floor
space of each non-residential building to produce equipment stock totals by
building (PBE):
where:
SBN ¼ non residential building floor space.
The source for regional estimates of regional equipment and machinery
stock is as above (Beenstock et al. 2011).
The buildings layer also allows for the spatial allocation of aggregated
households, and population counts (see Table 1) into building level households
and inhabitant totals. Given the existence of these spatial estimates, the distri-
bution of aggregate average monthly earnings, participation in the work force,
employment sector, disabilities and age (see Table 1) into a building level
distribution is implemented.
4. Household density by SA (households per m2) of residential floor space in a
statistical area (HSR) is calculated as follows:
HS
H SR ¼
SSR
where:
HS ¼ Total population per statistical area (IV in Table 1).
SSR ¼ Total statistical area residential floor space.
The number of households per building (HB) is calculated as follows:
H B ¼ H SR SBR
IS
I SR ¼
SSR
where:
SSR ¼ Total statistical area residential floor space.
IS ¼ Total population per statistical area.
Population counts per building (IB) are then calculated as follows:
I B ¼ I SR SBR
MB ¼ MSI H B
where:
MSI ¼ Average monthly earnings per household by SA.
HB ¼ Total number of households in a building.
7. Number of inhabitants in each building participating in the labor force 2008
(IW) is calculated by multiplying the number of inhabitants in a building by the
labor participation rate in the corresponding SA.
374 A.Y. Grinberger et al.
IW ¼ W S IB
where:
WS ¼ % of inhabitants participating in the labor force in an SA
IB ¼ Population count per building
8. Number of inhabitants per building by employment sector (IO) is calculated by
multiplying the percentage of inhabitants employed by sector (commercial,
governmental, industrial or home-based) per statistical area by the number of
inhabitants in each building.
I O ¼ OS I B
where:
OS ¼ % of inhabitants employed in an employment category.
IB ¼ Population counts per building
9. Number of disabled inhabitants in each building (ID) is calculated by multi-
plying the number of inhabitants in a building by the percentage of disabled in
the corresponding SA.
I D ¼ DS I B
where:
DS ¼ % of disabled inhabitants in an SA
IB ¼ Population count per building
10. Number of inhabitants in each age category (IA) is calculated by multiplying
the number of inhabitants in a building by the percentage of inhabitants in each
age category in the corresponding SA.
I A ¼ AS I B
where:
AS ¼ % of inhabitants in each age group category in an SA
IB ¼ Population count per building.
where:
hh is the new residential location for household h,
bj is the building considered,
½ is a binary expression with value of 1 if true and 0 otherwise,
Ih is household h’s monthly income,
HPj is monthly housing cost of an average apartment in building j,
376 A.Y. Grinberger et al.
where:
# Aci is the number of activities for resident i,
ki is a randomly drawn number between [0, 1] reflecting preferences regarding
number of activities,
carh is a binary variable equal to 1 if the household h owns a car and 0 otherwise,
disi is a binary variable equal to 1 if individual i is disabled and 0 otherwise,
agei is the age group of individual i,
employedi is a binary variable equal to 1 if i is employed and 0 otherwise,
herei is a binary variable equal to 1 when i’s workplace is located within the
study area and 0 otherwise,
kxk indicates the nearest integer number to x,
a is the average number of activities based on employment status; equals 2.5 for
employed residents and 3 for non-employed.
atþ1, i ¼ bj ) bj 6¼ at, i * ki Att bj ¼ 1
where:
at,i is the current location of individual i,
atþ1, i is the next location of activity of individual i,
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 377
where:
ΣEj is the number of non occupied buildings within a 100 m buffer of building j,
ΣBj is the number of all buildings within a 100 m buffer of building j,
Dij is the distance of building j from the current location of individual i,
max Di is the distance of the building within the study area furthest away from
the current location of individual i,
LUj is the land-use of building j,
nonRes is non-residential use,
FSj is the floor-space volume of building j,
maxFS is the floor-space volume of the largest non-residential building within
the study area.
3. Choice of workplace location is calculated similarly to the choice of activity
location:
" Dij
FSj
#
maxDi þ1 maxFS
WPi ¼ bj ) LU j ¼ ELU i * ki > ¼1
2
where:
WPi is the workplace location of individual i,
ELUi is the employment-sector-related land-use for individual i,
ki is a randomly drawn number between [0, 1] representing workplace location
preferences,
Dij is the distance between building j and individual i’s place of residence,
max Di is the distance of the building within the study area furthest away from
individual i’s place of residence.
4. Building values and the monthly cost of living in a dwelling unit are derived in a
3-stage process. First, daily change in average house price per SA is calculated.
Then, values of individual buildings are derived and finally the price of the
single, average dwelling unit is calculated. For non-residential buildings, the
calculation of individual building values is similar.
!!
=popz, t þ resz, t=resz, tþ1 þ nResz, tþ1 nResz, t
popz, tþ1
AHPz, tþ1 ¼ AHPz, t * 1 þ log
3
nResz, t
ANRV z, tþ1 ¼ ANRV z, t * 1 þ log
nResz, tþ1
378 A.Y. Grinberger et al.
where:
AHPz,t is average housing price per meter in SA z at time t,
popz,t is population in SA z at time t,
resz,t is the number of residential buildings in SA z at time t,
nResz,t is the number of non-residential buildings in SA z at time t,
ANRVz,t is the average non-residential value per meter in SA z at time t,
SLj, t
HPj, t ¼ AHPz, t *FSj * SLz, t
V j, t ¼ ANRV z, t *FSj
where:
HPj,t is the house price of a dwelling unit in building j at time t,
SLs,t is the service level within area s at time t—the ratio of non-residential
buildings to residential buildings in this perimeter,
Vj,t is the non-residential value of building j.
0 X
Lt 1
B HPl, t, C
B HPj, t l¼1 C
B ΣApj X
Lt C
B C
B ΣApl C
I t *B C
B1 þ l¼1 C
B Pσ t C
B C
@ A
Pdu, t ¼
c
where:
Pdu,t is the monthly cost of living in dwelling unit du at time t,
Īt is the average household income in the study area at time t,
ΣAp is the number of dwelling units within a building. If the building is initially
of residential use, this is equal to its initial population size, otherwise it is the
floor-space volume of the building divided by 90 (assumed to be average
dwelling unit size in meters),
Lt is the number of residential buildings in the study area at time t,
Pσt is the standard deviation of dwelling unit prices within the study area at
time t,
c is a constant.
5. Land-use changes, from residential to commercial and from commercial to
unoccupied are based on the congruence between the building floor-space
volume and the average intensity of traffic on roads within a 100 m radius
over the preceding 30 days. Both these values are compared with the (assumed)
exponential distribution of all values in the study area. This is done by comput-
ing the logistic probability of the relative difference in their locations in the
distribution:
Dynamic Agent Based Simulation of an Urban Disaster Using Synthetic Big Data 379
eΔxj, t
Pj, t Δxj, t ¼
1 þ eΔxj, t
zTRj, t zFSt, t
Δxj, t ¼
jzFS j
=yt
yj , t
=yt
ymedt
e e
zyt, j ¼
yt
where:
Pj,t is the probability of land-use change for building j at time t,
Δxj,t is the relative difference in position of traffic load and floor-space for
building j at time t,
zyj,t is the position of value y in the exponential distribution, relative to the
median for building j at time t,
=y
yj , t
e t
is the exponential probability density value for y 1
¼ ^λ t for building j at
ty y t
time t,
ymedt is the median of y at time t.
If P > 0.99 for residential use, it changes to commercial. If the value is in the
range [P(1) 0.01, P(1)] for commercial uses, the building becomes unoccu-
pied. This functional form and criteria values reduce the sensitivity of large
commercial uses and small residential uses to traffic volume. Consequently, the
process of traffic-related land-use change is not biased by a tendency to inflate
initial land uses.
6. Earthquake impact is calculated as follows:
c*10mag
Imj ¼
Dj *log Dj *Fj
where:
Imj is the impact building j suffers, c is a constant, mag is the earthquake
magnitude (similar to Richter scale), Dj is distance of building j from the
earthquake epicenter, Fj is number of floors in building j.
willingness to pay for housing parameter is allowed. A unique value is drawn for
each household from a normal distribution centered on the average expenditure
on housing in Israel (23.4 % of total income), see http://www1.cbs.gov.il/reader/
newhodaot/hodaa_template_eng.html?hodaa¼201415290.
Each scenario is simulated 25 times with extreme results discarded (two cases
for the lu scenario and one for the hb scenario). All other parameters remain
unchanged. To obtain high resolution results we compute the most frequent land-
use and the average residential value, non-residential value and vulnerability index
for each building at discrete time points. For land-use we compute the share of
times a different land-use is registered in relation to the baseline scenario. For other
variables we compute the Median Absolute Percentage Error (MAPE). The results
are presented in Fig. 14.
The results illustrate morphological stability with the same land-use over
scenario registered for almost 90 % of the buildings at all time points. The other
variables also exhibit parameter stability (which is greater in the lu scenario). The
only sensitive variable is average non-residential value. This is due to two changes
exerting influence on the spatial distribution of commercial functions. The lu
scenario constrains agglomeration tendencies and the hb scenario affects residen-
tial, and hence traffic, patterns. As commercial stock is small relative to residential
stock these changes do not exhibit strong morphological influence but register
greater and more fluctuating differences than in the other variables. While general
patterns indicate parameter stability, micro-level differences are still observed.
References
Batty M, Hudson-Smith A, Milton R, Crooks A (2010) Map mashups, Web 2.0 and the GIS
revolution. Ann GIS 16(1):1–13
Beenstock M, Felsenstein D, Ben Zeev NB (2011) Capital deepening and regional inequality: an
empirical analysis. Ann Reg Sci 47:599–617
Chang SE (2003) Transportation planning for disasters: an accessibility approach. Environ Plan A
35(6):1051–1072
Chen X, Zhan FB (2008) Agent-based modeling and simulation of urban evacuation: relative
effectiveness of simultaneous and staged evacuation strategies. J Oper Res Soc 59:25–33
Chen X, Meaker JW, Zhan FB (2006) Agent-based modeling and analysis of hurricane evacuation
procedures for the Florida Keys. Nat Hazards 38:321–338
Crooks AT, Wise S (2013) GIS and agent based models for humanitarian assistance. Comput
Environ Urban Syst 41:100–111
Dawson RJ, Peppe R, Wang M (2011) An agent-based model for risk-based flood incident
management. Nat Hazards 59:167–189
Eicher CL, Brewer CA (2001) Dasymetric mapping and areal interpolation: implementation and
evaluation. Cartogr Geogr Inf Sci 28(2):125–138
Felsenstein D, Lichter M (2014) Social and economic vulnerability of coastal communities to sea
level rise and extreme flooding. Nat Hazards 71:463–491
Fiedrich F, Burghardt P (2007) Agent-based systems for disaster management. Commun ACM 50
(3):41–42
Grinberger AY, Felsenstein D (2014) Bouncing back or bouncing forward? Simulating urban
resilience. Urban Des Plan 167(3):115–124
Grinberger AY, Felsenstein D (2017) A tale of two earthquakes: dynamic agent-based simulation
of urban resilience. In: Lombard J, Stern E, Clarke G (eds) Applied spatial modeling and
planning. Routledge, Abingdon, forthcoming
Harper G, Mayhew L (2012a) Using administrative data to count local populations. J Appl Spat
Anal Policy 5(2):97–122
Harper G, Mayhew L (2012b) Re-thinking households—using administrative data to count and
classify households with some applications. Actuarial research paper, no. 198. Cass Business
School, London
Keon D, Steinberg B, Yeh H, Pancake CM, Wright D (2014) Web-based spatiotemporal simula-
tion modeling and visualization of tsunami inundation and potential human response. Int J
Geogr Inf Sci 28(5):987–1009
Kwan MP, Lee J (2005) Emergency response after 9/11: the potential of realtime 3D GIS for quick
emergency response in micro-spatial environments. Comput Environ Urban Syst 29:93–113
382 A.Y. Grinberger et al.
Lichter M, Felsenstein D (2012) Assessing the costs of sea level rise and extreme flooding at the
local level: a GIS-based approach. Ocean Coast Manag 59:47–62
Linard C, Gilbert M, Tatem AJ (2011) Assessing the use of global land cover data for guiding large
area population distribution modeling. GeoJournal 76(5):525–538
Macal CM, North MJ (2005) Tutorial on agent-based modeling and simulation. In: Proceedings of
the 37th conference on winter simulation, WSC’05. pp 2–15
Mennis J (2003) Generating surface model of population using dasymetric mapping. Prof Geogr
55(1):31–42
Ngo TA, See L (2012) Calibration and validation of agent based models of land cover change. In:
Heppenstall AJ, Crooks AT, See LM, Batty M (eds) Agent-based model of geographical
systems. Springer, Dordrecht, pp 191–199
Ogawa Y, Akiyama Y, Kanasugi H, Sengoku H, Shibasaki R (2015) Evaluating the damage of
great earthquakes in aggregate units based on detailed population distribution for each time
frame. Paper presented at the 14th international conference on computers in urban planning and
urban management (CUPUM 2015), Boston MA. http://web.mit.edu/cron/project/
CUPUM2015/proceedings/Content/index.html
Oliveira MGS, Vovsha P, Wolf J, Birotker Y, Givon D, Paasche J (2011) Global positioning
system-assisted prompted recall household travel survey to support development of advanced
travel model in Jerusalem, Israel. Transp Res Rec J Transp Res Board 2246(1):16–23
Rashed T, Weeks J, Couclelis H, Herold M (2007) An integrative GIS and remote sensing model
for place-based urban vulnerability analysis. In: Mesev V (ed) Integration of GIS and remote
sensing. Wiley, New York, pp 199–231
Reibel M, Bufalino ME (2005) Street-weighted interpolation techniques for demographic count
estimation in incompatible zone systems. Environ Plan A 37:127–139
Salamon A, Katz O, Crouvi O (2010) Zones of required investigation for earthquake-related
hazards in Jerusalem. Nat Hazards 53(2):375–406
Schettkat R (1996) Flows in labor markets: concepts and international comparative results. In:
Schettkat R (ed) The flow analysis of labor markets. Routledge, London, pp 14–36
Simoes JA (2012) An agent based/ network approach to spatial epidemics. In: Heppenstall AJ,
Crooks AT, See LM, Batty M (eds) Agent-based models of geographical systems. Springer,
Dordrecht, pp 591–610
Torrens PM (2014) High-resolution space-time processes for agents at the built-human interface of
urban earthquakes. Int J Geogr Inf Sci 28(5):964–986
Zhu Y, Ferreira J Jr (2015) Data integration to create large-scale spatially detailed synthetic
populations. In: Geertman S, Ferreira J Jr, Goodspeed R, Stillwell J (eds) Planning support
systems and smart cities. Springer, Dordrecht, pp 121–141
Zou Y, Torrens PM, Ghanem RG, Kevrekidis IG (2012) Accelerating agent-based computation of
complex urban systems. Int J Geogr Inf Sci 26(10):1917–1937
Estimation of Urban Transport Accessibility
at the Spatial Resolution of an Individual
Traveler
1 Introduction
relate accessibility to the notion of consumer surplus and net benefits to the users of
the transportation system (e.g. Ben-Akiva and Lerman 1979). (d) Measures of the
spatio-temporal fit—emphasize the range and frequency of the activities in which a
person takes part and whether it is possible to sequence them so that all can be
undertaken within given space-time constraints (e.g. Neutens et al. 2010; Miller
1999).
Accessibility is, usually, a relative measure. The growing interest in the
interdependence between sustainable development and mobility has emphasized
the importance of proper estimation of various dimensions of public transport
relative to private car or bike accessibility (Kaplan et al. 2014; Tribby and
Zanderbergen 2012; Martin et al. 2008; O’Sullivan et al. 2000). Since the disparity
of accessibility between cars and public transport provides important information
about the degree of car dependence in urban areas, measuring the relative accessi-
bility of transport versus car has been recently analyzed in many urban regions
(Benenson et al. 2011; Ferguson et al. 2013; Grengs et al. 2010; Mao and
Nekorchuk 2013; Mavoa et al. 2012; Blumenberg and Ong 2001; Hess 2005;
Martin et al. 2008; Kawabata 2009; Salonen and Toivenen 2013). The importance
of these studies is straightforward—in a majority of studies, public transport
accessibility is lower, often significantly, than that of the private car.
Although different kinds of measures were developed in the literature not much
guidance is provided on how to choose or apply these measures in policy assess-
ments. Even more—a key gap exists between how accessibility is primarily
addressed in the literature as a physical-financial construct that can be improved
by proper design or costing, and how human travelers actually conceive it mentally
in their day-to-day experiences. This point was well summarized by Kwan (1999:
210): “the accessibility experience of individuals in their everyday lives is much
more complex than that which can be measured with conventional measures of
accessibility”.
A main hurdle in measuring or modeling accessibility is fitting the spatial
resolution of analysis to the scale where real human travelers make travel decisions.
An adequate view of accessibility demands analysis at the spatial resolution of
human activity—moving from one building as an origin to a destination in another
building, by navigating the transportation network, comprised of different modes,
lines and stops.
This human viewpoint is difficult to model due to the necessary big data and
heavy computation requirements it entails. Consequently, accessibility has been
mainly evaluated at a coarse and granular scale of municipalities (Ivan et al. 2013);
counties, (Karner and Niemeier 2013); transport analysis zones (Black and Conroy
1977; Bhandari et al. 2009; Ferguson et al. 2013; Foth et al. 2013; Rashidi and
Mohammadian 2011; Haas et al. 2008; Burkey 2012; Lao and Liu 2009; Grengs
et al. 2010; Kaplan et al. 2014) or neighborhoods (Witten et al. 2011). Aggregate
analysis, assumes that the centers of the zones (centroids) are the origin and
destination points, and eventually results in a discontinuity when evaluating two
adjacent zones. Few studies have attempted to model accessibility at parcel level
data and more often, disaggregate data is eventually aggregated for analysis (Mavoa
386 I. Benenson et al.
et al. 2012; Tribby and Zanderbergen 2012; Welch and Mishra 2013; Salonen and
Toivenen 2013). The exception is Owen and Levinson (2015), who measure
accessibility by public transport at the resolution of zipcodes. While often sufficient
for car-based accessibility, where the travel between traffic zones takes time
comparable to the imprecision of velocity estimate, aggregate estimates tend to
either over- or under-estimate public transport accessibility, where the walk or
waiting time can be critical for the personal mode choice. Moreover, important
components of public transport accessibility, such as walking times to embarkation
stops, destinations and between stops when changing lines, and waiting for transfer
times, are not usually considered explicitly in the calculations of total travel time
between aggregate units.
While existing tools used in transportation practice, (such as TRANSCAD or
EMME), are able to calculate detailed public transport-based accessibility, they
were not designed for this purpose. High-resolution calculation of accessibility with
these and similar general-purpose tools for transport analysis is extremely time-
consuming, and it is thus prohibitively expensive to use, in generating accurate
estimates of public transport accessibility for an entire metropolitan area.
Being conceptually evident, spatially explicit high-resolution measurement of
accessibility raises severe computational problems. A typical metropolitan area
with a population of several millions contains 0.5–1 million buildings and demands
the processing of hundreds of thousands origins and destinations. Trips themselves
cover tens of thousands street segments and hundreds of public transport lines of
different kinds (Benenson et al. 2010, 2011). Thus, computations of accessibility at
a human traveler resolution involve the processing of huge volumes of raw data. For
example, the metropolitan area of Tel Aviv has a population of 2.5 million,
c.a. 300,000 buildings and over 300 bus lines. Until recently, attempting such an
endeavor seemed impossible, and there was no option but to aggregate. However,
recent developments in graph databases and algorithms (Buerli 2012), offer a
solution to these problems.
The aim of this paper is the following. First, we describe a new GIS-based
computer application that manages to make fast accessibility calculations at the
resolution of individual buildings, and exploits high-resolution urban GIS and
provides precise estimates, in space and in time, of public transport-based acces-
sibility. We are thus able to assess the transportation system at every location and
for every hour of the day. Our application implements the ideas proposed by
Benenson et al. (2011) and their recent development (Benenson et al. 2014), run-
ning on a SQL server engine. Second, we apply the application on a real case study
involving the implementation of a new Light Rail (LRT) line in the Metropolitan
Area of Tel Aviv. We use this case study to illustrate the importance of high-
resolution accessibility estimates to obtain unbiased estimates of the relative acces-
sibility between public transport and car in two scenarios with and without the
proposed LRT line.
The rest of the paper is organized in the following manner. Section 2, presents
the operational approach we applied to measure accessibility. Section 3 presents the
computer application. Section 4 presents the case study analysis. Section 5, con-
cludes and suggests future research directions.
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 387
Accessibility is a function of all three components of an urban system: (a) land use
pattern that defines the distribution of people’s activities; (b) the residential distri-
bution and socio-economic characteristics of people; (c) the transportation system
with its road configuration, modes of travel, and the time, cost and impedance of
travel from and to any place within the metropolitan area. All these define the
satisfaction or benefits that individuals obtain from traveling from their homes to
different destinations. Ideally, the urban system evolves and adapts i.e., a fit
between land use and transportation components on the one hand, and individual
needs on the other, is maintained. However, even in an ideal setting, different
components of the urban system commonly have different the rates of change and
adaptation. It takes years and sometimes decades for street configurations, land-use
and buildings to change. Therefore, for this research we can safely assume they are
constant. Conversely, residential and activity patterns that are related to people,
change and adapt quickly to changes in the other subsystems. Consequently, we can
safely assume that accessibility is governed by the land-use and transportation
subsystems, and measure it as representing people’s potential mobility. Moreover,
we consider the level of service of the transportation system as more sensitive to
policy changes and investment in projects than the land-use system, and concentrate
on evaluating the impacts of changes in this subsystem on people’s potential
accessibility to a fixed set of destinations. At this stage, we avoid entering the
quagmire of predicting people’s actual spatiotemporal movements.
Our measure of accessibility is itself relational. We measure the accumulated
value of a selected characteristic (e.g., jobs, commercial area) available to a person
at destinations within a given timeframe (between 30 min to an hour), using a
chosen transportation mode, usually comparing between private car and public
transport.1 We then compare between these two values and estimate a relative
measure of accessibility.
The new application implements these measures of accessibility at a high spatial
resolution approximating a single building. These measures are based on a precise
estimate of the travel time between (O)rigin and (D)estination and are defined for a
given transportation (M)ode. For example, for (B)us and private (C)ar:
– Bus travel time (BTT):
BTT ¼ Walk time from origin to a stop of Bus #1 þ Waiting time of Bus #1 þ
Travel time of Bus #1 þ [Transfer walk time to Bus #2 þ Waiting time of Bus
#2 þ Travel time of Bus #2] þ [Transfer component related to additional
buses] þ Walk time from the final stop to destination (square brackets denote
optional components).
1
There is no major restriction to account in the future for bicycle and pedestrian movements
388 I. Benenson et al.
Given the destination D, the Bus to Car (B/C) Service area ratio is
2
Benenson et al. (2011) implemented quite similar accessibility measures at a resolution of Traffic
Analysis Zones (TAZ).
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 389
destinations beyond the region boundary will be excluded from the calculations for
all modes. This is especially important for the case of high MTT for one or both
modes.
Equations (1) and (2) can be easily specified for any particular type (k) of
destinations Dk or origins Ok and, further, towards weighting to include the
destinations’ and origins’ capacities, Dk,Capacity and Ok,Capacity. Examples of capac-
ity can be number of jobs at high-tech enterprises as destinations, low wage jobs at
industry buildings as destinations, affordable dwellings with origin capacity defined
as number of dwellings and so forth.
Equations (1) and (2) can be then generalized to include the ratio of the sums of
capacities of the destinations that can be accessed during time τ by Bus and Car:
X X
AAo, k ðτÞ ¼ D k , Capacity
Dk 2BAAo ðτÞ = Dk, Capacity Dk 2CAAo ðτÞ ð3Þ
Dk Dk
X X
SAD, k ðτÞ ¼ Dk
Ok, Capacity Ok 2 BSAD ðτÞ = Dk OK, Capacity Ok 2 CSAD ðτÞ ð4Þ
The numerators of the fractions in (3)–(4) is the overall capacity of the service/
access areas estimated for the bus mode, while the denominator is the overall
capacity of the service/access areas estimated for the car mode.
3 Computer Application
To enable the use of graph-theory in our calculations, we translate road (R) and
public transport (PT) networks into directed graphs. The road network is translated
into a graph (RGraph) in a standard fashion: junctions into nodes, street sections
into links (two-way segments are translated into two links) and travel time into the
impedance. The graph of the public transport network (PTGraph) depends on
timetable and, thus, is constructed for a given time interval.
Given the PT lines, stops and timetable, the node N of the public transport
network graph is defined by the quadruple:
390 I. Benenson et al.
The impedance of a link between the B-node and a node N of a PT-graph is the
walking time between B and N plus the waiting time for the arriving line. The
impedance of a link between a node N of the PT-graph and B-node is just the
walking time from a corresponding stop to the building.
Figure 1 presents a description of the translation of a typical bus trip from origin
to destination into a sequence of connected links.
Representation of PT network as a graph enables the application of standard
algorithms for building minimal spanning tree (Dijkstra 1959) and, thus, estimation
of the service and access areas for a given building. However, this cannot be done
with the help of the standard GIS software, as ArcGIS, that start with roads and
junctions as GIS layers. Moreover, calculation of the service/access areas for all
urban buildings, which number in the hundreds of thousands, is an extremely time-
consuming operation. For comparison, calculation of car accessibility for the entire
Tel Aviv Metropolitan Area, organized with the help of the ArcGIS Model Builder
(Allen 2011) and employing standard ArcGIS Network Analyst procedure, took us
several days.
Our goal is to investigate and compare travel opportunities supplied by public
transport. That is why car accessibility calculation is done only once and subse-
quently used as the denominator in all the rest of the calculations of relative
The use of the Relational Database Management System (RDBMS) for calculating
PT accessibility is based on the observation that the degree of most of the quadruple
nodes within the PTGraph is exactly 2. That is, most of quadruple nodes of the
PTGraph are connected, by just two (directed) links, to previous and next stops of
the same PT vehicle traversing the same line.
Let us, for convenience, use the identifiers (ID) of bus lines and stops as their
names. Formally, a quadruple node
N1 ¼ <PT_LINE_ID, TERMINAL_DEPARTURE_TIME, STOP1_ID,
STOP1_ARRIVAL_TIME>
of the PTGraph is connected, by the bus line PT_LINE_ID, to two nodes only
N0 ¼ <PT_LINE_ID, TERMINAL_DEPARTURE_TIME, STOP0_ID,
STOP0_ARRIVAL_TIME>, and
N2 ¼ <PT_LINE_ID, TERMINAL_DEPARTURE_TIME, STOP2_ID,
STOP2_ARRIVAL_TIME>.
Where STOP0_ID is a stop that is previous to STOP1_ID on the PT_LINE_ID
for the bus starting the travel at TERMINAL_DEPARTURE_TIME, and
STOP2_ID is a stop next to STOP1_ID on the PT_LINE_ID for this bus.
Conversely, the degree of quadruple nodes at which transfers take place is higher
than 2: in addition to being connected to quadruples denoting the previous and next
stops of the line, this node is connected to the quadruples that can be reached by
foot, as defined in Sect. 3.1.
In what follows, we use BuildingID for the identifier of the building B, and
NodeID for the identifier of the quadruple node N. Full RDBMS representation of
the PT graph consists of four tables. We present them below as TableName
(Meaning of field1, Meaning of field2, etc.):
– Building-PTNode (BuildingID, trip start quadruple NodeID, Walk time between
building B and stop of the quadruple N þ waiting time at a stop of quadruple N);
– DirectPTTravel (quadruple NodeID, ID of the quadruple N that can be directly
reached from the NodeID by the line of quadruple NodeID, travel time between
stop of quadruple NodeID and stop of quadruple N);
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 393
– Transfer (NodeID of the stop of transfer start quadruple N1, NodeID of the stop
of the transfer end quadruple N2, walk time between stops of N1 and
N2 þ waiting time at a stop of N2 for a line of quadruple N2)
– PTNode-Building (trip final quadruple NodeID, BuildingID, Walk time between
the stop of the quadruple N and Building B)
The idea in our approach is expressed by the DirectPTTravel one-to-many table:
It presents all possible direct trips from a certain stop with a certain line. Standard
SQL queries that join between this and other tables are sufficient for estimating
access or service areas.
To illustrate the idea, we build the occurrences of DirectPTTravel and Transfer
tables for the example presented in Fig. 1 (number in circle is used as quadruple’s
ID):
DirectPTTravel
StartNode_ID ArrivalNode_ID TravelTime(mins)
1 1 0
1 2 4
1 3 6
1 4 8
1 5 11
1 6 12
1 7 15
1 8 18
2 2 0
2 3 2
2 4 4
2 5 7
... ... ...
11 11 0
11 12 6
... ... ...
12 12 0
12 13 6
... ... ...
Transfer
StartNode_ID ArrivalNode_ID TransferTime(mins)
4 15 5
... ... ...
result above provides the trips between buildings and final stops of a trip and a
further join with the PTNode-Building finally provides trips between buildings.
Most of the computation time is spent to two last queries that include GroupBy
clause, as we need the minimal time travel between two buildings.
The output table contains three major fields: ID of the origin building, ID of the
destination building and total travel time. For further analysis, we also store the
components of the trip time: walk times in the beginning and end of a trip and
transfer time in case a trip includes transfer, bus travel time(s) and waiting times at
the beginning of a trip and transfer. Based on the output table we construct the table
that contains number of buildings, total population and total number of jobs that can
be accessed during a given trip time, usually 15, 30, 45 and 60 min. This table is
further joined with the layer of buildings thus enabling the map presentation of
accessibility. Output of the car accessibility computations is organized in the same
way and joined with the layer of buildings as well. Relative accessibility—the ratio
of the PT-based and car-based accessibility is calculated within ArcGIS. These
maps of relative and absolute accessibility are those presented in Sect. 4.
It is important to note that all four tables are built based on the exact timetables
of all lines and change whenever the timetable is changed. For the case of ~300 bus
lines and several thousand stops in the Tel Aviv Metropolitan Area, the number of
rows in the tables, for the bus trips that are less than 1 h, is about several million,
within a standard range of modern SQL RDBMS abilities. All our calculations were
thus performed with the free version of the Microsoft SQL server. The total
computation time for the full PT accessibility map for the Tel Aviv metropolitan
is 2 h.
– Layer of origins and destinations with the use and capacity given. Typically, this
is a layer of buildings, while parks and leisure locations may comprise other
layers.
The input is defined by the travel start/arrival time and additional parameters as
presented in Table 1:
The output is a new layer detailing accessibility indices, at which every origin O
or destination D is characterised by the values of index AAO,k(τ) or SAD,k(τ) for the
given range of τ e.g. maps for τ ¼ 15, 30, 45 and 60 min of travel. These results are
captured in shape files that are easily exported to any GIS software.
Figure 3 shows the access areas to jobs by car (15–45 min) and by bus (45 min) for
the morning peak hour. It is clear that the car access area in the metropolitan area is
essentially larger than the bus access area, which is limited to the urban core for a
396 I. Benenson et al.
Fig. 2 Tel Aviv metropolitan area (a) and its core (b) at resolution of 60 60 m cells that
correspond to typical distance between centers of two building footprints
45-min trip. In 45 min, any area in the metropolitan region is accessible by car
whereas only areas in or near the core are accessible by bus.
Figure 4 illustrates why we consider high resolution an important aspect of
accessibility estimation. In the first row is the accessibility index calculated at
TAZ level in Benenson et al. (2011), in the second is the relative accessibility at
the resolution of buildings aggregated to transportation analysis zones (TAZ). Both
maps were calculated based on data of the numbers of jobs obtained from the
metropolitan mass transit planning authority at the resolution of statistical areas
(approximating TAZ). For the high-resolution calculation, the number of jobs in
each statistical area was uniformly distributed over the area.
Despite very close averages over the entire metropolitan area—0.356 for the
calculation at resolution of TAZ and 0.336 for the calculation at resolution of
buildings and further aggregation, it is easy to verify the contradictory results of
computation at the two levels of resolution. On the right, the map built based on the
high-resolution results presents seamless changes in accessibility between neigh-
boring zones with higher accessibility in the center compared to lower in the
suburbs. Conversely, on the left, low resolution results in the familiar patchwork
discontinuity in accessibility levels, with the suburb areas often having higher
accessibility relative to the center.
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 397
Fig. 3 Access areas by car and bus starting from the Tel Aviv University terminal marked by the
arrange circle, Metropolitan built area (a), zoom on the city core (b)
Figure 5 presents the relative accessibility levels obtained when the Red LRT line is
running in the background. The assumption is that the LRT operates in parallel to
existing road and bus networks. We did not have information to update to a future
configuration at this stage of the analysis.
Figure 6 shows the absolute number of additional accessible jobs due to the
introduction of the Red LRT line. The configuration of the LRT is based on an
average peak-hour frequency of 5 min in both directions.
As one can see, most of the benefit is around the LRT corridor clearly visible in
darker colors and this change is most noticeable for the short trips. For longer trips,
the impact of the LRT is less evident. That is, buildings close to the LRT line enjoy
an improvement in accessibility to all other areas even within short trips. However,
with the bus network that is not adjusted to the LRT, this benefit dissipates as
journey time and distance from the LRT line increase. It should also be noted that
since the analysis is based on access areas the main LRT function, which is to
provide accessibility to job destinations in the core area (i.e. service area), is not
viewed in these maps. Naturally, at the morning peak hour most of the traffic on the
LRT is bound to the core and not the other way. A complete analysis of the LRT
benefit would require looking also at the service area complementary view of
accessibility. Moreover, to be effective the LRT introduction should be supported
398 I. Benenson et al.
Fig. 4 Access Area Ratio AAO(60) for the 60 min trip that starts 7:00 AM, with zero or one
transfer between the public transport lines. Directly calculated at resolution of TAZ, for the entire
metropolitan (a) and metropolitan’s center (c) from Benenson et al. (2011); calculated at resolution
of a buildings and aggregated by TAZ for the entire metropolitan (b) and metropolitan’s center (d)
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 399
Fig. 5 Access Area AAO(τ) with “Red” LRT line (30–60 min, starting 7:00 AM, 0 or 1 transfer)
400 I. Benenson et al.
Fig. 6 Number of accessible jobs with LRT (30–60 min. trip, starting 7:00 AM, 1 transfer)
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 401
by essential planned changes in the bus network. Otherwise, higher speed of the
LRT will not be sufficient to compensate for the transfer time between bus and rail,
especially for the longer trips.
3
The interested reader who wishes to make use of the application is invited to contact the
corresponding author. The application can be used for scientific purposes without restrictions
provided that data is given in the requested format and bearing in mind the costs of cluster
operation time.
402 I. Benenson et al.
Acknowledgments Parts of this research were funded by the Chief Scientist Office of the Israeli
Ministry of Transport. The second author kindly thanks the Returning Scientist fellowship (2014)
of the Israeli Ministry of Immigration and Absorption and support of Tel Aviv University. Part of
this work is based on the Master thesis of Amit Rosenthal supervised by the corresponding author.
The authors would like to thank two anonymous reviewers for constructive and helpful comments
on the draft version.
References
Alam BM, Thompson GL, Brown JR (2010) Estimating transit accessibility with an alternative
method. Transp Res Rec J Transp Res Board 2144(1):62–71
Allen DW (2011) Getting to know ArcGIS ModelBuilder. ESRI Press, 336 p
Ben-Akiva M, Lerman SR (1979) Disaggregate travel and mobility choice models and measures of
accessibility. In: Hensher DA, Stopher PR (eds) Behavioural travel modelling. Croom Helm,
London, pp 654–679
Benenson I, Martens K, Rofé Y, Kwartler A (2010) Measuring the gap between car and transit
accessibility estimating access using a high-resolution transit network geographic information
system. Transp Res Rec J Transp Res Board N2144:28–35
Benenson I, Martens K, Rofé Y, Kwartler A (2011) Public transport versus private car: GIS-based
estimation of accessibility applied to the Tel Aviv metropolitan area. Ann Reg Sci 47:499–515
Benenson I, Geyzersky D, et al (2014) Transport accessibility from a human point of view. In: Key
presentation at the geoinformatics for intelligent transportation conference, Ostrava, 27 Jan
2014
Bertolini L, le Clercq F, Kapoen L (2005) Sustainable accessibility: a conceptual framework to
integrate transport and land use plan-making. Two test-applications in the Netherlands and a
reflection on the way forward. Transp Policy 12(3):207–220
Bhandari K, Kato H, Hayashi Y (2009) Economic and equity evaluation of Delhi Metro. Int J
Urban Sci 13(2):187–203
Black J, Conroy M (1977) Accessibility measures and the social evaluation of urban structure.
Environ Plan A 9(9):1013–1031
Blumenberg EA, Ong P (2001) Cars, buses, and jobs: welfare participants and employment access
in Los Angeles. Transp Res Rec J Transp Res Board 1756:22–31
Bristow G, Farrington J, Shaw J, Richardson T (2009) Developing an evaluation framework for
crosscutting policy goals: the Accessibility Policy Assessment Tool. Environ Plan A 41(1):48
Estimation of Urban Transport Accessibility at the Spatial Resolution. . . 403
Liu S, Zhu X (2004) Accessibility analyst: an integrated GIS tool for accessibility analysis in urban
transportation planning. Environ Plan B Plan Des 31(1):105–124. doi:10.1068/b305
Lucas K (2012) Transport and social exclusion: where are we now? Transp Policy 20:105–113
Mao L, Nekorchuk D (2013) Measuring spatial accessibility to healthcare for populations with
multiple transportation modes. Health Place 24:115–122
Martens K (2012) Justice in transport as justice in access: applying Walzer’s ‘Spheres of Justice’ to
the transport sector. Transportation 39(6):1035–1053, Online 21 February 2012
Martin D, Jordan H, Roderick P (2008) Taking the bus: incorporating public transport timetable
data into health care accessibility modelling. Environ Plan A 40(10):2510
Mavoa S, Witten K, McCreanor T, O’Sullivan D (2012) GIS based destination accessibility via
public transit and walking in Auckland, New Zealand. J Transp Geogr 20(1):15–22
Miller HJ (1999) Measuring space-time accessibility benefits within transportation networks: basic
theory and computational procedures. Geogr Anal 31:187–212
Minocha I, Sriraj PS, Metaxatos P, Thakuriah PV (2008) Analysis of transport quality of service
and employment accessibility for the Greater Chicago, Illinois, Region. Transp Res Rec J
Transp Res Board 2042(1):20–29
Neutens T et al (2010) Equity of urban service delivery: a comparison of different accessibility
measures. Environ Plan A 42(7):1613
O’Sullivan D, Morrison A, Shearer J (2000) Using desktop GIS for the investigation of accessi-
bility by public transport: an isochrone approach. Int J Geogr Inf Sci 14(1):85–104
Owen A, Levinson DM (2015) Modeling the commute mode share of transit using continuous
accessibility to jobs. Transp Res Part A Policy Pract 74:110–122
Rashidi TH, Mohammadian AK (2011) A dynamic hazard-based system of equations of vehicle
ownership with endogenous long-term decision factors incorporating group decision making. J
Transp Geogr 19(6):1072–1080
Salonen M, Toivenen T (2013) Modelling travel time in urban networks: comparable measures for
private car and public transport. J Transp Geogr 31:143–153
Tribby CP, Zanderbergen PA (2012) High-resolution spatio-temporal modeling of public transit
accessibility. Appl Geogr 34:345–355
Welch TF, Mishra S (2013) A measure of equity for public transit connectivity. J Transp Geogr
33:29–41
Witten K, Exteter D, Field A (2003) The quality of urban environments: mapping variation in
access to community resources. Urban Stud 40(1):161–177
Witten K, Pearce J, Day P (2011) Neighbourhood Destination Accessibility Index: a GIS tool for
measuring infrastructure support for neighbourhood physical activity. Environ Plan Part A 43
(1):205
Modeling Taxi Demand and Supply
in New York City Using Large-Scale
Taxi GPS Data
Abstract Data from taxicabs equipped with Global Positioning Systems (GPS) are
collected by many transportation agencies, including the Taxi and Limousine
Commission in New York City. The raw data sets are too large and complex to
analyze directly with many conventional tools, but when the big data are appropri-
ately processed and integrated with Geographic Information Systems (GIS), sophis-
ticated demand models and visualizations of vehicle movements can be developed.
These models are useful for providing insights about the nature of travel demand as
well as the performance of the street network and the fleet of vehicles that use
it. This paper demonstrates how big data collected from GPS in taxicabs can be
used to model taxi demand and supply, using 10 months of taxi trip records from
New York City. The resulting count models are used to identify locations and times
of day when there is a mismatch between the availability of taxicabs and the
demand for taxi service in the city. The findings are useful for making decisions
about how to regulate and manage the fleet of taxicabs and other transportation
systems in New York City.
Keywords Big data • Taxi demand modeling • Taxi GPS data • Transit
accessibility • Count regression model
1 Introduction
Spatially referenced big data provides opportunities to obtain new and useful
insights on transportation markets in large urban areas. One such source is the set
of trip records that are collected and logged using in-vehicle Global Positioning
Systems (GPS) in taxicab fleets. In large cities, tens of thousands of records are
collected every day, amounting to data about millions of trips per year. The raw
data sets are too large to analyze with conventional tools, and the insights that are
gained from looking at descriptive statistics or visualizations of individual vehicle
trajectories are limited. A great opportunity exists to improve our understanding of
transportation in cities and the specific role of the taxicab market within the
transportation system by processing and integrating the data with a Geographic
Information System (GIS). Moving beyond simple descriptions and categorizations
of the taxi trip data, the development of sophisticated models and visualizations of
vehicle movements and demand patterns can provide insights about the nature of
urban travel demand, the performance of the street network, and operation of the
taxicab fleet that uses it.
Taxicabs are an important mode of public transportation in many urban areas,
providing service in the form of a personalized curb-to-curb trip. At times, taxicabs
compete with public transit systems including bus, light rail, subway, and com-
muter trains. At other times, taxis complement transit by carrying passengers from a
transit station to their final destination—serving the so-called “last mile.” In the
United States, the largest fleet of taxis is operated in New York City (NYC), where
yellow medallion taxicabs generated approximately $1.8 billion revenue carrying
240 million passengers in 2005 (Schaller 2006). All taxicabs in NYC are regulated
by the Taxi and Limousine Commission (TLC), which issues medallions and sets
the fare structure. As of 2014, there are 13,437 medallions for licensed taxicabs in
NYC (Bloomberg and Yassky 2014), which provide service within the five bor-
oughs but focus primarily on serving demand in Manhattan and at the city’s
airports; John F. Kennedy International Airport and LaGuardia Airport. Since
2013, a fleet of Street Hail Livery vehicles, known as green cabs, have been issued
medallions to serve street hails in northern Manhattan and the Outer Boroughs, not
including the airports.
The TLC requires that all yellow taxicabs are equipped with GPS through the
Taxicab Passenger Enhancements Project (TPEP), which records trip data and
collects information on fare, payment type, and communicates a trace of the route
being traveled to passengers via a backseat screen. This paper makes use of a
detailed set of data that includes records for all 147 million trips served by taxicabs
in NYC in the 10-month period from February 1, 2010 to November 28, 2010. Each
record includes the date, time, and location of the pick-up and drop-off as well as
information about payment, the driver, medallion, and shift. This dataset provides a
rich source of information to conduct analysis of the variation of taxi pick-ups and
drop-offs across space and time.
In order to effectively plan and manage the fleet of taxicabs, it is necessary to
understand what factors drive demand for taxi service, how the use of taxicabs
relates to the availability of public transit, and how these patterns vary across
different locations in the city and at different times of day. A trip generation
model that relates taxi demand to observable characteristics of a neighborhood
(e.g., demographics, employment, and transit accessibility) is developed with high
temporal and spatial resolution. This paper demonstrates how GPS data from a
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 407
large set of taxicab data can be used to model demand and supply and how these
models can be used to identify locations and times of day when there is a mismatch
between the availability of taxicabs and the demand for taxi service. The models are
useful for making decisions about how to manage the transportation systems,
including the fleet of taxicabs themselves.
Recent work has been done to identify the factors that influence demand for
taxicabs within each census tract in NYC based on observable characteristics of
each census tract (Yang and Gonzales 2014). The separate models were developed
to estimate the number of taxicab pick-ups and drop-offs within each census tract
during each hour of the day, and six important explanatory variables were identi-
fied: population, education (percent of population with at least a bachelor’s degree),
median age, median income per capita, employment by industry sector, and transit
accessibility. Yang and Gonzales (2014) specifically developed a technique to
measure and map transit accessibility based on the time that it takes to walk to a
transit station and wait to board the next departing vehicle. By modeling taxi
demand based on spatially specific information about population characteristics,
economic activities as indicated by employment, and the availability of public
transit services, the models showed how the influence of various relevant factors
changes over different times during the day.
This paper builds on existing research by introducing a novel method to quantify
the supply of available taxicabs in a neighborhood based on where passengers are
dropped off and the vehicle becoming available for hire. Although the total supply
of taxicabs is itself of interest to policymakers and regulators, the spatial distribu-
tion of this supply has a big effect on where customers are able to hail taxicabs on
the street and how long they can expect to wait for an available vehicle. Thus,
accounting for the supply of taxis in models of the number of taxicab pick-ups
provides additional insights about where taxi demand is being served and where
there may be latent or underserved demand for taxicab services. The models that are
developed in this paper present additional improvements over previous models by
explicitly acknowledging that the number of taxi pick-ups in a census tract is a
count process and should be modeled with a count distribution such as a Poisson or
negative binomial regression. Both the inclusion of an independent variable for
taxicab supply and the use of a count data regression yield detailed models that
provide improved insights to the factors that drive taxi demand and affect taxi
supply. Furthermore, the visualizations of the modeled results provide greater
insights than common techniques that merely plot raw data or show simple aggre-
gations. By developing sophisticated models of supply and demand using the
extensive set of data of NYC taxicabs, the underlying patterns in the data reveal
how the mode is used and how it may be managed to serve the city better.
408 C. Yang and E.J. Gonzales
2 Literature Review
There are a number of studies of taxicabs in the literature from the fields of policy
and economics. Earlier theoretical models developed for taxicab demand are
mainly economic models for the taxicab market (Orr 1969). Although classic
economy theory states that demand and supply will reach equilibrium in a free
market, most taxi markets are not actually free, and the roles of regulations that
either support or constrain the taxicab industry need to be considered. Furthermore,
it has been argued that the price generated by “competitive equilibrium” may be
insufficient to cover the social welfare costs (Douglas 1972) or too high to fully
realize certain social benefits (Arnott 1996). Based on a study of taxicabs in
London, England, Beesley (1973) argued that five contributing factors account
for the number of taxis per head: (1) independent regulations, (2) the proportion
of tourists, (3) income per capita (especially in the center of London), (4) a highly
developed radially-oriented railway system, and (5) car ownership. Although these
classic papers build a theoretical foundation for modeling taxicab demand, they are
based only on aggregated citywide data, such as the medallion price by year
(Schreiber 1975), occurrence of taxicab monopolies by city (Eckert 1973; Frankena
and Pautler 1984), and the total number of taxicabs by city (Gaunt 1995).
More recently, attention has been directed toward identifying the factors that
influence the generation of taxicab demand. Schaller (1999) developed an empirical
time series regression model of NYC to understand the relationship between
taxicab revenue per mile and economic activity in the city (measured by employ-
ment at eating and drinking places), taxi supply, taxi fare, and bus fare. However,
Schaller’s (1999) model is not spatially specific, and is based only on the evolution
of citywide totals and averages over time. Other studies compare the supply of taxis
in different cities in order to investigate the relationships between taxi demand and
factors such as city size, the availability and cost of privately owned cars, the cost of
taxi usage, population, and presence of competing modes (Schaller 2005; Maa
2005). These studies provide comparisons across different locations, but they do
not account for changes with time.
There have been many technology developments that are beneficial to modeling
taxicab demand. Examples include in-vehicle Global Positioning Systems (GPS)
implanted in modern taxicabs and analytical tools like Geographic Information
Systems (GIS), which facilitate analysis of spatially referenced data (Girardin and
Blat 2010; Balan et al. 2011; Bai et al. 2013). As a result, a massive amount of
detailed data is recorded automatically for trips served by modern taxicab fleets,
such as pick-up locations, drop-off locations, and in some cities a complete track of
the route connecting the two (Liang et al. 2013). These large-scale taxicab data
make it is possible to build empirical models to understand how taxi trips are
generated and distributed across space and time, and how they compete with
other transportation modes.
The potential for extracting useful information about taxicab demand and the
role of taxis in the broader transportation systems has just begun to be tapped. One
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 409
3 Data
The database consists of complete information for all 147 million taxicab trips
made in NYC between February 1, 2010 and November 28, 2010. Between 5.5 and
5.8 million taxi trips are made each day in New York City. Each record includes
information about when and where a trip was made, the distance traveled, and the
fare paid. Specifically, the dataset includes the following fields for each record:
1. Taxi Medallion Number, Shift Number, Trip Number, and Driver Name;
2. Pickup Location (latitude and longitude), Date, and Time;
3. Drop-off Location (latitude and longitude), Date, and Time;
4. Distance Travelled from Pickup to Drop-Off;
5. Number of Passengers;
6. Total Fare Paid, including breakdown by Fare, Tolls, and Tips;
7. Method of Payment (e.g., cash, credit card).
410 C. Yang and E.J. Gonzales
These data are collected by the Taxi & Limousine Commission (TLC) using the
GPS and meter devices that are installed in every licensed (yellow medallion) taxi.
The data was received in a TXT file format with a magnitude of 40 GB. This dataset
is too large to manage using traditional data tools such as EXCEL or ACCESS,
therefore a database management system; an SQL server, is used instead. Its
primary querying language is T-SQL.
There are three steps to make this large dataset more manageable for regression
analysis: first, the raw data is imported to the SQL server; second, less than 2 %
original records are obviously false and have been eliminated, e.g., records without
valid locational information or records with zero distance traveled or records with
zero fare paid; finally, the locations of pick-ups and drop-offs are aggregated by
NYC census tract, and the times are aggregated by hour of the day.
The response variable is the number of taxicab pick-up counts in each census
tract per hour. Six explanatory variables are included, which have been identified as
important in a previous study with the same data set (Yang and Gonzales 2014):
population, education (percent of population with at least a bachelor’s degree),
median age, median income per capita, employment by industry sector, and transit
accessibility. The number of taxicab drop-offs in each census tract during each hour
is added as an additional explanatory variable representing the immediately avail-
able supply of taxicabs at each location and time. The sources of data for the
explanatory factors considered in this study include:
• Drop-off taxi demand per hour aggregated by NYC census tract (DrpOff)
• 2010 total population that has been aggregated by NYC census tract (Pop)
• Median age that has been aggregated by NYC census tract (MedAge)
• Percent of education that is higher than bachelors aggregated by NYC census
tract (EduBac)
• Transit Access Time (TAT), the combined estimated walking time a person must
spend to access the nearest station (transit accessibility) and the estimated time
that person will wait for transit service (transit level of service);
• Total jobs aggregated by NYC census tract (TotJob)
• Per capita income aggregated by NYC census tract (CapInc)
Since the model can only be estimated for census tracts with valid data for the
response and explanatory variables, the data set is cleaned to eliminate census tracts
that do not contain population or employment data. Of the 2143 census tracts in
NYC, 17 census tracts are deleted from the data set.
4 Methodology
Linear regression models are inadequate for count data, because the response
variable is a count of random events, which cannot be negative. As a result, the
models that are developed and compared in this study are count models that are
specifically developed to represent count processes. First, the Poisson regression
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 411
model is introduced, following the reference of Ismail and Jemain (2007). In order
to account for the varying effects that each of the explanatory variables have on the
response variable at different times of the day, a separate model is estimated for the
data in each hour.
Let Yi be the independent Poisson random variable for the count of taxicab trips
in census tract i ¼ 1 . . . 2126. The probability density function of Yi is defined as:
expðμi Þμi yi
PrðY i ¼ yi Þ ¼ ð1Þ
yi !
Eð Y i Þ ¼ μ i ð3Þ
Var ðY i Þ ¼ θμi ð4Þ
Var ðY i Þ ¼ μi þ μi 2 vi 1 ¼ μi þ μi 2 α ð6Þ
The mean and the variance will be equal if α ¼ 0, so the Poisson distribution is also
a special case of the negative binomial distribution. Values of α > 0 indicate that
the variance exceeds the mean, and the observed distribution is overdispersed.
In order to select the most appropriate model specification for the count regression,
we must compare the mean and variance of the taxicab pick-up counts, which are
the response variable for the proposed model. Table 1 presents a summary by hour
of the day of the mean and variance of the total number of taxicab pick-ups per
census tract in the 10-month data sample.
The variance of taxicab pick-ups per census tract in the 10-month dataset greatly
exceeds the mean, as shown in Table 1, which provides an indication that the data is
overdispersed. This pattern holds whether all counts from all hours of the day are
considered together or the records are broken down by hour of the day. The
implication is that the count model for the regression should be appropriate for
overdispersed data. To choose between the quasi-Poisson distribution and the
negative binomial distribution, it is necessary to look at how the mean and variance
appear to be related. Since a goal of this study is to consider how the effect of
explanatory variables changes with the hour of the day, a separate model is estimated
for each hour, and the comparison of mean and variance must be considered within
each hourly aggregation. Figure 1 presents separate plots comparing count mean and
variance for three representative hours: hour 0 is 12:00 A.M.–1:00 A.M. (midnight);
hour 8 is 8:00 A.M.–9:00 A.M. (morning peak); and hour 17 is 5:00 P.M.–6:00
P.M. (evening peak).
In order to choose the distribution that most appropriately represents the
response variable, the data within each hourly aggregation are divided in 100 sub-
sets using the quantiles of the taxicab pick-up counts. The first category includes
taxicab pick-ups for census tracts whose counts fall between the 0 quantile and 0.01
quantile, the second category includes census tract data in the range of the 0.01
quantile and 0.02 quantile, and so on. Within each quantile category, the mean and
variance of the included data are calculated, and plotted in Fig. 1. A linear function
of the form shown in (4) is fitted to estimate θ and see how well the data matches the
assumed relationship for a quasi-Poisson regression model. A quadratic function of
the form shown in (6) is fitted to estimate α and see how well the data matches the
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 413
Table 1 Mean and variance Hour of Mean value of pickup Variance of pickup
of response variable by hour day counts counts
0 2446 100,804,416
1 1827 63,070,167
2 1373 41,751,337
3 994 24,875,367
4 711 10,338,265
5 558 4,547,384
6 1202 25,984,332
7 2213 79,611,311
8 2862 128,381,917
9 2948 137,735,882
10 2801 122,508,279
11 2857 131,214,089
12 3063 153,234,156
13 3030 149,228,639
14 3133 161,516,405
15 2976 143,190,381
16 2586 106,107,666
17 3115 151,429,667
18 3759 224,925,568
19 3913 244,880,173
20 3615 209,472,196
21 3469 195,600,486
22 3360 186,065,870
23 3017 147,131,917
All 61,828 57,071,324,807
hours
assumed relationship for a negative binomial regression model. The goodness of fit
parameter, R2, is used to identify which specification fits the data better. A value of
R2 closer to 1 indicates a better fit. It can be seen in the examples for hours 0, 8, and
17 (Fig. 1) that the quadratic function provides a better fit for relating the variance
and mean, indicating that the negative binomial distribution is more appropriate for
the counts of taxicab pick-ups.
Although the overdispersed taxicab pick-up data appear to show that a negative
binomial regression is a more appropriate model than a Poisson regression, it is also
necessary to compare the fit of the models with the explanatory variables that have
been identified in the Data section. Several methods can be used to compare the fit
414 C. Yang and E.J. Gonzales
Fig. 1 Plot of the variance vs. mean of the aggregated hourly taxicab pick-up counts for (a) hour
0 (midnight), (b) hour 8 (morning peak), and (c) hour 17 (evening peak). Data are grouped into
100 categories by quantile. The linear equation for the quasi-Poisson model is shown with the blue
dotted line. The quadratic equation for the negative binomial model is shown with the red dashed line
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 415
where p is the number of parameters in the model and LL is the Log Likelihood
of the model. A smaller AIC value represents a better model, and the measure is
used to ensure that the model is not overfitted to the data.
2. Goodness-of-Fit Test
The goodness-of-fit test is an analysis of variance (ANOVA) test based on
calculating the Pearson’s residuals. The Pearson test statistics is (Cameron and
Windmeijer 1996):
X
n
χ2 ¼ e2i ð8Þ
i¼1
where the definition of the Pearson residual, ei, depends on whether the regres-
sion is a Poisson model of a negative binomial model:
Yi μ ^i Yi μ^
Poisson Model ei ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi ¼ pffiffiffiffiffi i ð9Þ
Var ðY i Þ μ
^i
Yi μ ^i Yi μ ^i
Negative Binomial Model ei ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi ð10Þ
Var ðY i Þ μ
^i þμ ^ 2i α
In both of these cases, Yi is the observed count of taxicab pick-ups in census tract
i, μ
^ i is the modeled count, and there are a total of n census tracts included in the
dataset. The Pearson statistic, χ 2, is approximately distributed as chi-square with
n p degrees of freedom, where n is the number of observations and p is the
number of predicted parameters (i.e., one parameter per explanatory variable
included in the model). If χ 2 > χ 2np, 0:05 or P-value ðχ 2 testÞ < 0:05 then the
model is statistically different from the observed data at the 95 % confidence
level. Therefore, we seek a model with P-value ðχ 2 testÞ > 0:05.
416 C. Yang and E.J. Gonzales
X
n
yi
G2 ¼ 2 yi ln ð11Þ
i¼1
μ
^i
where the χ 2LL statistics is chi-squared distributed with degrees of freedom equal
to the difference of the number of parameters estimated in model 1 and model
2. If χ 2LL is larger than the critical value for the 95 % confidence level, then model
1 is said to be statistically different from model 2.
5 Results
Having identified that the negative binomial distribution is more appropriate for the
taxicab pick-up data than the quasi-Poisson distribution (as illustrated in Fig. 1), it
useful to compare the results using a conventional Poisson regression model with
the results of negative binomial regression in order to demonstrate the effect of
acknowledging the overdispersed response variable. In order to make a comparison
between the Poisson and negative binomial regressions, a separate model has been
estimated for each hour of the day. The coefficients of the explanatory variables for
both models are shown in Table 2 and Fig. 2. In order to show all of the coefficients
on the same graph, we have normalized the scale such that a magnitude of 1 is the
magnitude of the coefficient at midnight (hour 0). This allows us to see the relative
changes in each coefficient value, and the sign for positive and negative values is
not changed.
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 417
Table 2 Parameters of the Poisson and negative binomial regressions for taxicab trip
generation
Hour DrpOff Pop MedAge EduBac TAT TotJob
0 0.000048 0.000056 0.021619 0.035982 -0.158105 0.000005
1 0.000081 0.000026 0.026725 0.031175 -0.145743 0.000001
2 0.000103 0.000014 0.021769 0.035571 -0.115985 -0.000008
3 0.000194 -0.000039 0.030535 0.032231 -0.095762 -0.000020
4 0.000160 0.000081 -0.006056 0.039077 -0.120689 -0.000004
5 0.000075 0.000162 -0.007020 0.047770 -0.129105 0.000005
6 0.000031 0.000185 0.007363 0.050621 -0.126587 0.000003
7 0.000021 0.000183 0.013675 0.054706 -0.107053 0.000000
Poisson Model Coefficients
Fig. 2 Scaled coefficients of parameters of the Poisson and negative binomial regressions for
taxicab trip generation by hour of the day (the light shade indicates non-significance of the
coefficients at 0.05 level)
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 419
The six explanatory variables that are significant are the number of drop offs,
which represents the available supply of empty taxicabs in a census tract (DrpOff);
population (Pop); median age of residents (MedAge); percent of population
attaining at least a bachelor’s degree (EduBac); transit access time (TAT); and
the total number of jobs located in the census tract (TotJob). Although per capita
income had been identified as an important explanatory variable in a previous
model based on linear regression (Yang and Gonzales 2014), the income is highly
correlated with the measure of educational attainment (EduBac) in NYC. There-
fore, to avoid problems associated with autocorrelation of the explanatory variable,
income has been omitted, and the level of education is kept in the models.
The coefficients for all variables in the Poisson regression are statistically
significant at the level of 0.05 (see top half part of Table 2). While most of the
coefficients remain significant when the negative binomial regression is used,
median age fails to exhibit significance at the 0.05 level for most hours of the day
(see lower part of Table 2 and Fig. 2). The magnitude of the model parameters is
more stable across models for some explanatory variables than others (Fig. 2). Also,
as shown from Fig. 2, the variables TAT, EduBac and MedAge exhibit bigger
magnitude and variability throughout the day compared to the other three variables:
DrpOff, MedAge, and Pop. The results show that regardless of model specification,
the taxi supply (DrpOff), education (EduBac), and transit accessibility (TAT) are
always significant determinants of taxicab pick-up demand at all times of the day.
In order to determine whether a Poisson regression or a negative binomial
regression fits better with the observed hourly taxicab pick-ups, the fours goodness
of fit tests introduced previously are used to assess the fit of the two models. These
statistics are summarized in Table 3 for each of the 24 Poisson models and
24 negative binomial models (i.e., one for each hour of the day). The results in
Table 3 show through many methods of comparison that the negative binomial
regression provides a better fit for the data than the Poisson regression. The
interpretations of the statistics are as follows:
1. The AIC values for the negative binomial regression models are much lower than
for the Poisson regression models.
2. Both models’ specifications suffer from low p-values for the χ 2 test, so there is
substantial variation in the observed data that is not explained by the models.
These errors will be reflected in the residuals, so an analysis of the residuals is
valuable and necessary.
3. The sums of squared deviances for the negative binomial regression models are
much smaller than for the Poisson regression models.
4. The very low p-values of likelihood ratio test statistics suggest that the negative
binomial regression model is very different from the Poisson regression model.
In light of the numerous differences between the models, the better fit and more
appropriate model is the negative binomial regression. That said, the model is not
perfect, and although a number of statistically significant explanatory variables and
parameters have been identified, these are not sufficient to fully explain the
420 C. Yang and E.J. Gonzales
Table 3 Goodness-of-fit statistics for the Poisson (POI) and negative binomial (NB) models
P-value (X2 Likelihood ratio test
AIC test) G2 between POI and NB
Significant
if P-value
The smaller the is less than The smaller the Significant if P-value
Interpretation better 0.05 better is less than 0.05
Hour POI NB POI NB POI NB P-value (X2LL)
0 4,139,707 17,535 0 0 4,132,262 2293 0
1 3,095,928 17,005 0 0 3,088,663 2267 0
2 3,196,864 16,254 0 0 3,189,866 2248 0
3 2,600,073 15,684 0 0 2,593,272 2233 0
4 2,165,955 15,695 0 0 2,159,067 2249 0
5 1,398,907 15,302 0 0 1,392,502 2187 0
6 2,866,840 16,003 0 0 2,860,328 2153 0
7 4,510,050 16,852 0 0 4,503,196 2206 0
8 4,949,851 16,851 0 0 4,943,023 2180 0
9 4,703,359 16,006 0 0 4,696,921 2073 0
10 4,188,069 15,906 0 0 4,181,654 2118 0
11 4,357,659 15,656 0 0 4,351,337 2075 0
12 4,609,746 15,734 0 0 4,603,387 2081 0
13 4,620,149 15,732 0 0 4,613,788 2088 0
14 5,142,257 16,043 0 0 5,135,735 2125 0
15 5,072,322 16,034 0 0 5,065,810 2122 0
16 4,448,535 15,912 0 0 4,442,054 2123 0
17 5,147,389 16,414 0 0 5,140,683 2154 0
18 5,958,635 16,625 0 0 5,951,825 2151 0
19 6,493,192 16,740 0 0 6,486,305 2173 0
20 5,795,703 16,793 0 0 5,788,756 2194 0
21 5,569,018 17,090 0 0 5,561,899 2249 0
22 5,591,048 17,466 0 0 5,583,733 2280 0
23 4,965,376 17,625 0 0 4,957,950 2288 0
variation in the number of taxicab trips that are generated in census tracts
across NYC.
The interpretation of the model parameters is the same for the negative binomial
and Poisson regressions, because both models employ a Log link function. With
every unit increase of explanatory variable x, the predictor for the response variable
increases by a multiplicative factor exp (β). For example the parameter of popula-
tion in hour 8 of the negative binomial model is 0.000331, so an increase of
population in a census tract by one inhabitant will tend to increase demand by a
factor of 1.00033. A positive parameter indicates that the explanatory variable is
associated with increased numbers of taxicab pick-ups, and a negative parameter
indicates an effect of decreased taxicab pick-ups.
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 421
In the negative binomial model, the model parameters all have the expected
signs. The number of observed taxicab pick-ups increases with taxi supply
(DrpOff), population (Pop), education (EduBac), and the total number of jobs
within a census tract (TotJob). The effect of transit access time (TAT) is negative,
which means that more taxi pick-ups are made in places that have shorter or faster
access to the subway service. There are a couple of possible reasons why a tendency
for taxis to be used in the same places that have good transit service should exist.
One reason is that people may be likely to take taxis in order to get to or from transit
services, so a subway station is a place where a traveler exits the transit system and
may look for a taxicab to reach his or her final destination. Another reason is that the
types of people and trip purposes that tend to use taxis (e.g., high value of time,
unwillingness to search or pay for parking) also tend to occur in parts of the cities
that have a lot of transit service. The negative parameter value for TAT is consistent
for every hour of the day in the Poisson and negative binomial models, but the
precise reason cannot be determined from this regression alone.
One objective of this study is to identify the locations and times of day when
there may be a mismatch between taxi demand and supply. One way to investigate
this is to look specifically at the Pearson residuals from the models as defined in
(10). For a single hour of the day, the residuals for each census tract in the city can
be mapped in order to visualize the spatial distribution of the model errors. Maps
are presented in Fig. 3 for hour 0 (midnight), hour 8 (morning peak), and hour
17 (evening peak), and the color indicates where the model overestimates taxicab
pick-ups (i.e., negative residual shown in green) and where the model underesti-
mates taxicab pick-ups (i.e., positive residuals shown in red). The Pearson residual
is calculated by dividing the actual residual by the assumed standard deviation,
which for the negative binomial model, increases as a quadratic function of the
mean. This manipulation is used to show the magnitude of error in a standardized
manner so that busier census tracts don’t dominate the figure since larger observed
and fitted counts will tend to have errors that are larger in magnitude even if those
errors are small relative to the expected variance.
Taxicab supply is included as an explanatory variable in the model, and the
availability of cabs is shown to increase the number of realizing taxicab pick-ups
(because the parameter value for DrpOff is positive). A negative residual, which
represents an overestimate from the model, provides an indication that there are
relatively fewer taxicab pick-ups being demanded relative to the supply of empty
cabs available, controlling for the characteristics of the neighborhood. Conversely,
a positive residual, which represents an underestimate from the model, provides an
indication that there are relatively more taxicab pick-ups being demanded relative
to the supply of empty taxicabs. Census tracts that fall into this second condition are
of interest, because these are the locations during each hour of the day, that appear
to have insufficient taxicab service relative to similar neighborhoods in other parts
of the city.
In hour 0 (12:00 A.M.–1:00 A.M., midnight), the central part of Manhattan and
most of the census tracts in the Outer Boroughs have negative Pearson residuals
(colored green in Fig. 3a), and the model overestimates the realizing count of
422 C. Yang and E.J. Gonzales
Fig. 3 Pearson residuals of the negative binomial regression models for (a) hour 0, (b) hour 8, and
(c) hour 17
taxicab pick-ups. The census tracts with positive Pearson residuals (colored red in
Fig. 3a), mean that there are more taxicab pick-ups than our model predicts in
northern Manhattan, the Lower East Side, western Queens, and the downtown and
Williamsburg parts of Brooklyn. These are neighborhoods where there tends to be
more night activity than indicated by the explanatory variables and thus more taxi
demand. These are the neighborhoods where there is likely to be the largest
mismatch between the supply of available taxicabs and the number of people
who seek to be picked up by a taxicab.
It is useful to compare the patterns from hour 0 with other hours of the day,
because activity patterns in NYC change over the course of the day. In hour 8 (8:00
A.M.–9:00 A.M., morning peak), the negative Pearson residuals in central Man-
hattan and much of the Outer Boroughs reduce in magnitude (yellow or light orange
in Fig. 3b). This suggests that the binomial regression model provides a better fit
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 423
during the morning, and taxicab pick-up counts are estimated with less error. One
reason for this may be that data associated with the residents of a census tract are
most relevant for predicting the number of trips that these residents are likely to
make from their homes in the morning. In hour 17 (5:00 P.M.–6:00 P.M., evening
peak), the magnitudes of the Pearson residuals become larger again.
Despite the variations, there are consistent patterns in the maps of the Pearson
residuals across all hours of the day. The locations where the model underestimates
trips at midnight also tend to have underestimated trips in the morning and evening.
Many of these neighborhoods, such as Harlem, the Lower East Side, Astoria,
Williamsburg, and Downtown Brooklyn, are dense residential neighborhoods
with vibrant local businesses but without the same level of large commercial and
tourist activities as are concentrated in much of Manhattan. This may be a reason
why these inner neighborhoods are associated with high demand for taxicab pick-
ups, but the taxicab fleet has a tendency to focus service in more central parts of the
Manhattan. Many of the further outlying neighborhoods in the Bronx, Queens,
Brooklyn, and Staten Island tend to have demand overestimated by the model. This
is likely because the populations in those areas either have lower incomes, which
make them less likely to choose to pay for taxicab service, or the neighborhood
development is at lower densities, which are more conducive to travel by private car
than by hailing a taxicab.
The TLC has already changed policies to address some of the mismatch between
taxicab supply and demand in NYC. The green Street Hail Livery vehicles (Boro
taxi) are allowed to pick passengers only in Manhattan above 96th Street and in the
Outer Boroughs, not including the airports. This coverage area overlaps with many
of the underestimated (and potentially underserved) neighborhoods identified in
Fig. 3. One part of the city that is consistently underestimated in the models but is
not within the green cab’s pick-up area is Manhattan’s Lower East Side. One reason
for this may be the recent growth that has occurred in the neighborhood, which has
increased activity but may not be reflected in the service provided by taxis.
Nevertheless, the Lower East Side is an example of a neighborhood area that this
modeling approach can identify as being in need of additional taxicab service.
These models and figures could be useful tools for transportation planners who
want to understand where taxicab service is used, and where more taxicab supply is
needed.
6 Conclusion
This study made use of a negative binomial regression model to interpret 10 months
of overdispersed taxicab demand data in NYC. Negative binomial regressions have
been broadly applied to biology, bio-chemistry, insurance, and finance industries,
and this paper shows that the model approach is well suited demand modeling for
taxicabs. The raw taxicab dataset includes 147 million records, and in order to make
sense of the patterns, the records are aggregated by census tract and hour of the day
in order to develop meaningful models of the way that taxicab demand varies across
424 C. Yang and E.J. Gonzales
space and time. A number of count regression models have been considered to
model the number of taxicab pick-ups per census tract within an hour of the day,
including the Poisson model, quasi-Poisson model, and negative binomial model.
By a series of statistical tests, the negative binomial model is shown to be most
appropriate for the overdispersed count data. An analysis of the residuals provides
useful insights about where taxicab demand appears to be adequately served by the
existing supply of taxicabs, and where there is a need for more taxicab services.
The modeling approach started by using important explanatory variables that
were identified in a previous modeling effort that used the same dataset (Yang and
Gonzales 2014). An additional explanatory variable was added to represent the
taxicab supply, and this is the number of taxicab drop-offs in each census tract
during each hour of the day, because each drop-off corresponds to a taxicab
becoming available for another customer. The negative binomial regression
shows that three explanatory variables are significant during every hour of the
day (drop-offs, educational attainment, and transit access time), and two others are
significant during most waking hours of the day (population and total number
of jobs).
The residual graphs suggest that central Manhattan and most of the Outer
Boroughs have at least enough taxi supply for the demand that is observed,
controlling for neighborhood characteristics. The northern part of Manhattan, the
Lower East Side, and the western parts of the Queens and Brooklyn all have more
observed taxicab pick-ups than the model predicts. The fleet of green Street Hail
Livery vehicles serves some of these neighborhoods but not the Lower East Side.
The maps of residuals provide some useful insights for the transportation planners
to understand when and where we need more taxis.
The taxicab data used to create these models has both spatial and temporal
dimensions. The effect of time is accounted for by separating the data by hour of the
day, and fitting a negative binomial regression for each hour. The effect of the time
of the day that each explanatory variable has on the number of taxicab pick-ups can
be observed by comparing the parameter values in Table 2. The variability of the
coefficients over time is shown in Fig. 2. Additional modeling effort is needed to
also account for the spatial correlations in the data set. It is clear from the maps of
residuals that adjacent census tracts have correlated performance indicated by the
correlated errors. One way to account for these correlations is with a Generalized
Linear Mixed Model. Other efforts to improve the model would be to consider
additional explanatory variables to account for the activities or popularity of a
census tract or to account for the movement of empty taxicabs in search of
customers.
Large datasets, such as the records of taxicab trips in NYC, present some
challenges, because the raw data are too big to be analyzed directly by conventional
methods. By processing the data, and developing models that relate the taxicab data
to other sources of information about the characteristics of different parts of the city
at different times of day, it is possible to gain useful insights about the role that
taxicabs play in the broader transportation system. More importantly, these insights
can be used to plan and improve the transportation system to meet the needs of
users.
Modeling Taxi Demand and Supply in New York City Using Large-Scale. . . 425
References
1 Introduction
GPS trajectory data, a temporally ordered sequence of GPS track logs with (x, y, t)
(or coordinates at a given time interval), have been found useful in many applica-
tion domains. GPS trajectories have been used to complement personal travel
S. Hwang (*)
Geography, DePaul University, Chicago, IL, USA
e-mail: [email protected]
C. Evans • T. Hanke
Physical Therapy, Midwestern University, Downers Grove, IL, USA
e-mail: [email protected]; [email protected]
surveys (Stopher et al. 2008), detect activities of individuals (Eagle and (Sandy)
Pentland 2006; Liao et al. 2007; Rodrigues et al. 2014), improve measurement of
physical activity and community mobility (Krenn et al. 2011; Hwang et al. 2013),
examine effects of transportation infrastructure on physical activity (Duncan
et al. 2009), understand the role of environmental factors in occurrence of diseases
(Richardson et al. 2013), recommend locations for location-based social network
services (Zheng et al. 2011), and detect anomalous traffic patterns in urban settings
(Pang et al. 2013).
The utility of GPS trajectory data will only grow as a greater amount of GPS data
become readily available from location-aware devices. With increasing population
in urban areas, the ability to track human movements in high spatiotemporal
resolution offers much potential in urban informatics—“the study, design, and
practice of urban experiences across different urban contexts that are created by
new opportunities of real-time, ubiquitous technology and the augmentation that
mediates the physical and digital layers of people networks and urban infrastruc-
tures” (Foth et al. 2011). GPS data integrated with other data such as survey, census,
and GIS data can lead to a better understanding of “circulatory and nervous
systems” of cities (Marcus 2008). Aided by patterns extracted from GPS trajecto-
ries (e.g., patterns of vehicle traffic or human movement), one can examine travel
behavior, traffic congestion, community participation, and human impacts on the
environment at a fine granularity (Zheng et al. 2014).
Whether this potential can be realized, however, rests on accuracy of algorithms
that detect those patterns from GPS trajectory data. Unfortunately, it is difficult to
detect patterns reliably from GPS trajectories that are largely characterized as
voluminous and noisy. To illustrate, hundreds of thousands of data points are
generated from GPS tracking of an individual during the period of a week with
5 s intervals. This renders manual processing of GPS trajectories nearly infeasible
especially when dealing with a collection of individual GPS trajectories over an
extended period of time. GPS trajectories are often interspersed with spatial out-
liers where coordinates of track logs are deviated from those of temporally neigh-
boring track logs. Further, GPS trajectories contain a fair amount of gaps (the time
period when track logs are not recorded) caused when GPS satellite signals cannot
be received indoors or when the GPS logger battery runs out. While needs for
automated procedures for processing GPS data exist, developing algorithms that are
robust to uncertainty remains a challenging task.
This paper presents a method for identifying stay points from individual GPS
trajectories based on spatiotemporal criteria while treating gaps in GPS trajectories.
Operationally, stay points are where an individual stays (or does not move) for a
minimal time duration (Li et al. 2008). It is important to detect stays points because
stay points may indicate where individuals are likely to conduct activities (e.g.,
work in the office, shop at a store, and gather for social occasions) that are
meaningful to individuals. Stay points are fine-grained snapshots of the activity
space where individuals work, live, and play. Once stay points or personally
important places are identified, one can infer semantics of those places and activ-
ities that might have occurred in those places with the aid of analytics and
contextual data (like POIs) (Liao et al. 2007; Ye et al. 2009).
Detecting Stop Episodes from GPS Trajectories with Gaps 429
Stay point detection is a crucial step in trajectory data analysis from a method-
ological standpoint. Based on a stay point, trajectories can be divided into a set of
homogeneous segments (or episodes), which concisely capture semantics of trajec-
tories that are of interest to application domains (Parent et al. 2013). The most
commonly used criterion for segmenting trajectories is whether data points are
considered to consist of stop (a region where activities are conducted) or move
(a route taken from stop A to stop B). For instance, on such segments it is possible
to reliably infer mode of transportation (motorized or not), types of activities,
purpose of trips (discretionary or not), and vehicle emissions in the context of
formulating policies toward a healthy or sustainable city. Practitioners in disaster
management can learn about the most frequent sub-trajectories or trajectories with
similar behaviors from collective trajectories (like hurricanes or tornados) for
disaster preparedness (Dodge et al. 2012).
This paper intends to highlight stay point detection as a preliminary step to
segmentation of individual trajectories. We extend the previous work by Hwang
et al. (2013) that proposes a method for segmenting trajectories into stop or move
(trip) based on a density-based spatial clustering algorithm (like DBSCAN) adapted
to temporal criteria. DBSCAN detects spatial clusters by checking whether the
minimum number of data points (MinPts) is within a search radius (eps) from a
randomly chosen core point to detect initial clusters, and aggregating those initial
clusters that are density connected (Ester et al. 1996). Spatial clusters with high
density and arbitrary shape can indicate a stay point, and those clusters can be
detected reliably by DBSCAN. Finally, those spatial clusters that last for the
minimal time duration are identified as a stay point. The previous work, however,
does not consider cases when gaps might represent a stop. Satellite signals are lost
when one enters a building, and this can result in falling short of MinPts necessary
to be detected as a spatial cluster. The same problem occurs when one exits a
building because the time required for a GPS receiver to acquire satellite signals
may take a while. In this paper we present a new method for stay point detection
that addresses the limitation above. The proposed method fills gaps such that
DBSCAN can detect stay points reliably even in the presence of gaps.
The remaining part of this paper is organized as follows. In Sect. 2, we review
related work focusing on stay point detection from GPS trajectories in relation to
the field of trajectory data analysis. Section 2 intends to highlight diversity and
limitations in methods for detecting a stay point. We describe the proposed method
in Sect. 3. The proposal combines density-based clustering with gap treatment to
achieve improved results. We implement the proposed method as part of measuring
community mobility, more specifically to infer the number of stops made. We
compare the performance of methods proposed in the previous and current work in
terms of how accurate the number of stops predicted is. These results are presented
in Sect. 4. We conclude the study by discussing lessons learned and limitations of
the proposed method in Sect. 5.
430 S. Hwang et al.
2 Related Work
stores in the mall) cannot be reliably detected as a stay point by the algorithm
above.
A density-based spatial clustering algorithm can be used to overcome the
limitation above. In other words, spatially connected GPS logs that span beyond a
distance threshold can be detected using the clustering algorithm. Yuan et al. (2013)
uses the clustering algorithm to recommend where taxi drivers are likely to find
customers by detecting parking places where taxi drivers are stationary from
collective GPS trajectories of vehicles. Schoier and Borruso (2011) use DBSCAN
to detect significant places from GPS trajectories. Spatial clusters detected with
these techniques, however, lack temporal dimension (such as time duration between
the beginning and end times).
Hwang et al. (2013) detect a stay point by checking whether data points
constituting a spatial cluster (detected by DBSCAN) meet temporal criteria.
ST-DBSCAN that extends DBSCAN to consider non-spatial attributes (including
time) is not well tailored to dealing with time expressed in categorical (binary) scale
(i.e., exceed time threshold or not) that are possessed by the unknown number of
spatially adjacent data points (Birant and Kut 2007). Finally, additional features
like change of direction and sinuosity can be considered in stay point detection
given that a stay point can be characterized by higher rate of direction change and
higher sinuosity than a move episode (Rocha et al. 2010; Dodge et al. 2012). In the
following, we review related work on how gaps are treated.
A gap is usually defined as a stop if the gap duration is above a threshold ranging
from 1 to 10 min (depending on study area) (Ashbrook and Starner 2002). Data
points can be added for the gap using linear interpolation; that is, location and time
of data points added is linearly interpolated between a data point right before the
gap, and a data point after the gap (Winter and Yin 2010). Speed for the gap can be
estimated based on gap distance and gap time, and if the estimated gap speed
matches the speed of the previous trip and of the following trip then the gap is set to
a stop (Stopher et al. 2008). This work suggests that the successful stay point
detection involves constructing empirically-based rules that consider a combination
of features described above in conjunction with reducing uncertainty of GPS
trajectories.
Following on from the previous research, we propose the framework that consists of
four modules: data cleaning, gap treatment, stay point detection, and post-
processing. The data cleaning module deletes spatial outliers. The gap treatment
module fills gaps whose duration is at least 3 min using a linear interpolation. The
stay point detection module identifies stay points based on spatial density
(eps ¼ 50 m; MinPts ¼ 5) and the minimal time duration t (3 min) using DBSCAN
adapted to temporal criteria. The post-processing module removes noise that
remains after the previous module, based on the moving window that consists of
432 S. Hwang et al.
five data points. The main difference in this method from previous work (Hwang
et al. 2013) is in the gap treatment prior to the stay point detection. By treating gaps,
it is possible to address a limitation of DBSCAN that does not detect gaps as a
potential stay point. All other parameters remain the same, which allows us to
examine effects of gap treatment on results. In the following, we describe the
proposed method in detail.
The quality of the raw GPS data was inspected and treated as necessary to minimize
effects of uncertainty. GPS track logs were overlaid on data of higher accuracy and
independent source (ArcGIS 10.3 Map Service World Imagery). It was shown that
most of the GPS data were well aligned with reference data, but a few spatial
outliers were found present in trajectories recorded for the 1-week period. Outliers
are represented by an abrupt change in position, and can be detected by abrupt
changes in speed. More specifically, change in speed between two consecutive
track logs was calculated as change in location divided by change in time. A track
log is deleted if the elapsed speed for consecutive data points is greater than 130 km
per hour. It is possible that data points that are not actually outliers (e.g., as a result
of cars speeding) will be deleted if the elapsed speed is greater than or equal to
130 kph, but in any case, those data points would not be used for the purposes of
stay point detection.
The gap treatment module adds a given number of data points (k) for a gap whose
time duration exceeds a certain threshold (q), where parameters k and q are chosen
such that a gap that is likely to represent a stop episode is detected as a stay point,
and a gap that is likely to represent a move episode is not detected as a stay point.
More specifically, k data points were added just before any gap whose elapsed time
from the previous track log is at least q seconds, where k ¼ MinPts þ 1, and
q ¼ t þ r where t is the minimum time duration in seconds of a stay point and r is
a recording time interval in seconds for GPS logging. In this study, k is 6 and q is
210 s as t is 180 s and r is 30 s. Time and position of those data points added are
linearly interpolated.
The idea is that if the spatial distance between a data point before the gap ( pi) and a
data point after the gap ( pj) is sufficiently large (i.e., the distance is greater than eps,
and pi and pj are not density connected), then those six data points that were added are
unlikely to be detected as a stay point due to low density. Good examples of this are
where an individual takes subway, a vehicle enters a tunnel, or a GPS logger runs out
of battery. If the spatial distance between pi and pj is sufficiently small (i.e., the
Detecting Stop Episodes from GPS Trajectories with Gaps 433
distance is less than eps, and those pi and pj are density connected), then those six data
points that were added are likely to be detected as a stay point. An example of this
would be where GPS satellite signals are lost when one enters a building and the
subsequent data point ( pj) after the gap is close to pi (when one exits a building), which
most likely indicates that a stop is made at a place.
For DBSCAN, eps is set to 50 m, and MinPts is set to 5 based on observed spatial
accuracy of data and extent of spatial clusters. Obviously, eps and MinPts should be
chosen in relation to t (the minimum time duration of a stay point) and r (recording
time interval). Spatial clusters are identified from DBSCAN given parameters
MinPts and eps. With DBSCAN, track logs that are scattered around a stay point
are treated as noise, and therefore those track logs (such as beginning of a new trip
before or after making a stop) do not form part of a stay point. Staying at home is
not included in stay point detection to reduce processing time, where staying at
home is identified as track logs within 100 m from the home location provided by
study participants.
Three minutes is chosen as t because this allows for capturing short duration
activities (such as running an errand). Previous studies and observation of the study
area indicate that the minimum time duration for stay points or important places is
2–30 min (Ashbrook and Starner 2002; Stopher et al. 2008; Ye et al. 2009). It was
then checked to see if track logs constituting a spatial cluster are consecutive for the
minimal duration of time t. If this condition is met, a spatial cluster is disaggregated
into one or more stay points so that a place visited more than once is identified with
multiple stay points and different time stamps. Track logs that are identified as a
stay point are flagged as “stop”, and track logs that are not identified as a stay point
are flagged as “move”.
3.4 Post-processing
Some track logs can still remain misclassified in the presence of both spatial and
non-spatial outliers after the stay point detection module. For instance, anomalies in
GPS measurement cause some track logs that are actually part of a stay point to be
classified as “move” due to low density around those spatial outliers. Conversely,
track logs that are not semantically a stop episode (such as waiting for a traffic light
for longer than t) can be falsely classified as “stop”. What is common in those
misclassified track logs is that they are surrounded by track logs that are classified
otherwise. Filtering stop/move values in the moving window with a given size can
fix this problem. In this study, stop/move values are replaced with the most
common value of five consecutive track logs.
434 S. Hwang et al.
For example, a series of track points might have an array of values [stop, move,
stop, stop, stop] although they should be classified as [stop, stop, stop, stop, stop].
The program calculates a majority value (the most common value) in the moving
window that consists of five consecutive track points, and replaces the binary value
of each track log (stop or move) with the majority value. This process is repeated for
all track logs. The size of the moving window is calculated as t/r 1 (that is, 5 in
this study) to make filtering results balanced between too noisy and too smooth
outcomes. That way, an array of five track logs [stop, move, stop, stop, stop] can be
classified into a stop as a whole since the second value (“move”) will be replaced
with “stop”. This effectively removes noise that remains after clustering-based stay
point detection.
Unique identifiers (IDs) are assigned to stops and moves based on temporal order
and a rule that a stop is followed by a trip, and vice versa. In operational terms, a
stop is a sequence of track logs that are spatially clustered and temporally contin-
uous; a move is a sequence of track logs that are not spatially clustered but is
temporally continuous. Track logs that are marked as a stop with the unique ID are
aggregated (or dissolved) into a geographic object called a “stop center” that
possesses spatiotemporal attributes, including coordinates of representative loca-
tion (mean center), time duration, start time and end time.
The stay point detection algorithm described above was applied to data collected
for the study by Evans et al. (2012). The ultimate goal of this work is to monitor
how patients return to the community after rehabilitation treatment following a
stroke. Objective measures of community mobility (i.e., ability to get around in the
community) allow health care practitioners to infer patients’ functional status and
monitor how patients respond to clinical intervention. One way to measure com-
munity mobility or community participation is to calculate the number of stop
episodes where one or more activities are likely to be conducted.
Subjects provided informed consent to participate in the study, involving carry-
ing a GPS logger (GlobalSat DG-100 Data Logger or QStarz Travel recorder XT)
during their waking hours for 1 week. One week was chosen as a monitoring period
because similar types of activities are typically repeated over a 1-week period,
which allows for capturing a minimally distinct set of activities. Recording interval
(r) was set to 30 s, and data were collected in a passive and continuous mode.
Therefore, one set of GPS trajectory data is comprised of track logs continuously
recorded for 1 week. Data was collected from May 2009 to December 2014. We
collected data from two sets of subjects (control subjects and patients) from 1 week,
5 weeks, 9 weeks, 6 months, and 1 year after a baseline time (such as time of
rehabilitation treatment). We implemented a total of 78 weeks’ worth of data from
17 subjects.
Detecting Stop Episodes from GPS Trajectories with Gaps 435
Fig. 2 Evaluation of the proposed method. (a) Comparison of # stops detected by the previous
work against # stops validated. (b) Comparison of # stops detected by the current work against #
stops validated
5 Conclusion
The proposed method combines gap treatment with DBSCAN adapted to temporal
criteria to detect stay points from GPS trajectories. DBSCAN does not detect stay
points when gaps are present and the number of data points near gaps is less than
MinPts. For temporal DBSCAN to detect a stay point even in the presence of gaps,
qualified (i.e., >t) gaps are filled first by adding k (or MinPts þ 1) data points whose
position and time are linearly interpolated. Then track logs that meet spatiotemporal
constraints (i.e., high density, minimal time duration) are detected as a stay point. If
two data points between qualified gaps are far apart, those gaps will be detected as a
Detecting Stop Episodes from GPS Trajectories with Gaps 437
move episode. If two data points between qualified gaps are close to each other (less
than eps), those gaps will be detected as a stop episode. Finally, the majority the
filtering was performed on the moving window that consists of t/r 1 (five in this
study) consecutive data points to smooth out any misclassified values marked (stop
or move) for track logs.
We evaluate performance of the proposed method on nine trajectory data.
Performance was measured by how closely the number of stops detected by the
proposed method is aligned with the number of stops validated. When we compare
the performance between the previous work (Hwang et al. 2013) and the current
work, it was found that the current work outperforms the previous work. This
implies that gap treatment (which is new in the current work compared to the
previous work) contributes to improved performance of the stay point detection
method based on spatiotemporal criteria. Further, the statistically significant corre-
lation between the number of stops validated through visual inspection and the
number of stops detected by the proposed method indicates that the proposed
method detects stay points to a satisfactory level of accuracy. The contribution of
this study is that the stay point detection is also assessed with respect to gaps in data.
This study also explores the effect of uncertainty handling (such as gap treatment)
on performance, which is often overlooked in trajectory data analysis.
Several limitations of this research should be acknowledged. The performance
of the proposed method can be better evaluated. One way to improve evaluation is
to investigate the degree of match between processed data and validation data at the
unit of track logs (rather than at the unit of stops). Another way to improve
evaluation is to ask participants to annotate spatiotemporal attributes of sample
stops, and use those annotated data as baseline data for determining the perfor-
mance of the proposed method. Although a density-based clustering algorithm has
an advantage that differentiates stops from noises (moves) reliably, appropriate
values of parameters (such as search radius, the minimum number of data points,
and the minimum time duration) should be empirically determined in relation to
recording time interval, logging mode, and spatial accuracy of GPS trajectories.
This can be done in future research.
The accuracy of stay point detection is crucial to segmenting trajectories into
meaningful elements that are of value to a given application. For instance, seman-
tics of stop episodes (e.g., type of location, time of visits, and characteristics of their
surroundings) can be inferred with further analysis, which provides clues to under-
standing activity patterns of individuals. Similarly, inferring semantics of move
episodes (e.g., mode of transportation, level of physical activities, and routes taken)
will be useful in understanding transportation mode choice, determining features of
the built environment that promote physical activity, and estimating transportation-
related carbon footprints of individuals. It is hoped that this paper demonstrates the
importance of, and various strategies for, segmenting trajectories as a preliminary
to furthering knowledge and discovery in Big Data.
438 S. Hwang et al.
References
Ashbrook D, Starner T (2002) Learning significant locations and predicting user movement with
GPS. In: Proceedings of the sixth international symposium on wearable computers, 2002
(ISWC 2002). pp 101–108
Birant D, Kut A (2007) ST-DBSCAN: an algorithm for clustering spatial–temporal data. Data
Knowl Eng 60(1):208–221, http://www.sciencedirect.com/science/article/pii/S0169023X06
000218. Accessed 10 July 2015
Dodge S, Laube P, Weibel R (2012) Movement similarity assessment using symbolic representa-
tion of trajectories. Int J Geogr Inf Sci 26(9):1563–1588, http://dx.doi.org/10.1080/13658816.
2011.630003. Accessed 10 July 2015
Duncan MJ, Badland HM, Mummery WK (2009) Applying GPS to enhance understanding of
transport-related physical activity. J Sci Med Sport 12(5):549–556, http://www.sciencedirect.
com/science/article/pii/S1440244008002107. Accessed 10 July 2015
Eagle N, (Sandy) Pentland A (2006) Reality mining: sensing complex social systems. Pers
Ubiquitous Comput 10(4):255–268, http://dx.doi.org/10.1007/s00779-005-0046-3. Accessed
10 July 2015
Ester M, et al (1996) A density-based algorithm for discovering clusters in large spatial databases
with noise. AAAI Press, pp 226–231
Evans CC et al (2012) Monitoring community mobility with global positioning system technology
after a stroke: a case study. J Neurol Phys Ther 36(2):68–78, http://content.wkhealth.com/
linkback/openurl?sid¼WKPTLP:landingpage&an¼01253086-201206000-00004. Accessed
10 July 2015
Foth M, Choi JH, Satchell C (2011) Urban informatics. In: Proceedings of the ACM 2011
conference on computer supported cooperative work, CSCW’11. ACM, New York, pp 1–8.
http://doi.acm.org/10.1145/1958824.1958826. Accessed 10 July 2015
Hwang S, Hanke T, Evans C (2013) Automated extraction of community mobility measures from
GPS stream data using temporal DBSCAN. In: Murgante B, et al (eds) Computational science
and its applications—ICCSA 2013 (Lecture notes in computer science). Springer, Berlin, pp
86–98. http://link.springer.com/chapter/10.1007/978-3-642-39643-4_7. Accessed 10 July
2015
Krenn PJ et al (2011) Use of global positioning systems to study physical activity and the
environment: a systematic review. Am J Prev Med 41(5):508–515, http://www.sciencedirect.
com/science/article/pii/S0749379711005460. Accessed 10 July 2015
Li Q, et al (2008) Mining user similarity based on location history. In: Proceedings of the 16th
ACM SIGSPATIAL international conference on advances in geographic information systems,
GIS’08. ACM, New York, pp 34:1–34:10. http://doi.acm.org/10.1145/1463434.1463477.
Accessed 10 July 2015
Liao L, Fox D, Kautz H (2007) Extracting places and activities from GPS traces using hierarchical
conditional random fields. Int J Rob Res 26(1):119–134, http://ijr.sagepub.com/content/26/1/
119. Accessed 10 July 2015
Marcus F (2008) Handbook of research on urban informatics: the practice and promise of the real-
time city: the practice and promise of the real-time city. IGI Global, Hershey, PA
Pang LX et al (2013) On detection of emerging anomalous traffic patterns using GPS data. Data
Knowl Eng 87:357–373, http://www.sciencedirect.com/science/article/pii/S0169023X13000475.
Accessed 10 July 2015
Parent C, et al (2013) Semantic trajectories modeling and analysis. ACM Comput Surv 45(4):
42:1–42:32. http://doi.acm.org/10.1145/2501654.2501656. Accessed 10 July 2015
Richardson DB et al (2013) Spatial turn in health research. Science 339(6126):1390–1392, http://
www.ncbi.nlm.nih.gov/pmc/articles/PMC3757548/. Accessed 10 July 2015
Rocha JAMR, et al (2010) DB-SMoT: a direction-based spatio-temporal clustering method. In:
Intelligent systems (IS), 2010 5th IEEE international conference. pp 114–119
Detecting Stop Episodes from GPS Trajectories with Gaps 439
Rodrigues A, Damásio C, Cunha JE (2014) Using GPS logs to identify agronomical activities. In:
Huerta J, Schade S, Granell C (eds) Connecting a digital Europe through location and place
(Lecture notes in geoinformation and cartography). Springer, pp 105–121. http://link.springer.
com/chapter/10.1007/978-3-319-03611-3_7. Accessed 10 July 2015
Schoier G, Borruso G (2011) Individual movements and geographical data mining. Clustering
algorithms for highlighting hotspots in personal navigation routes. In: Murgante B, et al (eds)
Computational science and its applications—ICCSA 2011 (Lecture notes in computer sci-
ence). Springer, Berlin, pp 454–465. http://link.springer.com/chapter/10.1007/978-3-642-
21928-3_32. Accessed 10 July 2015
Schuessler N, Axhausen K (2009) Processing raw data from global positioning systems without
additional information. Transp Res Rec J Transp Res Board 2105:28–36, http://
trrjournalonline.trb.org/doi/abs/10.3141/2105-04. Accessed 10 July 2015
Stopher P, FitzGerald C, Zhang J (2008) Search for a global positioning system device to measure
person travel. Transp Res Part C Emerg Technol 16(3):350–369, http://www.sciencedirect.
com/science/article/pii/S0968090X07000836. Accessed 10 July 2015
Winter S, Yin Z-C (2010) Directed movements in probabilistic time geography. Int J Geogr Inf Sci
24(9):1349–1365, http://dx.doi.org/10.1080/13658811003619150. Accessed 10 July 2015
Ye Y, et al (2009) Mining individual life pattern based on location history. In: Tenth international
conference on mobile data management: systems, services and middleware, 2009, MDM’09.
pp 1–10
Yuan NJ et al (2013) T-finder: a recommender system for finding passengers and vacant taxis.
IEEE Trans Knowl Data Eng 25(10):2390–2403
Zheng Y (2015) Trajectory data mining: an overview. ACM Trans Intell Syst Technol 6(3):
29:1–29:41. http://doi.acm.org/10.1145/2743025. Accessed 10 July 2015
Zheng Y, et al (2011) Recommending friends and locations based on individual location history.
ACM Trans Web 5(1): 5:1–5:44. http://doi.acm.org/10.1145/1921591.1921596. Accessed
10 July 2015
Zheng Y, et al (2014) Urban computing: concepts, methodologies, and applications. ACM Trans
Intell Syst Technol 5(3): 38:1–38:55. http://doi.acm.org/10.1145/2629592. Accessed 10 July
2015
Part VI
Emergencies and Crisis
Using Social Media and Satellite Data
for Damage Assessment in Urban Areas
During Emergencies
G. Cervone (*)
GeoInformatics and Earth Observation Laboratory, Department of Geography and Institute
for CyberScience, The Pennsylvania State University, University Park, PA, USA
Research Application Laboratory, National Center for Atmospheric Research,
Boulder, CO, USA
e-mail: [email protected]
E. Schnebele • N. Waters • M. Moccaldi • R. Sicignano
GeoInformatics and Earth Observation Laboratory, Department of Geography and Institute
for CyberScience, The Pennsylvania State University, University Park, PA, USA
1 Introduction
Every year natural hazards are responsible for powerful and extensive damage to
people, property, and the environment. Drastic population growth, especially along
coastal areas or in developing countries, has increased the risk posed by natural
hazards to large, vulnerable populations at unprecedented levels (Tate and
Frazier 2013). Furthermore, unusually strong and frequent weather events are
occurring worldwide, causing floods, landslides, and droughts affecting thousands
of people (Smith and Katz 2013). A single catastrophic event can claim thousands
of lives, cause billions of dollars of damage, trigger a global economic depression,
destroy natural landmarks, render a large territory uninhabitable, and destabilize the
military and political balance in a region (Keilis-Borok 2002). Furthermore, the
increasing urbanization of human society, including the emergence of megacities,
has led to highly interdependent and vulnerable social infrastructure that may lack
the resilience of a more agrarian, traditional society (Wenzel et al. 2007). In urban
areas, it is crucial to develop new ways of assessing damage in real-time to aid in
mitigating the risks posed by hazards. Annually, the identification, assessment, and
repair of damage caused by hazards requires thousands of work hours and billions
of dollars.
Remote sensing data are of paramount importance during disasters and have
become the de-facto standard for providing high resolution imagery for damage
assessment and the coordination of disaster relief operations (Cutter 2003; Joyce
et al. 2009). First responders rely heavily on remotely sensed imagery for coordi-
nation of relief and response efforts as well as the prioritizing of resource allocation.
Determining the location and severity of damage to transportation infrastructure
is particularly critical for establishing evacuation and supply routes as well as repair
and maintenance agendas. Following the Colorado floods of September 2013 over
1000 bridges required inspection and approximately 200 miles of highway and
50 bridges were destroyed.1 A variety of assessment techniques were utilized
following Hurricane Katrina in 2005 to evaluate transportation infrastructure
including visual, non-destructive, and remote sensing. However, the assessment
of transportation infrastructure over such a large area could have been accelerated
through the use of high resolution imagery and geospatial analysis
(Uddin 2011; Schnebele et al. 2015).
Despite the wide availability of large remote sensing datasets from numerous
sensors, specific data might not be collected in the time and space most urgently
required. Geo-temporal gaps result due to satellite revisit time limitations, atmo-
spheric opacity, or other obstructions. However, aerial platforms, especially
Unmanned Aerial Vehicles (UAVs), can be quickly deployed to collect data
about specific regions and be used to complement satellite imagery. UAVs are
capable of providing high resolution, near real-time imagery often with less
expense than manned aerial- or space-borne platforms. Their quick response
1
http://www.denverpost.com/breakingnews, http://www.thedenverchannel.com/news/local-news.
Using Social Media and Satellite Data for Damage Assessment in Urban Areas. . . 445
times, high maneuverability and resolution make them important tools for disaster
assessment (Tatham 2009).
Contributed data that contain spatial and temporal information can provide
valuable Volunteered Geographic Information (VGI), harnessing the power of
‘citizens as sensors’ to provide a multitude of on-the-ground data, often in real
time (Goodchild 2007). Although these volunteered data are often published with-
out scientific intent, and usually carry little scientific merit, it is still possible to
mine mission critical information. For example, during hurricane Katrina,
geolocated pictures and videos searchable through Google provided early emer-
gency response with ground-view information. These data have been used during
major events, with the capture, in near real-time, of the evolution and impact of
major hazards (De Longueville et al. 2009; Pultar et al. 2009; Heverin and
Zach 2010; Vieweg et al. 2010; Acar and Muraki 2011; Verma et al. 2011; Earle
et al. 2012; Tyshchuk et al. 2012).
Volunteered data can be employed to provide timely damage assessment, help in
rescue and relief operations, as well as the optimization of engineering reconnais-
sance (Laituri and Kodrich 2008; Dashti et al. 2014; Schnebele and
Cervone 2013; Schnebele et al. 2014a,b). While the quantity and real-time avail-
ability of VGI make it a valuable resource for disaster management applications,
data volume, as well as its unstructured, heterogeneous nature, make the effective
use of VGI challenging. Volunteered data can be diverse, complex, and overwhelm-
ing in volume, velocity, and in the variety of viewpoints they offer. Negotiating
these overwhelming streams is beyond the capacity of human analysts. Current
research offers some novel capabilities to utilize these streams in new, ground-
breaking ways, leveraging, fusing and filtering this new generation of air-, space-
and ground-based sensor-generated data (Oxendine et al. 2014).
2 Data
Multiple sources of contributed, remote sensing, and open source geospatial data were
collected and utilized during this research. All the data are relative to the 2013 Colorado
floods, and were collected between Septmber 11 and 17, 2013. Most of the data were
collected in real time, as the event was unfolding. A summary of the sources and
collection dates of the contributed and remote sensing data is available in Table 1.
Table 1 Sources and collection dates of contributed and remote sensing data
Data 9.11 9.12 9.13 9.14 9.15 9.16 9.17
Contributed
Twitter X X X X X X X
Photos X X X X
Remote sensing
Landsat-8 X
Civil Air Patrol X X X X
446 G. Cervone et al.
2.1.1 Twitter
The social networking site Twitter is used by the public to share information about
their daily lives through micro-blogging. These micro-blogs, or ‘tweets’, are limited
to 140 characters, so abbreviations and colloquial phrasing are common, making
the automation of filtering by content challenging. Different criteria are often
applied for filtered and directed searches of Twitter content. For example, a
hashtag2 is an identifier unique to Twitter and is frequently used as a search tool.
The creation and use of a hashtag can be established by any user and may develop a
greater public following if it is viewed as useful, popular, or providing current
information. Other search techniques may use keywords or location for filtering.
Twitter data were collected using a geospatial database setup at The Pennsylva-
nia State University (PSU) GeoVISTA center. Although the volume of Twitter data
streams is huge, only a small percentage of tweets contain geolocation information.
For this particular study, about 2000 geolocated tweets were collected with hashtag
#boulderflood.
2.1.2 Photos
In addition, a total of 80 images relating to the period from 11–14 September 2013 were
downloaded through the website of the city of Boulder (https://bouldercolorado.gov/
flood). These images did not contain geolocation information, but they included a
description of when and where they were acquired. The google API was used to convert
the spatial description to precise longitude and latitude coordinates.
While in the current research the data were semi-manually georectified, it is
possible to use services such as Flickr or Instagram to automatically download
geolocated photos. These services can provide additional crucial information dur-
ing emergencies because photos are easily verifiable and can contain valuable
information for transportation assessment.
2
A word or an unspaced phrase prefixed with the sign #.
Using Social Media and Satellite Data for Damage Assessment in Urban Areas. . . 447
from the USGS Hazards Data Distribution System (HDDS). Landsat 8 consists of
nine spectral bands with a resolution of 30 m: Band 1 (coastal aerosol, useful for
coastal and aerosol studies, 0.43–0.45 μm); Bands 2–4 (optical, 0.45–0.51,
0.53–0.59, 0.64–0.67 μm), Band 5 (near-IR, 0.85–0.88 μm), Bands 6 and 7 -
(shortwave-IR, 1.57–1.65, 2.11–2.29 μm) and Band 9 (cirrus, useful for cirrus
cloud detection, 1.36–1.38 μm). In addition, a 15 m panchromatic band (Band
8, 0.50–0.68 μm) and two 100 m thermal IR (Bands 10 and 11, 10.60–11.19,
11.50–12.51 μm ) were also collected from Landsat 8 OLI/TIRS.
2.2.2 Aerial
Aerial photos collected by the Civil Air Patrol (CAP), the civilian branch of the
US Air Force, captured from 14–17 September 2013 in the areas surrounding
Boulder (105:5364104:9925∘ W longitude and 40:2603139:93602∘ N latitude)
provided an additional source of remotely sensed data. The georeferenced Civil Air
Patrol RGB composite photos were downloaded from the USGS Hazards Data
Distribution System (HDDS).
Shapefiles defining the extent of the City of Boulder and Boulder County were
downloaded from the City of Boulder3 and the Colorado Department of Local
Affairs4 websites, respectively. In addition, a line/line shapefile of road networks
for Boulder County was downloaded from the US Census Bureau.5
3 Methods
The proposed methodology is based on the fusion of contributed data with remote
sensing imagery for damage assessment of transportation infrastructure.
For the Colorado floods of 2013, supervised machine learning classifications were
employed to identify water in each of the satellite images. Water pixels are
3
https://bouldercolorado.gov.
4
http://www.colorado.gov.
5
http://www.census.gov.
448 G. Cervone et al.
1 XN x x
i
^p ðxÞ ¼ K : ð1Þ
Nh i¼1 h
where K(u) is the Kernel function and h is the bandwidth (Raykar and
Duraiswami 2006). There are different kinds of kernel density estimators such as
Epanechnikov, Triangular, Gaussian, Rectangular. The density estimator chosen
for this work is a Gaussian kernel with zero mean and unit variance having the
following form:
1 X
N
2
eðxxi Þ =2h :
2
^p ðxÞ ¼ pffiffiffiffiffiffiffiffiffiffi ð2Þ
2
N 2πh i¼1
X
N
^p ðx; HÞ ¼ n1 K H ðx Xi Þ ð3Þ
i¼1
where:
x ¼ ðx1 , x2 , . . . , xd ÞT
Xi ¼ ðXi1 , Xi2 , . . . , Xid ÞT
i ¼ 1, 2, . . . , N
K H ðxÞ ¼ jHj1=2 KðH1=2 xÞ
In this case K(x) is the spatial kernel and H is the bandwidth matrix, which is
symmetric and positive-definite. As in the uni-dimensional case, an optimal band-
width matrix has to be chosen, for example, using the method illustrated in Duong
and Hazelton (2005). In this project, data have been interpolated by using the R
command smoothScatter of the package graphics based on the Fast Fourier
transform. It is a variation of regular Kernel interpolation that reduces the compu-
tational complexity from O(N2) to O(N ). The O (‘big O’) notation is a Computer
Science metric to quantify and compare the complexity of algorithms (Knuth 1976).
O(N ) indicates a linear complexity, whereas O(N2) indicates a much higher,
quadratic, complexity. Generally, the bandwidth is automatically calculated by
using the R command bkde2D of the R package KernSmooth. However, the
bandwidth for tweets interpolation has been specified because information related
to them has a lower weight.
Using Social Media and Satellite Data for Damage Assessment in Urban Areas. . . 451
Fig. 1 Water pixel classification using Landsat 8 data collected 12 May 2013 (a) and
17 September 2013 (b). The background images are the Landsat band 5 for each of the 2 days
Using Social Media and Satellite Data for Damage Assessment in Urban Areas. . . 453
Legend
40°15'0"N
CAP
Photos
Tweets
Landsat
40°12'0"N
BoulderCity
BoulderCounty
40°9'0"N
40°6'0"N
40°3'0"N
40°0'0"N
39°57'0"N
39°54'0"N
Fig. 2 Classified Landsat 8 image collected 17 September 2013 image and interpolated ancillary
data give an indication of flood activity around the Boulder, CO area. While some data sources
overlap, others have a very different spatial extent
Using contributed data points (Fig. 3a), a flooding and damage surface is inter-
polated using a kernel density smoothing application as discussed in Sect. 3.2 for
the downtown Boulder area. After an interpolated surface is created from each data
set (tweets and photos), they are combined using a weighted sum overlay approach.
The tweets layer is assigned a weight of 1 and the photos layer a weight of
2. A higher weight is assigned to the photos layer because information can be
more easily verified in photos, therefore there is a higher level of confidence in this
data set. The weighted layers are summed yielding a flooding and damage assess-
ment surface created solely from contributed data (Fig. 3b). This surface is then
paired with a high resolution road network layer. Roads are identified as potentially
compromised or impassable based on the underlying damage assessment (Fig. 3c).
In a final step, the classified roads are compared to roads closed by the Boulder
Emergency Operations Center (EOC) from 11–15 September 2013 (Fig. 3d).
454 G. Cervone et al.
Fig. 3 Using contributed data geolocated in the downtown Boulder area (a), an interpolated
damage surface is created (b) and when paired with a road network, classifies potentially
compromised roads (c). Roads which were closed by the Boulder Emergency Operations Center
(EOC) that were also classified using this approach (d)
Using Social Media and Satellite Data for Damage Assessment in Urban Areas. . . 455
5 Conclusions
Big data, such as those generated through social media platforms, provide unprec-
edented access to real-time, on-the-ground information during emergencies,
supplementing traditional, standard data sources such as space- and aerial-based
remote sensing. In addition to inherent gaps in remote sensing data due to platform
or revisit limitations or atmospheric interference, obtaining data and information
for urban areas can be especially challenging because of resolution requirements.
Identifying potential damage at the street or block level provides an opportunity for
triaging site visits for evaluation. However, utilizing big data efficiently and
effectively is challenging owing to its complexity, size, and in the case of social
media, heterogeneity (variety). New algorithms and techniques are required to
harness the power of contributed data in real-time for emergency applications.
This paper presents a new methodology for locating natural hazards using
contributed data, in particular Twitter. Once remote sensing data are collected,
they, in combination with contributed data, can be used to provide an assessment of
the ensuing damage. While Twitter is effective at identifying ‘hot spots’ at the city
level, at the street level other sources provide a supplemental source of information
with a finer detail (e.g. photos). In addition, remote sensing data may be limited by
revisit times or cloud cover, so contributed ground data provide an additional
source of information.
Challenges associated with utilizing contributed data, such as questions related
to producer anonymity and geolocation accuracy as well as differing levels in data
confidence make the application of these data during hazard events especially
challenging. In addition to identifying a particular hazard, in this case flood waters,
by pairing the interpolated damage assessment surface with a road network creates
a classified ‘road hazards map’ which can be used to triage and optimize site
inspections or tasks for additional data collection.
Acknowledgements Work performed under this project has been partially funded by the Office
of Naval Research (ONR) award #N00014-14-1-0208 (PSU #171570).
References
Acar A, Muraki Y (2011) Twitter for crisis communication: lessons learned from Japan’s tsunami
disaster. Int J Web Based Communities 7(3):392–402
Cutter SL (2003) Giscience, disasters, and emergency management. Trans GIS 7(4):439–446
Dashti S, Palen L, Heris M, Anderson K, Anderson S, Anderson T (2014) Supporting disaster
reconnaissance with social media data: a design-oriented case study of the 2013 Colorado
floods. In: Proceedings of the 11th international conference on information systems for crisis
response and management. ISCRAM, pp 630–639
De Longueville B, Smith R, Luraschi G (2009) OMG, from here, I can see the flames!: a use case
of mining location based social networks to acquire spatio-temporal data on forest fires. In:
456 G. Cervone et al.
Proceedings of the 2009 international workshop on location based social networks. ACM,
New York, pp 73–80
Duong T, Hazelton ML (2005) Cross-validation bandwidth matrices for multivariate kernel
density estimation. Scand J Stat 32(3):485–506
Earle P, Bowden D, Guy M (2012) Twitter earthquake detection: earthquake monitoring in a social
world. Ann Geophys 54(6):708–715
Goodchild M (2007) Citizens as sensors: the world of volunteered geography. GeoJournal 69
(4):211–221
Heverin T, Zach L (2010) Microblogging for crisis communication: examination of Twitter use in
response to a 2009 violent crisis in the Seattle-Tacoma, Washington, area. In: Proceedings of
the 7th international conference on information systems for crisis response and management.
ISCRAM, pp 1–5
Joyce KE, Belliss SE, Samsonov SV, McNeill SJ, Glassey PJ (2009) A review of the status of
satellite remote sensing and image processing techniques for mapping natural hazards and
disasters. Prog Phys Geogr 33(2):183–207
Keilis-Borok V (2002) Earthquake prediction: state-of-the-art and emerging possibilities. Annu
Rev Earth Planet Sci 30:1–33
Knuth DE (1976) Big omicron and big omega and big theta. ACM Sigact News 8(2):18–24
Laituri M, Kodrich K (2008) On line disaster response community: people as sensors of high
magnitude disasters using internet GIS. Sensors 8(5):3037–3055
Mühlenstädt T, Kuhnt S (2011) Kernel interpolation. Comput Stat Data Anal 55(11):2962–2974
Oxendine C, Schnebele E, Cervone G, Waters N (2014) Fusing non-authoritative data to improve
situational awareness in emergencies. In: Proceedings of the 11th international conference on
information systems for crisis response and management. ISCRAM, pp 760–764
Pultar E, Raubal M, Cova T, Goodchild M (2009) Dynamic GIS case studies: wildfire evacuation
and volunteered geographic information. Trans GIS 13(1):85–104
Raykar VC, Duraiswami R (2006) Fast optimal bandwidth selection for kernel density estimation.
In: SDM. SIAM, Philadelphia, pp 524–528
Ripley B (2008) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Schnebele E, Cervone G (2013) Improving remote sensing flood assessment using volunteered
geographical data. Nat Hazards Earth Syst Sci 13:669–677
Schnebele E, Cervone G, Kumar S, Waters N (2014a) Real time estimation of the Calgary floods
using limited remote sensing data. Water 6:381–398
Schnebele E, Cervone G, Waters N (2014b) Road assessment after flood events using
non-authoritative data. Nat Hazards Earth Syst Sci 14(4):1007–1015
Schnebele E, Tanyu B, Cervone G, Waters N (2015) Review of remote sensing methodologies for
pavement management and assessment. Eur Transp Res Rev 7(2):1–19
Smith L (1997) Satellite remote sensing of river inundation area, stage, and discharge: a review.
Hydrol Process 11(10):1427–1439
Smith AB, Katz RW (2013) US billion-dollar weather and climate disasters: data sources, trends,
accuracy and biases. Nat Hazards 67(2):387–410
Tate C, Frazier T (2013) A GIS methodology to assess exposure of coastal infrastructure to storm
surge & sea-level rise: a case study of Sarasota County, Florida. J Geogr Nat Disasters
1:2167–0587
Tatham P (2009) An investigation into the suitability of the use of unmanned aerial vehicle
systems (UAVs) to support the initial needs assessment process in rapid onset humanitarian
disasters. Int J Risk Assess Manage 13(1):60–78
Tobler WR (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr
46:234–240
Tyshchuk Y, Hui C, Grabowski M, Wallace W (2012) Social media and warning response impacts
in extreme events: results from a naturally occurring experiment. In: Proceedings of the 45th
annual Hawaii international conference on system sciences (HICSS). IEEE, New York, pp
818–827
Using Social Media and Satellite Data for Damage Assessment in Urban Areas. . . 457
Uddin W (2011) Remote sensing laser and imagery data for inventory and condition assessment of
road and airport infrastructure and GIS visualization. Int J Roads Airports (IJRA) 1(1):53–67
Verma S, Vieweg S, Corvey W, Palen L, Martin J, Palmer M, Schram A, Anderson K (2011)
Natural language processing to the rescue? Extracting “situational awareness” Tweets during
mass emergency. In: Proceedings of the 5th international conference on weblogs and social
media. AAAI, Palo Alto
Vieweg S, Hughes A, Starbird K, Palen L (2010) Microblogging during two natural hazards
events: what twitter may contribute to situational awareness. In: Proceedings of the 28th
international conference on human factors in computing systems. ACM, New York, pp
1079–1088
Waters N (2017) Tobler’s first law of geography. In: Richardson D, Castree N, Goodchild MF,
Kobayashi A, Liu W, Marston R (eds) The international encyclopedia of geography. Wiley,
New York, In press
Wenzel F, Bendimerad F, Sinha R (2007) Megacities–megarisks. Nat Hazards 42(3):481–491
Part VII
Health and Well-Being
‘Big Data’: Pedestrian Volume Using Google
Street View Images
L. Yin (*) • L. Wu
Department of Urban and Regional Planning, University at Buffalo, The State University of
New York, Buffalo, NY 14214, USA
e-mail: [email protected]; [email protected]
Q. Cheng
Department of Electronics and Information Engineering, Huazhong University of Science and
Technology, Wuhan 430074, China
Z. Shao
The State Key Laboratory of Information Engineering on Surveying Mapping and Remote
Sensing, Wuhan University, Wuhan 430079, China
Z. Wang
Department of Urban and Regional Planning, University at Buffalo, The State University of
New York, Buffalo, NY 14214, USA
Center for Human-Engaged Computing, Kochi University of Technology, 185 Miyanokuchi,
Tosayamada-Cho, Kami-Shi, Kochi 782-8502, Japan
1 Introduction
Findings from recent research have suggested a link between the built environment
and physical activity. Responding to the widespread growing interest in walkable,
transit-oriented development and healthy communities, many recent studies in
planning and public health focused on improving the pedestrian environment
(Ewing and Bartholomew 2013; Yin 2014; Yin 2013; Clifton et al. 2007). Pedes-
trian counts are used as a quantitative measure of pedestrian volume to help
evaluate walkability and how it correlates with land use, and other built environ-
ment characteristics. It can also be used as baseline data to help inform planning and
decision-making.
Pedestrians have been the subject of increasing attention among planners,
engineers and public health officials. There is, however, inadequate research on
pedestrian volume and movement. In addition, the method of data collection for
detailed information about pedestrian activity has been insufficient and inefficient
(Hajrasouliha and Yin 2014). Pedestrian count data has been collected by field
work, self-reported surveys and automated counting. Field work and self-reported
surveys are more subjective than automatic counts using video-taping or sensors.
Most pedestrian counts are done manually because of the high cost associated with
using technologies such as laser scanners and infrared counters.
With the recent rapid development of internet and cloud computing, we are
entering the era of ‘big data’ with the “Internet of People and Things” and the
“Internet of Everything” (O’Leary 2013). The term ‘big data’ was defined to
describe the making of the rapidly expanding amount of digital information ana-
lyzable and “the actual use of that data as a means to improve productivity, generate
and facilitate innovation and improve decision making” (O’Leary 2013).
Google Street View provides panoramic views along many streets of the
U.S. and around the world and is readily available to anyone with access to the
internet. Although some parts of the images such as people’s faces or automobile
license plates are blurred to protect privacy, they are still potentially useful to
identify the number of pedestrians on a particular street for generating patterns of
walkability across a city. Human detection based on images and videos has had a
wide range of applications in a variety of fields including robotics and intelligent
transportation for collision prediction, driver assistance, and demographic recogni-
tion (Prioletti et al. 2013; Gallagher and Chen 2009). This study introduces an
image-based machine learning method to planners for detecting pedestrian activity
from Google Street View images, aiming to provide and recommend future
research an alternative method for collecting pedestrian counts more consistently
and subjectively and to stimulate discussion of the use of ‘big data’ for planning and
design. The results can help researchers to analyze built environment walkability
and livability.
‘Big Data’: Pedestrian Volume Using Google Street View Images 463
2 Pedestrian Detection
objective obeys the scoring functions used in Felzenszwalb et al. (2008) in the
following form.
max
f β ðxÞ ¼ β ∅ðx; zÞ ð1Þ
z 2 Z ðxÞ
2.1 Validation
The detection algorithm discussed above was used on images of over 300 street
blocks scattered across the City of Buffalo, where field pedestrian counts were
collected by urban planning graduate students at the University at Buffalo, The
State University of New York. Depending on the block size, two to five images
were used to assemble one image for each street block. Every street block has
tabular information to be used to add up the number of pedestrians detected and
visualize the pedestrian counts, including block ID, starting and ending coordinates,
linked Google Street View image number, etc.
The model results were validated against two sets of data. One is the pedestrian
count data collected in the spring and fall seasons from 2011 to 2013 by the students
at University at Buffalo. The second is a walkability index assigned to streets,
‘Big Data’: Pedestrian Volume Using Google Street View Images 465
which was collected from WalkScore, a private company that provides walkability
services to promote walkable neighborhoods. Pedestrian count data was collected
by counting the number of pedestrians on sample street blocks at non-intersection
locations, in a 15-min interval. Walk scores were calculated based on distance to the
closest amenity such as parks, schools, etc. The Google Street View image data are
from static shots for every street. Even though these data sets were collected using
different methods, the patterns of walkability represented by these data should be a
reasonable reflection of the how streets in Buffalo are really used by pedestrians.
3 Findings
The results of multi-scale image segmentation are shown in Fig. 3, first two
columns. The images in the first row have a single pedestrian and the second row
images have multiple pedestrians and both showed that pedestrians can be seg-
mented effectively. At the end of this primary stage, the background was removed,
some of the environmental interference was reduced, and the effective area for the
fine detection stage was retained. This is done to help reduce the false detection rate
and to increase accuracy because unnecessary complex calculations are avoided at
the fine detection stage. Thus the detection rate was significantly enhanced.
Images in the rightmost column of Fig. 3 show the final detection results. After
using sparse multi-scale segmentation to remove the background area, the green
rectangles are detected as specific pedestrian areas. As can be seen from the figure,
pedestrians can be detected by the algorithm used. Experimental results show that
the detection accuracy is increased compared with the traditional pedestrian detec-
tion algorithms, and the detection speed is improved.
Figure 4 shows the pedestrian counts based on the pedestrian detection method
proposed in this paper (Fig. 4a) in comparison with the counts obtained from the
field work (Fig. 4b) and patterns of walkability from WalkScore (Fig. 4c). All three
maps show the similar patterns of walkability in the City of Buffalo, with the
highest concentration of pedestrians in downtown and Elmwood village areas.
Google Street View data has a lower number of pedestrians than the counts from
the field work because Google Street View captures only the number of pedestrians
at one point in time during one static shot while field work usually captures counts
over a period of time.
This paper used a pedestrian detection algorithm based on sparse multi-scale image
segmentation and a cascade deformable model to extract pedestrian counts from the
Google Street View images. The experimental results showed that this algorithm
performs reasonably fast in comparison to the traditional DPM method. The
466 L. Yin et al.
detection results were compared and showed to resemble the pedestrian counts
collected by field work. The patterns of walkability in the City also resemble the
WalkScore data. Future work includes further pedestrian characteristics analysis,
combined with a pedestrian tracking algorithm to accelerate the detection effi-
ciency, and robust real-time pedestrian detection.
Google Street View images provide a valuable source for compiling data for
design and planning. However, in the form it is currently published and made
available to the public, there are two aspects of limitations for pedestrian detection:
‘Big Data’: Pedestrian Volume Using Google Street View Images 467
Fig. 4 Pedestrian counts comparison: Google Street View image pedestrian detection (a) vs. field
work (b) vs. WalkScore (c)
468 L. Yin et al.
References
O’Leary DE (2013) “Big data’, the ‘internet of things’ and the ‘internet of signs’. Intell Syst
Account Finance Manage 20:53–65
Prioletti A, Mogelmose A, Grisleri P et al (2013) Part-based pedestrian detection and feature-based
tracking for driver assistance: real-time, robust algorithms, and evaluation. IEEE Trans Intell
Transp Syst 14(3):1346–1359
Yin L (2013) Assessing walkability in the city of Buffalo: an application of agent-based simula-
tion. J Urban Plan Dev 139(3):166–175
Yin L (2014) Review of Ewing, Reid and Otto Clemente. 2013. Measuring urban design: metrics
for livable places. J Plan Literature 29(3): 273e274
Yu CN, Joachims T (2008) Learning structural SVMs with latent variables. In: Neural information
processing systems
Zhu Q, Shai A, Chert Y (2006) Fast human detection using a cascade of histograms of oriented
gradients. In: Proceedings of IEEE conference on computer vision and pattern recognition,
New York. pp 1491–1498
Learning from Outdoor Webcams:
Surveillance of Physical Activity Across
Environments
Abstract Publicly available, outdoor webcams continuously view the world and
share images. These cameras include traffic cams, campus cams, ski-resort cams,
etc. The Archive of Many Outdoor Scenes (AMOS) is a project aiming to geolocate,
annotate, archive, and visualize these cameras and images to serve as a resource for
a wide variety of scientific applications. The AMOS dataset has archived over
750 million images of outdoor environments from 27,000 webcams since 2006. Our
goal is to utilize the AMOS image dataset and crowdsourcing to develop reliable
and valid tools to improve physical activity assessment via online, outdoor webcam
capture of global physical activity patterns and urban built environment
characteristics.
This project’s grand scale-up of capturing physical activity patterns and built
environments is a methodological step forward in advancing a real-time, non-labor
intensive assessment using webcams, crowdsourcing, and eventually machine
learning. The combined use of webcams capturing outdoor scenes every 30 min
and crowdsources providing the labor of annotating the scenes allows for acceler-
ated public health surveillance related to physical activity across numerous built
environments. The ultimate goal of this public health and computer vision collab-
oration is to develop machine learning algorithms that will automatically identify
and calculate physical activity patterns.
1 Introduction
Kevin Lynch’s 1960 book, ‘The Image of the City’, was one of the first to
emphasize the importance of social scientists and design professionals in signifying
ways that urban design and built environments can be quantitatively measured and
improved (Lynch 1960). ‘The Image of the City’ led to enormous efforts to
investigate the structure and function of cities, to characterize perception of neigh-
borhoods (Jacobs 1961; Xu et al. 2012), and promotion of social interactions
(Milgram et al. 1992; Oldenburg 1989). To date, large scale studies seeking to
understand and quantify how specific features or changes in the built environment
impact individuals, their behavior, and interactions have required extensive in-the-
field observation and/or expensive and data-intensive technology including accel-
erometers and GPS (Adams et al. 2015; Sampson and Raudenbush 1999). Such
studies only provide a limited view of behaviors, their context, and how each may
change as a function of the built environment. These studies are time intensive and
expensive, deploying masses of graduate students to conduct interviews about
people’s daily routines (Milgram et al. 1992) or requiring hand-coding of thousands
of hours of video (Whyte 1980) to characterize a few city plazas and parks. Even
current state-of-the art technology to investigate associations between behavior and
the urban built environment uses multiple expensive devices at the individual level
(GPS and accelerometer) and connects this data to Geographic Information System
(GIS) layers known to often be unreliable (James et al. 2014; Kerr et al. 2011;
Schipperijn et al. 2014).
A key population behavior of interest to our research team is physical activity
(Hipp et al. 2013a, b; Adlakha et al. 2014). Physical activity is associated with many
health outcomes including obesity, diabetes, heart disease, and some cancer (Office
of the Surgeon General 2011). Over 30 % of adults and 17 % of children and
adolescents in the US are obese (CDC 2009), with lack of physical activity due to
constraints in the built environment being an important influence (Ferdinand
et al. 2012). Lack of safe places to walk and bicycle and lack of access to parks
and open space can impact the frequency, duration, and quality of physical activity
available to residents in urban settings (Jackson 2003; Jackson et al. 2013;
Brownson et al. 2009). Physical activity may be purposive such as a jog in a
park, or incidental such as a 10 min walk from home to a public transit stop. In
both purposive and incidental cases the design of urban built environments influ-
ences the decisions and experience of physical activity behaviors. As such, the US
Guide to Community Preventive Services (Community Guide) currently recom-
mends the following built environment interventions to increase physical activity
behaviors and reduce obesity: (1) community and street-scale urban design and land
use policies; (2) creation of, or enhanced access to places for physical activity; and
(3) transportation policies and practices (CDC 2011).
Physical Activity Assessment. Physical activity and built environment research
has expanded during the past 20 years (Handy et al. 2002; Ferdinand et al. 2012;
Harris et al. 2013). The research has followed traditional patterns of growth
Learning from Outdoor Webcams: Surveillance of Physical Activity Across. . . 473
beginning with ecological studies of association (Ewing et al. 2003), then local
validation of associations via retrospective surveys and researcher-present obser-
vation (Bedimo-Rung et al. 2006; Mckenzie and Cohen 2006). For example, the
System for Observing Physical Activity and Recreation in Communities
(SOPARC) (Mckenzie and Cohen 2006) was developed to understand physical
activity in context with the environment while being unobtrusive. SOPARC con-
tinues to be a popular method of assessing physical activity with pairs of
researchers positioning themselves in numerous target areas to scan the environ-
ment for the number of people participating in sedentary to vigorous physical
activity (Reed et al. 2012; Baran et al. 2013; Cohen et al. 2012). Presently, natural
experiments related to physical activity patterns and built environments are grow-
ing in popularity (Cohen et al. 2012). These studies have been of great benefit to the
field by informing public health and urban design. While there is now a substantial
body of evidence to inform local interventions and policies (Ding and Gebel 2012;
Saelens and Handy 2008; Kaczynski and Henderson 2007; Feng et al. 2010;
Renalds et al. 2010; Sandercock et al. 2010), currently used methodologies and
the use of small, local samples limit the external validity and dissemination of many
results, interventions, and policies. There is a need for large-scale, evidence-
informed evaluations of physical activity to increase external validity as evident
in recent calls for more studies across a greater variety of environments (Dyck
et al. 2012; Cerin et al. 2009).
Big Data Opportunities. Big data and modern technology has opened up
several opportunities to obtain new insights on cities and offer the potential for
dramatically more efficient measurement tools (Hipp 2013; Graham and Hipp
2014). The relative ease of capturing large sample data has led to amazing results
that highlight how people move through cities based on check-ins (Naaman 2011;
Silva et al. 2012) or uploaded photos (Crandall et al. 2009). In addition, GIS, GPS,
accelerometers, smart phone applications (apps), and person-point-of-view cameras
are each being used in many studies, often in combination (Graham and Hipp 2014;
Kerr et al. 2011; Hurvitz et al. 2014). Apps that track running and walking routes
are being investigated for where populations move and how parks and other built
environment infrastructure may be associated with such movement (Adlakha
et al. 2014; Hirsch et al. 2014).
Though these big data sources offer important contributions to the field of
physical activity and built environment research, they are each dependent on
individuals to upload data, allow access to data, and/or agree to wear multiple
devices. This is the epitome of the quantified-self movement (Barrett et al. 2013). A
complementary alternative big data source is the pervasive capture of urban envi-
ronments by traffic cameras and other public, online webcams. This environmental-
point-of-view imaging also captures human behavior and physical activity as
persons traverse and use urban space.
The Archive of Many Outdoor Scenes (AMOS) has been archiving one image
each half hour from most online, publicly available webcams since 2006, creating
an open and widely distributed research resource (Pless and Jacobs 2006). AMOS
began to collect images from these 27,000 webcams mapped in Fig. 1 to understand
474 J.A. Hipp et al.
Fig. 1 Map of cameras captured by the Archive of Many Outdoor Scenes (AMOS)
the local effects of climate variations on plants (Jacobs et al. 2007). Other dataset
uses include large collections of up-close, on the ground measurements to suggest
corrections to standard satellite data products such as NASA’s Moderate Resolution
Imaging Spectroradiometer (MODIS) estimates of tree growing seasons (Ilushin
et al. 2013; Jacobs et al. 2009; Richardson et al. 2011). This global network of
existing cameras also captures images of public spaces—plazas, parks, street
intersections, waterfronts—creating an archive of how public spaces have changed
over time and what behaviors are being performed within these spaces.
With its archive of over 750 million captured images, AMOS not only represents
27,000 unique environments, but is capturing concurrent behaviors in and across
Learning from Outdoor Webcams: Surveillance of Physical Activity Across. . . 475
2 Data Sources
representation of the image at that time of year and time of day. Pixels can also be
represented using principal component analysis to quickly identify images that
differ based on precipitation, snowfall, dusk, dawn, etc. This summary serves
several purposes. First, it is a data availability visualization, where dark red
highlights when the camera was down and did not capture images. Second, it
highlights annual patterns such as the summer nights being shorter than winter
nights. Third, it reveals changes in camera angle. Fourth, changes in precipitation,
plant growth, and/or shadows are easily recognizable. Finally, data capture errors
are quickly visible. This data visualization is “clickable” so that a user can see, by
clicking, the image from a particular time of day and time of year.
Each camera also contains extensive metadata as outlined in the Fig. 2:
(D) Shows the geolocation of the camera; (E) Shows free form text tags that our
research team and other groups use to keep track of and search for cameras with
particular properties; (F) Allows the tagging of specific images (instead of cam-
eras), and (G) Is a pointer to zip-files for data from this camera or a python script to
allow selective downloading. When exact camera locations are known, the cameras
may be geo-oriented and calibrated relative to global coordinates as shown in
Fig. 1.
Amazon.com’s Mechanical Turk Crowdsource. Virtual audits have emerged
as a reliable method to process the growing volume of web-based data on the
physical environment (Badland et al. 2010; Clarke et al. 2010; Odgers et al. 2012;
Bader et al. 2015). Research has also turned to virtual platforms as a way to recruit
study participants and complete simple tasks (Kamel Boulos et al. 2011; Hipp
et al. 2013a; Kaczynski et al. 2014). The Amazon.com Mechanical Turk (MTurk)
website outsources Human Intelligence Tasks (HITs), or tasks that have not yet
been automated by computers. Workers may browse available HITs and are paid
for every HIT completed successfully (Buhrmester et al. 2011). MTurk workers are
paid a minimum of US$0.01 per HIT, making them a far less expensive option than
traditional research assistant annotators (Berinsky et al. 2012). MTurk has been
found to be an effective method for survey participant recruitment, with more
representative and valid results than the convenience sampling often used for social
science research (Bohannon 2011). MTurk has also been used for research task
completion such as transcription and annotation. These have generally been small
in scale and MTurk reliability for larger scale data analysis has not been established
(Hipp et al. 2013a). Within MTurk, our team has designed a unique web-form used
with the MTurk HIT that allows workers to annotate images by demarcating each
pedestrian, bicyclist, and vehicle located in a photograph.
Trained Research Assistants. Trained undergraduate and graduate Research
Assistants from the computer science and public health departments at Washington
University in St. Louis have annotated images for physical activity behaviors and
built environment attributes. For both behaviors and environments, Research
Assistants were provided with example captured scenes. Project Principal Investi-
gators supervised the scene annotation process and provided real-time feedback on
uncertain scenes. Difficult or exceptional scenes and images were presented to the
research group to ensure that all behaviors and environments were annotated in a
consistent manner.
478 J.A. Hipp et al.
3 Methods
commonly used built environment audit tools to establish a list of potential built
environment tags. These were the Environmental Assessment of Public Recreation
Spaces (EAPRS) (Saelens et al. 2006) and the Irvine-Minnesota Inventory (Day
et al. 2006). From an initial list of 73 built environment items that we believed could
be annotated using captured images we narrowed the final list down to 21 built
environment tags. Following the combination of similar terms, we further reduced
the potential list of tags based on the inclusion criteria that the tag must be
theoretically related to human behaviors.
To establish which of the 27,000 AMOS webcams are at an appropriate urban
built environment scale, i.e., those with the potential of capturing physical activity,
our team designed an interface that selects a set of camera IDs, and displays
25 cameras per screen. This internal (available only to research staff) HIT was
created to populate a webpage with the 25 unique camera images. Below each
image were a green checkmark and a red x-mark. If physical activity behaviors
could be captured in the scene, the green checkmark was selected and this tag
automatically added to a dataset of physical activity behavior cameras. This process
was repeated with trained Research Assistants for reliability and resulted in a set of
1906 cameras at an appropriate built environment and physical activity scale. In
addition to the above inclusion criteria, selected cameras must have captured scenes
from at least 12 consecutive months. The final 21 built environment tags are
presented in Table 1.
To tag each camera, Research Assistants were provided a one-page written and
photographic example (from AMOS dataset) of each built environment tag. For
example, a written description for a crosswalk was provided along with captured
images of different styles of crosswalks from across the globe. A second internal
HIT was created similar to the above that populated a webpage with 20 unique
camera images, each marked with a green checkmark and a red x-mark. If the
provided built environment tag (e.g., crosswalk) was present in the image then the
green checkmark was selected and this tag automatically added to the camera
annotation. If a Research Assistant was unsure they could click on the image to
review other images captured by the same camera or could request the assistance of
other Research Assistants or Principal Investigators to verify their selection. This
process was completed for all 21 built environment tags across all 1906 cameras in
the AMOS physical activity and built environment dataset. To date, the built
environment tags have only been annotated by trained Research Assistants. Reli-
ability and validity of tags is a future step of this research agenda. This initial step
provided the team a workable set of publicly available webcams to address our two
study aims.
480 J.A. Hipp et al.
Table 1 List of built environment tags used to annotate AMOS webcam images
No. Built environment tag Number of cameras with built environment tag present
1. Open space 769
2. Sidewalk 825
3. Plaza/square 174
4. Residential/homes 97
5. Trees 1058
6. Buildings 1245
7. Street, intersection 621
8. Bike lane 71
9. Athletic fields 60
10. Speed control 185
11. Trail path 154
12. Street, road 1029
13. Signage 59
14. Commerce retail 382
15. Play features 42
16. Sitting features 166
17. Motor vehicles 1048
18. Crosswalk 576
19. Bike racks 27
20. Water 326
21. Snow/Ski 169
4 Data Analysis
Physical Activity Behaviors. In the pilot project we used t-tests and logistic
regressions to analyze the difference in physical activity behaviors before and
after the addition of the bike lane along Pennsylvania Avenue. T-tests were used
for pedestrians and vehicles, where the data was along a continuous scale from 0 to
20 (20 being the most captured in any one scene). Logistic regression was used for
the presence or absence of a bicyclist in each image.
Reliability and Validity. Inter-rater reliability (IRR) and validity statistics
(Pearson’s R, Inter-Class Correlations, and Cohen’s Kappa) were calculated within
and between the five MTurk workers and between the two trained Research
Assistants. The appropriate statistic was calculated for two, three, four, or five
MTurk workers to determine the optimal number of workers necessary to capture a
reliable and valid count of pedestrians, bicyclists, and vehicles in a scene. Due to
each scene being annotated by five unique MTurk workers we were able to test the
reliability of ten unique combinations of workers; that is, Worker 1 and Worker
2, Worker 1 and Worker 3, Worker 1 and Worker 4, etc. Similar combinations were
used with three workers (ten unique combinations) and four workers (five unique
combinations). Each combination was compared to the trained Research Assistants
results to measure validity. For all tests we used Landis and Koch’s magnitudes of
Learning from Outdoor Webcams: Surveillance of Physical Activity Across. . . 481
5 Results
Pilot Project. Previously published results reveal that publicly available, online
webcams are capable of capturing physical activity behavior and are capable of
capturing changes in these behaviors pre and post built environment changes (Hipp
et al. 2013a). Of the 240 images captured by the camera located at Pennsylvania
Avenue NW and 9th Street NW (7 am–7 pm for 1 week prior to built environment
change and 7 am–7 pm for 1 week post change), 237 (99 %) had at least one
pedestrian annotated, and 38 (16 %) had at least one bicyclist annotated. Table 2
presents the number and percent of images with pedestrian, bicyclist and motor
vehicle annotation by at least one MTurk worker, by intersection.
The odds of the traffic webcam at Pennsylvania Avenue NW and 9th Street NW
capturing a bicyclist present in the scene in 2010 increased 3.5 times, compared to
2009 (OR ¼ 3.57, p < 0.001). The number of bicyclists present per scene increased
four-fold between 2009 (mean ¼ 0.03; SD ¼ 0.20) and 2010 (0.14; 0.90; F ¼ 36.72,
1198; p ¼ 0.002). Both results are associated with the addition of the new bike lane.
There was no associated increase in the number of pedestrians at the street inter-
section following the addition of the bike lane, as may be theoretically expected
with a bicycle-related built environment change, not a pedestrian-related change.
Table 2 Number and percent of images with at least one bicycle, pedestrian, or motor vehicle
annotated by at least one MTurk worker, by intersection/camera ID
Camera N images N images with N images with N images with motor
ID collected pedestrians (%)a bicyclists (%)a vehicles (%)a
919 244 42(17 %) 7(3 %) 162(66 %)
920 241 119(49 %) 18(7 %) 237(98 %)
923 241 78(33 %) 9(4 %) 217(90 %)
925 240 86(36 %) 24(10 %) 215(90 %)
929 236 204(86 %) 40(17 %) 236 (100 %)
930 245 186(76 %) 27(11 %) 242(99 %)
935 240 133(55 %) 15(6 %) 193(80 %)
942 242 225(93 %) 32(13 %) 236(98 %)
950 240 213(89 %) 21(9 %) 235(98 %)
968 240 133(55 %) 8(3 %) 209(87 %)
984 240 237(99 %) 38(16 %) 240 (100 %)
996 246 131(53 %) 28(11 %) 245(100 %)
Total 2895 1787(62 %) 267(9.2 %) 2667(92 %)
a
Percent rounded to nearest whole number
482 J.A. Hipp et al.
Fig. 3 Reliability results for annotation of pedestrians in 720 webcam scenes. Lines represent the
average reliability score across five unique cameras
0.841–0.922). The reliability statistics for four and five MTurk workers again
showed substantial rater/annotator agreement, and near perfect agreement between
the two Research Assistants.
Validity Assessment. From reliability estimates, we concluded that using four
MTurk workers was the most reliable and cost-efficient method. Next, validity
statistics were calculated for four MTurk workers and two trained RAs. Validity
statistics (Pearson’s R) for pedestrians (0.846–0.901) and vehicles (0.753–0.857)
showed substantial to near perfect agreement. Validity (Cohen’s kappa) for cyclists
(0.361–0.494) were in the fair-moderate agreement range.
Built Environment Tags. As provided in Table 1, our final list of built envi-
ronment tags includes 21 unique items. The number of cameras with the tag present
is also presented. ‘Buildings’ were found the most frequently; at a scale to capture
human behavior across 1245 webcams. ‘Bike racks’ was annotated the fewest
times, only occurring in 27 scenes. Figure 4 shows an example map of where
each of the cameras with the built environment tag of ‘open space’ is located.
6 Discussion
The use of public, captured imagery to annotate built environments for public
health research is an emerging field. To date the captured imagery has been static
and only available via Google Street View and Google Satellite imagery (Charreire
et al. 2014; Edwards et al. 2013; Wilson et al. 2012; Taylor and Lovell 2012;
Learning from Outdoor Webcams: Surveillance of Physical Activity Across. . . 485
Odgers et al. 2012; Kelly et al. 2013, 2014; Wilson and Kelly 2011; Taylor
et al. 2011; Rundle et al. 2011). There have been no attempts to crowdsource this
image annotation, nor combine annotation of built environments and images cap-
turing physical activity behaviors. Using an 8-year archive of captured webcam
images and crowdsources, we have demonstrated that improvements in urban built
environments are associated with subsequent and significant increases in physical
activity behaviors. Webcams are able to capture a variety of built environment
attributes and our study shows webcams are a reliable and valid source of built
environment information. As such, the emerging technology of publicly available
webcams facilitates both consistent uptake and potentially timely dissemination of
physical activity and built environment behaviors across a variety of outdoor
environments. The AMOS webcams have the potential to serve as an important
and cost-effective part of urban environment and public health surveillance to
evaluate patterns and trends of population-level physical activity behavior in
diverse built environments.
In addition to presenting a new way to study physical activity and the built
environment, our findings contribute to novel research methodologies. The use of
crowdsources (Amazon’s Mturk) proved to be a reliable, valid, inexpensive, and
quick method for annotating street scenes captured by public, online webcams.
While MTurk workers have previously been found to be a valid and reliable source
of participant recruitment for experimental research, this is the first research agenda
that has found MTurk to be a valid and reliable method for content analysis
(Buhrmester et al. 2011; Berinsky et al. 2012; Hipp et al. 2013a, b). Our results
indicate taking the average annotation of four unique MTurk workers appears to be
the optimal threshold. Our results also show that across each mode of transportation
assessed, the average reliability score with four unique workers was 0.691, which is
considered substantial agreement (Landis and Koch 1977).
In addition to substantial agreement between the MTurk workers, the trained
RAs yielded substantial agreement with vehicles, near perfect agreement with
pedestrians, but only fair agreement with cyclists. The bicyclist statistics were the
least reliable, primarily due to the low number of images (only 10 % of captured
scenes) with a cyclist present. Similar to reliability statistics, validity was near
perfect for pedestrians and vehicles, but only fair to moderate for cyclists. These
results suggest MTurk workers are a quick, cheap annotation resource for com-
monly captured image artifacts. However, MTurk is not yet primed to capture rare
events in captured scenes without additional instruction or limitations to the type of
workers allowed to complete tasks.
Our current big data and urban informatics research agenda shows that publicly
available, online webcams offer a reliable and valid source for measuring physical
activity behavior in urban settings. Our findings lay the foundation for studying
physical activity and built environment characteristics using the magnitude of
globally-available recorded images as measurements. The research agenda is inno-
vative in: (1) its potential to characterize physical activity patterns over the time-
scale of years and with orders of magnitude more measurements than would be
feasible by standard methods, (2) the ability to use the increase in data to
486 J.A. Hipp et al.
the jurisdiction of an Institutional Review Board and therefore falls outside the
purview of the Human Research Protection Office. AMOS is an archival dataset of
publicly available photos. The photos are being used for counts and annotation of
physical activity patterns and built environment attributes and are not concerned
with individual or identifiable information. To date, no camera has been identified
that is at an angle and height so as to distinguish an individual’s face. The use of
publicly available webcams fits with the ‘Big Sister’ approach to the use of cameras
for human-centered design and social values (Stauffer and Grimson 2000). Related,
recent research utilizing Google Street View and Google Earth images have also
been HRPO-exempt (Sequeira et al. 2013; Sadanand and Corso 2012; National
Center for Safe Routes to School 2010; Saelens et al. 2006).
Finally, AMOS is quite literally “Seeing Cities Through Big Data” with appli-
cations for research methods and urban informatics. With thoughtful psychometrics
and application of this half-billion image dataset, and growing, we believe perva-
sive webcams can assist researchers and urban practitioners alike in better under-
standing how we use place and how the shape and context of urban places influence
our movement and behavior.
Acknowledgements This work was funded by the Washington University in St. Louis University
Research Strategic Alliance pilot award and the National Cancer Institute of the National Institutes
of Health under award number 1R21CA186481. The content is solely the responsibility of the
authors and does not necessarily represent the official views of the National Institutes of Health.
References
Adams MA, Todd M, Kurka J, Conway TL, Cain KL, Frank LD, Sallis JF (2015) Patterns of
walkability, transit, and recreation environment for physical activity. Am J Prev Med 49
(6):878–887
Adlakha D, Budd EL, Gernes R, Sequeira S, Hipp JA (2014) Use of emerging technologies to
assess differences in outdoor physical activity in St. Louis, Missouri. Front Public Health 2:41
Bader MDM, Mooney SJ, Lee YJ, Sheehan D, Neckerman KM, Rundle AG, Teitler JO (2015)
Development and deployment of the Computer Assisted Neighborhood Visual Assessment
System (CANVAS) to measure health-related neighborhood conditions. Health Place
31:163–172
Badland HM, Opit S, Witten K, Kearns RA, Mavoa S (2010) Can virtual streetscape audits reliably
replace physical streetscape audits? J Urban Health 87:1007–1016
Baran PK, Smith WR, Moore RC, Floyd MF, Bocarro JN, Cosco NG, Danninger TM (2013) Park
use among youth and adults: examination of individual, social, and urban form factors. Environ
Behav. doi:10.1177/0013916512470134
Barrett MA, Humblet O, Hiatt RA, Adler NE (2013) Big data and disease prevention: from
quantified self to quantified communities. Big Data 1:168–175
Bedimo-Rung A, Gustat J, Tompkins BJ, Rice J, Thomson J (2006) Development of a direct
observation instrument to measure environmental characteristics of parks for physical activity.
J Phys Act Health 3:S176–S189
Berinsky AJ, Huber GA, Lenz GS (2012) Evaluating online labor markets for experimental
research: Amazon.com’s Mechanical Turk. Polit Anal 20:351–368
Bohannon J (2011) Social science for pennies. Science 334:307
488 J.A. Hipp et al.
Brownson RC, Hoehner CM, Day K, Forsyth A, Sallis JF (2009) Measuring the built environment
for physical activity: state of the science. Am J Prev Med 36: S99–S123.e12
Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s Mechanical Turk: a new source of
inexpensive, yet high-quality, data? Perspect Psychol Sci 6:3–5
CDC (2009) Division of Nutrition, Physical Activity and Obesity. http://www.cdc.gov/nccdphp/
dnpa/index.htm
CDC (2011) Guide to community preventive services. Epidemiology Program Office, CDC,
Atlanta, GA
Cerin E, Conway TL, Saelens BE, Frank LD, Sallis JF (2009) Cross-validation of the factorial
structure of the Neighborhood Environment Walkability Scale (NEWS) and its abbreviated
form (NEWS-A). Int J Behav Nutr Phys Act 6:32
Charreire H, Mackenbach JD, Ouasti M, Lakerveld J, Compernolle S, Ben-Rebah M, McKee M,
Brug J, Rutter H, Oppert JM (2014) Using remote sensing to define environmental character-
istics related to physical activity and dietary behaviours: a systematic review (the SPOTLIGHT
project). Health Place 25:1–9
Clarke P, Ailshire J, Melendez R, Bader M, Morenoff J (2010) Using Google Earth to conduct a
neighborhood audit: reliability of a virtual audit instrument. Health Place 16:1224–1229
Cohen DA, Marsh T, Williamson S, Golinelli D, McKenzie TL (2012) Impact and cost-
effectiveness of family Fitness Zones: a natural experiment in urban public parks. Health
Place 18:39–45
Crandall DJ, Backstrom L, Huttenlocher D, Kleinberg J (2009) Mapping the world’s photos. In:
Proceedings of the 18th international conference on World Wide Web
Day K, Boarnet M, Alfonzo M, Forsyth A (2006) The Irvine–Minnesota inventory to measure built
environments: development. Am J Prev Med 30:144–152
Ding D, Gebel K (2012) Built environment, physical activity, and obesity: what have we learned
from reviewing the literature? Health Place 18:100–105
Dyck DV, Cerin E, Conway TL, Bourdeaudhuij ID, Owen N, Kerr J, Cardon G, Frank LD, Saelens
BE, Sallis JF (2012) Perceived neighborhood environmental attributes associated with adults’
transport-related walking and cycling: findings from the USA, Australia and Belgium. Int J
Behav Nutr Phys Act 9:70
Edwards N, Hooper P, Trapp GSA, Bull F, Boruff B, Giles-Corti B (2013) Development of a
Public Open Space Desktop Auditing Tool (POSDAT): a remote sensing approach. Appl
Geogr 38:22–30
Ewing R, Meakins G, Hamidi S, Nelson A (2003) Relationship between urban sprawl and physical
activity, obesity, and morbidity. Am J Health Promot 18:47–57
Feng J, Glass TA, Curriero FC, Stewart WF, Schwartz BS (2010) The built environment and
obesity: a systematic review of the epidemiologic evidence. Health Place 16:175–190
Ferdinand AO, Sen B, Rahurkar S, Engler S, Menachemi N (2012) The relationship between built
environments and physical activity: a systematic review. Am J Public Health 102:e7–e13
Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L (2009) Detecting
influenza epidemics using search engine query data. Nature 457:1012–1014
Graham DJ, Hipp JA (2014) Emerging technologies to promote and evaluate physical activity:
cutting-edge research and future directions. Front Public Health 2:66
Handy S, Boarnet M, Ewing R, Killingsworth R (2002) How the built environment affects physical
activity: views from urban planning. Am J Prev Med 23:64–73
Harris JK, Lecy J, Parra DC, Hipp A, Brownson RC (2013) Mapping the development of research
on physical activity and the built environment. Prev Med 57:533–540
Hipp JA (2013) Physical activity surveillance and emerging technologies. Braz J Phys Act Health
18:2–4
Hipp JA, Adlakha D, Eyler AA, Chang B, Pless R (2013a) Emerging technologies: webcams and
crowd-sourcing to identify active transportation. Am J Prev Med 44:96–97
Learning from Outdoor Webcams: Surveillance of Physical Activity Across. . . 489
Hipp JA, Adlakha D, Gernes R, Kargol A, Pless R (2013b) Do you see what I see: crowdsource
annotation of captured scenes. In: Proceedings of the 4th international SenseCam &
pervasive imaging conference. San Diego, CA: ACM
Hirsch JA, James P, Robinson JR, Eastman KM, Conley KD, Evenson KR, Laden F (2014) Using
MapMyFitness to place physical activity into neighborhood context. Front Public Health 2:19
Hurvitz PM, Moudon AV, Kang B, Saelens BE, Duncan GE (2014) Emerging technologies for
assessing physical activity behaviors in space and time. Front Public Health 2:2
Ilushin D, Richardson A, Toomey M, Pless R, Shapiro A (2013) Comparing the effects of different
remote sensing techniques for extracting deciduous broadleaf phenology. In: AGU Fall
Meeting abstracts
Jackson RJ (2003) The impact of the built environment on health: an emerging field. Am J Public
Health 93:1382–1384
Jackson RJ, Dannenberg AL, Frumkin H (2013) Health and the built environment: 10 years after.
Am J Public Health 103:1542–1544
Jacobs J (1961) The death and life of great American cities. Random House LLC, New York
Jacobs N, Roman N, Pless R (2007) Consistent temporal variations in many outdoor scenes. In:
IEEE conference on computer vision and pattern recognition, 2007, CVPR ’07, 17–22 June
2007. pp 1–6
Jacobs N, Roman N, Pless R (2008) Toward fully automatic geo-location and geo-orientation of
static outdoor cameras. In: Proc. IEEE workshop on video/image sensor networks
Jacobs N, Burgin W, Fridrich N, Abrams A, Miskell K, Braswell BH, Richardson AD, Pless R
(2009) The global network of outdoor webcams: properties and applications. In: ACM
international conference on advances in geographic information systems (SIGSPATIAL GIS)
James P, Berrigan D, Hart JE, Hipp JA, Hoehner CM, Kerr J, Major JM, Oka M, Laden F (2014)
Effects of buffer size and shape on associations between the built environment and energy
balance. Health Place 27:162–170
Kaczynski AT, Henderson KA (2007) Environmental correlates of physical activity: a review of
evidence about parks and recreation. Leis Sci 29:315–354
Kaczynski AT, Wilhelm Stanis SA, Hipp JA (2014) Point-of-decision prompts for increasing park-
based physical activity: a crowdsource analysis. Prev Med 69:87–89
Kamel Boulos MN, Resch B, Crowley DN, Breslin JG, Sohn G, Burtner R, Pike WA, Jezierski E,
Chuang KY (2011) Crowdsourcing, citizen sensing and sensor web technologies for public and
environmental health surveillance and crisis management: trends, OGC standards and appli-
cation examples. Int J Health Geogr 10:67
Kelly C, Wilson J, Baker E, Miller D, Schootman M (2013) Using Google Street View to audit the
built environment: inter-rater reliability results. Ann Behav Med 45(Suppl 1):S108–S112
Kelly C, Wilson JS, Schootman M, Clennin M, Baker EA, Miller DK (2014) The built environ-
ment predicts observed physical activity. Front Public Health 2:52
Kerr J, Duncan S, Schipperijn J (2011) Using global positioning systems in health research: a
practical approach to data collection and processing. Am J Prev Med 41:532–540
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Bio-
metrics 33:159–174
Lynch K (1960) The image of the city. MIT Press, Cambridge
Mckenzie TL, Cohen DA (2006) System for observing play and recreation in communities
(SOPARC). In: Center for Population Health and Health Disparities (ed) RAND
Milgram S, Sabini JE, Silver ME (1992) The individual in a social world: essays and experiments.
McGraw-Hill Book Company, New York
Naaman M (2011) Geographic information from georeferenced social media data. SIGSPATIAL
Special 3:54–61
National Center for Safe Routes to School (2010) Retrieved from http://www.saferoutesinfo.org/.
Odgers CL, Caspi A, Bates CJ, Sampson RJ, Moffitt TE (2012) Systematic social observation of
children’s neighborhoods using Google Street View: a reliable and cost-effective method. J
Child Psychol Psychiatry 53:1009–1017
490 J.A. Hipp et al.
Office of the Surgeon General, Overweight and obesity: at a glance (2011) Retrieved from http://
www.cdc.gov/nccdphp/sgr/ataglan.htm.
Oldenburg R (1989) The great good place: cafés, coffee shops, community centers, beauty parlors,
general stores, bars, hangouts, and how they get you through the day. Paragon House,
New York
Pless R, Jacobs N (2006) The archive of many outdoor scenes, media and machines lab,
Washington University in St. Louis and University of Kentucky
Reed JA, Price AE, Grost L, Mantinan K (2012) Demographic characteristics and physical activity
behaviors in sixteen Michigan parks. J Community Health 37:507–512
Renalds A, Smith TH, Hale PJ (2010) A systematic review of built environment and health. Fam
Community Health 33:68–78
Richardson A, Friedl M, Frolking S, Pless R, Collaborators P (2011) PhenoCam: a continental-
scale observatory for monitoring the phenology of terrestrial vegetation. In: AGU Fall Meeting
abstracts
Rundle AG, Bader MDM, Richards CA, Neckerman KM, Teitler JO (2011) Using Google Street
View to audit neighborhood environments. Am J Prev Med 40:94–100
Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: IEEE
conference on computer vision and pattern recognition (CVPR), 2012
Saelens BE, Handy S (2008) Built environment correlates of walking: a review. Med Sci Sports
Exerc 40:S550–S566
Saelens BE, Frank LD, Auffrey C, Whitaker RC, Burdette HL, Colabianchi N (2006) Measuring
physical environments of parks and playgrounds: EAPRS instrument development and inter-
rater reliability. J Phys Act Health 3:S190–S207
Sampson RJ, Raudenbush SW (1999) Systematic social observation of public spaces: a new look
at disorder in urban neighborhoods. Am J Sociol 105:603
Sandercock G, Angus C, Barton J (2010) Physical activity levels of children living in different
built environments. Prev Med 50:193–198
Schipperijn J, Kerr J, Duncan S, Madsen T, Klinker CD, Troelsen J (2014) Dynamic accuracy of
GPS receivers for use in health research: a novel method to assess GPS accuracy in real-world
settings. Front Public Health 2:21
Sequeira S, Hipp A, Adlakha D, Pless R (2013) Effectiveness of built environment interventions
by season using web cameras. In: 141st APHA annual meeting, 2–6 Nov 2013
Silva TH, Melo PO, Almeida JM, Salles J, Loureiro AA (2012) Visualizing the invisible image of
cities. In: IEEE international conference on green computing and communications
(GreenCom), 2012
Stauffer C, Grimson WEL (2000) Learning patterns of activity using real-time tracking. IEEE
Trans Pattern Anal Mach Intell 22:747–757
Taylor JR, Lovell ST (2012) Mapping public and private spaces of urban agriculture in Chicago
through the analysis of high-resolution aerial images in Google Earth. Landsc Urban Plan
108:57–70
Taylor BT, Peter F, Adrian EB, Anna W, Jonathan CC, Sally R (2011) Measuring the quality of
public open space using Google Earth. Am J Prev Med 40:105–112
Whyte WH (1980) The social life of small urban spaces. The Conservation Foundation,
Washington, DC
Wilson JS, Kelly CM (2011) Measuring the quality of public open space using Google Earth: a
commentary. Am J Prev Med 40:276–277
Wilson JS, Kelly CM, Schootman M, Baker EA, Banerjee A, Clennin M, Miller DK (2012)
Assessing the built environment using omnidirectional imagery. Am J Prev Med 42:193–199
Xu Z, Weinberger KQ, Chapelle O (2012) Distance metric learning for kernel machines. arXiv
preprint arXiv:1208.3422
Mapping Urban Soundscapes via Citygram
The study and practice of environmental cartography has a long and rich human
history with early examples found in ancient fifth century Babylonian and ancient
Chinese cultures (Horowitz 1988; Berendt 1992). Much more recently, with the
1
http://www.sfu.ca/~truax/wsp.html
Mapping Urban Soundscapes via Citygram 493
2
http://www.portlandonline.com/auditor/index.cfm?c¼28714#cid_18511
3
http://www.amny.com/news/noise-is-city-s-all-time-top-311-complaint-1.7409693
494 T.H. Park
2 Related Work
4
https://nycopendata.socrata.com/
Mapping Urban Soundscapes via Citygram 495
have been explored to capture and measure soundscapes and environmental noise.
Many of these applications tap into a somewhat new phenomenon of the “citizen-
scientist” to help contribute in solving problems hindered by spatio-temporality. In
2008, an application called NoiseTube was developed at Sony Computer Science
Laboratory to measure noise via smartphones, enabling the development of crowd-
sourced noise maps of urban areas (Maisonneuve et al. 2009). NoiseTube attempted
to address the “lack of public involvement in the management of the commons” by
empowering the public with smartphones to measure “personal” noise pollution
exposure via mean dB SPL levels. WideNoise, a 2009 Android/iOS application is
also a citizen-science noise metering example that includes an active world map of
current noise levels (Anon n.d.a, b, c, d). WideNoise includes a social user experi-
ence component, which encourages active participation and also includes a sound
sample-tagging feature that allows users to annotate sounds by associating it with
its source ID and mood labels. Another related project is Motivity (2010), which
employs a small number of stationary decibel meters at key intersections in the
Tenderloin neighborhood of San Francisco.5 Developed as an acoustic ecology
project to demonstrate the efficacy of noise metering in a high-traffic area, the
project uses an instrumentation system consisting of fixed microphones with
embedded computing systems placed at intersections within a 25-block area. As
with the other projects, TenderNoise and WideNoise both use the one-size-fits-all
SPL metric to evaluate noise. In the area of preservation and capturing dying
soundscapes such as rainforests, Global Soundscape is a project from Purdue
University that offers simple tagging options and additional verbose descriptors
inputted via a custom smartphone application. One of the goals of this project is to
collect over “one million natural soundscapes” as part of a special Earth Day
experience on April 22, 2014. Like many of the other software solutions, Global
Soundscape also provides a mapping interface with “nodes” that represent
geo-tagged citizen-scientist6 contributed audio snapshots—short audio recordings
frozen in time and space. The Locustream SoundMap is another soundscape-based
project and is based on a so-called “networked open mic” streaming concept. In
essence, Locustream aims to broadcast site-specific, unmodified audio through an
Internet mapping interface by participants referred to as “streamers.” Streamers are
persons who deploy custom-made Locustream devices, which are provided from
the developers to potential users in order to share the “non-spectacular or non-event
based quality of the streams.” This project was one of the many sources of
inspiration for our own Citygram project and we have taken many concepts a few
steps further by providing means for anyone with a computer and microphone to
participate in serving as a “streamer” as further discussed in Sect. 3.1.2. Other
examples include remote environmental sensing such as Sensor City (Steele
et al. 2013), NoiseMap (Schweizer et al. 2011) and Tmote Invent (So-In
et al. 2012). The latter two utilize a periodic record-and-upload SPL level strategy
5
http://tendernoise.movity.com
6
https://www.globalsoundscapes.org/
496 T.H. Park
for noise monitoring and Sensor City is a project from the Netherlands that aims to
deploy “hundreds” of fixed sensors equipped with high-end, calibrated acoustic
monitoring hardware and its own dedicated fiber-optic network around a small city.
A final example is the Center for Urban Science and Progress (CUSP) SONYC
Project (2015–), which is built on Citygram’s core fixed and citizen-science sensor
network designs and technologies to capture urban noise pollution.
In 2011, Citygram (Park et al. 2012, 2013, 2014b, c, d; Shamoon and Park 2014) was
launched by Tae Hong Park in response to addressing inadequacies of past and
present digital maps: absence of dynamicity and focus on ocularity. Traditional
digital maps typically are built on static landscapes characterized by slowly chang-
ing landmarks such as buildings, avenues, train tracks, lakes, forests, and other
visible objects. Ocular, physical objects, although critical in any mapping model,
are not the only elements that define environments, however; various energy types
including acoustic energies are also important factors that make up our environ-
ments. Noticing the underrepresentation of sound in modern interactive mapping
practices, we began to explore and develop concepts that would enable spatio-
temporal mapping via real-time capture, streaming, analysis, and human-computer
interaction (HCI) technologies. Although it is difficult to pinpoint as to why mean-
ingful soundmaps have not yet been developed, it is not that difficult to observe that
(1) our society is visually oriented and (2) technical issues related to real-time sensor
networks and spatio-temporal resolution have likely played a role in this phenom-
enon. Google Maps, for example, updates spatial images every 1–3 years7 reflecting
the low sampling rate needed to capture the nature of slowly changing landscapes.
This is clearly inadequate for capturing sound: the standard sampling rate for full
spectrum audio is 44.1 kHz. Our first iteration [Citygram One (Park et al. 2012)]
focuses on exploring spatio-temporal acoustic energy to reveal meaningful informa-
tion including spatial loudness, noise pollution, and spatial emotion/mood. The
backbone of Citygram includes a sensor network, server technology, edge-compute
models, visualizations, interaction technologies, data archives, and machine learning
techniques as further summarized in the following sub-sections.
Our sensor network design philosophy is based on adopting scalable, robust, cost-
effective, and flexible remote sensing device (RSD) technologies that communicate
7
https://sites.google.com/site/earthhowdoi/Home/ageandclarityofimagery
Mapping Urban Soundscapes via Citygram 497
Fig. 2 (a) Android mini PC and audio interface and (b) single MEMS microphone
8
e.g. NABEL and OstLuft have a combined sensor node count of five for the entire city of Zurich.
9
New York University’s Center for Urban Science and Progress (CUSP).
500 T.H. Park
10
http://business.time.com/2013/01/09/google-brings-free-public-wifi-to-its-new-york-city-
neighborhood/
11
http://www.nydailynews.com/new-york/brooklyn/40-public-pay-phones-broken-city-records-
article-1.1914079
12
http://www.link.nyc
502 T.H. Park
The Big Data analytics component of our project involves fundamental understand-
ing of soundscapes through descriptors including semantic, emotive, and acoustic
descriptors. In the case of semantic soundscape descriptors, much of the work
originated from research in acoustic ecology (Schafer 1977) where the identity of
the sound source, the notion of signal (foreground), keynote (background),
soundmarks (symbolically important), geophony (natural), biophony (biological),
and anthrophony (human-generated) play important roles. However, as our imme-
diate focus lies in automatic classification of urban noise polluting agents, a first
step towards this goal is the development of agreed-upon urban soundscape taxon-
omy. That is, (1) determining what important sound classes occupy urban
soundscapes, (2) developing an agreed-upon soundscape namespace, and
(3) establishing organizational and relational insights of its classes. This research,
however, is still underexplored and a standardized taxonomy is yet to be established
(Marcell et al. 2000; Guastavino 2007; Brown et al. 2011). In an effort to develop
urban environmental noise taxonomy, we have prototyped software tools for
soundscape annotation that follows an open-ended labeling paradigm similar to
work by Marcell et al. (2000). We are also developing an urban soundscape
Mapping Urban Soundscapes via Citygram 503
taxonomy that reflects the notion of “collective listening” rather than relying on the
opinions of a few (Foale and Davies 2012). We believe that our methodology of
inviting both researchers and the public to define and refine the pool of sono-
semantic concepts has the potential to contribute to pluralistic soundscape taxon-
omy. Of critical importance in this approach is to ensure a sufficiently large
collection of data. As an initial proof-of-concept we developed an urban sound-
scape mining methodology through the open, crowd-sourced database Freesound13,
a methodology adopted in (Salamon et al. 2014). Although Freesound is a rich
resource for “annotated” audio files, initial semantic analysis proved to be less-
than-ideal due to the amount of noise of unrelated words present in the data—each
soundfile, no matter how long or short includes a set of labels that describe the
entire recording. After employing a number of de-noising filters including tag
normalization, spelling correction, lemmatization (Jurafsky and Martin 2009),
and histogram pruning, we were able to substantially clean the original 2203 tags
down to 230 tags obtained from 1188 annotated audio files (Park et al. 2014d). An
analysis of the filtered tags suggested that, with additional filtering techniques, we
could gain insights into hierarchical and taxonomical information in addition to our
current conditional probability techniques. In addition to semantic descriptor anal-
ysis, we are currently investigating research efficacy in spatio-acoustic affective
13
Crowd-sourced online sound repository contains user-specified metadata, including tags and
labels.
504 T.H. Park
The field of automatic soundscape classification is still in its nascent stages due to a
number of factors including: (1) the lack of ground truth datasets (Giannoulis
et al. 2013), (2) the underexplored state of soundscape namespace, (3) the over-
whelming emphasis on speech recognition (Valero Gonzalez and Alı́as Pujol 2013;
Tur and Stolcke 2007; Gygi 2001), and (4) the sonic complexity/diversity of
soundscape classes. A soundscape can literally contain any sound, making the
sound classification task fundamentally difficult (Duan et al. 2012). That is not to
say that research in this field—something we refer to as Soundscape Information
Retrieval (SIR)—is inactive: research publications related to music, speech, and
environmental sound as a whole has increased more than four-fold between 2003
and 2010 (Valero Gonzalez and Alı́as Pujol 2013); and numerous research subfields
exist today, including projects related to monitoring bird species, traffic, and
gunshot detection (Clavel et al. 2005; Cai et al. 2007; Mogi and Kasai 2012; Van
der Merwe and Jordaan 2013).
An important research area in audio classification is the engineering of quanti-
fiable audio descriptors (Keim et al. 2004; Lerch 2012; Müller 2007; Peeters 2004)
that are extracted every few hundred milliseconds, commonly subjected to statis-
tical summaries (Aucouturier 2006; Cowling and Sitte 2003; Meng et al. 2007),
which are fed to a classifier. Standard features for representing audio include
centroid (SF), spread (SS), flatness measure (SFM), spread, flux (SF),
Mel-Frequency Cepstral Coefficients (MFCC) in the frequency-domain; attack
time, amplitude envelope, Linear Predictive Coding (LPC) coefficients, and zero-
crossing rate, in the time-domain. What makes automatic classification particularly
interesting is that noise is not entirely subjective and feature vectors are also not
spectro-temporally invariant. In some cases, depending on environmental condi-
tions, a sound source may not be perceived the same way, and a sound’s feature
vectors may also change for the same sound class. Sound class variance may also be
influenced by the notion of presence in the form of foreground, middle-ground, and
background sound. In a sense, it could be argued that traditional noise measurement
practices primarily focus on the foreground characteristic of noise quantified via dB
levels—sounds that were produced in the background would yield low noise
rankings and thus unnoticeable, although in reality, it may contribute to irritation
to listeners nearby.
One of the notable initiatives in soundscape classification research began
recently in 2013 with the creation of the IEEE D-CASE Challenge for Computa-
tional Audio Scene Analysis (CASA) (Giannoulis et al. 2013). Although training
Mapping Urban Soundscapes via Citygram 505
and evaluation of SIR systems were primarily focused on indoor office sounds,14 it
is still worthwhile to note some of the SIR techniques presented at the Challenge. In
the area of feature extraction, MFCCs were widely used, although in some studies, a
case was made for time-domain and computer vision approaches realized via
matching pursuit (MP) (Chu et al. 2008; 2009) and a k-NN-based spectrogram
image feature (Dennis 2011). The former used a dictionary of atoms for feature
presentation and the latter exploited spectrogram images for acoustic event detec-
tion (AED) and acoustic event classification (AEC). Both methods were demon-
strated as alternative methods to MFCCs and showed robust performance in the
presence of background noise. Some of the classifiers that were omnipresent
included k-NNs, GMMs, SVMs, HMMs, and SOFMs based on expert-engineered
feature vectors also reported in (Duan et al. 2012). The road ahead, however, is still
largely uncharted and fundamental topics concerning the taxonomy and soundscape
semantics, soundscape dataset availability, and the development of comprehensive
and robust classification models that are adaptable to the diversity, dynamicity, and
sonic uncertainty of outdoor soundscapes still remains challenging (Peltonen
et al. 2002).
14
Dissimilar to music and speech sounds although arguably a type of “environmental sound”
506 T.H. Park
15
www.sf11.org lists barking dog, people talking, and car alarms as top 3 examples for noise
complaints.
16
Houston noise code lists vehicles, amplified sound from vehicles, and noise animals as top noise
examples.
Mapping Urban Soundscapes via Citygram 507
17
A crime prevention theory that has seen substantial success in crime control in large cities
including New York, Chicago, Los Angeles, Baltimore, Boston, Albuquerque, Massachusetts, and
also Holland.
Mapping Urban Soundscapes via Citygram 509
opportunity to provide immersive, interactive, and real-time maps that will improve
our understanding of our cyber-physical society. This research represents collabo-
rative efforts of faculty, staff, and students from a number of organizations includ-
ing the NYU Steinhardt, NYU ITP, NYU CUSP; the California Institute of the Arts;
and the NYC DOT and NYC Department of Environmental Protection.
References
Mydlarz C, Nacach S, Park T, Roginska A (2014a) The design of urban sound monitoring devices.
In: Audio Engineering Society convention 137. Audio Engineering Society
Mydlarz C, Nacach S, Roginska A (2014b) The implementation of MEMS microphones for urban
sound sensing. In: Audio Engineering Society convention 137. Audio Engineering Society
Mydlarz C, Shamoon C, Baglione M, Pimpinella M (2015) The design and calibration of low cost
urban acoustic sensing devices [Online]. http://www.conforg.fr/euronoise2015/proceedings/
data/articles/000497.pdf. Accessed 15 Sept 2015
Park TH, Cook P (2005) Radial/elliptical basis function neural networks for timbre classification.
In: Proceedings of the Journées d’Informatique Musicale, Paris
Park TH, Miller B, Shrestha A, Lee S, Turner J, Marse A (2012) Citygram one: visualizing urban
acoustic ecology. In: Proceedings of the conference on digital humanities 2012, Hamburg
Park TH, Turner J, Jacoby C, Marse A, Music M, Kapur A, He J (2013) Locative sonification:
playing the world through citygram. In: International computer music conference proceedings
(ICMC), 2013. Perth: ICMA, pp 11–17
Park TH, Lee J, You J, Yoo M-J, Turner J (2014a) Towards soundscape information retrieval
(SIR). In: Proceedings of the international computer music conference 2014, Athens, Greece
Park TH, Musick M, Turner J, Mydlarz C, Lee JH, You J, DuBois L (2014b) Citygram: one year
later . . .. In: International computer music conference proceedings, 2014, Athens, Greece
Park TH, Turner J, Musick M, Lee JH, Jacoby C, Mydlarz C, Salamon J (2014c) Sensing urban
soundscapes. In: Workshop on mining urban data, 2014
Park TH, Turner J, You J, Lee JH, Musick M (2014d) Towards soundscape information retrieval
(SIR). In: International computer music conference proceedings (ICMC), 2014. Athens,
Greece: ICMA
Passchier-Vermeer W, Passchier WF (2000) Noise exposure and public health. Environ Health
Perspect 108(Suppl 1):123
Peeters G (2004) {A large set of audio features for sound description (similarity and classification)
in the CUIDADO project}
Peltonen V, Tuomi J, Klapuri A, Huopaniemi J, Sorsa T (2002) Computational auditory scene
recognition. In: 2002 I.E. international conference on acoustics, speech, and signal processing
(ICASSP). pp II–1941
Ramsay T, Burnett R, Krewski D (2003) Exploring bias in a generalized additive model for spatial
air pollution data. Environ Health Perspect 111(10): 1283–1288. http://www.pubmedcentral.
nih.gov/articlerender.fcgi?artid¼1241607&tool¼pmcentrez&rendertype¼abstract. Accessed
22 May 2014
Roads C (1988) Introduction to granular synthesis. Comput Music J 12:11–13
Salamon J, Jacoby C, Bello JP (2014) A dataset and taxonomy for urban sound research. In:
Proceedings of 22nd ACM international conference on multimedia, Orlando, USA
Saukh O, Hasenfratz D, Noori A, Ulrich T, Thiele L (n.d.) Demo abstract: route selection of
mobile sensors for air quality monitoring
Schafer RM (1977) The soundscape: our sonic environment and the tuning of the world [Online].
http://philpapers.org/rec/SCHTSO-15. Accessed 16 Apr 2014
Schafer RM (1993) The soundscape: our sonic environment and the tuning of the world [Online].
Inner Traditions/Bear. http://books.google.com/books/about/The_Soundscape.html?id¼_
N56QgAACAAJ&pgis¼1.
Schweizer I, Bärtl R, Schulz A, Probst F, Mühlhäuser M (2011) NoiseMap—real-time participa-
tory noise maps. World. pp to appear
Shamoon C, Park T (2014) New York City’s new noise code and NYU’s Citygram-sound project.
In: INTER-NOISE and NOISE-CON congress and conference proceedings, vol 249(5), pp
2634–2640. Institute of Noise Control Engineering
Skinner C, Grimwood C (2005) The UK noise climate 1990–2001: population exposure and
attitudes to environmental noise. Appl Acoust 66(2):231–243
Mapping Urban Soundscapes via Citygram 513
Slapper G (1996) Let’s try to keep the peace. The Times [Online]. http://scholar.google.com/
scholar?q¼%22let%27s+try+to+keep+the+peace%22&btnG¼&hl¼en&as_sdt¼0%2C33#0.
Accessed 18 July 2014
So-In C, Weeramongkonlert N, Phaudphut C, Waikham B, Khunboa C, Jaikaeo C (2012) Android
OS mobile monitoring systems using an efficient transmission technique over Tmote sky
WSNs. In: Proceedings of the 2012 8th international symposium on communication systems,
networks and digital signal processing, CSNDSP 2012
Steele D, Krijnders D, Guatavino C (2013) The sensor city initiative: cognitive sensors for
soundscape transformations. In: Proceedings of GIS Ostrava 2013: geoinformatics for city
transformations, 2013, Ostrava, Czech Republic
Tur G, Stolcke A (2007) Unsupervised language model adaptation for meeting recognition. In:
2007 I.E. international conference on acoustics, speech and signal processing—ICASSP’07
[Online], IEEE. pp IV–173–IV–176
Valero Gonzalez X, Alı́as Pujol F (2013) Automatic classification of road vehicles considering
their pass-by acoustic signature. J Acoust Soc Am 133(5):3322
Van der Merwe JF, Jordaan JA (2013) Comparison between general cross correlation and a
template-matching scheme in the application of acoustic gunshot detection. In: 2013 Africon,
September 2013, IEEE [Online]. pp 1–5
Van Dijk FJH, Souman AM, De Vries FF (1987) Non-auditory effects of noise in industry. Int
Arch Occup Environ Health 59(2):133–145
Ward WD (1987) Noise and human efficiency. Ear Hear 8(4):254–255
Washington SE (n.d.) The daily decibel
Woolner P, Hall E (2010) Noise in schools: a holistic approach to the issue. Int J Environ Res
Public Health 7(8):3255–3269
Wrightson K (2000) An introduction to acoustic ecology. Soundscape J Acoust Ecol 1(1):10–13
Zhao YM, Zhang SZ, Selvin S, Spear RC (1991) A dose response relation for noise induced
hypertension. Br J Ind Med 48(3):179–184
Part VIII
Social Equity and Data Democracy
Big Data and Smart (Equitable) Cities
Abstract Elected officials and bureaucrats claim that Big Data is dramatically
changing city hall by allowing more efficient and effective decision-making. This
has sparked a rise in the number of “Offices of Innovation” that collect, manage, use
and share Big Data, in major cities throughout the U.S. This paper seeks to answer
two questions. First, is Big Data changing how decisions are made in city hall?
Second, is Big Data being used to address social equity and how? This study
examines Offices of Innovation that use Big Data in five major American cities:
New York, Chicago, Boston, Philadelphia, and Louisville, focusing specifically on
three dimensions of Big Data and social equity: data democratization, digital access
and literacy, and promoting equitable outcomes. Furthermore, this study highlights
innovative practices that address social problems in order to provide directions for
future research and practice on the topic of Big Data and social equity.
1 Introduction
Elected officials and bureaucrats claim that Big Data is dramatically changing city
hall by allowing more efficient and effective decision-making. This has sparked a
rise in the number of “Offices of Innovation” that collect, manage, use and share
Big Data, in major cities throughout the United States. A watershed moment for Big
Data and cities was President Obama’s Open Government Initiative announced in
January of 2009, which provided a Federal directive to establish deadlines for
action on open data (Orzag 2009). Shortly thereafter, a number of local municipal-
ities in the U.S. began making data more accessible, developed policies around
open data, and made government services and civic engagement easier through the
use of new technologies and Big Data.
San Francisco launched the first open data portal for a U.S. city in 2009 and
opened the first Mayor’s Office of Civic Innovation in January 2012 (Appallicious
2014). As of July 2013, at least ten cities had Chief Innovation Officers, and a
survey conducted in Spring 2013 found that “44 % of cities of populations of more
than 300,000 and 10 % cities of populations between 50,000 and 100,000 had
offices of innovation” (Burstein 2013). While San Francisco had the first office,
Mayor Bloomberg’s dedication to opening data in New York has been heralded by
civic innovators as one of the driving forces behind the open data movement and the
national trend towards greater civic entrepreneurship (Appallicious 2014).
Offices of Innovation have become popular because they offer the promise of
using Big Data for predictive analytics, streamlining local government processes,
and reducing costs. Yet, very little research has been conducted on what data is
being harnessed, how it is organized and managed, who has access, and how its use
affects residents. Even less attention has been paid to the relationship between Big
Data and equity.
This paper seeks to answer two questions. First, is Big Data changing how
decisions are made in city hall? Second, is Big Data being used to address social
equity and how? This paper seeks to answer these questions by examining Offices
of Innovation that use Big Data in five major American cities: New York, Chicago,
Boston, Philadelphia, and Louisville. In particular, this study examines three
dimensions of Big Data and social equity: data democratization, digital access
and literacy, and promoting equitable outcomes. Furthermore, this study highlights
innovative practices that address social problems in order to provide directions for
future research and practice on the topic of Big Data and social equity.
Although the private sector has become highly sophisticated at culling Big Data to
shape business practices and planning for some time now, the use of Big Data in the
public sector is a relatively new phenomenon. Much of the academic literature to
date on Big Data and cities has largely focused on the historical evolution of Big
Data and smart cities (Batty 2012, 2013; Kitchin 2014; Batty et al. 2012; Chourabi
et al. 2012), the potential impact that Big Data can have on the future of citizens’
lives (Domingo et al. 2013; Batty 2013; Chen and Zhang 2014; Wigan and Clarke
2013; Hemerly 2013), and the challenges the public sector faces integrating Big
Data into existing processes and strategies (Batty 2012; Joseph and Johnson 2013;
Vilajosana et al. 2013; Almirall et al. 2014; Chen and Zhang 2014; Cumbley and
Church 2013; Wigan and Clarke 2013; Kim et al. 2014; Hemerly 2013). However,
less research has centered on the relationships between Big Data, local governance,
and social equity.
Social equity in governance, is defined by the Standing Panel on Social Equity in
Governance as “The fair, just and equitable management of all institutions serving
the public directly or by contract; the fair, just and equitable distribution of public
Big Data and Smart (Equitable) Cities 519
While scholars have paid attention to the digital divide in relation to digital access
over the past 15 years, recent research has emphasized growing concerns over
digital literacy—the skills, knowledge, or familiarity with digital technology (Gil-
bert et al. 2008; Gilmour 2007; Lee et al. 2015; Correa 2010; Hargittai 2002). This
form of digital divide has been referred to as the “participatory gap” (Fuentes-
Bautista 2013) and signifies that even if individuals have access to computers,
smartphones, or the Internet, they may lack the skills, education, or familiarity to
take advantage of the opportunities that information and communications technol-
ogies (ICTs) can provide (Warren 2007; Gilbert et al. 2008; Kvasny and Keil 2006;
Looker and Thiessen 2003; DiMaggio et al. 2004; Light 2001). Differences in
levels of accessibility and digital literacy are found to be correlated with typical
patterns of social exclusion in society (Warren 2007; Lee et al. 2015; Mossberger
et al. 2012; DiMaggio et al. 2004). In particular, socioeconomic status is considered
the leading cause of the new digital literacy divide (Guillen and Suarez 2005).
As municipal governments become increasingly reliant on digital technology,
the ability to navigate public agency websites, download and submit forms elec-
tronically, scan documents, and a host of other digital skills are increasingly
becoming important. Digital illiteracy will undoubtedly limit the ability of individ-
uals, organizations, and local businesses to access resources and opportunities.
Community-based organizations, for example, often struggle with having low
capacity to perform sophisticated studies or evaluations using Big Data and,
therefore, do not have the ability to provide quantitative analyses that may be
required to apply for and receive government or philanthropic funding. Thus,
understanding the barriers to digital literacy and the characteristics of groups and
organizations that are persistently illiterate will allow local governments to adopt
policies and practices to address it.
The first generation of initiatives developed to address the digital divide proposed
that by improving digital accessibility, this would benefit disadvantaged groups and
reduce gaps in access and usage (Azari and Pick 2005). The idea behind these
initiatives relied on the assumption that closing gaps in technological access would
mitigate broader inequalities, including literacy. These initiatives also assumed that
providing access to ICTs would improve disadvantaged groups’ social statuses
(e.g. income) (Light 2001). However, studies found confounding factors associated
with digital inequality, including available equipment, autonomy in using ICTs,
skills, support (e.g. technical assistance), and variations in purposes (e.g. using
ICTS to obtain jobs vs. social networking) (DiMaggio et al. 2004). Increasing
digital access does not adequately address these five issues and, therefore may be
Big Data and Smart (Equitable) Cities 521
ineffective at reducing the digital divide (Looker and Thiessen 2003). For example,
Kvasny and Keil’s (2006) study found that providing computers, Internet access,
and basic computer training was helpful, but not sufficient at eliminating the digital
divide for low-income families in high-poverty areas. This study pointed to the
intersection between digital inequities and other social structural inequities, such as
lack of access to high-quality schools, limited public investment, and pervasive
poverty. Even when digital divide initiatives do help low-income Americans living
in poor neighborhoods to gain digital literacy, these programs often do not mitigate
inequities caused by disparities in transportation access or educational status (Light
2001; Tapia et al. 2011). Thus, the literature suggests that in order to be effective in
closing the digital gap, digital programs and policies must also be coordinated with
other social policies that address the root causes of the digital divide: poverty, poor
education, economic residential segregation, and public and private sector disin-
vestment in poor neighborhoods.
Technological innovations have become an integral part of America’s commu-
nication, information, and educational culture over the past decade. Access to
information and computer technologies is increasingly considered a “necessity”
to participate in many daily functions (Light 2001). As ICTs have become more
integrated into daily life, populations that have historically not had access to or
familiarity with how to use ICTs may become increasingly disadvantaged without
improved access and literacy (Tapia et al. 2011). This may also exacerbate other
forms of social, economic, and political marginalization for excluded groups
(Gilbert et al. 2008; Lee et al. 2015). Furthermore, growing disparities between
the digital “haves” and “have-nots” can have lasting negative social and economic
consequences to neighborhoods and cities.
Recognizing how data is collected and who provides the data has important
implications for democracy and the distribution of government resources. For
example, crowdsourcing—a citizen engagement platform—will disproportionately
benefit individuals and groups that provide data through mobile applications or
web-based applications. Without an understanding of how to use ICTs, disadvan-
taged groups will not have their voices heard (Jenkins et al. 2009; Bailey and
Ngwenyama 2011; Lee et al. 2015). However, being digitally illiterate may be of
greater concern for more common procedures, such as job applications or qualify-
ing for federal assistance programs. Today, even some minimum wage jobs require
job applications to be filed online. Individuals without access to a computer or the
Internet and/or individuals without any familiarity in completing paperwork or
forms online may experience significant difficultly completing the application,
which may further exacerbate existing economic inequities. Data and digital ineq-
uities, in terms of both access and literacy, compound issues that disadvantaged
populations face. As a result, it is important to continually address inequity in
digital access and literacy in order to prevent populations from becoming increas-
ingly disenfranchised and for inequalities to be exacerbated as technological inno-
vations continue to develop.
522 M.T. Nguyen and E. Boundy
One of the primary way in which Big Data is changing City Halls nationwide is by
increasing the number of data sources (e.g. administrative, mobile application data,
social media) available to develop data analytics and predictive processes to inform
decision-making. These types of systems are being developed to save money,
allowing municipal governments to stretch budgets, improve efficiency, and
develop new methods of communication and networking internally in order to be
more innovative in delivering public services. Two of our case study cities offer
insights into how Big Data is used to inform local government decision-making:
New York and Chicago. New York has been approaching Big Data from a problem
solving approach, whereas Chicago is working to infuse data analytics into their
existing governmental structure in a comprehensive way.
Big Data and Smart (Equitable) Cities 523
4.1.1 Big Data and Predictive Processes: New York and Chicago
New York City’s Mayor’s Office of Data Analytics (MODA) has been widely
recognized for using predictive analytics in government decision-making over the
past several years. Since 2012, New York City has approached predictive analytics
as a way to “evolve government” to improve the efficient allocation of resources
and develop a better response to the real-time needs of citizens (Howard 2011). The
City’s repository for administrative data is called DataBridge and was designed to
perform cross-agency data analysis utilizing data from 45 city agencies simulta-
neously. According to Nicholas O’Brien, the Acting Chief Analytics Officer of
New York, the main challenge the office has had to overcome has been working
with Big Data from “45 mayoral agencies spread out in a distributed system. . .[this
has been a challenging and arduous process because] each city department has a
different anthology for how they characterize the data, so it’s important to under-
stand the overlaps and the exceptions” (Personal Communication, February
24, 2014). Thus, matching the data across administrative units and ensuring the
quality of the data is extremely important as the decisions made through the
predictive analytics process are only as valid and reliable as the data utilized to
predict the outcomes.
Some of MODA’s most lauded successes include: “(1) a five-fold return on the
time that building inspectors spend on looking for illegal apartments, (2) an
increase in the rate of detection for dangerous buildings prone to fire-related issues,
(3) more than a doubling of the hit rate for discovering stores selling bootlegged
cigarettes, and (4) a five-fold increase in the detection of business licenses being
flipped” (Howard 2012). MODA’s quantifiable successes using predictive analytics
has inspired other cities nationwide to create these types of processes to improve
their own internal productivity and decision-making. Although the benefits have
been well documented, there is no account of how much it costs to collect, manage,
and analyze the data. Therefore, this does not allow for a critical examination of
whether Big Data predictive analytics is a more cost-effective problem solving tool
than other methods.
In January 2014, Chicago received a $1 million grant from Bloomberg Philan-
thropies to create the first open-source, predictive analytics platform, called Smart
Data (Ash Center Mayors Challenge Research Team 2014). Chicago collects seven
million rows of data each day that is automatically populated and gathered in
varying formats, through separate systems. The SmartData platform will be able
to analyze millions of lines of data in real time to improve Chicago’s predictive
processes and according to Brenna Berman, will “develop a new method of data-
driven decision making that can change how cities across the country operate” (Ash
Center Mayors Challenge Research Team 2014).
The SmartData platform will have the power to transform predictive analytics
for cities nationwide through its open-source technology. If successful, the devel-
opment of this replicable model for predictive processes can potentially change
decision-making processes for every municipal government nationwide that utilizes
524 M.T. Nguyen and E. Boundy
While Big Data and predictive analytics offers the potential for greater efficiency
and cost-savings, it can also do harm. Users of Big Data should be careful to ensure
the accuracy and completeness of the data used in predictive models. There is also
the potential for human error or misinterpretation of the results, thus it is important
to cross check the findings from predictive models with individuals in the field—
including staff who work within communities or the public at large. While effi-
ciency is important, accuracy and transparency is equally important when using Big
Data for predictive modeling or forecasting.
Our case studies revealed that Big Data and new technologies have tackled tame
problems (Rittel and Webber 1973), such as infrastructure improvements, how to
allocate staff time, and making city hall run more efficiently and proactively, rather
than focusing on the more intractable problems of inequality, poverty and social
equity. Based on our research, we find that there are three ways that cities are
addressing social equity with Big Data: democratizing data, improving digital
access and literacy, and promoting equitable outcomes using Big Data. We discuss
each of these topics in turn below.
526 M.T. Nguyen and E. Boundy
Enacted in March 2012, New York City’s landmark Open Data law—Local Law
11—was the first of its kind at the local U.S. municipal level (NYC DoITT 2012).
The result of Local Law 11 was that New York City established a plan with yearly
milestones to release all of the city’s data from city agencies by 2018. When
finished, it will become the first U.S. local municipality with a complete compre-
hensive public agency dataset inventory (Williams, “NYC’s Plan to Release All-ish
of their data,” 2013a). According to Gale Brewer, Manhattan Borough President,
New York City’s open data law was more significant and transformative than the
federal directive because it demonstrated how this type of work could be
implemented at the local level (Goodyear 2013).
The Mayor’s Office of Data Analytics (MODA) operates New York City’s open
data portal and works closely with NYC DoITT to populate the data portal and
pursue other projects relating to data innovation and analytics (Feuer 2013). MODA
has been successful at procuring 1,500 datasets from the city’s public agencies thus
far. Yet, there remains many challenges to completing this task, including the cost,
organizational capacity, data management skills, and ongoing maintenance and
upkeep of the data. What is also not clear is who uses the data and for what purpose,
which raises questions about data formatting and requisite skills and education of
users. Nicholas O’Brien, Acting Director of MODA explains,
We’re also really starting to understand our audience. The customers of our open data portal
are primarily non-profits, who we considered mid-tier data users that have some digital and
data expertise but aren’t necessarily writing code or programming. We also know for our
tech-savvy users, we have to direct them to our developer portal for more robust resources,
Big Data and Smart (Equitable) Cities 527
and we have a third level of users that have very limited skills with data analysis.
Understanding what each of these audiences want and need is an ongoing process for
us. (Personal Communication, February 24, 2014).
Since 2012, Boston, Chicago, Louisville, and Philadelphia have established open
data executive orders. These cities have largely developed open data portals and
created new executive positions to manage data initiatives. Philadelphia is the only
municipal government in the country that does not “unilaterally control” the city’s
open data portal (Wink, “What Happens to OpenDataPhilly Now?,” 2013). Instead,
Philadelphia’s portal is managed by a non-profit and contains both municipal and
non-municipal data (that users can submit directly).
In December 2012, Mayor Emanuel in Chicago established an open data exec-
utive order and created a position for a Chief Data Officer (CDO) to speed up the
development of an open data portal. In order to improve transparency and build
working relationships between departments with regard to Big Data, the executive
order required that an Open Data Advisory Group, which includes representatives
from each agency, to convene in order to discuss the portal’s ongoing development
(Thornton, “How open data is transforming Chicago”, 2013a). According to Brenna
Berman, Commissioner and Chief Information Officer, “meeting participants pri-
oritize what datasets should be developed and identifies cross agency collaborations
for data analytics” (Personal Communication, March 21, 2014). To support
Chicago’s open data portal, the city established an accompanying data dictionary
for information about all data being published (Thornton, “How Chicago’s Data
Dictionary is Enhancing Open Government”, 2013b). The Data Dictionary takes
transparency to another level and enhances the open data experience beyond what
the other major American cities are doing.
In October 2013, Louisville announced an executive order for creating an open
data plan. At that time, Louisville’s open data policy was the first U.S. municipal
policy that stated open data will be the “default mode” for how government
electronic information will be formatted, stored, and made available to the public
(Williams, “New Louisville’s Open Data Policy Insists Open by Default is the
Future”, 2013b). The implications are that data that is legally accessible will be
proactively disclosed online through the city’s open data portal. Since January
2014, Louisville’s open data portal has been in development and operated by a
small team working within the Louisville Metro Technology Services department
(Personal Communication, February 25, 2014).
Among our case study cities, Louisville has the lowest population with almost
600,000 residents and the smallest city government. Currently, the city has a
“homegrown portal” that the city staff developed. The current process for this
homegrown portal requires a data specialist to determine (with a small team)
which datasets should be prioritized based on volume of requests and ease of
“cleaning” the data. Louisville hopes to eventually publish between 500 and 1000
528 M.T. Nguyen and E. Boundy
While the open data movement has generated excitement and support from
municipal governments, civic hackers, and tech-savvy citizens, these innovative
applications typically provide benefits or services to those who also already utilize
data and technology in their everyday lives. For citizens that have access to and
understand these systems, they are able to receive benefits in terms of cost,
efficiency, and decision-making.
Despite the publicity surrounding open data, providing data does not mean that
every citizen will directly receive or experience a benefit or improve their quality of
life. Truly innovative municipal governments should aim to provide widespread
access and understanding of data and technologies to their citizenry (McAuley
et al. 2011). The following analogy between libraries and open data portals is
instructive for how data portals should be conceived: “we didn’t build libraries
for an already literate citizenry. We built libraries to help citizens become literate.
Today we build open data portals not because we have a data or public policy
literate citizenry, we build them so citizens may become literate in data, visualiza-
tion, and public policy” (Eaves 2010). Nigel Jacob of Boston’s MONUM echoes
these sentiments by saying “open data is a passive role for the government. . .fine
for software development, but it does not actively engaged with citizens.” Thus, in
order for cities to develop a democratic data system, they need to make the data
usable and provide supplementary resources and training to ensure widespread use
and impact.
Big Data and Smart (Equitable) Cities 529
In the last few decades, local governments have been engaged in activities to reduce
the digital divide by increasing access to broadband, Wi-Fi, ICTs, and computer
centers. Coupled with these programs and the increasing affordability of acquiring
technology, digital access is becoming less of a problem.
However, a new digital divide is emerging between individuals who can effec-
tively access and use digital resources and data to improve their well-being and
those who can not. For example, being unable to download, fill out, and submit a
job application on-line will severely limit their job opportunities. If digital literacy
is low among groups that are traditionally disadvantaged, this may exacerbate
social inequality.
5.3.1 Chicago and New York: Digital Access and Literacy Initiatives
In 2009, every city in our study, except Louisville received funding from the
Broadband Technologies Opportunities Program (BTOP), a federal program
designed to expand access to broadband services nationwide. New York City
received $42 million from BTOP and developed the NYC Connected Communities
program, which focused on broadband adoption and improving access to computer
centers in low-income and limited-English neighborhoods throughout five bor-
oughs (Personal Communication, March 10, 2014). Through this program,
100 computing centers were opened at local public libraries, public housing devel-
opments, community centers and senior centers. The majority of these centers have
remained open as more funding was acquired in 2013 when the BTOP funding
expired. NYC Connected Communities included computer training and digital
literacy programs designed to meet community needs (Personal Communication,
March 10, 2014). Since 2010, the NYC Connected Communities program has
hosted more than three million user sessions citywide, approximately 100,000
residents have participated in training classes, and over 4.7 million residents have
attended open lab sessions (NYC DoITT, “Technology & Public Service Innova-
tion: Broadband Access” n.d.; City of New York Public Computing Centers 2014).
In Chicago, the Smart Communities program received $7 million of federal
funding in 2010 to develop training and outreach initiatives centralized in five
low-income communities in Chicago (Tolbert et al. 2012). The Smart Communities
program created a master plan that included considerable input from community
members to determine program priorities to address challenges specific to their
community (Deronne and Walek 2010). Thus, in Chicago, the design of the pro-
grams was developed through a “bottom up” participatory process that resulted in
unique programmatic components that focused on the idea that the “community
knows best.” (Personal Communication, February 25, 2014).
Through early 2013, the Smart Communities program has trained approximately
20,000 people in computer and digital literacy skills (City of Chicago Public
530 M.T. Nguyen and E. Boundy
The programs mentioned above are supported by large sums of federal funds. But,
these funds are not available to the vast majority of cities throughout the country.
Thus, we offer examples of smaller scale initiatives to improve digital access and
literacy found in our case study cities. In Chicago, LISC, a non-profit community-
based organization, has built upon the Smart Communities program and developed
community-focused initiatives that provide training for residents from a diversity of
demographic backgrounds and offers an online presence for low-income neighbor-
hoods. Boston and New York have installed computers in vans to serve as mobile
city halls that bring public staff into the field to offer services and to provide access
to technology to residents of concentrated poor and minority neighborhoods.
Boston operates “Tech Goes Home” (TGH), an initiative that provides digital
literacy courses, subsidized computer software, and broadband access to school-
age children and their families. Louisville’s focus on digital literacy comes from a
workforce development perspective. The city has made investments in developing
high-level data analysis skills in residents that can improve employment opportu-
nities while simultaneously making the city more attractive to businesses.
The one issue with these smaller-scale approaches is that it is difficult to develop
an initiative or program that tackles both digital access and digital literacy on a
smaller scale and budget. The initiatives in Chicago have managed to continue this
dual emphasis through LISC’s partnership with their other organizations. Boston’s
Tech Goes Home program has managed to expand since its development in 2000
and has evolved into one of the more sustainable models of digital literacy by
providing a computer and low-cost access to broadband to those who complete their
program.
All the cities in this study have worked to close the digital divide in terms of access
and literacy. However, the innovativeness and diversity of Boston and Chicago’s
programs demonstrate the significant investment and local resources required, both
financially and in terms of the coordination between local stakeholders. Justin
Holmes at Boston’s Office of Innovation & Technology’s describes the complexity
of approaching issues of data access and literacy in diverse communities:
“Our engagement approach is multichannel. . .we need to be mobile, move beyond call
centers and traditional centers, and use social media as a ‘value add’ to reach people. We’re
Big Data and Smart (Equitable) Cities 531
working to meet people where they are comfortable” (Personal Communication, March
19, 2014).
The main commonality between most cities was the development of public
computing centers to improve access. Dan O’Neil, Executive Director of the
Smart Chicago Collaborative, believes that “public computing centers are the
most essential building block in providing access to technology” (Personal Com-
munication, March 6, 2014). However, the programs with the potential for long-
lasting impacts appear to be those with a concentrated effort on providing extensive
on-site training through a site-specific curriculum tailored to the wants and needs of
the community. Andrew Buss, the Director of Innovation Management in the
Philadelphia Office of Innovation & Technology, identified that the key to
Philadelphia’s KEYSPOT computing center initiative was having an instructor on
site:
“You can’t just have a room with a bunch of technology. . .you need to have a person onsite
at each location for assistance on how to use the equipment and to solve minor tech issues
[which creates] a guided experience” (Personal Communication, February 28, 2014).
The availability and expertise of on-site instructors was also seen in each of the
mobile van initiatives and has proven to be crucial for digital literacy programs.
Furthermore, it seems that establishing a high level of trust between the program
providers, teachers, and participants is integral to the program’s success and to see
positive outcomes gained for students.
Improving data literacy is important for a diversity of users, not only user groups
that do not have access to new technologies. Brenna Berman, the Chief Information
Officer at Chicago’s Department of Innovation & Technology, spoke about the
importance of non-profits accessing and utilizing data:
“We’ve been creating a partnership between commercial organizations and the philan-
thropic community to make sure non-profits are benefiting from Big Data and using some
indirect organizations that have been addressing the gap. . .we know non-profits were not
embracing Big Data and weren’t using data to inform decisions. They needed representa-
tives from communities to teach them how to do this, so we’ve run education workshops to
collaborate and educate. . .like that saying, a rising tide raises all ships” (Personal Com-
munication, March 21, 2014).
Therefore, closing the digital divide gap may not be simply a matter of providing
access and training to individuals, but also to low capacity organizations.
The third dimension of social equity relates to the promotion of equitable outcomes
using Big Data. This could be conceived in two ways. First, directing Big Data
analysis to reduce disparities across various social dimensions (i.e. income, race,
ethnicity, and gender) for different groups. Second, targeting disadvantaged or
underserved groups by using Big Data to improve their quality of life. The city of
532 M.T. Nguyen and E. Boundy
Among government agencies, public health agencies have been leading the way in
using Big Data to reduce health disparities. New technologies offer innovative ways
to assist low-income individuals to manage their healthcare and improve their
health. In 2010, the city of Louisville created an inhaler sensor called Asthmapolis,
which also comes with a supplementary mobile application that allows asthma
patients and their doctors to understand asthma’s triggers and provides an effective
way to control asthma, while simultaneously generating data for public health
researchers (Propeller 2013). More than 500 sensors have been deployed to
low-income residents suffering with asthma in Louisville. While the program is
still in the early stages, the benefits for residents have been notable. Interviews with
program participants highlight their increased confidence in their disease manage-
ment due to the “smart inhalers.” Furthermore, participants are happy to be part of
the program because the inhaler sensor is provided free of charge through funding
from philanthropic grants (Runyon 2013).
This program in Louisville is believed to be truly transformative in “breaking
down data silos in the public sector. . .[and] is a model project for how the public
sector and communities should start working with informatics” (RWJF Public
Health Blog 2012). Utilizing this technology provides a benefit to the user, as
their disease management improves. It is also useful to doctors and public health
officials because the individual-level data, geo-tagged by location, generated can be
utilized to inform future public health decisions. The City of Louisville has pushed
to incorporate innovations that improve public health because of the belief that
having a healthy population contributes to regional and economic competitiveness,
which encourages businesses to locate in the city (RWJF Public Health Blog 2012).
This mindset is consistent with Louisville’s strategy to improve data literacy as a
workforce development tool to increase the city’s economic competitiveness. Thus,
for smaller cities such as Louisville, innovations in technology and Big Data that
promotes the image and reputation of the city as cutting edge and with a good
quality of life can improve economic competitiveness.
In 2009, HHS-Connect was developed in New York to collect all data relating to
social services in one digital repository in order to streamline the intake process for
clients visiting different social service agencies. HHS Connect has transformed
service delivery for social services into a client-centric model. The increased
coordination between city agencies has improved case management processes and
provided clients with one access point to self-screen for over 30 benefit programs.
Big Data and Smart (Equitable) Cities 533
These types of internal innovations can make the experience easier for clients while
helping overburdened agencies detect fraud, improve service delivery, and reduce
costs (Goldsmith 2014).
These programs emphasize the need to develop partnerships between social
service providers to allow data sharing between agencies and streamline intake
services. This creates organizational efficiencies and also makes receiving services
for socially vulnerable populations easier and more efficient, thus saving individ-
uals time and money. However, there are a variety of issues that limit the power of
Big Data. On the federal level, statutes vary about what health records, educational
transcripts, and data related to homelessness, child welfare, drug abuse and mental
health can be collected, published, or shared (Goldsmith and Kingsley 2013). On
the state and local level, many laws were written prior to the digital age and can
create conflict and confusion, thereby slowing down the adoption of innovations in
these fields.
This case study of five U.S. cities with Offices of Innovation sought to answer two
primary research questions. First, is Big Data changing decision-making in city
hall? Second, is Big Data being used to address social equity and how? To varying
degrees, Big Data in all of our case study cities is altering the way in which
decisions are made in local government by supplying more data sources, integrating
cross agency data, and to use predictive rather than reactive analytics to make
decisions. This has the potential to improve administrative efficiency and reduce
man hours spent on tasks, thereby saving time, energy, and money. While this may
be true, cities often do not calculate the costs associated with collecting, cleaning,
managing, and updating Big Data. No study to date has examined the cost-
effectiveness of these programs to determine the return on investment. Further-
more, local government’s focus on tame problems using a rational framework that
promotes efficiency in government systems, raises long-standing concerns about
“problem definition” within government (Rittel and Webber 1973; Dery 1984;
Baumgartner and Jones 1993; Rochefort and Cobb 1994). In particular, top down
models of decision-making that use technologies accessible by groups that are
already advantaged may exacerbate social inequalities and inhibit democratic
processes.
While major cities, such as New York, have high capacity public agencies that
can populate the data required in a centralized repository, smaller cities may not.
Louisville’s open data portal, for example, relies on one staff member populating
the data using a data portal that was developed in-house. What happens if this staff
member leaves his post? New York City’s MODA provides the needed expertise
and capacity to assist public agencies and departments to conduct predictive
534 M.T. Nguyen and E. Boundy
makers, and benefit from spending less time and effort to receive municipal services
when the van arrives in their neighborhood.
Understanding how Big Data can be used to address issues of equity is complex,
due to the various dimensions of equity that can be considered. Each of the cities
studied have been focusing on some issues of equity, but none have taken a
comprehensive, multi-faceted approach to social equity. Big Data and new tech-
nologies have the potential to address thornier wicked issues if different policy
questions and priorities were raised and if there is political support for it. Our study
suggests that cities using Big Data have opted to more frequently tackle questions
focused on system optimization rather than on targeting social inequality.
The city of Boston has a unique structure for their office of innovation. Their office
is called the Boston Department of Innovation and Technology or DoIT and is
housed in Boston’s City Hall. Their primary role is to collect, manage, and organize
Big Data. DoIt is also the city’s internal social media team and operates a coordi-
nated, data-driven, strategy across all social media platforms, such as Facebook and
Twitter, with the goal of curating daily engagement to help improve residents’
quality of life (“Boston’s Mayoral Transition”, NextBoston). The city has a social
media policy and an organizational strategy to support this work with a social media
liaison positioned in each of the city’s departments. Due to these efforts, Boston’s
social media strategy has received national recognition and has seen exponential
growth in terms of engagement with the public. For example, in 2013, the City of
Boston’s Facebook page followers grew by 200 % and the page’s reach grew 400 %
between 2012 and 2013 (“Boston’s Mayoral Transition”, NextBoston).
DoIt collaborates very closely with the Mayor’s Office for New Urban Mechan-
ics or MONUM. According to the Co-Founder of MONUM, the department,
“serves as a complementary force for city departments to innovate city services
and we’re there to support them. . .[and unlike DoIt], MONUM has a great deal of
independence and the ability to be innovative while not being encumbered by
maintaining and supporting the innovation” (personal communication, March
19, 2014). Because MONUM is not managing the Big Data, the department focuses
on piloting innovative, and sometimes risky programs that if successful, will be
scaled up within a city department or city-wide. Thus, they are provided the
freedom and flexibility to be creative and innovative. In 2013, Boston was named
the #1 Digital City in America by the Center for Digital Government’s annual
Digital Cities Survey. Between the efforts of DoIt and MONUM, and their social
media strategy, the city of Boston is widely regarded as one of the leading Big Data
innovators in municipal government.
536 M.T. Nguyen and E. Boundy
New York City’s innovation office is known as the Mayor’s Office of Data
Analytics (MODA). MODA works in coordination with New York City’s Depart-
ment of Information Technology & Communications (DoITT). NYC DoITT is
Big Data and Smart (Equitable) Cities 537
primarily responsible for managing and improving the city government’s IT infra-
structure and telecommunication services to enhance service delivery for
New York’s residents and businesses. MODA was officially established by an
executive order from Mayor Bloomberg in April 2013, but the agency had been
working informally within New York City government for several years previously
under the name “Financial Crime Task Force” (Personal Communication, February
24, 2014). MODA manages the city’s Open Data portal and works extensively on
data management and analytics using their internally developed data platform,
known as DataBridge. In order to establish DataBridge, MODA collaborated with
DoITT to consolidate references for each building address in the city throughout all
of the city’s agencies into one database, so that when one searches by address, all of
the information from every department is accessible in one place (Nicholas
O’Brien, personal communication, February 24, 2014). MODA operates specific
projects to improve processes or gather more information about the city’s opera-
tions. MODA’s projects typically fall into one of these four categories: (1) aiding
disaster response and recovery through improved information, (2) assisting NYC
agencies with data analysis and delivery of their services, (3) using analytics to
deliver insights for economic development, and (4) encouraging transparency of
data between the city’s agencies, as well as to the general public.
services, bring modern technology to cities, or change the way citizens interact with
city hall” (p. 4). MONUM’s Philadelphia office is also located in city hall and its
mission to transform city services and engage citizens and institutions throughout
the city to participate in addressing the needs of city residents.
References
City of New York Public Computing Centers (2014) National Telecommunications and Informa-
tion Administration: annual performance progress report for Public Computing Centers. http://
www2.ntia.doc.gov/files/grantees/17-43-b10507_apr2013.pdf. Accessed 18 June 2014
Correa T (2010) The participation divide among “online experts”: experience, skills and psycho-
logical factors as predictors of college students’ web content creation. J Comput Mediat
Commun 16(1):71–92, http://dx.doi.org/10.1111/j.1083-6101.2010.01532.x
Cumbley R, Church P (2013) Is “Big Data” creepy? Comput Law Secur Rev 29(5):601–609, http://
dx.doi.org/10.1016/j.clsr.2013.07.007. Accessed 9 Mar 2014
Deronne J, Walek G (2010) Neighborhoods get smart about technology. Smart Communities.
http://www.smartcommunitieschicago.org/news/2443. Accessed 9 Apr 2014
Dery D (1984) Problem definition in policy analysis. University Press of Kansas, Lawrence
DiMaggio P, Hargittai E, Celeste C, Shafer S (2004) Digital inequality: from unequal access to
differentiated use. In: Neckerman K (ed) Social inequality. Russell Sage, New York, pp
355–400
Domingo A, Bellalta B, Palacin M, Oliver M, Almirall E (2013) Public open sensor data:
revolutionizing smart cities. IEEE Technol Soc Mag 32(4):50–56, http://dx.doi.org/10.1109/
MTS.2013.2286421. Accessed 4 Mar 2014
Eaves D (2010) Learning from libraries: the literacy challenge of open data. eavesca. http://eaves.
ca/2010/06/10/learning-from-libraries-the-literacy-challenge-of-open-data/. Accessed 13 July
2014
Feuer A (2013) The Mayor’s geek squad. The New York Times. http://www.nytimes.com/2013/
03/24/nyregion/mayor-bloombergs-geek-squad.html?pagewanted¼all. Accessed 8 Apr 2014
Fuentes-Bautista M (2013) Rethinking localism in the broadband era: a participatory community
development approach. Gov Inf Q 31(1):65–77, http://dx.doi.org/10.1016/j.giq.2012.08.007.
Accessed 20 Mar 2014
Gilbert M, Masucci M, Homko C, Bove A (2008) Theorizing the digital divide: information and
communication technology use frameworks among poor women using a telemedicine system.
Geoforum 39(2):912–925, http://dx.doi.org/10.1016/j.geoforum.2007.08.001. Accessed 4 Mar
2014
Gilmour JA (2007) Reducing disparities in the access and use of Internet health information. A
discussion paper. Int J Nurs Stud 44(7):1270–1278, http://dx.doi.org/10.1016/j.ijnurstu.2006.
05.007. Accessed 10 Mar 2014
Goldsmith S (2014) Unleashing a community of innovators|data-smart city solutions. Data-Smart
City Solutions. http://datasmart.ash.harvard.edu/news/article/unleashing-a-community-of-
innovators-399. Accessed 1 May 2014
Goldsmith S, Kingsley C (2013) Getting big data to the good guys|data-smart city solutions. Data-
Smart City Solutions. http://datasmart.ash.harvard.edu/news/article/getting-big-data-to-the-
good-guys-140. Accessed 1 May 2014
Goodyear S (2013) Why New York City’s open data law is worth caring about. The Atlantic Cities.
http://www.theatlanticcities.com/technology/2013/03/why-new-york-citys-open-data-law-
worth-caring-about/4904/. Accessed 8 Apr 2014
Guillen MF, Suarez SL (2005) Explaining the global digital divide: economic, political and
sociological drivers of cross-national internet use. Soc Forces 84(2):681–708
Hargittai E (2002) Second-level digital divide: differences in people’s online skills. First Monday
7(4)
Hemerly J (2013) Public policy considerations for data-driven innovation. Computer 46(6):25–31,
http://dx.doi.org/10.1109/MC.2013.186. Accessed 7 Mar 2014
Hilbert M (2011) The end justifies the definition: the manifold outlooks on the digital divide and
their practical usefulness for policy-making. Telecomm Policy 35(8):715–736, http://dx.doi.
org/10.1016/j.telpol.2011.06.012. Accessed 14 Mar 2014
Hollywood JS, Smith SC, Price C, McInnis B, Perry W (2012) Predictive policing: what it is, what
it isn’t, and where it can be useful. NLECTC Information and Geospatial Technologies Center
540 M.T. Nguyen and E. Boundy
Propeller Health (2013) Wyckoff Heights Medical Center is first New York Hospital to offer
Asthmapolis Mobile Asthma Management Program. Propeller Health. http://propellerhealth.
com/press-releases/. Accessed 9 Apr 2014
Revenaugh M (2000) Beyond the digital divide: pathways to equity. Technol Learn 20(10), http://
eric.ed.gov/?id¼EJ615183. Accessed 2 Mar 2014
Reyes J (2014) “Smart policing” movement training Philly cops to be data scientists. Technically
Philly Smart policing movement training Philly cops to be data scientists Comments. http://
technical.ly/philly/2014/02/18/philadelphia-police-smart-policing-crime-scientists/. Accessed
30 Apr 2014
Rich S (2012) E-government. Chicago’s data brain trust tells all. http://www.govtech.com/e-
government/Chicagos-Data-Brain-Trust-Tells-All.html. Accessed 8 Apr 2014
Rittel HW, Webber MM (1973) Dilemmas in a general theory of planning. Policy Sci 4
(2):155–169
Rochefort DA, Cobb RW (1994) Problem definition: an emerging perspective. In the politics of
problem definition: shaping the policy agenda. University Press of Kansas, Lawrence
Runyon K (2013) Louisville premieres new program to fight pulmonary disease. The Huffington
Post. http://www.huffingtonpost.com/keith-runyon/louisville-chooses-asthma_b_4086297.
html. Accessed 9 Apr 2014
RWJF Public Health Blog (2012) Asthmapolis: public health data in action. Robert Wood Johnson
Foundation. http://www.rwjf.org/en/blogs/new-public-health/2012/08/asthmapolis_public.
html. Accessed 9 Apr 2014
Shueh J (2014a) Big data could bring governments big benefits. Government Technology. http://
www.govtech.com/data/Big-Data-Could-Bring-Governments-Big-Benefits.html. Accessed
6 Apr 2014
Shueh J (2014b) 3 Reasons Chicago’s data analytics could be coming to your city. Government
Technology. http://www.govtech.com/data/3-Reasons-Chicagos-Analytics-Could-be-Com
ing-to-Your-City.html. Accessed 6 Apr 2014
Tapia AH, Kvasny L, Ortiz JA (2011) A critical discourse analysis of three US municipal wireless
network initiatives for enhancing social inclusion. Telemat Inform 28(3):215–226, http://dx.
doi.org/10.1016/j.tele.2010.07.002. Accessed 4 Mar 2014
Thornton S (2013a) How open data is transforming Chicago. Digital Communities. http://www.
digitalcommunities.com/articles/How-Open-Data-is-Transforming-Chicago.html. Accessed
8 Apr 2014
Thornton S (2013b) How Chicago’s data dictionary is enhancing open government. Government
Technology. http://www.govtech.com/data/How-Chicagos-Data-Dictionary-is-Enhancing-
Open-Government.html. Accessed 8 Apr 2014
Tolbert C, Mossberger K, Anderson C (2012) Measuring change in internet use and broadband
adoption: comparing BTOP smart communities and other Chicago neighborhoods. http://www.
lisc-chicago.org/uploads/lisc-chicago-clone/documents/measuring_change_in_internet_use_
full_report.pdf. Accessed 8 Apr 2014
Velaga NR, Beecroft M, Nelson JD, Corsar D, Edwards P (2012) Transport poverty meets the
digital divide: accessibility and connectivity in rural communities. J Transp Geogr
21:102–112, http://dx.doi.org/10.1016/j.jtrangeo.2011.12.005. Accessed 8 Mar 2014
Vilajosana I, Llosa J, Martinez B, Domingo-Prieto M, Angles A, Vilajosana X (2013)
Bootstrapping smart cities through a self-sustainable model based on big data flows. IEEE
Commun Mag 51(6):128–134, http://dx.doi.org/10.1109/MCOM.2013.6525605. Accessed
20 Mar 2014
Warren M (2007) The digital vicious cycle: links between social disadvantage and digital
exclusion in rural areas. Telecomm Policy 31(6–7):374–388, http://dx.doi.org/10.1016/j.
telpol.2007.04.001. Accessed 8 Mar 2014
Wigan MR, Clarke R (2013) Big data’s big unintended consequences. Computer 46(6):46–53,
http://dx.doi.org/10.1109/MC.2013.195. Accessed 8 Mar 2014
542 M.T. Nguyen and E. Boundy
Williams R (2013a) NYC’s plan to release all-ish of their data. Sunlight Foundation Blog. http://
sunlightfoundation.com/blog/2013/10/11/nycs-plan-to-release-all-ish-of-their-data/. Accessed
8 Apr 2014
Williams R (2013b) New Louisville open data policy insists open by default is the future. Sunlight
Foundation Blog. http://sunlightfoundation.com/blog/2013/10/21/new-louisville-open-data-
policy-insists-open-by-default-is-the-future/. Accessed 8 Apr 2014
Williams R (2014) Boston: the tale of two open data policies. Sunlight Foundation Blog. http://
sunlightfoundation.com/blog/2014/04/11/boston-the-tale-of-two-open-data-policies/.
Accessed 13 July 2014
Wink C (2013) What happens to OpenDataPhilly now? Technically Philly what happens to
OpenDataPhilly now Comments. http://technical.ly/philly/2013/09/18/what-happens-to-
opendataphilly-now/. Accessed 13 July 2014
Big Data, Small Apps: Premises and Products
of the Civic Hackathon
Abstract Connections and feedback among urban residents and the responsive
city are critical to Urban Informatics. One of the main modes of interaction between
the public and Big Data streams is the ever-expanding suite of urban-focused
smartphone applications. Governments are joining the app trend by hosting civic
hackathons focused on app development. For all the attention and effort spent on
app production and hackathons, however, a closer examination reveals a glaring
irony of the Big Data age: to date, the results have been remarkably small in both
scope and users. In this paper, we critically analyze the structure of The White
House Hackathon, New York City BigApps, and the National Day of Civic
Hacking, which are three recent, high-publicity hackathons in the United States.
We propose a taxonomy of civic apps, analyze hackathon models and results
against the taxonomy, and evaluate how the hackathon structure influences the
apps produced. In particular, we examine problem definitions embedded in the
different models and the issue of sustaining apps past the hackathon. We question
the effectiveness of apps as the interface between urban data and urban residents,
asking who is represented by and participates in the solutions offered by apps. We
determine that the transparency, collaboration and innovation that hackathons
aspire to are not yet fully realized, leading to the question: can civic Big Data
lead to big impacts?
1 Introduction
In the age of Big Data, mobile technology is one of the most crucial sources of data
exchange. Analysts are examining the preferences, behaviors and opinions of the
public through status updates, tweets, photos, videos, GPS tracks and check-ins. In
turn, urban residents are accessing the same data as they view restaurant reviews
with Yelp, find a ride with Uber, and stream Instagram photos. Untethered devices
such as smartphones and tablets are critical to real-time, on-the-go data uploading
and access. On these mobile devices, the public is connecting to data through apps.
Apps are becoming the primary interface between data and the public.
At present, apps are primarily created by private companies seeking to profit
from granular knowledge of urban behaviors. Yet, the allure and potential of apps is
increasingly recognized by non-profit and government organizations, with devel-
opment encouraged from the federal government all the way down to local munic-
ipalities. In the private tech industry, the “hackathon,” a short and intense period of
collaborative brainstorming, development, and coding, is a standard model of app
development, and now the public sector is following suit. So-called “civic
hackathons” are rapidly proliferating. Many view civic hackathons and app devel-
opment as an exciting indicator of a new era of collaborative, open governance and
bottom-up engagement. In the words of the promoters behind the National Day of
Civic Hacking,
Civic hackers are community members (engineers, software developers, designers, entre-
preneurs, activists, concerned citizens) who collaborate with others, including government,
to invent ways to improve quality of life in their communities. . .Participants will use
technology, publicly available data, and entrepreneurial thinking to tackle some of our
most pressing social challenges such as coordination of homeless shelters or access to
fresh, local, affordable food. (Hack for Change 2014a)
In private industry, the utility of the hackathon is usually clear: employees work
to innovate new products that will keep the company on the cutting edge of the
market, often with potential shared profits (Krueger 2012). The goals of civic
hackathons are less so. Ostensibly, they consist of citizen developers and represen-
tatives donating their time to create apps that address community wants and needs.
The structure of the events, however, heavily influences the kind of apps produced,
their intended users, and their long-term sustainability. In this study, we examine
three models of civic hacking, develop a taxonomy of civic apps based on their
structure, and offer cautions and suggestions for future civic hackathons.
As Open and Big Data proliferate, there is seemingly little reason not to hold a
hackathon. The stakes can be very low. However, to use a tech industry term, are
hackathons “disrupting” models of governance by widening participation? Is app
generation leading to more efficient problem solving?
Since the Obama Administration took office, they have supported open data and
open governance. Their agenda was formalized through 2009s Open Government
Directive. The directive required all 143 United States federal agencies to upload all
non-confidential data sets to a newly created sharing platform, data.gov, by
November 9, 2013 (The White House, Office of Management and Budget 2009).
Disseminating newly disclosed data through a hackathon is attractive because of the
transparency and participation associated with the model. Hackathon proponents
assert that creating civic apps is critical to maximally leveraging Open Data.
The 2013 White House Hackathon focused on an Application Program Interface
(API), a code library that gives immediate access to data as it is updated. The API,
We the People, provides data on federal petitions, such as information concerning
when the petition was created and how many people have signed it. From this API,
civic hackers developed apps that made the data accessible in other formats, show
the number of signatures in real time, and map spatial patterns of support (The
White House We the People 2013). Given the limited data at hand, the fact that over
30 apps developed could be considered a success. However, of the 66,146 available
datasets on data.gov, which range from environmental to budget information, the
choice to make petition information the focus of the first hackathon is puzzling.
Possibly, petition data were selected as the locus because petitions indicate an open,
collaborative government, but creating better access to petition data does little to
change the minimal impact of petitions in federal governance. The second White
House hackathon held in 2014 (results still pending at the writing of this paper)
continued to focus on petitions with Petitions Write API, which is intended to
expand the platforms and sites through which people can submit petitions (Heyman
2014).
Big Data, Small Apps: Premises and Products of the Civic Hackathon 547
The 2013 White House Hackathon produced many prototypes, but few enduring
projects. Since the competition, none of the apps have been institutionalized by the
government, nor does it appear the apps have been updated or distributed. Video
demonstrations of the apps are available, but few of the apps are directly accessible
or downloadable. One downloadable program, a code library that extends analytic
possibilities by porting petition data into the statistical program R, does not work
with the current version of R.
While the initial tenets of Obama’s Open Government Directive were “trans-
parency, participation, and collaboration,” (The White House, Office of Manage-
ment and Budget 2009), the collaboration component has largely fallen by the
wayside (Peled 2011). Spokespeople for the US government’s Open Data Initiative
now choose to highlight the potential of transparency and downplay the failure of
federal agencies to use their data. The White House has made a very public push to
tout transparency as a virtue in of its own sake. While the hackathon may widen the
range of participants in governance, it falls short of deeply collaborative, open
governance.
One of the tangible successes of the hackathon is that all the code produced was
made available through the public code repository, GitHub. Everything that was
finalized in the hackathon is now a public resource, so it could potentially be
accessed and built upon in the future. The optimistic view is that “[w]ith each
hackathon, some of the detritus—bits of code, training videos, documentation, the
right people trading email addresses—becomes scaffolding for the attendees of
later ones” (Judd 2011). It may be left to the developer community, however, and
not the White House, to expand the results.
New York City’s Big Apps contest is one of the longest running civic hackathons. It
is also considered one of the most successful, in terms of apps sustained past the
contest period. The competition originated with former Mayor Michael Bloomberg,
who also pioneered some of the first Open Data legislation in the country and was
the first to appoint a Chief Information and Innovation Officer, Rahul N. Merchant,
in 2012 (New York City Office of the Mayor 2013). Both Bloomberg and Merchant
brought considerable experience in the business, financial and technology sectors.
They were able to secure major sponsors such as Facebook, eBay, Microsoft, and
Google for the contest.
Some believe this kind of private-public partnership is key to garnering talent
and funding apps that will sustain beyond the competition. In 2013, a panel of
judges awarded $55,000 to the BigApps winner, and amounts ranging from $5000
to $25,000 to runners up in several categories. In all 3 years of the competition,
BigApps has also always awarded an “Investor’s Choice” prize, underscoring its
focus on apps that offer a financial return.
548 S.J. Carr and A. Lassiter
Yet, BigApps does not explicitly require financial returns. Instead, it asks that
participants explore ways to use technology to make New York City a better place
to “live, work, learn, or play” (NYCEDC 2016). It also offers participants a chance
to tackle known challenges supplied by 30 private and public entities. These range
from improving health access, finding charging stations for electric cars, and
helping network parents of students in the New York City school system. Past
participants have occasionally taken on these challenges, but most winners of the
contest come up with their own ideas, usually with the intent of monetization. The
first place winner of the 2013 contest, HealthyOut, uses a Yelp API to help connect
people with healthier food delivery options. HealthyOut has since raised $1.2
million in venture capital, beyond the $55,000 in prize money earned from BigApps
(Lawler 2013).
The apps that have endured since the 2013 competition were able to secure
substantial funding and develop a revenue stream. Hopscotch, an educational
coding iPad app for children, also raised $1.2 million in seed funding (Lomas
2014) and is a top seller in the iPad store. The majority of BigApps winners with
more community minded goals are no longer functional, since they were unable to
create a sustainable financial model. HelpingHands, one of the prize winning apps
that helped NYC residents enroll in social services, was not available in app stores
at the time of this writing and the domain name was for sale.
It could be argued that BigApps encourages entrepreneurship and feeds money
back into the city. The leaders of BigApps claim successful ideas will be rewarded
with the resources to sustain them (Brustein 2012). However, its additional claims
“that [it] empowers the sharpest minds in tech, design, and business to solve NYC’s
toughest challenges” (NYCEDC) rings hollow. The slogan itself recognizes that
only a select group is empowered. Perhaps as a course correction to this issue and
response to several public criticisms of the non-civic goals of the winners (Brustein
2012), the 2014 and 2015 contests were considerably more focused in scope and
engaged tech and civic organizations for mentorship as well as offered contestants a
chance to work directly with public agencies to solve their specific, pre-defined
problems. Most notably, the 2014 context expanded the scope of contest deliver-
ables past smartphone apps to device design and data tools, among others. The
contest has sustained past the Bloomberg administration, and in 2015, the contest
attempted to link to policy initiatives by responding directly to Mayor de Blasio’s
OneNYC plan by asking participants to specifically address issues of Affordable
Housing, Zero Waste, Connected Cities, and Civic Engagement (NYCEDC 2016).
However, the contest still stops at the early idea stage, and as of this writing most
winners were still seeking further development and financial support.
Unlike the White House Hackathon and BigApps, the National Day of Civic
Hacking is not a government-driven initiative. The hackathon was organized by
Big Data, Small Apps: Premises and Products of the Civic Hackathon 549
consultant Second Muse with non-profits Code for America, Innovation Endeavors
and Random Hacks of Kindness. It is sponsored by Intel and the Knight Foundation,
with support from the White House Office of Science and Technology Policy,
several federal and state agencies, and other private companies. NDoCH aims to
address hyperlocal issues by promoting a coordinated set of nation-wide
hackathons hosted by individual localities on their own terms (Hack for Change
2014c; SecondMuse 2013).
NDoCH has few rules and the hackathon is interpreted broadly by participating
cities and states. The full set of projects from NDoCH is messy, but successfully
conveys the varying interests in participating areas. Most groups created apps or
websites, such as Hack 4 Colorado’s FloodForecast, which notifies users if their
home address is in danger of flooding. Other groups worked on alternative technical
projects. Maine Day of Civic Hacking, for example, focused on repairing a stop
motion animation film in a local museum.
Secondary to its local focus, NDoCH foregrounds some national issues that
participants can choose to address. In 2014, formal Challenges were advertised by
federal agencies like the Consumer Financial Protection Bureau and the Federal
Highway Administration. The Peace Corps, for example, requested “a fun, engag-
ing and easy-to-use interface with the numerous and diverse Peace Corps volunteer
opportunities that helps the user find the right opportunity for them” (Hack for
Change 2014d), which was subsequently prototyped at the San Francisco Day of
Civic Hacking (Hack for Change 2014e). NASA’s Challenge to increase awareness
of coastal inundation spurred several related projects.
Like the White House Hackathon, the route between hack and implementation is
unclear. But, NDoCH’s guiding principles are oriented toward the process of the
hackathon, rather than the results. Stated goals include “Demonstrate a commitment
to the principles of transparency, participation, and collaboration” and “Promote
Science, Technology, Engineering and Mathematics (STEM) education by encour-
aging students to utilize open technology for solutions to real challenges”
(SecondMuse 2013). Participants may not clearly understand, however, that most
apps will not survive past the hackathon. The tension between process and product
is exacerbated by reports from the NDoCH that tout the number of apps produced,
not just process-related goals. Ensuring authentic, collaborative processes over app
development remains a challenge.
How do civic apps propose to solve problems? In order to better understand the
products of the civic hackathon, we examined the results for the White House 2013
Hackathon, New York BigApps Contest 2013 and the National Day of Civic
Hacking 2014. We evaluate their descriptions, demonstrations, and locate them in
the Apple and Android stores, as applicable. For White House and BigApps, we
evaluated information on the winning entries. For NDoCH, there were no selected
550 S.J. Carr and A. Lassiter
Many of our most used commercial apps such as Google Maps, Yelp, specialize in
the spatial customization of individual daily routines. These apps, powered by ever
expanding GPS technology, have advanced the “spatially enabled society,” where
citizens are better able to communicate with the world around them (Roche
et al. 2012: 222). In spatially enabled society, “the question ceases to be simply
‘Where am I?’ and becomes: ‘What is around me?’ (as in services, people, and
traffic), ‘What can I expect?’ and ‘How do I get there?’” These apps center around
easing mobility and in many cases consumption in the city—finding parking, giving
real-time transit alerts, or customizing personal routes given a set of favorable
inputs.
It is no surprise that many civic apps have also seized on expanding the suite of
spatial customization apps, as mobility and movement in the city is a continuing
challenge, while self-locating with GPS remains a relatively new possibility. The
majority of the spatial customization apps provide real time transit alerts. In
addition, services like HealthyOut tells a user which restaurants near them are
best suited to their personal diet, while another app, called Poncho, tells what the
weather will be at every location in your daily routine. These spatial customization
apps are focused on the desires of the individual, harnessing open data to ease
everyday life.
Some of these apps also crowdsource data from users, aggregate the data, and
then provide users with continually updated, socially-derived urban data. Some of
these apps are oriented toward typically underserved populations. Ability Anyware
Assistive Technology Survey and Enabled City, built during the NDoCH, identifies
and find accessible routes and buildings for people with disabilities.
We also include personal services in this category. Even though this information
is sometimes aspatial, these services do help make urban life more efficient to the
individual. Two apps built at NYC BigApps, ChildcareDesk and HiredinNY,
intended to connect users to child care centers and jobs, respectively.
Big Data, Small Apps: Premises and Products of the Civic Hackathon 551
The spatially enabled society is also at the heart of many apps that are not expressly
built for individual easing, but rather to simply visualize otherwise invisible
information to increase awareness among app users. Apps produced at the Virginia
Beach hackathon under NDoCH mapped the effects of potential sea level rise.
Others mapped child hunger statistics (Maine Child Hunger Cartogram Viewer)
and SNAP benefits (SNAPshot) with the professed intent of spurring empathy.
In addition, this category of apps uses spatial information to encourage individ-
uals to act in their own community, by bringing visibility to difficult-to-perceive
issues. Freewheeling NC, an app built at a North Carolina hackathon during the
National Day of Civic Hacking, crowdsources bike routes with the intent of
influencing urban planning. Many of these apps use public data on underutilized
vacant land in order to help community members to use them as gathering places,
urban agriculture, or new development (Minimum Adaptable Viable Urban Space
(MAVUS); [freespace] ATX; Abandoned STL).
Some apps similarly raise awareness through aspatial data communication.
Often aimed at government transparency, such as several apps developed at the
White House Hackathon with the We the People API, or campaign finance infor-
mation and city council agendas. At times, data communication and spatial aware-
ness come together, such as Flood Forecast’s flood notifications.
Many apps built during the National Day of Civic Hacking focused on community
building through peer-to-peer communication. These apps help people in niche
groups find each other, such as connecting pet owners and teens. Community
building apps also help pair volunteers with nonprofit organizations such as Habitat
for Humanity, help people find places to donate leftover food, and allow citizens
direct access government officials.
4.4 Educational
The smallest group of apps are educational, which are often aimed at youth and
incorporate a gaming component. The aforementioned Hopscotch at NYC BigApps
teaches children to code, and two apps built under NDoCH aimed to teach users
about watersheds and urban geography by using a Minecraft-like interface.
552 S.J. Carr and A. Lassiter
Lastly, some apps are simply focused on making data accessible in a different
machine-readable format or providing analytic environments for the data, but not
for specific purposes. These are often interfaces geared towards developers to create
even more apps. Almost a third of the apps from the White House Hackathon fall
into this category; other data gateways were created for Peace Corps project data,
hospital discharge costs, and even presidential inaugural addresses for textual
analysis.
Table 1 presents a summary of the results from each hackathon against the app
taxonomy. Because of the large variations in end product numbers—17 for the
White House Hackathon, 7 in the BigApps contest, and 71 during the National Day
of Civic Hacking—the percentage of the total apps is given in Table 1.
The structure, funding, and data released in each type of hackathon not only
influenced the scale and number of apps, but the predominant type of apps pro-
duced. At the White House Hackathon, no winners were declared, but results and
code were posted for only 17 out of 30 completed projects executed at the
hackathon. Of the 17 apps, 12 communicated Spatial Awareness and Data Com-
munication, while 5 focused on Data Access. This is no doubt related to the focus
on only one dataset. It is also notable that the White House hackathon goals were
not to solve any particular urban challenge, but rather simply to see what techno-
logical expertise could do with a set of open data.
Table 1 Results from the White House Hackathon 2013, New York City’s BiggApps Contest
2013, and the National Day of Civic Hacking 2014
Spatial Spatial
customization awareness
and personal and data Data Community
services communication gateways building Educational Other
White House 0 % (0) 71 % (12) 29 % (5) 0 % (0) 0 % (0) 0 % (0)
Hackathon
2013 (17)
NYC 86 % (6) 0 % (0) 0 % (0) 0 % (0) 14 % (1) 0 % (0)
BigApps 2013
(7)
National Day 13 % (9) 34 % (24) 10 % (7) 28 % (20) 8 % (6) 7 % (5)
of Civic
Hacking 2014
(71)
The total number of apps produced is given in parentheses
Big Data, Small Apps: Premises and Products of the Civic Hackathon 553
The total number of entries in New York City’s BigApps is not available, but all
seven winning entries earned prize money. It was the only hackathon of the three to
offer prize money and the only to boast significant private business partners and
potential for investors. Among the apps, 6 out of 7 are focused on personal mobility
and personal services, with the remaining app was educational. While the contest
claims that it is bringing together experts and developers to “solve New York’s
toughest challenges,” its results thus far indicate that it is more focused on apps that
ease individual consumption and mobility; instead of public data sets, the winning
entry used the commercial API from Yelp.
While the White House Hackathon and BigApps each produced narrow results,
the National Day of Civic Hacking generated many apps, crossing all five catego-
ries. Of the 71 products from the competition, some of the results are as technically
sophisticated as those developed in BigApps and the White House Hackathon.
Others entries are not apps at all—they are requests for apps or brainstorming
sessions regarding the potential for apps. These “Other” results make up 7 % of all
the entries into NDoCH. Notably, however, 28 % of NDoCH projects focused on
Community Building, which was absent from the other two hackathons. Of Spatial
Customization and Personal Services apps, several focused on finding accessible
facilities for disabled citizens or improving public amenities such as bike lanes,
trails, and parks. As the NDoCH organizers hoped, the apps associated with
NDoCH reveal community-driven, locally-specific issues and civic innovation.
However, they also reveal the technological limits of many localities. It is not
surprising that the greatest number and most sophisticated apps come from tech-
nology hubs such as Palo Alto and Austin, where some of the more distant outposts
did not have the expertise to even produce an app at the end of the event.
6 Future Hackathons
The landscape of civic data and civic apps is rapidly changing, corresponding with
the rise of Open Data and Big Data, expanding mobile technology, and trends
toward technocracy (Mattern 2013). The 2014 National Day of Civic Hacking, then
in its second year, claimed a 30 % increase in events from its first hackathon
(Llewellyn 2014). Promoting civic hackathons is a not only a low-risk, adaptable
method of embracing contemporary problem solving amidst change, but previous
ones have generated enough success to keep attempting them. Hackathons have
fostered new types of civic participation, created enthusiasm among some commu-
nity members, developed some new apps, and may broadly encourage more tech-
nological innovation in government. Yet, while there is evidence of successes, civic
hackathons face unique challenges that must be addressed in order to deepen their
impact.
The examples proffered here show that there are three common and interrelated
issues with the hackathon model. Firstly, defining problems that are meaningful for
the community. Secondly, the challenge of aligning the goals of market-ready apps
554 S.J. Carr and A. Lassiter
with civic services. Thirdly, and most importantly, that the civic hackathon has the
responsibility of addressing the needs of its full constituency, not simply the
smartphone-owning, tech-literate public.
Many experts have identified that lack of structure in hackathons, meant to encour-
age out of the box approaches, can lead to unfocused results. NASA’s open data
portal, for example, states that the key to implementing a successful hackathon “is
to invest the effort to identify the right problem statements, provide the supporting
data, and get a good mix of people in the room” (Llewellyn 2012). As data scientist
Jake Porway warns, “They are not easy to get right. . . You need to have a clear
problem definition, include people who understand the data not just data analysis,
and be deeply sensitive with the data you’re analyzing” (Porway 2013).
When any local government is faced with a challenge, deeply understanding the
data being analyzed, the physical and social context, and possible opportunities is
crucial. Developing the web of knowledge necessary to solve most issues takes
time. While the public can offer needed fresh perspectives and local insight, it can
be difficult for outsiders to come in and hit the ground running, which is necessary
in a short-lived, intense hackathon. Many of the apps that come out of these contests
help users navigate the city, rate local places, or plan itineraries—all services that
are already well-covered and arguably better developed by large tech companies
(Brustein 2012).
Yet, “problem solving” is often touted as the key tenet of many hackathons.
BigApps seeks to “solve specific New York City challenges, known as BigIssues”
(New York City Office of the Mayor 2012). BigApps identified four BigIssues for
the 2013 competition: Jobs and Economic Mobility, Healthy Living, Lifelong
Learning, and Cleanweb: Energy, Environment, and Resilience. Of course, it is
nearly impossible to solve any sort of complex issue, like a BigIssue, in the context
of a hackathon. Creating nuanced solutions requires both quantitative expertise and
experience. If hackathons are going to meaningfully address difficult city problems,
it likely is necessary to create hybrid teams of public participants and government
employees, as the NYC BigApps contest has started to do. However, most crucially,
the contests need to consider a structure that can commit to working through
proposals beyond the short-lived timeframe of the hackathon.
survive past a hackathon are market-ready and able to attract venture capital during
or shortly after the hackathon.
At NDoCH, the organizers recognize that governments will not bear the respon-
sibility for apps after their creation:
“Each new technology has a unique path to implementation. The key to the development of
technologies that make their way out of the hackathon environment and into your commu-
nity are public and private partnerships. One path to sustainability is that a group of
volunteers develops a new app to connect low-income residents to the nearest free tax
preparation site over the course of National Day of Civic Hacking. Following the event, the
volunteers reach out to economic justice groups in their community so they can promote
their services using the app, seek a sponsor to offset the cost of the text message usage, and
work with government officials to promote the app as well as the availability of free tax
prep in your city.” (Hack for Change 2014b)
Not only is the onus of innovation and development shifted to a narrow swath of
the data-literate public, but the growth and sustainability is as well.
Sustainability is identified as the primary issue by many technologists. Code for
America’s Dan Melton writes, “. . .some of the biggest examples of disconnect and
potential opportunity come out of app contests or hackathons. Policy makers/
political leaders champion city or social contests, to which, developers respond
with dozens or even hundreds of submissions. So far so good. When the app contest
is over, often too is the partnership” (Melton 2011). O’Reilly Media editor Andy
Oram adds, “. . .how could one expect a developer to put in the time to maintain an
app, much less turn it into a robust, broadly useful tool for the general public?. . .
The payoff for something in the public sphere just isn’t there” (Oram 2011).
Organizations like CivicApps (http://civicapps.org/), help to overcome sustainabil-
ity issues by promoting apps for wider distribution, but nonetheless few are self-
sustaining. Even the more civic-minded winning apps at NYC BigApps 2013 such
as Helping Hands and HiredinNY are nowhere to be seen 1 year later, despite
funding awards.
There is some implication that failure may be, in part, because of app quality.
Refining existing ideas could help improve sustainability. Joshua Brustein of the
New York Times (2012) says, “Inevitably, most of these projects will fail. Start-ups
usually do. And considering the modest price of the program—BigApps costs the
city about $100,000 a year, according to the city’s Economic Development Corpo-
ration—the bar for success should be set low. But it seems that a better investment
might be to spend more time working with a few developers on specific ideas, rather
than continually soliciting new ones” (Brustein 2012). Tackling this issue will
require hackathon organizers to turn an eye to building communities before build-
ing apps. Clay Johnson of the Sunlight Foundation, a nonprofit dedicated to
government transparency, notes that they see their hackathons as only the beginning
of their engagement with both developers and volunteers (Johnson 2010). There are
some cases of longer term partnerships, such as the Federal Registry’s partnership
with the winners of the 2010 Apps for America civic hackathon to create a
redesigned data distribution portal (Oram 2011), but these examples of government
commitment are rare.
556 S.J. Carr and A. Lassiter
Revenue and consumption are the bases of most of the successful, sustaining
hackathon propositions. This makes it challenging to address problems that are
not profitable. Issues of the “unexotic underclass,” such as veterans and welfare
recipients, often go unaddressed (Nnaemeka 2013). This is evident despite Open
Data’s promise of egalitarianism and the participatory goals of hackathons.
One piece of this challenge is the demographic of participants that hackathons
typically attract. The majority of hackathon developers are young, well-educated,
and relatively affluent. Unsurprisingly, the majority of apps cater to this demo-
graphic, even in a civic context. As technologist Anthony Townsend (2013: 166)
writes, “. . .should we be surprised when they solve their own problems first?. . .Not
only do they not represent the full range of the city’s people; often these hackers
lack a sense that it’s even their duty to help others. . .”. Representation bias is also
noted by Porway (2013), when he similarly writes about a New York City
hackathon focused on greening the city, “. . .as a young affluent hacker, my problem
isn’t improving the city’s recycling programs, it’s finding kale on Saturdays.”
Organizers should particularly look to increase the participation of women and
low-income communities. The all-night structure and lack of code of conduct can
be intimidating to women (Rinearson 2013). The National Center for Women and
Information Technology notes the importance of specifically recruiting girls for
events, not only as coders but as judges and mentors (NCWIT n.d.). One organiza-
tion, Yes We Code, is responding by hosting their own hackathons that support
ideas from low-income teens (Yes We Code 2013). Rather than hosting separate
events, however, it is the responsibility of the city to ensure that such organizations
and their constituents have a voice in the city-sponsored hackathon.
Representation bias also extends to the hardware. Only 56 % of the
U.S. population own smartphones and owners are primarily people under 35, well
educated, and affluent (Smith 2013). Apps are, themselves, a limiting format.
Though they continue to increase in popularity, many difficult to reach groups
cannot be accessed with apps. In a world where mobile data contributes to visibility
and voice, those that are not able to partake become invisible.
Recent language in the NYC BigApps contest and National Day of Civic
Hacking acknowledges that apps can be exclusive. In 2014, BigApps will accept
competition entries that use a wider array of technology products (NYCEDC 2016).
Expanding the technology to gather data and input may have surprising outcomes.
When New York City’s non-emergency reporting system, 311, added a website and
then app to their telephone reporting system, they expected call volumes to go
down. Instead, they saw an overall increase in reporting, showing that different
mediums helped access different sectors of the population. Moving away from the
app interface could encourage broader engagement.
Big Data, Small Apps: Premises and Products of the Civic Hackathon 557
7 Conclusion
As of this writing, the White House has not held another hackathon, but NYC
BigApps and the National Day of Civic Hacking persist into 2016. While these
contests have been a reasonably convenient, low investment way for governments
to appear to engage in innovative, technology-enabled problem solving, the existing
hackathon models make it difficult to address truly complex, non-monetizeable
issues. As governments grapple with what to do with their newly opened datasets
and how to handle much of their Big Data, they are left with some difficult
challenges. How should governments ensure that civic innovations are institution-
alized and sustained without being dependent on private backers? How do they
ensure everyone is being fairly represented—that those on the far side of the digital
divide are not left out of the wake of technological progress?
The first step to improving the civic hackathon is to subject it to the same
scrutiny as any other urban practice. This includes clarifying the goals of
hackathons and developing associated metrics. If apps have the possibility of
creating efficiency gains, results should be internalized and governments should
commit personnel resources to hackathons, support scaling, and dedicate money for
ongoing operation. Alternatively, are hackathons intended to kickstart for-profit app
businesses? If so, the role of public money in this process should be made clear and
the surface claims to solving complex urban issues should be eliminated. Or, are
hackathons a method of signaling open governance? If so, this method of partici-
pation should be examined against existing models of collaborative governance.
While the analytical literature surrounding hackathons is scarce, it is necessary to
develop best practices for running a hackathon and building on the results. In doing
so, the hackathon may have the opportunity to become everything it wants to be:
transparent, collaborative, and innovative. For now, however, the lofty goals remain
unmet.
References
Brustein J (2012) Contest whose winners may not succeed. New York Times. http://www.nytimes.
com/2012/03/04/nyregion/new-yorks-bigapps-contest-has-mixed-results.html
Greenfield A (2013) Against the smart city (the city is here for you to use) [Kindle eBook ed.].
Amazon Digital Services
Hack for Change (2014a) Key highlights. http://hackforchange.org/about/key-highlights/
Hack for Change (2014b) FAQ. http://hackforchange.org/about/faq/
Hack for Change (2014c) About. http://hackforchange.org/about/
Hack for Change (2014d) Challenges. http://hackforchange.org/challenges/
Hack for Change (2014e) Peace corps peace. http://hackforchange.org/projects/peace-corps-
peace-2/
Heyman L (2014) Announcing the White House’s second annual civic hackathon [Blog post]. The
White House Blog. http://www.whitehouse.gov/blog/2014/05/01/announcing-white-houses-
second-annual-civic-hackathon
558 S.J. Carr and A. Lassiter
Johnson C (2010) Build communities not contests [Blog post]. The Information Diet. http://www.
informationdiet.com/blog/read/build-communities-not-contests
Judd N (2011) Code for America’s chief geek says civic hackers should fix hackathons next.
TechPresident. http://techpresident.com/short-post/code-americas-chief-geek-says-civic-
hackers-should-fix-hackathons-next
Keyani P (2012) Stay focused and keep hacking [Blog post]. Facebook Engineering Notes. https://
www.facebook.com/notes/facebook-engineering/stay-focused-and-keep-hacking/
10150842676418920
Krueger A (2012) Hackathons aren’t just for hacking. Wired Magazine. http://www.wired.com/
2012/06/hackathons-arent-just-for-hacking/
Lawler R (2013) HealthyOut is like a personal nutritionist for healthy food deliveries.
TechCrunch. http://techcrunch.com/2013/04/30/healthyout/
Llewellyn A (2012) The power of hackathons in government [Blog post]. NASA Blog. http://open.
nasa.gov/blog/2012/06/29/the-power-of-hackathons-in-government/
Llewellyn A (2014) National day by the numbers. Hack for Change Blog. http://hackforchange.
org/national-day-by-the-numbers/
Lomas N (2014) Hopscotch, an iPad app that helps kids learn to code, raises $1.2M. TechCrunch.
http://techcrunch.com/2014/05/08/hopscotch-seed/
Mattern S (2013) Methodolatry and the art of measure. Places: Design Observer. http://places.
designobserver.com/feature/methodolatry-in-urban-data-science/38174/
Melton D (2011) Scaling our movement [Blog post]. Code for America Blog. http://www.
codeforamerica.org/blog/2011/08/17/scaling-our-movement/
National Center for Women in Technology (n.d.) Top 10 ways to increase girls’ participation in
computing competitions. http://www.ncwit.org/resources/top-10-ways-increase-girls-participa
tion-computing-competitions/top-10-ways-increase-girls
New York City Economic Development Corporation (NYCEDC) (2016) NYC BigApps past
competitions. http://www.nycedc.com/services/nyc-bigapps/past-competitions
New York City Office of the Mayor (2012) Mayor Bloomberg appoints Rahul N. Merchant as the
city’s first chief information and innovation officer [Press Release]. News from the Blue Room.
http://www.nyc.gov/cgi-bin/misc/pfprinter.cgi?action¼print&sitename¼OM&
p¼1405371522000
New York City Office of the Mayor (2013) Mayor Bloomberg announces winners of NYC
BigApps, fourth annual competition to create apps using city data [Blog Post]. http://www1.
nyc.gov/office-of-the-mayor/news/215-13/mayor-bloomberg-winners-nyc-bigapps-fourth-
annual-competition-create-apps-using
Nnaemeka C (2013) The unexotic underclass. MIT Entrepreneurship Review. http://miter.mit.edu/
the-unexotic-underclass/
Oram A (2011) App outreach and sustainability: lessons learned by Portland, Oregon. Radar Blog
(O’Reilly Media). http://radar.oreilly.com/2011/07/app-outreach-and-sustainabilit.html
Peled A (2011) When transparency and collaboration collide: the USA open data program. J Am
Soc Inf Sci Technol 62(11):2085–2094
Porway J (2013) You can’t just hack your way to social change [Blog post]. Harvard Business
Review Blog Network. http://blogs.hbr.org/2013/03/you-cant-just-hack-your-way-to/
Rinearson T (2013) Running an inclusive hackathon. Medium. https://medium.com/hackers-and-
hacking/running-an-inclusive-hackathon-630f3f2e5e71
Roche S, Nabian N, Kloeckl K, Ratti C (2012) Are ‘smart cities’ smart enough. In: Global
geospatial conference, Quebec, Canada, 14–17 May
SecondMuse (2013) National day of civic hacking. http://secondmuse.com/project3.html
Smith A (2013) Smartphone ownership—2013 update. Pew Internet & American Life Project.
Washington, DC. http://www.pewinternet.org/~/media/Files/Reports/2013/PIP_Smartphone_
adoption_2013.pdf
Big Data, Small Apps: Premises and Products of the Civic Hackathon 559
The White House, Office of Management and Budget (2009) Open government directive [Press
release]. http://www.whitehouse.gov/open/documents/open-government-directive
The White House, We the People (2013) We the people API gallery. https://petitions.whitehouse.
gov/how-why/api-gallery
Townsend A (2013) Smart cities: big data, civic hackers, and the quest for a new utopia.
W.W. Norton, New York
Yes We Code (2013) Homepage. http://www.yeswecode.org/