Big Data Challenges Practices and Technologies NIST Big Data Public Working Group Workshop at IEEE Big Data 2014
Big Data Challenges Practices and Technologies NIST Big Data Public Working Group Workshop at IEEE Big Data 2014
II. THE STATE OF BIG DATA TECHNOLOGY C. NIST - Use Cases and Requirements- Geoffrey Fox,
The term Big Data has come to mean many things, and Indiana University
communication about new approaches has been hindered by The NIST Big Data Public Working Group collected an
the conflicting vocabulary and definitions. To better extensive catalog of use cases, reflecting Big Data applications
understand this emerging discipline, this panel discussed in public health, epidemiology, U.S. census, cargo shipping,
frameworks for understanding the new architectures, use cases geointelligence, defense, genomics, recommendation engines
and requirements, and benchmarks. and media. These applications were then examined in the
12
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
efficiently utilizing large quantities of data. It supports and high performance computing within the Federal
formation of working groups to tackle specific problems. Government.
• BDA aims to clarify some foundational terminologies in
B. Implementation of Big Data Applications in Government
the context of data analytics understanding
differences/overlaps with terms like data science, data and Science Communities – Joan L. Aron, Federal Big
analysis, data mining, etc. Data Working Group
A conceptual overview sets the context for the uses of Big
• BDA will develop a recommendation document with a Data for knowledge discovery and decision support and the
systematic classification of feasible combinations of challenges in developing applications. The federation of use
analysis algorithms, analytical tools, data and resource cases, data publications, solutions & technologies provides
characteristics and scientific queries. These examples. Semantic analysis is the basis of solutions for many
recommendation documents can serve as a best practice applications for government and science communities. The
guide for scientific groups/communities interested in federal government has greater needs for aggregating data
investing in Big Data technologies. while maintaining compliance with privacy and security
• BDA works to develop a consensus amongst its members requirements. Cognitive metadata, which is the metadata
to achieve this desired goal. coming from enhancing machine learning with our human
perception, reasoning or intuition, can be used for
• BDA collaborates with external bodies and initiatives - personalization purposes and conversely for protecting
such as NIST, OGC, ISO, EarthServer and others. personally identifiable information (PII). A new technology
for natural language understanding can be used to find high-
D. Next-Generation Computing Systems for Big Data value information in a large body of texts, such as a collection
Machine Learning and Graph Analytics – H. Howie of agency reports, with little specialized training. Advances in
Huang, George Washington University high-performance computational hardware are also important.
A semantic MEDLINE for searching biomedical research
Big data machine learning and graph analytics have been
literature uses hardware built for Resource Description
widely used in industry, academia and government.
Framework (RDF) triples in a graph database and semantic
Continuous advance in this area is critical to business success,
processing developed at the National Library of Medicine. A
scientific discovery, as well as cybersecurity. In this position
high-performance computing cluster environment is in use for
paper, we present the current state of the art, and propose that
searching public records, patent data, case law and news
next-generation computing systems for Big Data machine
articles. Use cases with a focus on environment and Earth
learning and graph analytics need innovative designs in both
system science illustrate achievements and challenges for the
hardware and software that provide a good match between Big
use of Big Data in data publishing and data access, data
Data algorithms and the underlying computing and storage
discovery and decision support, and workforce development
resources.
for the scientific community and decision-makers to work with
data science.
IV. BIG DATA SHARING AND COLLABORATION
Critical to moving Big Data forward as a discipline are the C. Data-Intensive Science Challenges – Thomas Huang,
methods needed for improving both collaboration and data NASA Earth Science Data Systems Data-Intensive
sharing. We are familiar with the cooperation for open source Architecture Working Group
technology development and in online courses, but how do we Data-Intensive Science defines three high-level activities:
cooperatively move forward and put these technologies into capture, curation, and analysis of data. Tackling Big Science
practice? How do we better provision data frameworks to Data requires more than just infusing Cloud Computing,
promote technology adoption, data sharing and data reuse? Hadoop, and NoSQL. Science data system architecture is an
orchestration of people, process, policies, and technologies. It
A. Public Private Collaboration – Johan Bos-Beier, ACT/IAC requires thorough understanding of the problem space,
ACT-IAC Big Data Committee seeks to enable government assessment of technologies available, process that is repeatable
agencies to make better data-driven decisions through the and traceable, and an adaptable architecture. This session
analysis, management, integration, and representation of large focuses on architectural discussion and enabling technologies
and complex data stores. The BDC seeks to: for tackling data-intensive science. The discussion should be
supported by use cases as the instrument to facilitate review of
• Provide a forum for information sharing and collaboration current science data systems and assessment of some of the
between federal, state, and local government agencies enabling technologies.
seeking to leverage their data for better informed decision-
making. D. Big Data Provenance and Metadata – Rajeev Agrawal,
• Advise or recommend approaches to developing Big Data North Carolina A&T State University
technical frameworks and capability maturity model With the progress of new technology, the volume and
assessments. complexity of data produced and processed in scientific
• Promote Big Data best practices through increasing research is increasing remarkably. This data is growing so fast
awareness of Big Data research, technologies, use cases, that existing resources are facing difficulty to analyze data
13
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
properly. It is important to properly track scientific workflows identity, authorization, audit, network and device security, and
to provide context and reproducibility. Provenance deals with federation across trust boundaries.
this need and assists scientists by delivering the lineage or
Clearly, the advent of Big Data has necessitated paradigm
history of the way of generating, using and modifying data. We
shifts in the understanding and enforcement of security and
discuss a complete workflow of tracking provenance
privacy (S&P) requirements. Significant changes are evolving,
information of Big Data.
notably in scaling existing solutions to meet the volume,
variety, and velocity of Big Data, and re-targeting security
V. BIG DATA SECURITY AND PRIVACY solutions amid shifts in technology infrastructure, e.g., dis-
The distribution of data across resources, and the tributed computing systems and non-relational data storage. In
involvement of a number of organizations in one system open addition, as diverse datasets become ever-easier to access,
up new concerns for security and privacy. This panel will focus many are increasingly personal in nature. Thus, a whole new
on the areas that are new and different because of the Big Data set of emerging issues must be addressed, including balancing
architectures. The panel will discuss the state of the art in privacy and utility, enabling analytics and governance on
security and privacy enhancing technologies, Big Data privacy encrypted data, and reconciling authentication and anonymity.
concerns and the over-arching challenge of deriving knowledge Working with other subgroups in the NBD-PWG, this
from Big Data while preserving privacy. subgroup has begun to expand the distributed computing
concept of a Big Data security fabric.
A. Big Data Analytics for Security –Pratyusa Manadhata, HP
and Computer Security Aalliance With the key Big Data characteristics of variety, volume,
and velocity in mind, the subgroup gathered use cases from
Enterprises routinely collect terabytes of security relevant
volunteers, developed a consensus security and privacy taxon-
data (e.g., network events, software application events, and
omy and reference architecture, and validated it by mapping
people action events) for several reasons, including the need
the use cases to the reference architecture.
for regulatory compliance and post-hoc forensic analysis. We
estimate that large enterprises may generate 10-100 billion
events per day depending on their size. These numbers will D. Education Data Pricacy and State Boards of Education –
grow as enterprises enable event logging in more sources, hire Amelia Vance, National Association of State Borads of
more employees, deploy more devices, and run more software. Education
Unfortunately, this volume of data quickly becomes Big data has the potential to revolutionize education, al-
overwhelming. Existing analytical techniques do not work well lowing for more efficient and effective schools. It can allow
at this scale and typically produce so many false positives that every teacher to personalize every element of instruction, and
their efficacy is undermined. The problem becomes worse as enable policymakers to see exactly which elements of each
enterprises move to cloud architectures and collect much more educational policy are successful in helping ensure students are
data. We will discuss techniques to mitigate this problem. college-and career-ready. However, while many technologists
believe that the benefits of Big Data in education are self-
B. Cyber Security and the Industrial Internet –Stephen evident and outweigh any dangers of collecting sensitive stu-
Mellor, Industrial Internet Consortium dent information, many parents, teachers, and policymakers do
not feel the same way. Only now are parents learning about the
Through its public-private partnership, the IIC is committed
data schools are collecting about their children. They are justly
to working with public and private partnerships to ensure that
concerned about how it is used and shared— the fact that data
security and privacy are integral parts of Industrial Internet
collection is often outsourced to third-party vendors only adds
products and services. The IIC is working with its ecosystem to
to their skepticism and concerns for their childs privacy. This
identify the requirements for communication protocols and
has led to an instinctual response by many policymakers and
create mechanisms to enhance rapid discovery, mitigation, and
others to work against the use of Big Data in education, despite
remediation of vulnerabilities in near real-time. This session
the potential benefits it may have for education. In 2014, state
will be an open discussion on how the IIC is defining future
legislatures introduced 110 bills in 36 states regarding student
requirements and recommendations to ensure the Industrial
data privacy. Seventy-nine of the 2014 bills have at least some
Internet is private and secure.
elements that would restrict the use of data in education. For
example, New Hampshires bill, which was passed into law,
C. NIST Big Data Security and Privacy –Mark Underwood, likely prevents predictive analytics. A bill in Missouri would
Krypton Brothers have defunded their statewide longitudinal data system. In all,
The NIST Big Data Interoperability Framework Volume 4: 28 of the 110 bills introduced passed into law this year. And,
Security and Privacy Requirements was prepared by the NBD- the number of student data privacy bills is expected to double
PWG’s Security and Privacy Subgroup to identify security and in the 2015 legislative session.
privacy issues particular to Big Data. Big Data application
domains include health care, drug discovery, finance and many Many of the bills introduced, and the laws passed, give
others from both the private and public sectors. Among the sce- state boards of education (SBEs) a key role in the data privacy
narios within these application domains are health exchanges, discussion. Eighteen SBEs are tasked by statute with writing
clinical trials, mergers and acquisitions, device telemetry, and their states student data management policy or have oversight
international anti-piracy. Security technology domains include authority for the agency that is writing the policy. Thirteen
SBEs are members of their states data management team.
14
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
Seven SBEs are required by statute to ensure FERPA com- far in states from failed attempts in responsible data collection
pliance. Fifty-five bills introduced in 2014 would give SBEs and privacy security.
some authority in regulating student data privacy. Existing
state privacy laws give many SBEs authority over various ACKNOWLEDGMENT
things to help secure data privacy, including appointing a chief
privacy officer, adopting and/or implementing state privacy The authors wish to thank the panelists for their time and
policies, and providing oversight of vendor contracts. SBEs efforts to share their expertise and further the dialog for
have also independently passed rules for their states to protect clarifying the new discipline of Big Data. The authors also
data privacy. Unfortunately, like many other policymakers, wish to acknowledge the contributions of the large group of
many SBE members are unaware of the potential benefits of participants in the NBD-PWG, who have discussed at length
Big Data in education. Education data privacy requires the emerging discipline of Big Data, and have helped form a
knowledge of privacy law, a basic understanding of Big Data, collective understanding of this new paradigm.
and a great deal of time to learn about the ins and outs of
todays education data privacy debate. The National REFERENCES
Association of State Boards of Education (NASBE) is helping
SBEs understand and pass effective policies on these issues [1] N. Grady, W. Chang, eds. “NIST Big Data Interoperability Framework:
that will protect data privacy while supporting educational Volume 1, Definitions” NIST. unpublished.
innovation through the use of Big Data. In this panel, Amelia [2] N. Grady, W. Chang, eds. “NIST Big Data Interoperability Framework:
Vance from NASBE will discuss the role SBEs play in Volume 2, Taxonomy” NIST. unpublished.
education data collection, the questions they are asking as they [3] G. Fox, W. Chang, eds. “NIST Big Data Interoperability Framework:
put together state privacy policies (particularly those dealing Volume 3, Use Cases and Requirements” NIST. unpublished.
with third party use of data), and what information [4] A. Roy, M. Underwood, W. Chang, eds. “NIST Big Data
policymakers need from technology providers in order to trust Interoperability Framework: Volume 4, Security and Privacy
Requirements” NIST. unpublished.
the use of Big Data in education.
[5] S. Mishra, W. Chang, eds. “NIST Big Data Interoperability Framework:
We consider the perspectives and recommendations from Volume 5, Architectures White Paper Survey” NIST. unpublished.
multiple organizations and experts, including the Data Quality [6] O. Levin, W. Chang, eds. “NIST Big Data Interoperability Framework:
Campaign, the Electronic Privacy Information Center, and the Volume 6, Reference Architecture” NIST. unpublished.
Pioneer Institute, as well as examine the lessons learned thus [7] D. Boyd, C. Buffington, W. Chang, eds. “NIST Big Data
Interoperability Framework: Volume 7, Taxonomy” NIST. unpublished.
15
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.