0% found this document useful (0 votes)
33 views5 pages

Big Data Challenges Practices and Technologies NIST Big Data Public Working Group Workshop at IEEE Big Data 2014

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views5 pages

Big Data Challenges Practices and Technologies NIST Big Data Public Working Group Workshop at IEEE Big Data 2014

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2014 IEEE International Conference on Big Data

Big Data: Challenges, Practices and Technologies


NIST Big Data Public Working Group Workshop at IEEE Big Data 2014

Nancy W. Grady1, Mark Underwood2, Arnab Roy3, Wo L. Chang4


1
Science Applications International Corporation, [email protected]
2
Krypton Brothers, LLC, [email protected]
3
Fujitsu Laboratories of America, [email protected]
4
National Institute of Standards and Technology, [email protected]

Abstract—Big Data has changed both technologies and


practices for building data analytics systems. A number of A. Data Consistency Issues in Big Data Systems – Jianmin
working groups have been discussing the recent changes along a Wang, Tsinghua Big Data Research Center
number of dimensions. The NIST Big Data Public Working Distributed storage systems are required to guarantee data
Group organized a workshop to promote communication among reliability, fault-tolerance and accessibility for users. Besides
working groups, technologists, and practitioners to come to an hardware configuration, the design and implementation of
understanding of the current state of the Big Data discipline, distributed systems is very important to reach these goals. The
collaboration best practices, future directions for this emerging most common solution is that we store multiple copies of the
specialization, and to identify security and privacy concerns. same data in different storage devices. The multiple copies are
called data replica.
Keywords—Big Data; reference architecture; collaboration;
security; privacy; metadata; standards We take two popular distributed storage systems as
examples to analyze the working mechanism and replica
I. INTRODUCTION consistency. The first one is Cassandra, propagating data in a
star model and the second one is HDFS, propagating data in a
NIST has been facilitating a Big Data Public Working chain model.
Group (NBD-PWG) to form a community of interest from
industry, academia, and government to promote better
understanding of this new discipline. The aim has been the B. NIST Big Data Interoperability Framework – Orit Levin,
development of consensus definitions, taxonomies, reference Microsoft
architectures, and technology roadmaps based on an The National Institute of Standards and Technology (NIST)
understanding of use cases and requirements. The goal is to NIST Big Data Interoperability Framework, Volume 6:
create vendor-neutral, technology and infrastructure agnostic Reference Architecture is one of seven volumes in the
vocabulary and descriptions. This will enable Big Data roadmap, whose overall aims are to define and prioritize Big
stakeholders to better understand the emerging discipline, and Data requirements, including interoperability, portability,
to choose the best-suited analytics tools for their processing reusability, extensibility, data usage, analytic techniques, and
and visualization requirements on the most appropriate technology infrastructure in order to support secure and
computing platforms and clusters. By providing a framework effective adoption of Big Data. The Reference Architecture is
for communication, needs and capabilities can be better dedicated to developing a vendor-neutral, technology- and
matched between technologists and practitioners. infrastructure-agnostic conceptual model and examining
related issues. Created by the NIST Big Data Public Working
To further the NBD-PWG goals, a workshop was staged at Group (NBD-PWG) Reference Architecture Subgroup, the
the IEEE Big Data 2014 conference to bring together conceptual model is based on the analysis of public Big Data
technologists and practitioners, to understand what has material and inputs from the other NBD-PWG subgroups. The
changed with Big Data, assess the current state of the art, NIST Big Data Reference Architecture (NBD-RA) is
identify lessons learned, and surface known challenges. To applicable to a variety of business environments including
span this new discipline, four panels were organized: The State tightly-integrated enterprise systems, as well as loosely-
of Big Data Technology, Big Data Future Trends, Big Data coupled vertical industries that rely on the cooperation of
Sharing and Collaboration, and Big Data Security and Privacy. independent stakeholders.

II. THE STATE OF BIG DATA TECHNOLOGY C. NIST - Use Cases and Requirements- Geoffrey Fox,
The term Big Data has come to mean many things, and Indiana University
communication about new approaches has been hindered by The NIST Big Data Public Working Group collected an
the conflicting vocabulary and definitions. To better extensive catalog of use cases, reflecting Big Data applications
understand this emerging discipline, this panel discussed in public health, epidemiology, U.S. census, cargo shipping,
frameworks for understanding the new architectures, use cases geointelligence, defense, genomics, recommendation engines
and requirements, and benchmarks. and media. These applications were then examined in the

978-1-4799-5666-1/14/$31.00 ©2014 IEEE 11


Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
context of the NIST group’s reference architecture to identify target the data collection process to improve accuracy and
recurring patterns thought to be specific to Big Data measurability.
applications. These patterns were further explored in light of
This paradigm, unifying modeling and instrumentation, is
current Apache stack offerings. These insights will likely be
timely with the advent of large-scale dynamic data and large-
useful to prospective system designers.
scale big computing. Large-scale dynamic data is the next
wave of Big Data, namely dynamic data arising from
D. Introducing TPCx-HS – first Industry Standard for ubiquitous sensing and control in engineered, natural, and
Benchmarking Big Data Systems – Raghunath Nambiar, societal systems. Numerous heterogeneous sensors and
Cisco controllers will instrument these systems. The opportunities
Over the past quarter century, industry standard and challenges at these “large-scales” relate not only to the size
benchmarks have had a significant impact on the computing of the data but the heterogeneity in data, data collection
industry. Vendors use benchmark standards to illustrate modalities, data fidelities, and timescale -- ranging from real-
performance competitiveness for their existing products, and to time data moving in microseconds to data at rest (archive). In
improve and monitor the performance of their products under tandem with this important dimension of dynamic data is an
development. Many buyers use the results as points of extended view of Big Computing, which includes a new
comparison when purchasing new computing systems. dimension of distributed computing; that is, the range of
Continuing on the Transaction Processing Performance computing from the high-end to computing at the sensor and
Council’s commitment to bring relevant benchmarks to controller levels, and in particular the collections of networked
industry, the TPC announced TPCx-HS – the first standard that assemblies of sensors and controllers.
provides verifiable performance, price/performance and energy
consumption metrics for Big Data systems. TPCx-HS can be The DDDAS paradigm, driving and exploiting notions of
used to assess a broad range of system topologies and large-scale dynamic data and large-scale Big Computing, is
implementation methodologies for Hadoop, in a technically shaping research directions and transforming a range of
rigorous and directly comparable, vendor-neutral manner. And application areas. Examples of advances and new capabilities
while modeling is based on a simple application, the results are are presented. These include analysis and decision support for
highly relevant to Big Data hardware and software systems. structural systems, manufacturing, environmental and critical
infrastructure (such as urban and air transportation), and power
grids.
III. BIG DATA FUTURE DIRECTIONS
Is volume, velocity, variety, veracity or some other facet of B. NIST Roadmap and Standards – David Boyd, L-3 Data
Big Data most critical for planning a particular Big Data Tactics
project? Will a given deployment, even if well considered, The NIST Big Data Interoperability Framework: Volume 7,
find itself overtaken by a superseding technology? What are Technology Roadmap was prepared by the NBD-PWG’s
the emerging trends and technologies to be aware of? These are Technology Roadmap Subgroup. It addresses the overarching
questions practitioners must entertain now as new commercial information and context about key questions such as:
releases are transforming the capabilities of widely used Big
Data software. The Future Directions panel considers likely • When is data considered “Big”?
Big Data trends in hardware, computing models, analytics and
measurement. • How did Big Data evolve?
• What will it evolve to?
A. InfoSymbiotics/DDDAS and the Nest Generation of Big
Data and Big Computing – Frederica Darema, Air Force • How is technology developing to deal with Big Data in
Office of Scientific Research terms of storage, organization, processing, and resource
management?
We describe the DDDAS (Dynamic Data Driven
Applications Systems), a new paradigm unifying systems • What standards are needed and evolving to deal with Big
modeling and systems instrumentation. DDDAS can facilitate Data? and,
new capabilities for advanced modeling/simulation and
• How might organizations address their Big Data
intelligent exploitation of data of engineered, natural, and
challenges?
societal multi-entity systems. Results may include improved
understanding, analysis, and optimized, autonomic This presentation will discuss the issues of Organizational
management and decision support of operational conditions of readiness, technology readiness, technology features, standards
these systems. initiatives and strategies.
The key underlying concept in DDDAS is the dynamic
integration between data and computation, whereby C. Big Data Analytics Interest Group (BDA IG) of Research
instrumentation data and executing models of systems become Data Alliance (RDA) – Kwo-Sen Kuo, Bayesics
a feedback control loop. On-line data are dynamically The Big Data Analytics (BDA) Interest Group was formed
incorporated into executing models of the system to improve to develop community based recommendations for viable data
the accuracy or speedup the simulation, and in reverse the analytics approaches to address scientific community needs of
executing model controls the instrumentation to selectively

12
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
efficiently utilizing large quantities of data. It supports and high performance computing within the Federal
formation of working groups to tackle specific problems. Government.
• BDA aims to clarify some foundational terminologies in
B. Implementation of Big Data Applications in Government
the context of data analytics understanding
differences/overlaps with terms like data science, data and Science Communities – Joan L. Aron, Federal Big
analysis, data mining, etc. Data Working Group
A conceptual overview sets the context for the uses of Big
• BDA will develop a recommendation document with a Data for knowledge discovery and decision support and the
systematic classification of feasible combinations of challenges in developing applications. The federation of use
analysis algorithms, analytical tools, data and resource cases, data publications, solutions & technologies provides
characteristics and scientific queries. These examples. Semantic analysis is the basis of solutions for many
recommendation documents can serve as a best practice applications for government and science communities. The
guide for scientific groups/communities interested in federal government has greater needs for aggregating data
investing in Big Data technologies. while maintaining compliance with privacy and security
• BDA works to develop a consensus amongst its members requirements. Cognitive metadata, which is the metadata
to achieve this desired goal. coming from enhancing machine learning with our human
perception, reasoning or intuition, can be used for
• BDA collaborates with external bodies and initiatives - personalization purposes and conversely for protecting
such as NIST, OGC, ISO, EarthServer and others. personally identifiable information (PII). A new technology
for natural language understanding can be used to find high-
D. Next-Generation Computing Systems for Big Data value information in a large body of texts, such as a collection
Machine Learning and Graph Analytics – H. Howie of agency reports, with little specialized training. Advances in
Huang, George Washington University high-performance computational hardware are also important.
A semantic MEDLINE for searching biomedical research
Big data machine learning and graph analytics have been
literature uses hardware built for Resource Description
widely used in industry, academia and government.
Framework (RDF) triples in a graph database and semantic
Continuous advance in this area is critical to business success,
processing developed at the National Library of Medicine. A
scientific discovery, as well as cybersecurity. In this position
high-performance computing cluster environment is in use for
paper, we present the current state of the art, and propose that
searching public records, patent data, case law and news
next-generation computing systems for Big Data machine
articles. Use cases with a focus on environment and Earth
learning and graph analytics need innovative designs in both
system science illustrate achievements and challenges for the
hardware and software that provide a good match between Big
use of Big Data in data publishing and data access, data
Data algorithms and the underlying computing and storage
discovery and decision support, and workforce development
resources.
for the scientific community and decision-makers to work with
data science.
IV. BIG DATA SHARING AND COLLABORATION
Critical to moving Big Data forward as a discipline are the C. Data-Intensive Science Challenges – Thomas Huang,
methods needed for improving both collaboration and data NASA Earth Science Data Systems Data-Intensive
sharing. We are familiar with the cooperation for open source Architecture Working Group
technology development and in online courses, but how do we Data-Intensive Science defines three high-level activities:
cooperatively move forward and put these technologies into capture, curation, and analysis of data. Tackling Big Science
practice? How do we better provision data frameworks to Data requires more than just infusing Cloud Computing,
promote technology adoption, data sharing and data reuse? Hadoop, and NoSQL. Science data system architecture is an
orchestration of people, process, policies, and technologies. It
A. Public Private Collaboration – Johan Bos-Beier, ACT/IAC requires thorough understanding of the problem space,
ACT-IAC Big Data Committee seeks to enable government assessment of technologies available, process that is repeatable
agencies to make better data-driven decisions through the and traceable, and an adaptable architecture. This session
analysis, management, integration, and representation of large focuses on architectural discussion and enabling technologies
and complex data stores. The BDC seeks to: for tackling data-intensive science. The discussion should be
supported by use cases as the instrument to facilitate review of
• Provide a forum for information sharing and collaboration current science data systems and assessment of some of the
between federal, state, and local government agencies enabling technologies.
seeking to leverage their data for better informed decision-
making. D. Big Data Provenance and Metadata – Rajeev Agrawal,
• Advise or recommend approaches to developing Big Data North Carolina A&T State University
technical frameworks and capability maturity model With the progress of new technology, the volume and
assessments. complexity of data produced and processed in scientific
• Promote Big Data best practices through increasing research is increasing remarkably. This data is growing so fast
awareness of Big Data research, technologies, use cases, that existing resources are facing difficulty to analyze data

13
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
properly. It is important to properly track scientific workflows identity, authorization, audit, network and device security, and
to provide context and reproducibility. Provenance deals with federation across trust boundaries.
this need and assists scientists by delivering the lineage or
Clearly, the advent of Big Data has necessitated paradigm
history of the way of generating, using and modifying data. We
shifts in the understanding and enforcement of security and
discuss a complete workflow of tracking provenance
privacy (S&P) requirements. Significant changes are evolving,
information of Big Data.
notably in scaling existing solutions to meet the volume,
variety, and velocity of Big Data, and re-targeting security
V. BIG DATA SECURITY AND PRIVACY solutions amid shifts in technology infrastructure, e.g., dis-
The distribution of data across resources, and the tributed computing systems and non-relational data storage. In
involvement of a number of organizations in one system open addition, as diverse datasets become ever-easier to access,
up new concerns for security and privacy. This panel will focus many are increasingly personal in nature. Thus, a whole new
on the areas that are new and different because of the Big Data set of emerging issues must be addressed, including balancing
architectures. The panel will discuss the state of the art in privacy and utility, enabling analytics and governance on
security and privacy enhancing technologies, Big Data privacy encrypted data, and reconciling authentication and anonymity.
concerns and the over-arching challenge of deriving knowledge Working with other subgroups in the NBD-PWG, this
from Big Data while preserving privacy. subgroup has begun to expand the distributed computing
concept of a Big Data security fabric.
A. Big Data Analytics for Security –Pratyusa Manadhata, HP
and Computer Security Aalliance With the key Big Data characteristics of variety, volume,
and velocity in mind, the subgroup gathered use cases from
Enterprises routinely collect terabytes of security relevant
volunteers, developed a consensus security and privacy taxon-
data (e.g., network events, software application events, and
omy and reference architecture, and validated it by mapping
people action events) for several reasons, including the need
the use cases to the reference architecture.
for regulatory compliance and post-hoc forensic analysis. We
estimate that large enterprises may generate 10-100 billion
events per day depending on their size. These numbers will D. Education Data Pricacy and State Boards of Education –
grow as enterprises enable event logging in more sources, hire Amelia Vance, National Association of State Borads of
more employees, deploy more devices, and run more software. Education
Unfortunately, this volume of data quickly becomes Big data has the potential to revolutionize education, al-
overwhelming. Existing analytical techniques do not work well lowing for more efficient and effective schools. It can allow
at this scale and typically produce so many false positives that every teacher to personalize every element of instruction, and
their efficacy is undermined. The problem becomes worse as enable policymakers to see exactly which elements of each
enterprises move to cloud architectures and collect much more educational policy are successful in helping ensure students are
data. We will discuss techniques to mitigate this problem. college-and career-ready. However, while many technologists
believe that the benefits of Big Data in education are self-
B. Cyber Security and the Industrial Internet –Stephen evident and outweigh any dangers of collecting sensitive stu-
Mellor, Industrial Internet Consortium dent information, many parents, teachers, and policymakers do
not feel the same way. Only now are parents learning about the
Through its public-private partnership, the IIC is committed
data schools are collecting about their children. They are justly
to working with public and private partnerships to ensure that
concerned about how it is used and shared— the fact that data
security and privacy are integral parts of Industrial Internet
collection is often outsourced to third-party vendors only adds
products and services. The IIC is working with its ecosystem to
to their skepticism and concerns for their childs privacy. This
identify the requirements for communication protocols and
has led to an instinctual response by many policymakers and
create mechanisms to enhance rapid discovery, mitigation, and
others to work against the use of Big Data in education, despite
remediation of vulnerabilities in near real-time. This session
the potential benefits it may have for education. In 2014, state
will be an open discussion on how the IIC is defining future
legislatures introduced 110 bills in 36 states regarding student
requirements and recommendations to ensure the Industrial
data privacy. Seventy-nine of the 2014 bills have at least some
Internet is private and secure.
elements that would restrict the use of data in education. For
example, New Hampshires bill, which was passed into law,
C. NIST Big Data Security and Privacy –Mark Underwood, likely prevents predictive analytics. A bill in Missouri would
Krypton Brothers have defunded their statewide longitudinal data system. In all,
The NIST Big Data Interoperability Framework Volume 4: 28 of the 110 bills introduced passed into law this year. And,
Security and Privacy Requirements was prepared by the NBD- the number of student data privacy bills is expected to double
PWG’s Security and Privacy Subgroup to identify security and in the 2015 legislative session.
privacy issues particular to Big Data. Big Data application
domains include health care, drug discovery, finance and many Many of the bills introduced, and the laws passed, give
others from both the private and public sectors. Among the sce- state boards of education (SBEs) a key role in the data privacy
narios within these application domains are health exchanges, discussion. Eighteen SBEs are tasked by statute with writing
clinical trials, mergers and acquisitions, device telemetry, and their states student data management policy or have oversight
international anti-piracy. Security technology domains include authority for the agency that is writing the policy. Thirteen
SBEs are members of their states data management team.

14
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.
Seven SBEs are required by statute to ensure FERPA com- far in states from failed attempts in responsible data collection
pliance. Fifty-five bills introduced in 2014 would give SBEs and privacy security.
some authority in regulating student data privacy. Existing
state privacy laws give many SBEs authority over various ACKNOWLEDGMENT
things to help secure data privacy, including appointing a chief
privacy officer, adopting and/or implementing state privacy The authors wish to thank the panelists for their time and
policies, and providing oversight of vendor contracts. SBEs efforts to share their expertise and further the dialog for
have also independently passed rules for their states to protect clarifying the new discipline of Big Data. The authors also
data privacy. Unfortunately, like many other policymakers, wish to acknowledge the contributions of the large group of
many SBE members are unaware of the potential benefits of participants in the NBD-PWG, who have discussed at length
Big Data in education. Education data privacy requires the emerging discipline of Big Data, and have helped form a
knowledge of privacy law, a basic understanding of Big Data, collective understanding of this new paradigm.
and a great deal of time to learn about the ins and outs of
todays education data privacy debate. The National REFERENCES
Association of State Boards of Education (NASBE) is helping
SBEs understand and pass effective policies on these issues [1] N. Grady, W. Chang, eds. “NIST Big Data Interoperability Framework:
that will protect data privacy while supporting educational Volume 1, Definitions” NIST. unpublished.
innovation through the use of Big Data. In this panel, Amelia [2] N. Grady, W. Chang, eds. “NIST Big Data Interoperability Framework:
Vance from NASBE will discuss the role SBEs play in Volume 2, Taxonomy” NIST. unpublished.
education data collection, the questions they are asking as they [3] G. Fox, W. Chang, eds. “NIST Big Data Interoperability Framework:
put together state privacy policies (particularly those dealing Volume 3, Use Cases and Requirements” NIST. unpublished.
with third party use of data), and what information [4] A. Roy, M. Underwood, W. Chang, eds. “NIST Big Data
policymakers need from technology providers in order to trust Interoperability Framework: Volume 4, Security and Privacy
Requirements” NIST. unpublished.
the use of Big Data in education.
[5] S. Mishra, W. Chang, eds. “NIST Big Data Interoperability Framework:
We consider the perspectives and recommendations from Volume 5, Architectures White Paper Survey” NIST. unpublished.
multiple organizations and experts, including the Data Quality [6] O. Levin, W. Chang, eds. “NIST Big Data Interoperability Framework:
Campaign, the Electronic Privacy Information Center, and the Volume 6, Reference Architecture” NIST. unpublished.
Pioneer Institute, as well as examine the lessons learned thus [7] D. Boyd, C. Buffington, W. Chang, eds. “NIST Big Data
Interoperability Framework: Volume 7, Taxonomy” NIST. unpublished.

15
Authorized licensed use limited to: University of Reading. Downloaded on March 21,2024 at 17:51:23 UTC from IEEE Xplore. Restrictions apply.

You might also like