Big Data: Business Intelligence, and Analytics
Big Data: Business Intelligence, and Analytics
com/guides dzone’s 2015 guide to Big data, business intelligence, and analytics
BIG DATA
BUSINESS INTELLIGENCE,
AND ANALYTICS
2015 EDITION
V’ dimensions of Big Data—volume, velocity, and 16 Mining the big data gold rush Infographic
variety. Of course, say data scientists, that picture of
18 Executive Insights on Big Data
Big Data doesn’t care about the meaning of the data, so by tom smith
we need another two V’s—maybe veracity and value.
The first three describe the data itself. The second two 22 How Streaming Is Shaking Up the Data Analytics
Landscape
are about how that data relates to reality. by Justin Langseth
Enjoy the Guide and let us know what you think. Want your solution to be featured in coming guides?
Please contact [email protected] for submission information.
Like to contribute content to coming guides?
j o h n e s p osi t o Please contact [email protected] for consideration.
E d i t o r- i n - C h i e f, DZ o n e r e s e a r c h Interested in becoming a dzone research partner?
r e s e a r c h @ dz o n e .c o m Please contact [email protected] for information.
Summary
the volume of data their organizations stored and used (<1 TB,
1-9 TB, 10-49 TB, etc.), as well as to estimate storage and usage
volumes for the coming year. We asked the same of respondents
this year. Results from last year to this year, in almost every
case, were within about 1% of each other. For example: last year,
23.6% of respondents estimated their organization used less than
one terabyte of data, while only 9.4% of respondents estimated
that they would be using less than one terabyte of data this
year. In this year’s results, 24% of respondents estimated they
DZone surveyed over 400 IT professionals to better
are currently using less than one terabyte of data, and only 10%
understand how individuals and organizations estimate they will be using that little data next year.
are currently managing the collection, storage,
Implications: Some organizations may be interested in (and
and analysis of Big Data. The results provide ambitious about) increasing the amount of data they are able
a snapshot of the landscape of Big Data in the to analyze and use for their business applications but unable to
devote resources to the tools and people necessary to store and
IT industry and, set against results from last use that data properly. Projections for future data storage may
year’s Big Data survey, reveal recent trends and be reconsidered when the cost of analysis is taken into account.
It’s also possible that a greater number of respondents last year
trajectories of Big Data technologies and strategies. focused on how they would deal with larger volumes of data,
and that increasingly those people tasked with planning Big
Data applications within their organizations are appreciating
the other V’s.
Research Takeaways
Recommendations: First, reconsider the volume of data you
actually need to store. It’s possible that storing and analyzing
01. Big Data is increasingly about more than data at petabyte levels could provide great value to your
just Hadoop organization, but you don’t want to waste resources on storage
Data: While Apache Hadoop remains popular (31% of for data you will never use. If you do need the volume, and
respondents reported that their company uses Hadoop), its storage is an issue, consider how best to scale your current
usage has dropped 5% from last year’s survey. Of the tools being datastores—or look into other types of datastores completely.
used with Hadoop, the two leading technologies—Hive and Michael Hausenblas’s article in this guide, “Five Ways Big Data
Pig—show usage decreases of 8% and 11% respectively from last Has Changed IT Operations,” discusses these options.
year’s results. Other tools being used with Hadoop, particularly
Spark and Flume, have increased in popularity (with respective
increases of 15% and 8%). Overall, the standard deviation of 03. Streaming data, Spark usage increase
usages among these tools (Hive, Pig, Spark, Flume, Sqoop, Drill,
Tez, and Impala) has decreased incrementally from last year (4%). Data: Spark is catching fire. In last year’s survey, 24% of Hadoop
users reported using Apache Spark, an open-source, in-memory
Implications: Organizational data needs are changing, and cluster computing framework. This year’s results show a full 15%
tools once considered useful for all Big Data applications no increase of that figure, the largest year-over-year growth in Big
longer suffice in every use case. When batch operations were Data tool usage by far.
predominant, Hadoop could handle most organizations’ needs.
Advances in other areas of the IT world (think IoT) have changed Implications: Spark offers the first successful post-Hadoop
the ways in which data needs to be collected, distributed, stored, general-purpose distributed computing platform, with a
and analyzed. Real-time data complicates these tasks and persistence model that facilitates massive performance gains
requires new tools to handle these complications efficiently. over Hadoop for many job types. As real-time analytics (velocity)
While Big Data seemed, for a while, to emphasize the volume of and f lexibility of data models (variety) grow increasingly
data involved, velocity and variety are becoming increasingly important, more developers are turning to Spark to handle large-
vital to many Big Data applications. scale data processing. (Read the DZone Apache Spark Refcard to
get started with Spark right away.)
Recommendations: Don’t rely on hype to make decisions
regarding how you work with Big Data. As discussed in this Recommendations: The impressive growth of Spark usage
guide’s Key Findings, many organizations use Hadoop even when within the past year does not, of course, make it the only tool
the volume of data they work with doesn’t require Hadoop’s you need. In fact, as Dean Wampler discusses in his article “Why
distributed storage or processing. There are a lot of tools out there Developers Are Flocking to Fast Data and the Spark, Kaf ka &
you can use for your particular Big Data needs; to learn more Cassandra Stack,” it’s one tool of many that can be combined for
about some of these tools, take a look at the article “Workload high performance in data storage, management, and analysis in
and Resource Management: YARN, Mesos, and Myriad” by Adam certain use cases. As new opportunities arise within Big Data,
Diaz, in this guide. Or take a look at our Solutions Directory for a new tools will become available, and new stacks created . Use
more comprehensive listing of platforms and tools. multiple tools to get the most out of your data.
Key
74% are either exploring and learning tools and technology
for large-scale data gathering and analysis or currently
building a proof-of-concept. Furthermore, of the remaining
26% (developers whose Big Data projects are beyond proof-
Research
of-concept), just over half are beyond the testing stage. These
numbers suggest that development patterns have considerable
room to mature in order to shrink the pre-productive R&D
cycle for applications that handle Big Data.
Findings
Nevertheless, it is significant that over half of developers
surveyed are exploring and/or prototyping systems for large-
scale data gathering and analysis. This is the first time a majority
of respondents reported practical, forward-looking embrace
of Big Data, although the change from last year is incremental
(up 6%). Big Data is now less hype than substance, even for
development initiatives not yet yielding business value.
01. What is the status of your organization’s large-scale 02. What data sources does your organization analyze?
data gathering and analysis efforts?
WE ARE A BIG
DATA TOOLMAKER
55% SERVER LOGS
DEPLOYING NEW REVISIONS CURRENTLY BUILDING
SUPPORTING MULTIPLE ANALYTICS
A PROOF-OF-CONCEPT 47% FILES
FIRST SOLUTION IS
5%
44%
DEPLOYED 8% 13% ERP, CRM, & OTHER ENT. SYS. DATA
9%
31% USER GENERATED DATA
scale collection, aggregation, and transport of log data in how Spark and Storm carve up the set of developers leaving
particular, has grown more quickly (up 8% this year) than any Hadoop for developing real-time applications.
other tool for Big Data—except for the more general-purpose
Spark framework.
05.Most data processing clusters remain
Interestingly, the share of organizations that analyze sensor small, perhaps too small to need Hadoop
data did not change from last year, despite evidence of
increased adoption of IoT (as surveyed in our 2015 Guide to the Presumably the need for distributed computation decreases
Internet of Things, released just last month). as per-node capabilities (hardware, DBMS technology, load-
balancing algorithms, point-of-origin data processing, data
management practices, etc.) increase, all other things being
04. Developers are moving away from Hadoop equal. But as distributed computing frameworks grow
for real-time (BI, search, optimization) increasingly powerful and easy to use, developers and systems
applications engineers are encountering fewer and fewer difficulties in
running and building software for distributed systems. The
As developers build new Big Data tools optimized for fundamental complexities generated by process and resource
non-batch operations, Hadoop usage is becoming more isolation, however, do not disappear. This makes the ease-of-use
specialized. One of Hadoop’s strengths is its ease of use: the offered by modern frameworks for distributed computing a
MapReduce abstraction is conceptually simple and suitable potentially hazardous temptation where a single node will do.
for many existing applications, and it saves many hours of
costly re-implementation of the same logic. Where real-time In fact, cluster sizes remain typically far below the scales
streaming and iterative algorithms are not needed, Hadoop for which Hadoop was originally built. 51% of respondents’
usage remains strong. In fact, the percentage of users who organizations process data on clusters containing fewer than
use Hadoop for ETL/ELT and data preparation has actually five nodes, well below the eight-node f loor in Hortonworks’
increased over the past year (65% versus 59% last year). Hadoop cluster sizing guide. Moreover, only 9% of respondents’
clusters exceed 9 nodes. While Hadoop MapReduce and
On the other hand, where real-time and (sometimes) iterative HDFS are perfectly able to run on a single node, developers
processing provides significant value, Hadoop usage has and IT managers should weigh the drawbacks of multi-node
dropped. Developers are now using Hadoop less for reporting/ complexity against the advantages offered by lower-end
BI (down 4%), search and pattern analysis (down 6%), and commodity hardware.
optimization analytics (down 3%) than a year ago—all
use cases that benefit from the real-time capabilities of
frameworks that avoid MapReduce and/or persist distributed 04. What does your company use Hadoop for?
datasets in memory.
in Spark usage, additional research is needed to discover 57% STATISTICS, TEXT, SEARCH, & PATTERNS ANALYTICS
which tools and techniques developers are actually using for
real-time work. It would be particularly interesting to see 50% REPORTING / BI
03. Which tools does your company use to help manage 31% DATA PROVISIONING
data with Hadoop?
30% STORING NEW DATA TYPES
PIG 4
63%
OTHER
SPARK
FLUME
49%
50+ 20-49
IMPALA 6%
NONE 10-19 9%
28%
27%
OTHER
DRILL 22%
13%
9%
7%
5-9
3
and Resource
01
YARN and Mesos both have a
container-based architecture.
02
Management:
Mesos was meant as a more generic
solution inclusive of Hadoop and
YARN.
03
YARN, Mesos,
Myriad is meant for scheduling of
YARN jobs via Mesos.
04
All these technologies have to
and Myriad
do with building a distributed
architecture.
by Adam Diaz
Workload management on distributed The unique requirements of people, business units, and
business as a whole make the development of such global
systems tends to be an afterthought in sharing difficult—if not impossible. This brings the need
implementation. Anyone who has used a for workload and resource management into sharp relief.
This article will describe the latest advances in Big Data-
compute cluster for a job-dependent function based workload and resource management.
quickly discovers that placement of work and
YARN
its prioritization are paramount not only Much has been written about YARN, and it is well
in daily operations but also in individual described in many places. For context, I offer a high-level
overview: YARN is essentially a container system and
success. Furthermore, many organizations scheduler designed primarily for use with a Hadoop-based
quickly find out that even with robust job cluster. The containers in YARN are capable of running
placement policies to allow for dynamic many types of tasks including MPI, web servers, and
virtually anything else one would like. YARN containers
resource sharing, multiple clusters become a might be seen as difficult to write, giving rise to other
requirement for individual lines of business projects like what was once called HOYA (Hbase on YARN,
which was eventually coined Apache Slider) as an attempt
based upon their requirements. As these at providing a generic implementation of easy-to-use YARN
clusters grow, so does so-called data siloing. containers. This allowed for a much easier entry point for
those wishing to use YARN over a distributed file system.
This has been used mainly for longer running services like
Over time, the amount of company computer power HBase and Accumulo, but will likely support other services
acquired could eclipse any reasonable argument for as it moves out of incubator status.
individual systems if used in a programmatically shareable
way. A properly shared technology allows for greater YARN, then, was expected to be a cornerstone of Hadoop as
utilization across all business units, allowing some an operating system for data. This would include HDFS as
organizations to consume beyond their reasonable share the storage layer along with YARN scheduling processes for
during global lulls in workloads across the organization. a wider variety of applications well beyond MapReduce.
framework, including a REST API for scaling up or down. engage in use cases from batch to streaming (real time/
The system also includes a Mesos executor for launching near real time) scheduled dynamically yet discretely over
the node manager as shown in the following diagram. a commodity compute infrastructure seems to be the holy
grail of many a project today. Mesos and Myriad seem to be
two projects well on their way to fulfilling that dream. There
Mesos YARN
1 is still a great deal of work to be done, including use cases of
Master Myriad
Scheduler RM jobs that span geography, along with associated challenges
2.0 CPU
such as cross-geography replication and high availability.
1 2
2.0 GB
Node
Mesos
Slave
YARN C1 In order to build a novel
NM
1
Myriad
Executor
1
3
C2
distributed computer,
2.5 CPU
2.5 GB one needs more than
Figure 5: Mesos/YARN Interact ion—YARN Container
Launching via Mesos Management
just a CPU/scheduler and
From a high level, this type of architectural abstraction
data storage.
might seem like common sense. It should be noted,
however, that it has taken some time to evolve and mature
each of these systems, which in themselves are fields of Conclusion
study worthy of extensive analysis. This new Myriad The discussion of high-level architectures really starts to
(Mesos-based) architecture allows for multiple benefits bring into play the concept’s overall solution architectures.
over YARN-based resource management. It makes resource Candidates include the Lambda and its more modern
management generic and, therefore, the use of the overall successor, the Zeta Architecture. Generic components like
system more f lexible. This includes running multiple a distributed persistence layer, containerization, and the
versions of Hadoop and other applications using the same handling of both solution and enterprise architecture are
hardware as well as operational f lexibility for sizing. It hallmarks of these advanced architectures. The question
also allows for use cases such as the same hardware for of how to best use resources in terms of first principal
development, testing, and production. Ultimately, this type componentry is being formed and reformed by a daily
of f lexibility has the greatest promise to fulfill the ever- onslaught of new technology. What is the best storage layer
changing needs of the modern data-driven architecture. tech? What is the best tech for streaming applications? All
of these questions are commonly asked and hotly debated.
A generic, multi-use architecture, encompassing a This author would argue that “best” in this case is the tool
distributed persistence layer along with the ability to or technology that fits the selection criteria specific to
your use case. Sometimes there is such a thing as “good
enough” when engineering a solution for any problem.
Mesos Frameworks
YARN Not all the technologies described above are needed by
Master
Chronos
(with Myriad)
Jenkins every organization, but building upon a solid foundational
framework that is dynamic and pluggable in its higher
layers will be the solution that eventually wins the day.
Spark Marathon
References:
mesosphere.github.io/marathon
Mesos Master nerds.airbnb.com/introducing-chronos
mesos.apache.org
apache-myriad.org
events.linuxfoundation.org/sites/events/files/slides/aconna15_bordelon.pdf
mapr.com/developer-preview/apache-myriad
THE LEADER
IN BIG DATA
CONSULTING.
By combining new data technologies like Hadoop and
Spark with a high-level strategy, we turn unstructured
information into real business intelligence.
O U R S ERV I CES
DESIGNING ENABLING
CRE ATING SYSTE MS TO COMPANIES
MODERN S CALE FOR TO PERFORM
INFR ASTRUCTURES TODAY’S EVER- COMPREHENSIVE
FOR DATA-DRIVEN GROWING ANALYSIS ON
ORGANIZATIONS A MOUNT OF L ARGE A MOUNTS
DATA OF DATA
M A M M OT H DATA .C O M | ( 9 1 9 ) 32 1 - 0 1 1 9
dzon e ’ s 2 01 5 g u i de to B ig data , b u s in e s s in t e llige n c e , a n d a n a ly t ic s 9
dzone.com/guides dzone’s 2015 guide to Big data, business intelligence, and analytics
Are Flocking to
01
The first generation of Big Data was
primarily focused on data capture and
offline batch mode analysis.
03
Fast Data is a new movement
that describes new systems and
Cassandra Stack
approaches focused on timely, cost-
efficient data processing, as well as
higher developer productivity.
by Dean Wampler
One of the most noteworthy trends for Big processed as soon as they arrive with tight time constraints,
often microseconds to milliseconds. High-frequency
Data developers today is the growing trading systems are one example, where market prices
importance of speed and flexibility for data move quickly, and real-time adjustments control who wins
and who loses. Between the extremes of batch and real-
pipelines in the enterprise.
time are more general stream processing models with less
stringent responsiveness guarantees. A popular example
Big Data got its start in the late 1990s when the largest is the mini-batch model, where data is captured in short
Internet companies were forced to invent new ways to time intervals and then processed as small batches, usually
manage data of unprecedented volumes. Today, when within time frames of seconds to minutes.
most people think of Big Data, they think of Hadoop
[1] or NoSQL databases. However, the original core
components of Hadoop—HDFS (Hadoop Distributed File Fast Data Defined
System for storage), MapReduce (the compute engine), The phrase “fast data” captures this range of new systems
and the resource manager now called YARN (Yet Another and approaches, which balance various tradeoffs to
Resource Negotiator)—were, until recently, rooted in the deliver timely, cost-efficient data processing, as well as
“batch mode” or “off line” processing commonplace. Data higher developer productivity. Let’s begin by discussing an
was captured to storage and then processed periodically emerging architecture for fast data.
with batch jobs. Most search engines worked this way
in the beginning; the data gathered by web crawlers was What high-level requirements must a Fast Data architecture
periodically processed into updated search results. satisfy? They form a triad:
failures limit the ability to deliver services); and driven by between downstream tools like Spark and upstream sources
events from the world around them. of data, especially those sources that can’t be queried again
in the event that data is, for some reason, lost downstream.
The New Stack Emerges for Fast Data
Finally, records can be written to a scalable, resilient
For applications where real-time, per-event processing
database, like Cassandra, Riak, or HBase; or to a distributed
is not needed—where mini-batch streaming is all that’s
filesystem, like HDFS or S3. Kaf ka might also be used as
required—research [3] shows that the following core
a temporary store of processed data, depending on the
combination of tools is emerging as very popular: Spark
downstream access requirements.
Streaming, Kaf ka, and Cassandra. According to a recent
Typesafe survey, 65% of respondents use or plan to use
Spark Streaming, 40% use Kaf ka, and over 20% use The Business Case for Fast Data
Cassandra. Most enterprises today are really wrangling data sets in
the multi-terabyte size range, rather than the petabyte size
Spark Streaming [4] ingests data from Kaf ka, databases, range typical of the large, well-known Internet companies.
and sometimes directly from incoming streams and file They want to manipulate and integrate different data
systems. The data is captured in mini-batches, which have sources in a wide variety of formats, a strength of Big Data
fixed time intervals on the order of seconds to minutes. At technologies. Overnight batch processing of large datasets
the end of each interval, the data is processed with Spark’s was the start, but it only touched a subset of the market
full suite of APIs, from simple ETL to sophisticated queries, requirements for processing data.
even machine learning algorithms. These other APIs are
essentially batch APIs, but the mini-batch model allows Now, speed is a strong driver for an even broader range
them to be used in a streaming context. of use cases. For most enterprises, that generally means
reducing the time between receiving data and when it can
attention on narrowing sizes, and with great efficiency at all size scales.
the time gap between You tend to think of open source disrupting existing
markets, but this streaming data / Fast Data movement was
that data. Spark / Kaf ka / Cassandra “stack” is the best place to start.
Resources
This architecture makes Spark a great tool for [1] hadoop.apache.org
[2] reactivemanifesto.org
implementing the Lambda Architecture [5], where separate
[3] typesafe.com/blog/apache-spark-preparing-for-the-next-wave-of-reactive-big-data
batch and streaming pipelines are used. The batch pipeline [4] spark.apache.org/streaming
processes historical data periodically, while the streaming [5] lambda-architecture.net
[6] radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
pipeline processes incoming events. The result sets are
[7] kaf ka.apache.org
integrated in a view that provides an up-to-date picture [8] cassandra.apache.org
of the data. A common problem with this architecture is [9] github.com/basho/riak
[10] hbase.apache.org
that domain logic is implemented twice [6], once for the
[11] hadoop.apache.org/docs/r1.2.1/hdfs_design.html
streaming pipeline and once for the batch pipeline. But code [12] aws.amazon.com/s3
written with Spark for the batch pipeline can also be used
in the streaming pipeline, using Spark Streaming, thereby
Dean Wampler, Ph.D., is the Architect for Big Data Products and
eliminating the duplication. Services in the Office of the CTO at Typesafe. He specializes in scalable,
distributed, “Big Data” application development using Spark, Mesos,
Hadoop, Scala, and other tools. Dean is a co-organizer and frequent
Kaf ka [7] provides very scalable and reliable ingestion speaker at conferences worldwide, the co-organizer of several Chicago-
of streaming data organized into user-defined topics. By area Meetup groups, and a contributor to several open source projects.
He is the author of “Programming Scala, 2nd Edition” and “Functional
focusing on a relatively narrow range of capabilities, it Programming for Java Developers,” and the co-author of “Programming
does what it does very well. Hence, it makes a great buffer Hive,” all from O’Reilly.
“
for developing and managing advanced
information fusion solutions. It combines the A technically superior approach
power of object data modeling technology with to application development.
the high-performance, parallel processing of
Hadoop and Spark to deliver an easier, more
“
effective way of supporting mission-critical
High availability with zero
applications at Big Data scale.
administration effort.
A New Approach
Big Data, including T-S data, can be analyzed through the
Hadoop Distributed File System (HDFS) with a traditional
or NoSQL database management system, but because such
systems typically rely on batch processing of data, this type
ThingSpan by Objectivity
®
ThingSpan combines object data modeling with the parallel processing of Hadoop and Spark
to support applications at Big Data scale.
CGG, a leader in fully integrated geoscience services, has been Data Management, Data Integration On-Premise
working with Objectivity to develop a common platform for
its major geoscience analytical software. The data challenge
Features
comes from having to integrate diverse geoscience data, and
then providing reservoir modeling and simulation data on top • Built on Hadoop
of this integrated data set. This common data model provided • Native Stream Processing Capabilities
by CGG and Objectivity enables analysts to work on thousands
of wells with data physically located anywhere in the world, • Monitoring Tool Included
thereby resulting in quick collaboration on important drilling • Supports Spark
decisions. The scale-out computing model of ThingSpan
and native support of Hadoop allows higher performance
notable customers
at a lower cost for analysis of large, multi-dimensional data
associated with geoscience. • CGG • Siemens • Drager
01
Data Has
Today’s hottest Big Data technologies
each carry their own specific
operational considerations that must
be mastered.
02
Changed
Almost all Big Data solutions assume
a scale-out architecture that leads
enterprises into advanced cluster
scheduling and orchestration.
03
IT Operations
There are five key trends that
operations teams should anticipate
related to packaging, testing, and
evolving Big Data platforms.
by Michael Hausenblas
With the uptick of Big Data technologies such as source, but more on this point below) implicitly assume a
scale-out architecture. Need to crunch more data? Add a
Hadoop, Spark, Kafka, and Cassandra, we are few machines. Want to process the data faster? Add a few
witnessing a fundamental change in how IT machines.
operations are carried out. Most, if not all, of said The adoption of the ‘commodity cluster’ paradigm, however,
Big Data technologies are inherently distributed has two implications that are sometimes overlooked by
organizations starting to roll out solutions:
systems, and many of them have their roots in
one of the nowadays dominating Web players, 1. With the ever-growing number of machines, sooner or
later the question arises if a pure on-premise deployment
especially Google. But how does using these Big is sustainable as you will need the space and pay hefty
Data technologies impact the daily IT operations energy bills while typically seeing cluster utilizations less
than 10%.
of a company aiming to benefit from them? To
2. The current best practice is to effectively create a
address this question, we take a deeper look at five dedicated cluster for each technology. This means you
trends you and your Ops team should be aware have a Hadoop cluster, a Kaf ka cluster, a Storm cluster, a
Cassandra cluster, etc.—not only because this siloes issues
of when employing Big Data technologies. These (in terms of being able to swiftly react to business needs;
trends emerged in the past 15 years and are of a for example, to accommodate different seasons), but also
because the overall TCO tends to increase.
technological as well as organizational nature.
The issues discussed above do not mean you can’t successfully
01. From Scale-Up to Scale-Out deploy Big Data solutions in your organization at scale; it
There is a strong tendency throughout all verticals to deploy simply means that you need to be prepared for the long-term
clusters of commodity machines connected with low-cost operational consequences, such as opex vs. capex, as well as
networking gear rather than the specialized, proprietary, migration scenarios.
and typically expensive supercomputers. While likely older
than 15 years, Google has spearheaded this movement with its 02. Open Source Rulez
Warehouse-Scale Computing Study. Almost all of the currently Open Source plays a fundamental role in Big Data
available Big Data solutions (especially those that are open technologies. Organizations adopt it to avoid vendor lock-in
and to be less dependent on external entities for bug fixes, or ArangoDB for rich relationship analysis, or a key-value
simply to adapt software to their specific needs. The open and store such as Redis for holding shopping basket data.
usually community-defined APIs ensure transparency; and
various bodies, such as the Apache Software Foundation or
the Eclipse Foundation, provide guidelines, infrastructure, 04. Data Gravity & Locality
and tooling for the fair and sustainable advancement of In your IT operations, you’ll usually find two sorts of services:
these technologies. Lately, we have also witnessed the stateless and stateful. The former includes things like a Web
rise of foundations such as the Open Data Platform, the server while the latter almost always is, or at least contains, a
Open Container Initiative, or the Cloud Native Computing datastore. Now, the insight that data has gravity is especially
Foundation, aiming to harmonize and standardize the relevant for stateful services. The implication here is to
interplay and packaging of infrastructure and components. consider the overall cost associated with transferring data,
both in terms of volume and in tooling, if you were to migrate
for disaster recovery reasons or to a new datastore altogether
(ever tried to restore 700TB of backup from S3?).
While the software Another aspect of data gravity in the context of crunching
data is known as data locality: the idea of bringing the
and free to use, right direction; using appropriate networking gear (like 10GE)
is another. As a general note: the more you can multiplex
and effectively use it. surprisingly often overlooked in a Big Data context: DevOps. As
it was aptly described in the book The Phoenix Project, DevOps
refers to the best practices for collaboration between the
software development and operational sides of an organization.
But what does this mean for Big Data technologies?
As in the previous case of the commodity clusters, there is
It means that you need to ensure that your data engineer
a gotcha here: there ain’t no such thing as a free lunch. That
and data scientist teams use the same environment for local
is, while the software might be open source and free to use,
testing as is used in production. For example, Spark does a
one still needs the expertise to efficiently and effectively
great job allowing you to go from testing to cluster submission.
use it. You’ll find yourself in one of two camps: either you’re
In addition, for the mid-to-long run, you should containerize
willing to invest the time and money to build this expertise
the entire production pipeline.
in-house—for example, hire data engineers and roll your own
Hadoop stack—or you externalize it by paying a commercial
entity (such as a Hadoop vendor) for packaging, testing, and Conclusion
evolving your Big Data platform. With the introduction of Big Data technologies in your
organization, you can quickly gain actionable business
insights from raw data. However, there are a few things
03. The Diversification of Datastores you should plan for from the IT operations point of view,
When Martin Fowler started to talk about polyglot including where to operate the cluster (on-premise, public
persistence in 2011, the topic was still a rather abstract one cloud, hybrid), what your strategy is concerning open source,
for many people—although Turing Award recipient Michael how to deal with different datastores as well as data gravity,
Stonebraker made this point already in his 2005 paper “‘One and, last but not least, how to set up the pipeline and the
Size Fits All’: An Idea Whose Time Has Come and Gone.” The organization in a way developers, data engineers, data
omnipotent and dominant era of the relational database scientists, and operations folks can and will want to work
is over, and we see more and more NoSQL systems gaining together to reap the benefits of Big Data.
mainstream traction.
Michael Hausenblas is a Datacenter Application Architect at
What this means for your operations: anticipate the increased Mesosphere. His background is in large-scale data integration research,
usage of different kinds of NoSQL datastores throughout the Internet of Things, and Web applications. He’s experienced in
the datacenter, and be ready to deal with the consequences. advocacy and standardization (World Wide Web Consortium, IETF).
Challenges that typically come up include: Michael frequently shares his experiences with the Lambda Architecture
and distributed systems through blog posts and public speaking
• Determining the system of record engagements and is a contributor to Apache Drill. Prior to Mesosphere,
Michael was Chief Data Engineer EMEA at MapR Technologies, and
• Synchronizing different stores prior to that a Research Fellow at the National University of Ireland,
Galway, where he acquired research funding totalling over €4M, working
• Selecting the best fit for the datastore to use for a certain with multinational corporations such as Fujitsu, Samsung and Microsoft
use case, for example a multimodal database like as well as governmental agencies in Ireland.
GOLD RUSH
how these practices and tools connect, we’ve illustrated a basic model of the Big
Data pipeline. And, we know… there’s so much more than what we can show
here on the page! Beyond the segment of the pipeline we show here is a whole
other territory of predictive analytics and business intelligence worth exploring.
For now, let’s take a look at five important stages for dealing with Big Data.
01 COLLECT
stage deals with streaming, real-time, and bulk import
data (data that has been previously collected). Data
volume and velocity are major concerns at this stage.
start
DARK DATA
Data that’s been previously collected, stored,
and processed, and which occurs at many
stages. Dark data is often thought to be useless,
or just not being used. New technologies and
methodologies are allowing for this data to be
processed and made useful.
NoSQL, HDFS, etc.), but also real-time and streaming data. Data has to be
stored and made available for early usage in other stages.
01
Insights on
Big Data isn’t going anywhere. It’s just
going to get bigger. Unfortunately so
are expectations.
02
Big Data
Ask plenty of questions to understand
what needs to be accomplished with
the data.
03
Failure to set realistic expectations
upfront can lead to wasted effort and
disappointed clients.
by tom smith
To more thoroughly understand the state of Big Data, and changing and growing exponentially beyond what companies
where it’s going, we interviewed 14 executives with diverse can traditionally handle.
backgrounds and experience with Big Data technologies, Scalability is critical, as are data management and retention
projects, and clients. policies. Data collection requires a more strategic approach.
Companies will evolve from collecting/storing every piece of
Specifically, we spoke to: data to collecting and storing data based on need.
02
Margaret Roth, Co-Founder and CMO, Yet Analytics •
Dr. Greg Curtin, CEO and Founder, Civic Resource Group Executives stay abreast of industry trends by meeting with
• Guy Kol, Founder and V.P. R&D, NRGene, Ness Ziona, clients and prospects, learning pain points, and determining
Israel • Gena Rotstein, CEO and Founder, Dexterity which data is available to solve the problem.
Ventures, Inc. • Scott Sundvor, CTO, 6SensorLabs • Ray Just as there’s a tsunami of data, there’s also a tsunami of
Kingman, CEO, Semcasting • Puneet Pandit, Founder information and hype about Big Data. Stay above the noise by
having a “big picture” perspective of the problem you are solving.
and CEO, Glassbeam • Mikko Jarva, CTO Intelligent
Data, Comptel Corporation • Vikram Gaitonde, Vice 03
President Products, Solix Technologies • Dan Potter, CMO,
Datawatch • Paul Kent, SVP Big Data, SAS • Matt Pfeil, Real world problems solved by Big Data are myriad. I spoke
CCO Co-Founder, DataStax • Philip Rathle, VP Products, with companies sequencing the wheat genome; enabling smart
cities; and evolving healthcare, automotive, retail, education,
Neo Technology, Inc. • Hari Sankar, VP of Product
media, and beyond. Every initiative is using data to help clients
Management, Oracle move from being react ive to proact ive.
Accessing data, integrating multiple sources, and providing
There is alignment with regards to what Big Data is, how it can the analysis to solve problems requires patience, vision, and
be used, and its future. Discrepancy lies in the perception of the knowledge. You will gain all three by working in a real-world
state of Big Data today. Some companies have been working with environment solving real problems. None will come from
Big Data for years, others feel unable to perform “real” analytics contemplating Big Data in the abstract.
work due to the data hygiene required, as well as the necessary
integration of disparate databases. Once you appreciate the amount of time spent on data
hygiene—an absolute requirement before any analysis can
take place—you’ll structure data collection and integration so
Here’s what we learned from the conversations.
hygiene is less tedious and time-consuming.
01 04
The definition of Big Data is consistent across executives and The composition of a data analytics team requires a
industries—volume, velocity, and variety of data that is always number of skills: development of algorithms and software
implementations, data science, design, engineering, and be aware of them as you get involved with specific projects.
input from analysts with domain expert ise. The most important Ask the right questions up front, set the right expectations, and
qualities for team members are creativity, collaboration, and save a lot of time and rework.
curiosity. No one person or skill-set is the solution to every Big
Data project. 08
Big Data provides an opportunity for developers to contribute Concerns around Big Data are similar to IoT except privacy is
beyond their typical scope of inf luence. It’s best for developers more important than security. Industrial data is one thing,
to have a broad range of interests and expertise. The more personal data is a whole other animal. As long as personal data
perspectives they can bring to bear on the problem, the better. is used for good, people will get comfortable as they benefit
from Big Data. While Big Data will result in greater knowledge,
05
it should also result in greater transparency—by governments,
companies, and advertisers—and help prevent fraud and
According to Ginni Rometty, CEO of IBM, Big Data is the “next oil.”
identity theft.
Several executives pointed out that this oil will be “unrefined”
for the next 10 to 20 years. The future of Big Data is in providing
real-time data to connect people, machines, experiences, and “Data lakes” should be weighed against the danger to privacy
environments to improve life in a more personal way—from posed by centralized data stores. Before we build huge data
fewer traffic jams to more sustainable agriculture. repositories, we need to know how we’re going to use and
safeguard that data. As the data infrastructure matures, these
Some executives I spoke with believe no one is really dealing problems will become increasingly easy to solve.
with Big Data yet. There is so much data in repositories that
the challenge is to figure out how to aggregate data so it can 09
be analyzed. We also need to determine the right questions to
ask, and the right data to store, to transform business and the The future of Big Data is the ability to make well-informed
customer experience. Other executives are already doing these decisions quicker and easier than ever before. People will
things for their clients; however, even these executives see not be doing what a machine can do, so they’ll be free to use
unrealized possibilities. their minds to do creative things. The blue-sky vision is: Big
Data will be the central technology to human existence, since
Demand for Big Data services is growing quickly as the it will affect all aspects of life (e.g., weather, transportation,
business world sees the possibilities. Once you empower healthcare, nutrition, energy, etc.).
business people, they ask for more information. They ask
smarter questions. Speed and agility gain importance. Real-
Based on the vision above, following are three takeaways
time operational and business data allows people to make well-
for developers.
informed decisions quickly, thereby saving time and money.
Effective use of Big Data is becoming an expectation.
• Big Data is evolutionary, not revolutionary. Big Data
problems are similar to problems you’ve faced before.
06
Leverage and improve upon the knowledge you already
have by learning new architectures and languages.
Hadoop was the most frequently mentioned software,
with Cloudera and Hortonworks being the most frequently
• Be prepared to be part of the bigger picture. Become more
mentioned management applications. However, many
well-rounded and more prepared to collaborate with, and
other solutions—including Cassandra, Clojure, Datomic,
Hype, NoSQL, PostgreSQL, SQL Server, and Tableau for contribute to, your team.
visualization—were discussed.
• Understand the real-world problems you are solving. (It
There’s enormous demand for Big Data developers, so you don’t may help to think in terms of Domain-Driven Design.)
need to know all of the software and applications. Pick what Think about creating the next destructive idea that’s
you want to become an expert in and write your ticket with that lurking in the open source community. Share more,
software. Taking the time to learn Hadoop is a good place to start. borrow more, be more open-minded about the possibilities
of what you are working on.
07
Hadoop for
At Hortonworks, we believe in supporting a distribution of
Hadoop that works with these technologies in a coordinated
and structured way, so developers can feel confident that they
will benefit from ongoing innovation in a way that is 100%
open by design.
01
Innovation in open source and at the
Is Shaking Up the
network and I/O layers has broken
down previous speed and performance
barriers for visualization and analytics
on big data sets.
Data Analytics
02
Analytics are being pushed into the
stream (via Spark), which is emerging
as the de facto approach for sub-
second query response times across
billions of rows of data.
Landscape 03
Working with data streams ensures
the timely and accurate analysis that
enables enterprises to harness the
value of the data they work so hard
to collect.
by Justin Langseth
The rise of Apache Spark and the general takes place as a stream of events and transactions. In the
beginning, the stream was recorded in a book—an actual
shift from batch to real-time has disrupted the book that held inventories and sales, with each transaction
traditional data stack over the last two years. penned in on its own line on the page. Over time, this
practice evolved. Books yielded to computers and databases,
But one of the last hurdles to getting actual but practical limitations still constrained data processing
value out of Big Data is on the analytics side, to local operations. Later on, data was packaged, written to
disk, and shipped between locations for further processing
where the speed to querying and visualizing and analysis. Grouping the data stream into batches made it
Big Data (and the effectiveness of those easier to store and transport.
visualizations translated into actual business Technology marches on, and it has now evolved to the
value) is still a relatively young conversation, point that, in many cases, batching is no longer necessary.
Systems are faster, networks are faster and more reliable,
despite the fact that 87% of enterprises believe and programming languages and databases have evolved to
accommodate a more distributed streaming architecture.
Big Data analytics will redefine the competitive For example, physical retail stores used to close for a day
landscape of their industries within the next each quarter to conduct inventory. Then, they evolved to
batch analysis of various locations on a weekly basis, and
three years [1]. then a daily basis. Now, they keep a running inventory that
is accurate through the most recent transaction. There are
Most engineers who are using legacy business intelligence countless similar examples across every industry.
tools are finding them woefully unprepared to handle the
performance load of Big Data, while others who may be So Why Are Analytics and Visualizations Still
writing their own analytics with D3.js or similar tools are in Batch Mode?
wrestling with the new backend challenges of fusing real- Traditional, batch-oriented data warehouses pull data from
time data with other datastores. multiple sources at regular periods, bringing it to a central
location and assembling it for analysis. This practice causes
Let’s take a look at the megatrend toward streaming data management and security headaches that grow larger
architectures, and how this is shaking up analytics over time as the number of data sources and the size of each
requirements for developers. batch grows. It takes a lot of time to export batches from the
data source and import them into the data warehouse. In
Data Naturally Exists in Streams very large organizations, for which time is of the essence,
All commerce, whether conducted online or in person, batching can cause conf licts with backup operations. And the
process of batching, transporting, and analysis often takes so analysis of streaming data from such sources as the
much time that it becomes impossible for a complex business Internet of Things (IoT), social media, location, market
to know what happened yesterday or even last week. feeds, news feeds, weather feeds, website clickstream
analysis, and live transactional data. Examples of
By contrast, with streaming-data analysis, organizations streaming-data analytics include telecommunications
know they are working with the most recent—and timely— companies optimizing mobile networks on the f ly using
version of data because they stream the data on demand. network device log and subscriber location data, hospitals
By tapping into data sources only when they need the data, decreasing the risk of nosocomial (hospital-originated)
organizations eliminate the problems presented by storing infections by capturing and analyzing real-time data from
and managing multiple versions of data. Data governance monitors on newborn babies, and office equipment vendors
and security are simplified; working with streaming data alerting service technicians to respond to impending
means not having to track and secure multiple batches. equipment failures.
We live in an on-demand world. It’s time to leave behind As the charge to streaming analytics continues and the
the model of the monolithic, complex, batch-oriented data focus becomes the “time to analytics gap” (how long it takes
warehouse and move toward a f lexible architecture built for from arrival of data to business value being realized), I see
streaming-data analysis. Working with data streams ensures three primary ways that developers should rethink how
the timely and accurate analysis that enables enterprises to they embed analytics into their applications:
harness the value of the data they work so hard to collect,
and tap into it to build competitive advantage. • Simplicity of Use: Analytics are evolving beyond the
data scientist workbench, and must be accessible to
broad business users. With streaming data, the visual
Breaking Down the Barriers to Real-Time
interface is critical to make the data more accessible to
Data Analysis
a non-developer audience. From allowing them to join
Previously, building streaming-data analysis environments different data sources, to interacting with that data at
was complex and costly. It took months or even years to the speed of thought—any developer bringing analytics
deploy. It required expensive, dedicated infrastructure; into a business application is being forced to deal with
it suffered from a lack of interoperability; it required the “consumerization of IT” trend such that business
specialized developers and data architects; and it failed to users get the same convenience layers and intuitiveness
adapt to rapid changes in the database world, such as the that any mobile or web application affords.
rise of unstructured data.
• Speed / Performance First: With visualization come
In the past few years we have witnessed a f lurry of activity requirements to bring the query results to the users in
near real-time. Business users won’t tolerate spinning
in the streaming-data analysis space, both in terms of
pinwheels while the queries get resolved (as was the
the development of new software and in the evolution of
case with old approaches to running Big Data queries
hardware and networking technology. Always-on, low-
against JDBC connectors). Today we’re seeing analytics
latency, high-bandwidth networks are less expensive pushed into the stream (via Spark), which is emerging
and more reliable than ever before. Inexpensive and fast as the de facto approach for sub-second query response
memory and storage allow for more efficient data analysis. times across billions of rows of data, and not having to
In the past few years, we’ve witnessed the rise of many move data before it’s queried.
easy-to-use, inexpensive, and open-source streaming-
data platform components. Apache Storm [2], a Hadoop- • Data Fusion: Embedded analytics capabilities must
compatible add-on (developed by Twitter) for rapid data make multiple data sources appear as one. Businesses
transformation, has been implemented by The Weather shouldn’t have to distinguish between “Big Data” versus
Channel, Spotify, WebMD, and Alibaba.com. Apache Spark other forms of data. There’s just data, period (including
non-streaming and static data)—and it needs to be
[3], a fast and general engine for large-scale data processing,
visualized, and available within business applications
supports SQL, machine learning, and streaming-data
for sub-second, interactive consumption.
analysis. Apache Kaf ka [4], an open-source message broker,
is widely used for consumption of streaming data. And
[1] forbes.com/sites/louiscolumbus/2014/10/19/84-of-enterprises-see-big-data-
Amazon Kinesis [5], a fully managed, cloud-based service analytics-changing-their-industries-competitive-landscapes-in-the-next-year
for real-time data processing over large, distributed data [2] storm.apache.org
streams, can continuously capture large volumes of data [3] spark.apache.org
from streaming sources. [4] kaf ka.apache.org
[5] aws.amazon.com/kinesis
Checklist for Developers Building Analytics Justin Langseth is an expert in Big Data, business intelligence,
and Visualizations on Top of Big Data text analytics, sentiment analytics, and real-time data processing.
The past few years have been witness to explosive growth He graduated from MIT with a degree in Management of Information
in the number of streaming data sources and the volume of Technology, and holds 14 patents. Zoomdata is Justin’s 5th startup—he
previously founded Strategy.com, Claraview, Clarabridge, and Augaroo. He
streaming data. It’s no longer enough to look to historical is eagerly awaiting the singularity to increase his personal I/O rate which
data for business insight. Organizations require timely is currently frustratingly pegged at about 300 baud.
ch ecklist
Big Data management is the genus to which Big Data
Big Data processing and analytics are species. For developers, data
scientists, and other data professionals, keeping every data
Management
management requirement in mind is not an easy task. This
checklist will help you navigate critical requirements for
Requirements
consistently and flexibly curating raw sources of Big Data into
trusted assets for next-generation analytics.
Mass ingestion of high-volume and Purpose-built components for data Resource-optimized pipeline
changed data parsing and transformation execution
Team-based data publishing and Data relationship tracking and Data provenance through end-to-end
sharing inference lineage
Automated policy-based profiling Data proliferation tracking and Centralized security policy
and discovery monitoring management
24 dzon e ’ s 2 01 5 g uid e to B i g data , busine ss int e lli ge nce , and ana lytics
diving deeper
dzone.com/guides dzone’s 2015 guide to Big data, business intelligence, and analytics
The Big Data/Analytics Zone is a prime resource The Database Zone is DZone’s portal for following The Internet of Things (IoT) Zone features all
and community for Big Data professionals of the news and trends of the database ecosystems, aspects of this multifaceted technology movement.
all types. We’re on top of all the best tips and which include relational (SQL) and non-relational Here you’ll find information related to IoT, including
news for Hadoop, R, and data visualization (NoSQL) solutions such as MySQL, PostgreSQL, Machine to Machine (M2M), real-time data, fog
technologies. Not only that, but we also give SQL Server, NuoDB, Neo4j, MongoDB, CouchDB, computing, haptics, open distributed computing,
you advice from data science experts on how to Cassandra and many others. and other hot topics. The IoT Zone goes beyond
understand and present that data. home automation to include wearables, business-
oriented technology, and more.
Solutions
This directory contains two solution types: (1) Big Data platforms
that provide analytics, data management, data visualization,
business intelligence, and more; and (2) open-source frameworks
for a variety of lower-level Big Data needs. It provides feature data
Directory
and product category information gathered from vendor websites
and project pages. Solutions are selected for inclusion based on
several impartial criteria, including solution maturity, technical
innovativeness, relevance, and data availability.
Hadoop Integrations
Actian Analytics Platform Big Data Analytics Yes actian.com
Available
alpinenow.com/use-case-
Alpine Chorus Predictive Analytics No Yes
big-data-business
Hadoop Integrations
Amazon Kinesis Big Data Analytics PaaS Yes aws.amazon.com
Available
Hadoop Integrations
Datawatch Designer Data Visualization Yes datawatch.com
Available
Hadoop Integrations
Infobright Enterprise Big Data Analytics No infobright.com
Available
Business Intelligence,
Necto by Panorama Software No Hadoop Connector Available panorama.com
Consulting
Hadoop Connectors
Oracle Big Data Cloud Service Data Management No oracle.com
Available
Pivotal Big Data Suite Data Management Platform Yes Yes pivotal.io
Hadoop Integrations
RedPoint Data Management Data Management No redpoint.net
Available
Hadoop Integrations
Sumo Logic Data Management Yes sumologic.com
Available
Hadoop Integrations
Tableau Desktop Big Data Analytics Yes tableausoftware.com
Available
Hadoop Integrations
TIBCO Spotfire Big Data Analytics Yes spotfire.tibco.com
Available
fr a me work s
i n t o f e at u r e d b i g d ata s o l u t i o n s
If you’d like to share data about these or other related solutions, please email us at [email protected].
Hortonworks Data Platform iHub by OpenText Analytics Informatica Intelligent Data Platform
• High Availability ETL and ELT • JDBC Connectors n/a • High Availability ETL and ELT
• High Availability n/a • High Availability ETL and ELT • High Availability ELT
glossary
Data Science The field of study Mesos An open-source cluster
broadly related to the collection, manager developed by the Apache
management, and analysis of raw data Software Foundation. Abstracts basic
by various means of tools, methods, and compute resources and uses a two-level
technologies. scheduling mechanism. Like a kernel
for distributed systems.
Data Warehouse A collection of
accumulated data from multiple streams Myriad An Apache project used to
Batch Processing Doing work in within a business, aggregated for the integrate YARN with Mesos.
chunks. More formally: the execution of purpose of business management—where
a series of programs (jobs) that process ready-to-use data is stored. Online Analytical Processing
sets of records as complete units (batches). (OLAP) A concept that refers to tools
Commonly used for processing large sets Database Clustering Making which aid in the processing of complex
of data offline for fast analysis later. the same dataset available on multiple queries, often for the purpose of data
nodes (physical or logical), often for mining.
big data Data whose size makes the advantages of fault tolerance, load
people use it differently. Describes the balancing, and parallel processing. Online Transaction Processing
entire process of discovering, collecting, (OLTP) A type of system that supports
managing, and analyzing datasets too Distributed System A set of the efficient processing of large
massive and/or unstructured to be networked computers solving a single numbers of database transactions; used
handled efficiently by traditional database problem together. Coordination is heavily for business client services.
tools and methods. accomplished via messages rather than
shared memory. Concurrency is managed Predictive Analytics The
Business Intelligence (BI) The use by software that functions like an OS over determination of patterns and possible
of tools and systems for the identification a set of individual machines. outcomes using existing datasets.
and analysis of business data to provide
historical and predictive insights. Eventual Consistency The idea that R (Language) An interpreted
databases will contain data that becomes language used in statistical computing
Complex Event Processing consistent over time. and visualization.
Treating events as sets of data points
from multiple sources. For example: three Extract Transform Load (ETL) Replication Storing multiple
simple events ‘stubbed toe,’ ‘fell forward’ & Taking data from one source, changing it
‘crushed nose’ might constitute the single copies of the same data so as to ensure
into a more useful form, and relocating consistency, availability, and fault-
complex event ‘injured by falling.’ Useful
concept to transition from ‘dumb’ input it to a data warehouse, where it can tolerance. Data is typically stored on
streams to ‘smart’ business analysis. be useful. Often involves extraction multiple storage devices.
from heterogeneous sources and
transformation functions that result in
Data Analytics The process of Scalability Ability of a system to
harvesting, managing, and analyzing similarly structured endpoint datasets.
accommodate increasing load.
large sets of data to identify patterns
and insights. Exploratory, confirmatory, Hadoop An Apache Software
Spark An in-memory open-source
and qualitative data analytics require Foundation distributed framework
different application-level feature sets. cluster computing framework by the
developed specifically for high-scalability,
Apache Software Foundation. Able to
data-intensive, distributed computing.
Data Management The complete
load data into a cluster’s memory and
The most popular implementation of
lifecycle of how an organization handles query it numerous times.
MapReduce.
storing, processing, and analyzing
datasets. Streaming Data Data that becomes
Hadoop Distributed File System
available continuously rather than
(HDFS) A distributed file system
Data Mining The process of discretely.
created by Apache Hadoop to handle
discovering patterns in large sets of
data and transforming that information data throughput and access from the
MapReduce algorithm. YARN (Yet Another Resource
into an understandable format. Often Negotiator) Also called MapReduce
involves deeper scientific and technical
2.0. Built for Hadoop by the Apache
work than analytics. High Availability Ability of a system
Software Foundation for cluster
to keep functioning despite component
resource management. Decouples job
Data Munging/Wrangling The failure.
tracking from resource management.
process of converting raw mapping data
into other formats using automated tools MapReduce A programming model
to create visualizations, aggregations, created by Google for high scalability and
and models. Often addresses data that distribution on multiple clusters for the
straight-up ETL discards. purpose of data processing.