0% found this document useful (0 votes)
65 views7 pages

MapR OptimizeEnterpriseArchit Hadoop and NoSQL

Uploaded by

Spit Fire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views7 pages

MapR OptimizeEnterpriseArchit Hadoop and NoSQL

Uploaded by

Spit Fire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

White Paper, July 2014

Optimize Your
Enterprise
Architecture with
Apache Hadoop ™ ®

and NoSQL
Add Big Data Technologies to Get
More Value from Your Stack
MapR Technologies, Inc.
White Paper, July 2014
®

Optimize Your Enterprise Architecture with


Apache™ Hadoop® and NoSQL
Add Big Data Technologies to Get More Value from Your Stack

Introduction Relational database management systems (RDBMS) have been the foundational technology for enter-
prise IT architectures for several decades, and for good reason. The relational model is well suited for
the data types that are used in business applications such as enterprise resource management (ERP),
customer resource management (CRM), enterprise performance management (EPM), etc. Extensions
for complex data types have enabled management of media types such as images, video, audio, and
document files. Enterprise-grade reliability and data protection let organizations run safe, 24x7 business
operations. The rich ecosystem of tools and related technologies, including data integration and applica-
tion development environments, further extend the value of RDBMSs. And the large available talent
pool ensures a path to project success.
With the wealth of advantages that RDBMSs bring, they will continue to play a significant role in
organizations today and in the future. And since most business operations, along with the most used
business data, have fundamentally remained the same over the years, organizations can continue
leveraging RDBMS-based technologies. So why is there so much buzz around newer technologies
like Apache Hadoop and NoSQL databases?

The Challenges of Big Data As seen over the years, certain trends in data management often require complementary technologies
that coexist with RDBMSs. For example, the growth of important unstructured data types—documents,
emails, web content—gave rise to the prevalence of search engine technology. Search engines were never
a threat to replace RDBMSs, but rather, were the perfect add-on technology to enable new types of
retrieval on different types of information.
More volumes of data, higher Today, big data is both an opportunity and challenge that is beginning to overwhelm an organization’s
speed requirements of data, and ability to manage data. More volumes of data, higher speed requirements of data, and greater variet-
ies of data all put pressure on existing RDBMS deployments. That pressure results in unacceptably slow
greater varieties of data all put performance, higher technology expenditures, and greater personnel effort. Organizations might try to
pressure on existing RDBMS de- solve the volume and velocity pressures on RDBMSs by either upgrading their computers (“scale up”)
ployments…This does not mean or spreading the work across multiple computers (“scale out”), but both options are often prohibitively
expensive. Scaling up RDBMSs require new expenditures that make existing hardware obsolete, and
that organizations should ex- scaling out RDBMSs require expensive specialized systems, or manually intensive techniques like
plore replacing their significant “sharding.” And organizations are burdened with handling many varieties of data in RDBMSs, which
investment in RDBMS based typically require significant manual effort on both the data modeling/re-modeling and query optimiza-
tion sides. Organizations want to maintain data agility, simplify application development, and lighten
technologies.
the data modeling effort with technologies designed for varying data structures.
This does not mean that organizations should explore replacing their significant investment in
RDBMS-based technologies. Nor does it mean that the huge ecosystem of tools and talent has lost
value. Everything in place can be enhanced with newer technologies like Hadoop and NoSQL.
Both of these technologies, borne out of the growing big data challenges, are being investigated as
the next big components to optimize existing enterprise architectures.
2 MapR Technologies, Inc.
Optimize Your Enterprise Architecture with
Apache Hadoop and NoSQL ®

Why Hadoop and NoSQL Apache Hadoop is a framework for running huge, distributed processing jobs across a cluster of hard-
ware servers. It is used for large-scale analytics that require massive parallel processing power, and has
its roots in internet search companies that needed to index millions of web pages. NoSQL refers to the
class of databases that were invented to handle the large volumes of data with varying and changing
structures that organizations deal with today.
Since Hadoop and NoSQL each address different needs and complement each other with regard to big
data, they are often used together. Organizations that use these technologies are embracing two major
trends in enterprise architectures today. The first entails deploying an incrementally and horizontally
scalable cluster of commodity hardware servers. The second entails handling many different, and often-
times changing, data formats in a single system. Both of these trends pertain to keeping the growing
cost and effort of managing big data under control. Hadoop, and NoSQL databases like Apache HBase™,
were designed for incremental, horizontal scaling, and for a wide variety of data formats.
Use of these technologies with existing enterprise architectures typically involves offloading excessive
volumes of data to reduce the storage and processing burden. This sometimes means that certain applica-
tions that struggle with huge volumes of rapidly growing data should be migrated to Hadoop or NoSQL.
This also means that new projects with big data characteristics are deployed on Hadoop or NoSQL.
With big data challenges offloaded, the RDBMS-based technologies are more focused on time-sensitive,
business-critical analytics. In a closed-loop analytical deployment, data in Hadoop and NoSQL are trans-
formed into an analytics-ready format and then moved to the original analytics system.

Hadoop Use Cases


With big data challenges off- Hadoop is ideal for two broad use cases. The first is for huge, data-intensive analytical jobs that can be
loaded, the RDBMS-based tech- best handled in a parallel processing environment. The value to an organization involves reducing the
workload of its high-end analytics engines by offloading certain data. This represents a separation of
nologies are more focused the differing value of data, which is often time-based. For example, an organization might want to keep
on time-sensitive, business- three months’ worth of data in their primary analytics system, and then run long-tail analytics on older
critical analytics. data. The older data is still valuable enough to keep and analyze, but is not time-sensitive, and can be
deployed onto more economical systems like Hadoop.
The second use case is for storing huge volumes of data, often to create “warm archives” of data that are
readily available for access when the need arises. This can significantly reduce the storage burden on a
high-end analytics system. Since Hadoop provides a distributed storage layer, organizations essentially
use Hadoop in this situation as a low cost network-attached storage device.
3 MapR Technologies, Inc.
Optimize Your Enterprise Architecture with
Apache Hadoop and NoSQL ®

Why Hadoop and NoSQL NoSQL Use Cases


continued
NoSQL is used for a variety of low latency, operational workloads for big data. While NoSQL technolo-
gies have significant overlap with RDBMSs with regard to the frequent read/write/update of data, the use
cases diverge around the growth and variety of data. The horizontal scalability of NoSQL alleviates the
need to predict data growth and plan for hardware upgrades, since increasing data loads can be accom-
modated by incrementally growing a cluster of servers. The flexibility to handle different data types is
useful for environments where the data model can often change due to changing business requirements.
Log files, customer profile information, web pages, and product catalog data are only some examples of
the growing and changing data types that NoSQL can handle.
Considering the range of capabilities that business users need, organizations should rarely try to
standardize on only one class of technologies. The goal should be to take advantage of any technology
that fits the organization’s information strategy. Generally, this points to deploying an integrated
combination of RDBMSs, Hadoop, and NoSQL.

Example Deployment Models When optimizing an existing enterprise architecture with Hadoop and NoSQL, a popular architectural
with an Enterprise Data Hub model used today is known as the enterprise data hub (EDH). An EDH, sometimes referred to as a
“data lake,” is a large central repository of multi-structured data including structured, semi-structured,
and unstructured data. Choice of Hadoop or NoSQL depends on the access pattern. When there could
be multiple or changing access patterns to the data—fast look-ups, long reports, full-text search, etc.—
storing a canonical version of the data in Hadoop makes sense. This leaves the opportunity for data
transformations in Hadoop that can then be queried with tools like Apache Hive or Apache Drill, or be
loaded into NoSQL. For consistent, unchanging access patterns on structured data, loading into NoSQL
is often the right choice, especially if the data is expected to be updated in place.
A successful enterprise data A successful EDH deployment has the same enterprise-grade characteristics that organizations already
hub deployment has the expect from RDBMS-based deployments. Features such as standards-based data access and interopera-
bility, multi-tenancy, high availability, disaster recovery, security, and high performance are all necessary
same enterprise-grade char-
in an EDH.
acteristics that organizations
An EDH is a core component of an overall solution that addresses a wide range of specific business
already expect from RDBMS- concerns. The cross-industry applicability of EDHs make it a suitable architecture for any organization.
based deployments. Example deployment models include:
Data Warehouse Optimization
A common trend in data warehouse (DW) deployments is the need to analyze larger time windows of
data. Analysts know that querying larger data sets can help to obtain more accurate and useful insights.
However, most organizations find that expanding their time windows is not a simple task, especially
when their DW is approaching full capacity. Analyzing even a few more months of data can be a huge
maintenance burden. Organizations previously had to choose either discarding older data, or making
larger investments in DW infrastructure to handle the additional data volumes. The default action is
typically the former option, so analysts are relegated to smaller data sets.

continued on next page


4 MapR Technologies, Inc.
Optimize Your Enterprise Architecture with
Apache Hadoop and NoSQL ®

Example Deployment Models with an With a DW optimization strategy, organizations move older data, and less frequently used data from the
Enterprise Data Hub continued DW to Hadoop and/or NoSQL. Hadoop/NoSQL then becomes the analytics engine for analysis that is
less time-sensitive. Organizations run analytics on Hadoop, typically as a batch job, and thus continue
gaining value from historical data. Fast look-ups, including report generation, can be done with data
stored in NoSQL. This model is very cost-effective as it can leverage commodity hardware. At the same
time, analysts can continue with their standard analytics processes on their DW on recent data. The
combination of analytical insights from both recent data and historical data gives analysts a more com-
plete picture of their enterprise data.
Data Consolidation
Data silos arise for many reasons, and they remain in place for many reasons, often for very good
reasons. Instead of trying to tear down the silos, organizations can consolidate views of their disparate
data sets by deploying an EDH with Hadoop and NoSQL. In some cases, redundant silos can be elimi-
nated by migrating the data to an EDH. By creating a centralized layer, analysts can more efficiently
“connect the dots” between the valuable data spread across an organization. With Hadoop and NoSQL,
organizations can support not only the volume and velocity of data, but also store the many varieties
of data from across the different silos without extensive data modeling efforts.
Data consolidation is often described using different terms based on industry or department-specific
deployments. For example, customer service departments use the term “360-degree customer view”
to describe the ability to immediately see all customer profile information along with the interactions
the customers have with the company. Also, variants of asset management systems, including IT asset
management and digital asset management, are data consolidation solutions that centralize information
about enterprise assets that are originally stored in disparate locations. Regardless of the type of data
that is collected and centralized, a data consolidation deployment helps users to uncover information
that would otherwise not be obvious if retrieved from separate locations. The ability to link related infor-
mation gives users an added advantage for discovering useful insights. The combination of Hadoop and
NoSQL gives organizations flexibility with regard to data structures to enable efficient collection of data
from a wide variety of remote sources.
Internet of Things and Sensor Data Analysis
Collecting sensor data from a large set of remote devices is an increasingly popular use case in big data.
The large number of sources, along with the potentially large number of data elements, results in a speed
and scale problem. While organizations might otherwise consider RDBMS technologies for collecting
and analyzing this data—typically because the data can be transformed into a highly structured format—
the use of an EDH will alleviate the data burden on RDBMSs. An EDH leverages the fast data loading
capabilities of Hadoop, tied with the fast querying capabilities of NoSQL, to become the entry point for
data from remote devices. Within Hadoop and NoSQL, data can be analyzed to deliver insights about
how the devices are operating, or the data can be transformed to be loaded into another analytical system
like a DW.

continued on next page


5 MapR Technologies, Inc.
Optimize Your Enterprise Architecture with
Apache Hadoop and NoSQL ®

Example Deployment Models with an Many devices are creating data today, typically taking measurements while used in real-world situations.
Enterprise Data Hub continued Equipment such as assembly line robots, oil drilling rigs, and wind turbines can take hundreds or even
thousands of measurements per second to accurately and immediately identify anomalous behavior that
signal an impending problem that can be proactively addressed. Products such as automobiles, printers,
and network routers track usage behavior to let manufacturers analyze what usage conditions lead to
premature product failure, as a means to improving product quality. With so many incoming data points,
organizations that need to analyze sensor data can free their existing systems from the added storage and
processing burden by leveraging Hadoop and NoSQL.

Why MapR The MapR Distribution for Hadoop is a software product that can solve enterprise architecture needs
on both the Hadoop and NoSQL sides. While MapR is well known for its performance, scalability, and
reliability advantages in the Hadoop community, it also provides strong capabilities with regard to multi-
tenancy, security, and interoperability. A key innovation in MapR is the integrated NoSQL database,
MapR-DB. Similar to Apache HBase, MapR-DB is an in-Hadoop database that runs in the same cluster
as a Hadoop deployment. MapR-DB is designed to run HBase applications using the same standard API,
thus leveraging the skillset of the large, existing HBase developer network.

Similar to Apache HBase,


MapR-DB is an in-Hadoop
database that runs in the
same cluster as a Hadoop
deployment.

The Optimized Data Architecture Using MapR

continued on next page


6 MapR Technologies, Inc.
Optimize Your Enterprise Architecture with
Apache Hadoop and NoSQL ®

Why MapR continued One major reason why an integrated Hadoop/NoSQL solution is important is the deployment benefits
of data locality, one of the tenets of Hadoop processing. Since data stored in an EDH is continually
analyzed, it is important to keep the data where the analysis is run. In a conventional Hadoop/NoSQL
MapR-DB is proven in pro- deployment, the two technologies run on separate clusters. This means that if large-scale, parallel-pro-
duction to help some of the cessed analytics is required with Hadoop and MapReduce, the NoSQL data must be first copied to the
biggest companies in the Hadoop cluster. Since the goal of optimizing an enterprise architecture is to alleviate big data pressures,
the notion of copying large volumes of data from one cluster to another is antithetic. Not only that, if
world optimize their existing adding new capabilities can be done in a single cluster versus multiple clusters, then organizations can
enterprise architectures. reduce the risk for error, the duplication of effort, and the total cost of ownership.
The advantage of MapR-DB is not simply that it is an integrated NoSQL database. It provides the high
performance, massive scalability, low maintenance overhead, and the enterprise-grade reliability to run
business-critical environments. MapR-DB is proven in production to help some of the biggest companies
in the world optimize their existing enterprise architectures. These customers handle huge workloads
that help them with business-focused goals such as identifying incremental revenue opportunities,
creating upsell/cross-sell recommendations, and even detecting fraudulent activity.

Conclusion Organizations are seeing a plethora of innovative technologies that can help with their big data chal-
lenges. Most of these innovations should be viewed as complementary to existing architectures, and not
as pure replacements. RDBMS technologies will continue to play a vital role in enterprises today, and
new challenges with big data will make organizations turn to Hadoop and NoSQL for help. By taking an
optimization-focused approach, organizations can continue gaining value from their data and their exist-
ing IT infrastructure, while also expanding their capabilities for future business requirements.

MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad
set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use
and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified distribution
for Hadoop. MapR is used by more than 500 customers across financial services, government, healthcare,
manufacturing, media, retail and telecommunications as well as by leading Global 2000 and Web 2.0 compa-
nies. Amazon, Cisco, Google and HP are part of the broad MapR partner ecosystem. Investors include Google
Capital, Lightspeed Venture Partners, Mayfield Fund, NEA, Qualcomm Ventures and Redpoint Ventures.

© 2014 MapR Technologies, Inc.

You might also like