Mapr Informatica Whitepaper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

White paper

Hadoop in the Enterprise:


Maximizing Big Data Benefits with MapR and
Informatica on Cisco Unified Computing System

Table of Contents
Introduction
Hadoop: A Strategic Data Analytics Platform
Informatica with MapR
A Better Hadoop: Additional Enhancements in MapRs Distribution
Summary
Introduction
The volume, velocity, and variety of data are all growing relentlessly. Organizations are struggling to find the tools, talent,
and time to harness the value and intelligence from this growth. The need to integrate big transaction data with big
interaction data while using big data processing technologies such as ApacheTM Hadoop is particularly challenging, and
achieving this integration at a big scale is an onerous uphill battle.
Informatica oers the industrys leading independent data integration platform, which uniquely enables organizations to
maximize the return on big data and support top business imperatives. Informatica is also integrated with Hadoop, which
is purpose-built for processing big data eectively and aordably, and specifically with MapR Technologies Distribution
for Hadoop, which improves performance, scalability, reliability, and ease of data access.
The Cisco Unified Computing System (Cisco UCS) Common Platform Architecture (CPA) for Big Data provides a
highly scalable platform that can be optimized and easily scaled for any size of Hadoop cluster and compute-intensive
applications. Cisco UCS CPA for Big Data comes with prevalidated configurations that allow organizations to select
performance and capacity as their needs dictate. Unique to Cisco UCS CPA for Big Data is its embedded extensible unified
management for managing all computing, networking, and storage resources. MapRs Distribution for Hadoop has been
extensively tested and validated on Cisco UCS CPA for Big Data.
This paper outlines how the combination of Informaticas Data Integration platform, the MapR Distribution for Hadoop,
and Cisco UCS CPA for Big Data oers powerful new capabilities for integrating and processing big data more eciently
and cost-eectively than ever before.

Hardtop: A Strategic Data Analytics Platform


Hadoop provides a way to capture, organize, store, search, share, and analyze data from disparate sources across a large
cluster of commodity servers. Hadoop is designed to scale up from dozens to thousands of servers, each oering local
computation and storage.
MapR Technologies has advanced the Hadoop state of the art with major enhancements that overcome significant
limitations of other Hadoop distributions, making Hadoop more enterprise-class in its operation, performance,
scalability, and reliability, as well as in its ease of integration into the enterprise.
One major enhancement MapR has made involves rearchitecting the Hadoop Distributed File System (HDFS) to provide
full random read/write semantics, high availability, and direct access through NFS. These innovations overcome the many
limitations of HDFS, including its batch-oriented data management and movement, lack of random read/write file access
by multiple users and processes, and the requirement that files be closed before new updates can be read.
In addition to overcoming HDFS limitations and improving data protection, MapRs Direct Access NFS aords some other
significant advantages. Lockless storage with random reads and writes enables simultaneous access to data in near real
time, substantially improving performance. Any remote client can simply mount the cluster, and application servers
can then write their data and log files directly into the cluster, rather than writing first to direct- or network-attached
storage. Existing applications and workflows can use standard NFS to access the Hadoop cluster to manipulate data, and
optionally take advantage of the MapReduce framework for parallel processing. In addition, files in the cluster can be
modified directly using ordinary text editors, command-line tools, and UNIX applications and utilities, as well as other
development environments.
Cisco UCS Common Platform Architecture for Big Data
The Cisco UCS solution for MapR and Informatica is based on the Cisco UCS CPA for Big Data a highly scalable
architecture designed to meet a variety of scale-out application demands with transparent data and management
integration capabilities via the Cisco UCS management software suite, such as Cisco UCS Manager, Cisco UCS Central, and
Cisco UCS Director. This platform, built from the ground up, delivers scale without complexity through automated server
deployment using service profiles, improving IT efficiency and lowering TCO relative to legacy infrastructure.
These reference architecture blueprints offer a choice of high-performance and high-capacity options. Reference
architectures are available in both single-rack and multi-rack configurations, with considerable capacity built into the
Cisco Unified Fabric.
Informatica with MapR on Cisco UCS CPA for Big Data
The combination of Informaticas Data Integration platform with MapRs Distribution for Hadoop on Cisco UCS CPA for
Big Data enables organizations to access, ingest, parse, and process the full range of structured and unstructured data
(including messaging streams) with greater performance, scale, and dependability than ever before.
Using MapRs Direct Access NFS, Informaticas Ultra Messaging can stream messages directly into the MapR cluster to be
retained and processed via MapReduce. Both Ultra Messaging and MapR
feature parallel architectures with high availability (no single points of
Commercial integration of the Informatica
failure) and best-in-class performance, making the combination ideal for
Data Integration platform with MapRs
production deployments. Due to the limitations of HDFS, only the MapR
Distribution for Hadoop includes:
Distribution for Hadoop can support Ultra Messaging streaming; no other
distribution can do so.
Bidirectional data integration with
Informatica PowerExchange
Informaticas Data Replication and FastClone provide high-performance
Near-real-time and snapshot
transaction updates and data loading from dierent hardware platforms
replication using Informatica Data
and data sources into the MapR cluster running on Cisco UCS CPA for
Replication and Informatica FastClone
analysis via MapReduce or Hive. The data is loaded into the MapR
Parallel parsing and transformation on
cluster in near real time or on a scheduled basis, whereas other Hadoop
MapR using Informatica HParser
distributions and database connectors provide much lower throughput
Data streaming using Informatica Ultra
and are limited to one-time table dumps and batch loading.
Messaging
Page 2 | Maximizing Big Data Benefits with MapR and Informatica

MapR, Inc. All Rights Reserved.

Cisco UCS CPA for Big Data facilitates fast data movement through its low-latency and lossless 10-Gbps unified fabric
thats fully redundant due to its active-active (high-availability) configuration, delivering higher performance than other
vendor solutions.
Informaticas HParser helps create an easy-to-use integrated data environment (IDE) that enables customers to visually
design data parsing transformations for industry-standard (for example, FIX, SWIFT, ACORD, HL7, EDI, and many more)
and popular document formats (for example MS Oce, PDF, etc.), as well as complex files (such as Logs, Omniture,
XML, and JSON), which can then be executed in parallel in the Hadoop cluster. The performance advantages of
MapR, combined with the eciency of HParser, allow users to perform data parsing and transformations with higher
performance and lower hardware costs compared to other options.
PowerExchange for Hadoop makes it easier for nonprogrammers to move transaction and interaction data between a
MapR cluster and other databases and data warehouses, without the use of hand-coding. MapRs Direct Access NFS
interface also enables users to take advantage of Informaticas full range of data sources and transformations with the
Hadoop environment.
Automating End-to-end Workloads
By configuring, managing, and automating end-to-end workloads that include API connections to both MapR
and Informatica PowerCenter, the burden of Day 2 operations administration is significantly reduced for Hadoop
administrative staff, DBAs, applications architects and data scientists. After the logical data flows are designed Cisco
Tidal Enterprise Scheduler (TES) does the rest. Cisco TES eliminates the bottlenecks and errors associated with script
management for extract, transform, and load (ETL) processes, data movement, MapReduce jobs and output to analytics
applications.
Cisco TES also automates existing operational data center workloads and aids with pre-validation, sizing, and
performance optimization, which reduces workload deployment errors and reduces the risks of missing SLAs.
A Better Hadoop: Additional Enhancements in MapRs Distribution
Direct Access NFS also facilitates support for volumes, snapshots, and mirroring for all data contained within the Hadoop
cluster, further improving reliability without requiring any extraordinary measures. Volumes make clustered data easier
to both access and manage by grouping related files and directories into a single tree structure that can be more readily
organized, administered, and secured. Snapshots can be taken periodically to create drag-and-drop recovery points, and
mirroring extends data protection to satisfy recovery time objectives. Local mirroring provides high performance for
frequently accessed data, while remote mirroring provides business continuity across multiple data centers, as well as
integration between on-premises and private clouds.
Another major enhancement MapR made was to eliminate single points of failure in the critical NameNode and
JobTracker functions. MapRs Distributed NameNode HA (High Availability) distributes the file metadata on ordinary
DataNodes throughout the cluster. In the extreme, all DataNodes might store and serve a portion of the file metadata.
Every portion is then persisted to disk (with the nodes data) and also replicated to at least two other nodes to increase
tolerance to multiple simultaneous node failures. This eliminates the need with other distributions to continuously back
up the Primary NameNode to either a Checkpoint Node (previously called the Secondary NameNode) or a Backup Node.
MapRs JobTracker oers similar resiliency, with the ability to continue all tasks with no interruption or data loss in the
event of a failure. Without such transparent failover, it is necessary to restart the aected job(s) from the beginning.
MapRs Distributed NameNode HA architecture also improves scalability and performance compared to configurations
with a single Primary NameNode. Even in a server configured with copious amounts of memory, a single NameNode
is normally limited to only about 70 million files. With MapRs Distributed NameNode HA architecture, in contrast,
the cluster scales in a linear fashion with the number of DataNodes, and can therefore contain a virtually unlimited
number of files. The performance advantage derives from the elimination of a Primary NameNode, which can become
a bottleneck even in relatively small clusters. With file metadata distributed across multiple DataNodes throughout the
cluster, performance also scales in a linear fashion with the size of the cluster.

Page 3 | Maximizing Big Data Benefits with MapR and Informatica

MapR, Inc. All Rights Reserved.

Summary
By using Informatica with MapRs Distribution for Hadoop on Cisco UCS CPA for Big Data, organizations are now able
to achieve high-performance data integration, replication, and messaging. Together, the three companies are pushing
the limits of high-performance networks to move many terabytes per hour of transaction, interaction, and streaming
data into the MapR cluster, as well as to parse and process a broad range of structured and unstructured data natively
in Hadoop all without coding. The combination also gives organizations a more aordable way to archive data in
applications, data warehouses, and/or legacy systems to Hadoop, or to archive data to Hadoops lower-cost storage.
Together Informatica, MapR, and Cisco provide cost-eective, high-performance analytic-ready data storage and
processing with enterprise-class high availability and business continuity.
To learn more, call 855-NOW-MAPR (855-669-6277) or visit the respective companies online at www.mapr.com,
www.informatica.com, or www.cisco.com/go/bigdata.

For more information,


please visit www.mapr.com

MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and
real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL,
database and streaming applications in one unified Big Data platform. MapR is used across financial services, retail, media, healthcare,
manufacturing, telecommunications and government organizations as well as by leading Fortune 100 and Web 2.0 companies.
Amazon, Cisco, and Google are part of MapRs broad partner ecosystem. Investors include Lightspeed Venture Partners, Mayfield
Fund, NEA, and Redpoint Ventures.
2013 MapR Technologies. All rights reserved. Apache Hadoop, HBase and Hadoop are trademarks of the Apache Software
Foundation and not affiliated with MapR Technologies.

You might also like