Mapr Informatica Whitepaper
Mapr Informatica Whitepaper
Mapr Informatica Whitepaper
Table of Contents
Introduction
Hadoop: A Strategic Data Analytics Platform
Informatica with MapR
A Better Hadoop: Additional Enhancements in MapRs Distribution
Summary
Introduction
The volume, velocity, and variety of data are all growing relentlessly. Organizations are struggling to find the tools, talent,
and time to harness the value and intelligence from this growth. The need to integrate big transaction data with big
interaction data while using big data processing technologies such as ApacheTM Hadoop is particularly challenging, and
achieving this integration at a big scale is an onerous uphill battle.
Informatica oers the industrys leading independent data integration platform, which uniquely enables organizations to
maximize the return on big data and support top business imperatives. Informatica is also integrated with Hadoop, which
is purpose-built for processing big data eectively and aordably, and specifically with MapR Technologies Distribution
for Hadoop, which improves performance, scalability, reliability, and ease of data access.
The Cisco Unified Computing System (Cisco UCS) Common Platform Architecture (CPA) for Big Data provides a
highly scalable platform that can be optimized and easily scaled for any size of Hadoop cluster and compute-intensive
applications. Cisco UCS CPA for Big Data comes with prevalidated configurations that allow organizations to select
performance and capacity as their needs dictate. Unique to Cisco UCS CPA for Big Data is its embedded extensible unified
management for managing all computing, networking, and storage resources. MapRs Distribution for Hadoop has been
extensively tested and validated on Cisco UCS CPA for Big Data.
This paper outlines how the combination of Informaticas Data Integration platform, the MapR Distribution for Hadoop,
and Cisco UCS CPA for Big Data oers powerful new capabilities for integrating and processing big data more eciently
and cost-eectively than ever before.
Cisco UCS CPA for Big Data facilitates fast data movement through its low-latency and lossless 10-Gbps unified fabric
thats fully redundant due to its active-active (high-availability) configuration, delivering higher performance than other
vendor solutions.
Informaticas HParser helps create an easy-to-use integrated data environment (IDE) that enables customers to visually
design data parsing transformations for industry-standard (for example, FIX, SWIFT, ACORD, HL7, EDI, and many more)
and popular document formats (for example MS Oce, PDF, etc.), as well as complex files (such as Logs, Omniture,
XML, and JSON), which can then be executed in parallel in the Hadoop cluster. The performance advantages of
MapR, combined with the eciency of HParser, allow users to perform data parsing and transformations with higher
performance and lower hardware costs compared to other options.
PowerExchange for Hadoop makes it easier for nonprogrammers to move transaction and interaction data between a
MapR cluster and other databases and data warehouses, without the use of hand-coding. MapRs Direct Access NFS
interface also enables users to take advantage of Informaticas full range of data sources and transformations with the
Hadoop environment.
Automating End-to-end Workloads
By configuring, managing, and automating end-to-end workloads that include API connections to both MapR
and Informatica PowerCenter, the burden of Day 2 operations administration is significantly reduced for Hadoop
administrative staff, DBAs, applications architects and data scientists. After the logical data flows are designed Cisco
Tidal Enterprise Scheduler (TES) does the rest. Cisco TES eliminates the bottlenecks and errors associated with script
management for extract, transform, and load (ETL) processes, data movement, MapReduce jobs and output to analytics
applications.
Cisco TES also automates existing operational data center workloads and aids with pre-validation, sizing, and
performance optimization, which reduces workload deployment errors and reduces the risks of missing SLAs.
A Better Hadoop: Additional Enhancements in MapRs Distribution
Direct Access NFS also facilitates support for volumes, snapshots, and mirroring for all data contained within the Hadoop
cluster, further improving reliability without requiring any extraordinary measures. Volumes make clustered data easier
to both access and manage by grouping related files and directories into a single tree structure that can be more readily
organized, administered, and secured. Snapshots can be taken periodically to create drag-and-drop recovery points, and
mirroring extends data protection to satisfy recovery time objectives. Local mirroring provides high performance for
frequently accessed data, while remote mirroring provides business continuity across multiple data centers, as well as
integration between on-premises and private clouds.
Another major enhancement MapR made was to eliminate single points of failure in the critical NameNode and
JobTracker functions. MapRs Distributed NameNode HA (High Availability) distributes the file metadata on ordinary
DataNodes throughout the cluster. In the extreme, all DataNodes might store and serve a portion of the file metadata.
Every portion is then persisted to disk (with the nodes data) and also replicated to at least two other nodes to increase
tolerance to multiple simultaneous node failures. This eliminates the need with other distributions to continuously back
up the Primary NameNode to either a Checkpoint Node (previously called the Secondary NameNode) or a Backup Node.
MapRs JobTracker oers similar resiliency, with the ability to continue all tasks with no interruption or data loss in the
event of a failure. Without such transparent failover, it is necessary to restart the aected job(s) from the beginning.
MapRs Distributed NameNode HA architecture also improves scalability and performance compared to configurations
with a single Primary NameNode. Even in a server configured with copious amounts of memory, a single NameNode
is normally limited to only about 70 million files. With MapRs Distributed NameNode HA architecture, in contrast,
the cluster scales in a linear fashion with the number of DataNodes, and can therefore contain a virtually unlimited
number of files. The performance advantage derives from the elimination of a Primary NameNode, which can become
a bottleneck even in relatively small clusters. With file metadata distributed across multiple DataNodes throughout the
cluster, performance also scales in a linear fashion with the size of the cluster.
Summary
By using Informatica with MapRs Distribution for Hadoop on Cisco UCS CPA for Big Data, organizations are now able
to achieve high-performance data integration, replication, and messaging. Together, the three companies are pushing
the limits of high-performance networks to move many terabytes per hour of transaction, interaction, and streaming
data into the MapR cluster, as well as to parse and process a broad range of structured and unstructured data natively
in Hadoop all without coding. The combination also gives organizations a more aordable way to archive data in
applications, data warehouses, and/or legacy systems to Hadoop, or to archive data to Hadoops lower-cost storage.
Together Informatica, MapR, and Cisco provide cost-eective, high-performance analytic-ready data storage and
processing with enterprise-class high availability and business continuity.
To learn more, call 855-NOW-MAPR (855-669-6277) or visit the respective companies online at www.mapr.com,
www.informatica.com, or www.cisco.com/go/bigdata.
MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and
real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL,
database and streaming applications in one unified Big Data platform. MapR is used across financial services, retail, media, healthcare,
manufacturing, telecommunications and government organizations as well as by leading Fortune 100 and Web 2.0 companies.
Amazon, Cisco, and Google are part of MapRs broad partner ecosystem. Investors include Lightspeed Venture Partners, Mayfield
Fund, NEA, and Redpoint Ventures.
2013 MapR Technologies. All rights reserved. Apache Hadoop, HBase and Hadoop are trademarks of the Apache Software
Foundation and not affiliated with MapR Technologies.