0% found this document useful (0 votes)
11 views10 pages

Big Data-11-20

This document provides information about IBM products and services, emphasizing that offerings may vary by country and that users should verify compatibility with non-IBM products. It includes disclaimers regarding warranties and patent rights, as well as details about the authors and contributors of the IBM Redbooks publication on big data analytics. The publication aims to assist technical professionals in managing big data environments using IBM's infrastructure solutions.

Uploaded by

moahh20011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Big Data-11-20

This document provides information about IBM products and services, emphasizing that offerings may vary by country and that users should verify compatibility with non-IBM products. It includes disclaimers regarding warranties and patent rights, as well as details about the authors and contributors of the IBM Redbooks publication on big data analytics. The publication aims to assist technical professionals in managing big data environments using IBM's infrastructure solutions.

Uploaded by

moahh20011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.

Any performance data contained herein was determined in a controlled environment. Therefore, the results
obtained in other operating environments may vary significantly. Some measurements may have been made
on development-level systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.

Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.

© Copyright IBM Corp. 2015. All rights reserved. ix


Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX® IBM Spectrum™ PowerPC®
Algo® IBM Watson™ PureSystems®
Algo Market® InfoSphere® QRadar®
Algo Risk® Insight™ Redbooks®
Algorithmics® Insights™ Redbooks (logo) ®
BigInsights™ Jazz™ Smarter Planet®
Bluemix™ LSF® SPSS®
Cognos® OpenPages® Symphony®
DataStage® Power Systems™ Tealeaf®
DB2® POWER6® Tivoli®
developerWorks® POWER7® WebSphere®
GPFS™ POWER8™
IBM® PowerLinux™

The following terms are trademarks of other companies:

SoftLayer, and SoftLayer device are trademarks or registered trademarks of SoftLayer, Inc., an IBM Company.

Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.

Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.

x IBM Software Defined Infrastructure for Big Data Analytics Workloads


IBM REDBOOKS PROMOTIONS

IBM Redbooks promotions

Find and read thousands of


IBM Redbooks publications
Search, bookmark, save and organize favorites
Get up-to-the-minute Redbooks news and announcements
Link to the latest Redbooks blogs and videos

Get the latest version of the Redbooks Mobile App

Download
Android
iOS

Now

Promote your business


in an IBM Redbooks
publication
®
Place a Sponsorship Promotion in an IBM
®
Redbooks publication, featuring your business
or solution with a link to your web site.

Qualified IBM Business Partners may place a full page


promotion in the most popular Redbooks publications.
Imagine the power of being seen by users who download ibm.com/Redbooks
millions of Redbooks publications each year! About Redbooks Business Partner Programs
THIS PAGE INTENTIONALLY LEFT BLANK
Preface

This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM
Platform Symphony® MapReduce framework, IBM Spectrum™ Scale (based Upon IBM
GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are
work together as an infrastructure to manage not just Hadoop-related offerings, but many
popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on.

It describes the different ways to run Hadoop in a big data environment, and demonstrates
how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with
its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed
workload managers offered by IBM. This information is for technical professionals
(consultants, technical support staff, IT architects, and IT specialists) who are responsible for
delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to
help uncover insights among client’s data so they can optimize product development and
business results.

Authors
This book was produced by a team of specialists from around the world working at the
International Technical Support Organization (ITSO), Austin Center.

Dino Quintero is a Complex Solutions Project Leader and an IBM Senior Certified IT
Specialist with the ITSO in Poughkeepsie, NY. His areas of knowledge include enterprise
continuous availability, enterprise systems management, system virtualization, technical
computing, and clustering solutions. He is an Open Group Distinguished IT Specialist. Dino
holds a Master of Computing Information Systems degree and a Bachelor of Science degree
in Computer Science from Marist College.

Daniel de Souza Casali is an IBM Cross Systems Senior Certified and has been working at
IBM for 11 years. Daniel works for Systems and Technology Group in Latin America as a
Software Defined infrastructure IT Specialist. Daniel holds an Engineering degree in Physics
from the Federal University of São Carlos (UFSCar). His areas of expertise include UNIX,
SAN networks, IBM Disk Subsystems, clustering cloud, and analytics solutions.

Marcelo Correia Lima is a Business Intelligence Architect at IBM. He has 17 years of


experience in development and integration of Enterprise Applications. His current area of
expertise is Business Analytics Optimization (BAO) Solutions. He has been planning and
managing Solutions lifecycle implementation, involving Multidimensional Modeling, IBM
InfoSphere® Data Architect, IBM InfoSphere DataStage®, IBM Cognos® Business
Intelligence, and IBM DB2®. In addition, Marcelo has added Hadoop, big data, IBM
InfoSphere BigInsights™, cloud computing, and IBM Platform Computing to his background.
Marcelo holds Level 1 IT Specialist Certification Experienced Level in the Services and
Business Information Management specializations (by The Actualizing IT Solutions Capability
Center - IBM Integrated Technology Delivery Headquarters). Before working as a Business
Intelligence Architect, he was involved in the design and implementation of IBM WebSphere®
and Java Enterprise Edition Applications for IBM Data Preparation/Data Services.

Istvan Gabor Szabo is an Infrastructure Architect and Linux Subject Matter Expert at IBM
Hungary (IBM DCCE SFV). He joined IBM in 2010 after receiving his bachelor degree in
Engineering Information Technology from the University of Obudai - John von Neumann

© Copyright IBM Corp. 2015. All rights reserved. xiii


Faculty of Informatics. Most of the time, he works on projects as a Linux technical lead. His
areas of expertise are configuring and troubleshooting complex environments, and building
automation methodologies for server builds. In his role as an Infrastructure Architect, he
works on the IBM Standard Software Installer (ISSI) environment, where he designs new
environments based on customer requirements.

Maciej Olejniczak is a Cross-functional Software Support Team Leader in a collaborative


environment. He works internationally with external and internal clients, IBM Business
Partners, services, labs, and research teams. He is a dedicated account advocate for large
customers in Poland. Maciej is an IBM Certified Expert in Actualizing IT Solutions: Software
Enablement. He achieved a master level in implementing all activities that transform
information technology from a vision to an actual working solution. Maciej is an Open Group
Master Certified IT Specialist.

Tiago Rodrigues de Mello is a Staff Software Engineer in Brazil with more than 10 years of
experience. Tiago’s ares of expertise include Linux system administration, software
development, and cloud computing. He is an OpenStack developer and a Continuous
Integration engineer at the IBM Linux Technology Center. Tiago holds a Bachelor in
Computer Science degree from the Federal University of Sao Carlos, Brazil.

Nilton Carlos dos Santos is an IT Architect and a Certified IT Specialist and has been with
IBM since 2007 and has 18 years experience in the IT industry. Before joining IBM, he worked
in several different areas of technology, including Linux and UNIX administration, database
management, development in many different languages, and network administration. Nilton
Carlos also has deep expertise in messaging, automation, monitoring, and reporting system
tools. He enjoys working with open source software.

Thanks to the following people for their contributions to this project:

Richard Conway, David Bennin


International Technical Support Organization, Austin Center

Now you can become a published author, too


Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an ITSO residency project and help write a book in your
area of expertise, while honing your experience using leading-edge technologies. Your efforts
will help to increase product acceptance and customer satisfaction, as you expand your
network of technical contacts and relationships. Residencies run from two to six weeks in
length, and you can participate either in person or as a remote resident working from your
home base.

Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html

Comments welcome
Your comments are important to us.

We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review Redbooks form:

xiv IBM Software Defined Infrastructure for Big Data Analytics Workloads
ibm.com/redbooks
򐂰 Send your comments by email:
[email protected]
򐂰 Mail your comments:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400

Stay connected to IBM Redbooks


򐂰 Find us on Facebook:
http://www.facebook.com/IBMRedbooks
򐂰 Follow us on Twitter:
http://twitter.com/ibmredbooks
򐂰 Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html

Preface xv
xvi IBM Software Defined Infrastructure for Big Data Analytics Workloads
1

Chapter 1. Introduction to big data


As our planet becomes more integrated, the rate of data growth is increasing exponentially.
This data explosion is rendering commonly accepted practices of data management
inadequate. This has resulted in a new wave of business challenges related to data
management and analytics. Everywhere, you hear and read many people mentioning the
term big data.

This chapter provides a foundational understanding of big data: What it is, and why you need
to care about it. It also describes IBM technology to handle these data management
challenges.

The following topics are covered in this chapter:


򐂰 1.1, “Evolution and characteristics of big data” on page 2
򐂰 1.2, “What’s in a bite” on page 5
򐂰 1.3, “Is the demand for a big data solution real?” on page 6
򐂰 1.4, “What is Hadoop?” on page 6
򐂰 1.5, “Hadoop Distributed File System in more detail” on page 7
򐂰 1.6, “MapReduce in more detail” on page 9
򐂰 1.7, “The changing nature of distributed computing” on page 10
򐂰 1.8, “IBM Platform Symphony grid manager” on page 10

© Copyright IBM Corp. 2015. All rights reserved. 1


1.1 Evolution and characteristics of big data
If your profession is heavily based in the realm of information management, there is a good
chance that you heard the term big data at least once over the past week or so. It has
become increasingly popular to incorporate “big data” in data management, cloud computing,
and application development discussions.

In a similar way, it was previously popular to bring the advent of service-oriented architecture
(SOA) and Web 2.0, just as examples. “Big data” is a trendy talking point at many companies,
but few people understand exactly what it means.

Rather than volunteering an arbitrary definition of the term, we believe that a better approach
is to explore the evolution of data, along with enterprise data management systems. This
approach ultimately arrives at a clear understanding of what big data is and why we need to
care about it.

The IBM Smarter Planet® initiative was launched during a speech to the Council of Foreign
Relations in New York in 2008. Smarter Planet focuses on development of technologies that
are advancing everyday experiences. A large part of developing such technology is
dependent on the collection and analysis of data from as many sources as possible. This
process is difficult, because the number and variety of sources continues to grow. The planet
is exponentially more instrumented, intelligent, and integrated, and it will continue to expand,
with better and faster capabilities. The World Wide Web is truly living up to its name. Through
its continued expansion, cloud computing and the web are driving our ability to generate and
have access to virtually unlimited amounts of data.

The statistics that are presented in Figure 1-1 confirm the validity of the world becoming
exponentially more instrumented.

Figure 1-1 Predicted worldwide data growth

There was an earlier point in history when only home computers and web-hosting servers
were connected to the web. If you had a connection to the web and ventured into the world of
chat rooms, you were able to communicate by instant messaging with someone in another
part of the world. Hard disk drives were 256 MB, CD players were top-shelf technology, and
cell phones were as large as lunch boxes. We are far from those days.

2 IBM Software Defined Infrastructure for Big Data Analytics Workloads

You might also like