Big Data-11-20
Big Data-11-20
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Any performance data contained herein was determined in a controlled environment. Therefore, the results
obtained in other operating environments may vary significantly. Some measurements may have been made
on development-level systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX® IBM Spectrum™ PowerPC®
Algo® IBM Watson™ PureSystems®
Algo Market® InfoSphere® QRadar®
Algo Risk® Insight™ Redbooks®
Algorithmics® Insights™ Redbooks (logo) ®
BigInsights™ Jazz™ Smarter Planet®
Bluemix™ LSF® SPSS®
Cognos® OpenPages® Symphony®
DataStage® Power Systems™ Tealeaf®
DB2® POWER6® Tivoli®
developerWorks® POWER7® WebSphere®
GPFS™ POWER8™
IBM® PowerLinux™
SoftLayer, and SoftLayer device are trademarks or registered trademarks of SoftLayer, Inc., an IBM Company.
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
Download
Android
iOS
Now
This IBM® Redbooks® publication documents how IBM Platform Computing, with its IBM
Platform Symphony® MapReduce framework, IBM Spectrum™ Scale (based Upon IBM
GPFS™), IBM Platform LSF®, the Advanced Service Controller for Platform Symphony are
work together as an infrastructure to manage not just Hadoop-related offerings, but many
popular industry offeringsm such as Apach Spark, Storm, MongoDB, Cassandra, and so on.
It describes the different ways to run Hadoop in a big data environment, and demonstrates
how IBM Platform Computing solutions, such as Platform Symphony and Platform LSF with
its MapReduce Accelerator, can help performance and agility to run Hadoop on distributed
workload managers offered by IBM. This information is for technical professionals
(consultants, technical support staff, IT architects, and IT specialists) who are responsible for
delivering cost-effective cloud services and big data solutions on IBM Power Systems™ to
help uncover insights among client’s data so they can optimize product development and
business results.
Authors
This book was produced by a team of specialists from around the world working at the
International Technical Support Organization (ITSO), Austin Center.
Dino Quintero is a Complex Solutions Project Leader and an IBM Senior Certified IT
Specialist with the ITSO in Poughkeepsie, NY. His areas of knowledge include enterprise
continuous availability, enterprise systems management, system virtualization, technical
computing, and clustering solutions. He is an Open Group Distinguished IT Specialist. Dino
holds a Master of Computing Information Systems degree and a Bachelor of Science degree
in Computer Science from Marist College.
Daniel de Souza Casali is an IBM Cross Systems Senior Certified and has been working at
IBM for 11 years. Daniel works for Systems and Technology Group in Latin America as a
Software Defined infrastructure IT Specialist. Daniel holds an Engineering degree in Physics
from the Federal University of São Carlos (UFSCar). His areas of expertise include UNIX,
SAN networks, IBM Disk Subsystems, clustering cloud, and analytics solutions.
Istvan Gabor Szabo is an Infrastructure Architect and Linux Subject Matter Expert at IBM
Hungary (IBM DCCE SFV). He joined IBM in 2010 after receiving his bachelor degree in
Engineering Information Technology from the University of Obudai - John von Neumann
Tiago Rodrigues de Mello is a Staff Software Engineer in Brazil with more than 10 years of
experience. Tiago’s ares of expertise include Linux system administration, software
development, and cloud computing. He is an OpenStack developer and a Continuous
Integration engineer at the IBM Linux Technology Center. Tiago holds a Bachelor in
Computer Science degree from the Federal University of Sao Carlos, Brazil.
Nilton Carlos dos Santos is an IT Architect and a Certified IT Specialist and has been with
IBM since 2007 and has 18 years experience in the IT industry. Before joining IBM, he worked
in several different areas of technology, including Linux and UNIX administration, database
management, development in many different languages, and network administration. Nilton
Carlos also has deep expertise in messaging, automation, monitoring, and reporting system
tools. He enjoys working with open source software.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us.
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form:
xiv IBM Software Defined Infrastructure for Big Data Analytics Workloads
ibm.com/redbooks
Send your comments by email:
[email protected]
Mail your comments:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface xv
xvi IBM Software Defined Infrastructure for Big Data Analytics Workloads
1
This chapter provides a foundational understanding of big data: What it is, and why you need
to care about it. It also describes IBM technology to handle these data management
challenges.
In a similar way, it was previously popular to bring the advent of service-oriented architecture
(SOA) and Web 2.0, just as examples. “Big data” is a trendy talking point at many companies,
but few people understand exactly what it means.
Rather than volunteering an arbitrary definition of the term, we believe that a better approach
is to explore the evolution of data, along with enterprise data management systems. This
approach ultimately arrives at a clear understanding of what big data is and why we need to
care about it.
The IBM Smarter Planet® initiative was launched during a speech to the Council of Foreign
Relations in New York in 2008. Smarter Planet focuses on development of technologies that
are advancing everyday experiences. A large part of developing such technology is
dependent on the collection and analysis of data from as many sources as possible. This
process is difficult, because the number and variety of sources continues to grow. The planet
is exponentially more instrumented, intelligent, and integrated, and it will continue to expand,
with better and faster capabilities. The World Wide Web is truly living up to its name. Through
its continued expansion, cloud computing and the web are driving our ability to generate and
have access to virtually unlimited amounts of data.
The statistics that are presented in Figure 1-1 confirm the validity of the world becoming
exponentially more instrumented.
There was an earlier point in history when only home computers and web-hosting servers
were connected to the web. If you had a connection to the web and ventured into the world of
chat rooms, you were able to communicate by instant messaging with someone in another
part of the world. Hard disk drives were 256 MB, CD players were top-shelf technology, and
cell phones were as large as lunch boxes. We are far from those days.