0% found this document useful (0 votes)
209 views20 pages

IBM Introduction To Big Data

Unit 1 of IBM big data course

Uploaded by

ayandatta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
209 views20 pages

IBM Introduction To Big Data

Unit 1 of IBM big data course

Uploaded by

ayandatta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Unit 1 Introduction to Big Data

Introduction to Big Data

IBM BigInsights v4.0

© Copyright IBM Corporation 2015


Course materials may not be reproduced in whole or in part without the written permission of IBM.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Introduction to Big Data

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-2


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Unit objectives
• Understand when and why you would use big data
• Explain the perception gap
• Explain the difference between data-at-rest and data-in-motion
• Describe the 3 Vs

Introduction to Big Data © Copyright IBM Corporation 2015

Unit objectives

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-3


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

System of Units / Binary System of Units

International System of Binary International Electrotechnical


Usage
Units (SI) (deprecated) Commission (IEC) - 1999
kilobyte KB 103 210 kibibyte KiB 210
megabyte MB 106 220 mebibyte MiB 220
gigabyte GB 109 230 gibibyte GiB 230
terabyte TB 1012 240 tebibyte TiB 240
petabyte PB 1015 250 pebibyte PiB 250
exabyte EB 1018 260 exbibyte EiB 260
zettabyte ZB 1021 270 zebibyte ZiB 270
yottabyte YB 1024 280 yobibyte YiB 280

Source: Wikipedia, http://en.wikipedia.org/wiki/Kibibytes

Introduction to Big Data © Copyright IBM Corporation 2015

System of Units/Binary System of Units


When dealing with big data, we speak of numbers that are not part of our everyday
conversations. Terms like kilobytes, megabytes, and gigabytes are commonly known.
The term terabyte has been added to our discussions in the past couple of years. But to
most people, the terms petabyte, exabyte, zettabyte, and yottabyte sound foreign. Like
it or not, those terms are necessary when dealing with big data. Some of these terms
will be used in this course. You should at least have a basic understanding of what they
mean.
• kilobyte (KB) 10 to the 3rd power
• megabyte (MB) 10 to the 6th power
• gigabyte (GB) 10 to the 9th power
• terabyte (TB) 10 to the 12th power
• petabyte (PB) 10 to the 15th power
• exabyte (EB) 10 to the 18th power
• zettabyte (ZB) 10 to the 21st power
• yottabyte (YB) 10 to the 24th power

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-4


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

The scale
• 2.5 petabytes
 Memory capacity of the human brain
• 13 petabytes
 Amount that could be downloaded from the internet in two minutes, if every
American (300M) was on a computer at the same time
• 4.75 exabytes
 Total genome sequences of all people on the Earth
• 422 exabytes
 Total digital data created in 2008
• 1 zetabyte
 World’s current digital storage capacity
• 1.8 zettabytes
 Total digital data expected to be created in 2011

Introduction to Big Data © Copyright IBM Corporation 2015

The scale
It is hard for most people to grasp the concept of how large a petabyte or an exabyte is.
For a long time people thought that a billion was a large number. But as quickly as most
governments spend a billion dollar or euros, obviously, it cannot be that large of a
number. To better understand extremely large numbers, it is best to view them in
comparison to something that you can understand. The capacity of the human brain is
about 2.5 petabytes. (This is also the estimated size of Walmart databases that handle
1 million customer transactions a day.) The total genome sequences of all people on
the Earth is 4.75 exabytes. The total amount of digital data created in 2008 was 422
exabytes. And the total that was expected to be created in 2011 was 1.8 zettabytes.
In 2000 the Sloan Digital Sky Survey began collecting astronomical data. In the first few
weeks it amassed more data than was collected in the history of astronomy. And the
total amount of data collected by the SDSS is the amount that its successor, the Large
Synoptic Survey Telescope, is expected to collect every 5 days, when it comes online
in 2016.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-5


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

There is an explosion in data and real world events

2 Billion Internet
users by 2011
1.3 Billion RFID tags in 2005
30 Billion RFID today
4.6 Billon
Mobile Phones
World Wide

Capital market

data volumes grew Twitter process


1,750%, 2003-06 7 terabytes of
data every day

World Data Centre for Climate Facebook process


220 Terabytes of Web data 10 terabytes of
9 Petabytes of additional data data every day

Introduction to Big Data © Copyright IBM Corporation 2015

There is an explosion in data and real world events


The amount of data that gets created every day is at mind boggling proportions and will
only continue to increase. Moore's law states that the speed of computer processing will
double every two years. It seems that there is some sort of a corollary to this law when
it comes to data as well. The problem that we have when dealing with so much data is
that it becomes almost impossible to separate the important facts from the non-
important facts. So we need computer programs to help us to distill the data. But a
single program, working with terabytes of data, requires a lot of time to process that
much data. And by the time the processing has been completed, the answer may no
longer be relevant. For example knowing the traffic patterns of the last three days does
not help you to determine how to cross a street at this instance.
The other thing that makes working with all of this data so difficult is the fact that most of
the data is unstructured. Computer programs work well with structured data. If there is a
data field that only has postal code values, then it is easy to search on that field and get
all of the stores within a particular geographical area. But what if you are looking for
some key words in recorded conversations? The data is there but accessing it
becomes significantly harder.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-6


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Some examples of big data


• Science • Commercial
• Astronomy • Web / event / database logs
• Atmospheric science • "Digital exhaust" (result of human
• Genomics interaction with the Internet)
• Biogeochemical • Sensor networks
• Biological • RFID
 and other complex and/or • Internet text and documents
interdisciplinary scientific research • Internet search indexing
• Social • Call detail records (CDR)
• Social networks • Medical records
• Social data • Photographic archives
 Person to person (P2P, C2C): • Video / audio archives
− Wish Lists on Amazon.com • Large scale eCommerce
− Craig’s List • Government
 Person to world (P2W, C2W): • Regular government business and
− Twitter
commerce needs
− Facebook
− LinkedIn
• Military and homeland security
surveillance
Introduction to Big Data © Copyright IBM Corporation 2015

Some examples of big data


Some examples of big data are social networks, web logs, RFID information, video and
audio archives, sensor data, military surveillance, astronomy, genomics and internet
search indexing.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-7


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

The growth of data


2.5 million items 5 TB per flight Organizations need deeper insights
per minute

1 in 3
300,000 tweets > 1 PB per
Business leaders frequently make
decisions based on information they
per minute day gas don’t trust, or don’t have

200 million emails turbines


per minute
220,000 photos
2020
40 zettabytes
1 in 2 Business leaders say they don't
have access to the information they
need to do their jobs
per minute

80% 83%
of CIOs cited "Business intelligence
and analytics" as part of their
visionary plans
Of world’s data to enhance competitiveness
is unstructured

60%
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
2012 make swift business decisions

2.8 zettabytes

Information is at the center of a new wave of opportunity


Introduction to Big Data © Copyright IBM Corporation 2015

The growth of data


The growth of data is staggering. Just look at some of the statistics shown in the
diagram. The tremendous volume and variety of data being generated at an
accelerated velocity creates tremendous opportunity for organizations. Unfortunately,
organizations are struggling to gain deeper insights from this data. Business leaders
continue to make decisions without access to the trusted information they need.
We all understand the well-organized structured data world. We've dealt with it for
decades. It's at the very core of what information technology and the advancement of
programmable computing has brought to us over the last 60 years. But a lot of data is
unstructured or semi-structured. We're really just at the very beginning of in terms of
what the possibilities are, how we can get at that information and what we can do with
it. Think about all the information being generated by social networking sites (like
Twitter, Facebook, LinkedIn, and so on), web logs, click streams, instant messages,
emails, electronic sensor data, and so on. How would it change your business if you
could efficiently filter through that data, aggregate the right bits and pieces, combine it
with your operational data, and analyze it effectively?

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-8


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Sources:
• The Guardian, May 2010
• IDC Digital Universe, 2010
• IBM Institute for Business Value, 2009
• IBM CIO Study 2010
• TDWI: Next Generation Data Warehouse Platforms Q4 2009
• https://blog.kissmetrics.com/facebook-statistics/
• http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html
• http://www.computerworlduk.com/news/infrastructure/3433595/boeing-787s-to-
create-half-a-terabyte-of-data-per-flight-says-virgin-atlantic/
• http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html
• http://www.forbes.com/sites/maribellopez/2013/05/10/ge-speaks-on-the-business-
value-of-the-internet-of-things/
• http://www.idc.com/prodserv/4Pillars/bigdata;jsessionid=94A407E4522FB407627
ECEBBAAA90A24
• http://www.digitalbuzzblog.com/infographic-24-hours-on-the-internet/
• ZB = 1 billion TB
• IDC reference:
o http://idcdocserv.com/925
o http://www.computer.org/portal/web/news/home/-
/blogs/2613266;jsessionid=abbfded1402383e107abfa2641d6

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-9


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Example: The perception gap surrounding social media


IBM 2010 CEO Study: 88 percent of CEOs said "getting closer to customers" was top priority over next 5 years and
viewed social media as a core part of that strategy
However, a March 2011 IBM study identified that companies fail to understand what customers want from social
advertising and outreach

Social media and social networking


will increase customer advocacy?

7%
l Disagree

23%
Neutral
Agree

70%

Source: "Capitalizing on
complexity, Insights from the "What Customers Want"
Global Chief Executive Office
First in a two-part series
Study," IBM Institute for Business
Value, 2010 IBM Institute for Business Value
Published March 2011

Introduction to Big Data © Copyright IBM Corporation 2015

Example: The perception gap surrounding social media


Let's look into one facet of the big data challenge a bit further. An IBM CEO study
revealed that social media was a core part of many firms strategy of getting closer to
customers. However, a separate study showed that companies often don't understand
what consumers really want from them on social media sites. Indeed, the top 2
consumer choices discounts and purchases didn't even make the top 10 list of what
companies thought their consumers wanted. Wouldn't it be great if companies didn't
have to guess? What if they could look at consumer behavior and sentiment expressed
on social media sites and really know what people wanted. This is one aspect of the big
data challenge.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-10


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Streams and oceans of information


• High speed information flowing in real-time, often
transient
 Information from sensors and instruments
 Information flowing from real-time logs and activity
monitors
 Streaming content like audio and video Information streams
 High speed transactions like tickers, trades, or traffic
systems
• Information stored outside conventional systems.
Data may originate from the Web or different internal
different systems
 Collection of what has streamed
 Information from social media, logs, click streams, and
emails
 Unstructured or mixed schema documents like claims, Information oceans
forms, and desktop applications
 Structured data from disparate systems
Introduction to Big Data © Copyright IBM Corporation 2015

Streams and oceans of information


There are two groups of big data. Some fall into the category of flowing in real time, for
example, information coming from sensors or video feeds. Sometimes the real-time
data can have very high volumes, like stock tickers or patient monitoring systems in a
hospital. This type of data cannot use a "store and access" method. Knowing the
volume of trades for a particular stock or a patient's vitals from two days ago does not
help you make decision right now. IBM's InfoSphere Streams was developed to handle
this type of data, which is referred to as being information streams.
On the other hand we can have massive amounts of stored data, like emails, web logs,
and click streams, that need to be analyzed. This data can consist of both structured
and unstructured data. The question then becomes how can we process this large
amount of data in a timely manner? We refer to data of that type as being information
oceans for which IBM BigInsights was designed to address.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-11


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Big data presents big opportunities


• Extract insight from a high volume, variety and velocity of data in a
timely and cost-effective manner

Variety: Manage and benefit from


diverse data types and data
structures
Velocity: Analyze streaming data and
large volumes of persistent
data
Volume: Scale from terabytes to
zettabytes

Introduction to Big Data © Copyright IBM Corporation 2015

Big data presents big opportunities


ZB = 1 billion TB
We believe that big data presents organizations with a big opportunity to extract new
insights that can improve their decision-making process and business plans. Massive
volume, variety and velocity are defining characteristics of big data, and IBM has built
its platform to address these characteristics.
In order to capitalize on big data, firms must be able to analyze a wide variety of data,
including text, sensor data, audio, video, transactional data, and others.
Sometimes, getting an edge over your competition can mean identifying a trend,
problem or opportunity, seconds, or even microseconds before someone else. More
and more of the data being produced today, has a very short half-life. Organizations
must be able to analyze this data in real-time if they are to be able to find insights in this
data.
And, as implied by the term big data, organization are facing massive volumes of data.
Organizations that don't know how to manage this data are overwhelmed by it. But
wouldn't it be great if the right technology was available to analyze all of the data so that
you could gain a better understanding of your business, your customers, and the
marketplace?

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-12


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Merging the traditional and big data approaches


Traditional Approach Big Data Approach
Structured & Repeatable Analysis Iterative & Exploratory Analysis

IT
Business Users
Delivers a platform
Determine what to enable creative
question to ask discovery

IT Business
Structures the Explores what
data to answer questions could be
that question asked
Monthly sales reports Brand sentiment
Profitability analysis Product strategy
Customer surveys Maximum asset utilization

Introduction to Big Data © Copyright IBM Corporation 2015

Merging the traditional and big data approaches


The big data approach complements the traditional approach.
The traditional approach calls for business users to determine what questions to ask
and IT structure the data to answer that question. This is well suited to many common
business processes, such as monitoring sales by geography, product or channel;
extract insight from customer surveys; cost and profitability analyses.
The big data approach is a bit different. With this approach, IT delivers a platform that
consolidates data sources of interest and enables creative discovery. Then the
business users use the platform to explore data for idea and questions to ask.
On the left, the traditional approach allows organization to answer questions that will be
asked time and time again. On the right, users have the ability to explore their data in a
more creative way. Before finding the answer, they must first define the question. Are
my customers starting to change their preferences? What is the best way to measure
brand health?

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-13


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

What we hear from customers


• Lots of potentially valuable data is
dormant or discarded due to
size/performance considerations
• Large volume of unstructured or semi-
structured data is not worth integrating
fully (such as Tweets, logs, etc.)
• Not clear what should be analyzed
(exploratory, iterative)
• Information distributed across multiple
systems and/or Internet
• Some information has a short useful
lifespan
• Volumes can be extremely high
• Analysis needed in the context of
existing information (not stand alone)
Introduction to Big Data © Copyright IBM Corporation 2015

What we hear from customers


We've been working with a number of customers who tell us about the kinds of
challenges they're facing with big data. In many cases, they're uncertain exactly what
needs to be analyzed - that is, they need to explore the volumes of data they have to
discover what might be of value. Quite often, large volumes of information are currently
lying dormant in their firms or are discarded completely due to size or performance
considerations.
In addition, potentially interesting business data is seldom in one place and much of the
unstructured data that may contain useful tidbits of information isn't worth fully
integrating into a data warehouse or in-house operational system. An example of this
includes posts to social media sites, such as Facebook, Twitter, or Yelp. Some of the
information has a short useful lifespan (sensor feeds, news feeds, and web logs are
examples of this), and volumes can be extremely high. Firms are looking for an
efficient, cost-effective way to address these issues within the context of their existing
businesses.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-14


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Big data scenarios span many industries

Multi-channel customer
sentiment and experience
analysis

Detect life-threatening
conditions at hospitals in time to
intervene

Predict weather patterns to plan


optimal wind turbine usage, and
optimize capital expenditure on
asset placement

Make risk decisions based on


real-time transactional data

Identify criminals and threats


from disparate video, audio,
and data feeds

Introduction to Big Data © Copyright IBM Corporation 2015

Big data scenarios span many industries


The need to cope with and leverage big data spans many industries and application
domains.
• Imagine if you could analyze all the tweets being created each day to figure out
what people are saying about your products and who the key influencers are
within your target demographics. Imagine being able to mine this data to identify
new market opportunities.
• What if hospitals could take the thousands of sensor readings collected every
hour per patients in ICUs to identify subtle indications that the patient is becoming
unwell, days earlier that is allowed by traditional techniques.
• Imagine if a green energy company could use PBs of weather data along with
massive volumes of operational data to optimize asset location and utilization,
making these environmentally friendly energy sources more cost competitive with
traditional sources.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-15


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

• Imagine if you could make risk decisions, such as whether or not someone
qualifies for a mortgage, in minutes, by analyzing many sources of data, including
real-time transactional data, while the client is still on the phone or in the office.
• Imagine if law enforcement agencies could analyze audio and video feeds in real-
time without human intervention to identify suspicious activity.
As these new sources of data continue to grow in volume, variety and velocity, so too
does the potential of this data to revolutionize the decision-making processes in every
industry.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-16


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Big data use study

2012 Big Data @ Work Study surveying 1144


business and IT professionals in 95 countries

Gartner Sept. 2014 report: 13% of surveyed organizations have deployed big data solutions, while 73%
have invested in big data or plan to do so.

Introduction to Big Data © Copyright IBM Corporation 2015

Big data use study


While some organizations have been very successful launching production big data
projects, studies show that the majority of organizations are still in the early adoption
stages, as shown in a study conducted by the University of Oxford and IBM in 2012.
Source: Analytics: The real-world use of big data, How innovative enterprises extract
value from uncertain data, IBM Institute for Business Value and Saïd Business School
at the University of Oxford, 2012
Link:
http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03519usen/GBE03519USEN.PDF
Following are excerpts from this report related to the chart shown. Additionally, in
September 2014 Gartner released a study showing similar results. IBM does not have
the right to redistribute this report, titled Major Myths About Big Data's Impact on
Information Infrastructure, G00269433.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-17


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

In the Educate stage, the primary focus is on awareness and knowledge development.
Almost 25 percent of respondents indicated that they are not yet using big data within
their organizations. While some remain relatively unaware of the topic of big data, our
interviews suggest that most organizations in this stage are studying the potential
benefits of big data technologies and analytics, and trying to better understand how big
data can help address important business opportunities in their own industries or
markets.
The focus of the Explore stage is to develop an organization's roadmap for big data
development. Almost half of respondents reported formal, ongoing discussions within
their organizations about how to use big data to solve important business challenges.
Key objectives of these organizations include developing a quantifiable business case
and creating a big data blueprint.
In the Engage stage, organizations begin to prove the business value of big data, as
well as perform an assessment of their technologies and skills. More than one in five
respondent organizations is currently developing proofs-of-concept (POCs) to validate
the requirements associated with implementing big data initiatives, as well as to
articulate the expected returns.
In the Execute stage, big data and analytics capabilities are more widely
operationalized and implemented within the organization. However, only 6 percent of
respondents reported that their organizations have implemented two or more big data
solutions at scale, the threshold for advancing to this stage.

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-18


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Big data use: focus areas and data sources

Introduction to Big Data © Copyright IBM Corporation 2015

Big data use: focus areas and data sources


When asked to rank their top three objectives for big data, nearly half of the
respondents identified customer-centric objectives as their organization's top priority
(see figure on the left). Companies clearly see big data as providing the ability to better
understand and predict customer behaviors, and by doing so, improve the customer
experience. Transactions, multi-channel interactions, social media, syndicated data
through sources like loyalty cards, and other customer-related information have
increased the ability of organizations to create a complete picture of customers'
preferences and demands, a goal of marketing, sales and customer service for
decades.
Any big data initiatives begin with untapped sources of internal information. In the figure
on the right, you can also see that external data sources (such as social media) factor
significantly in big data strategies as well.
Source: Analytics: The real-world use of big data, How innovative enterprises extract
value from uncertain data, IBM Institute for Business Value and Saïd Business School
at the University of Oxford, 2012
Link: http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03519usen/GBE03519USEN.PDF

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-19


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Introduction to Big Data

Unit summary
• Understand when and why you would use big data
• Explain the perception gap
• Explain the difference between data-at-rest and data-in-motion
• Describe the 3 Vs

Introduction to Big Data © Copyright IBM Corporation 2015

Unit summary

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE

© Copyright IBM Corp. 2015 1-20


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

You might also like