Week-1 Lecture Notes
Week-1 Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Introduction to Big Data
Preface
Content of this Lecture:
In this lecture, we will discuss a brief introduction to
Big Data: Why Big Data, Where did it come from?,
Challenges and applications of Big Data, Characteristics
EL
of Big Data i.e. Volume, Velocity, Variety and more V’s.
PT
N
EL
PT
The challenges include capture, curation, storage, search, sharing, transfer,
analysis, and visualization.
N
The trend to larger data sets is due to the additional information derivable
from analysis of a single large set of related data, as compared to separate
smaller sets with the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-time
roadway traffic conditions.”
Big Data Computing Vu Pham Introduction to Big Data
Facts and Figures
Walmart handles 1 million customer transactions/hour.
Facebook handles 40 billion photos from its user base!
Facebook inserts 500 terabytes of new data every day.
Facebook stores, accesses, and analyzes 30+ Petabytes of user
EL
generated data.
A flight generates 240 terabytes of flight data in 6-8 hours of flight.
More than 5 billion people are calling, texting, tweeting and
PT
browsing on mobile phones worldwide.
Decoding the human genome originally took 10 years to process;
N
now it can be achieved in one week.8
The largest AT&T database boasts titles including the largest volume
of data in one unique database (312 terabytes) and the second
largest number of rows in a unique database (1.9 trillion), which
comprises AT&T’s extensive calling records.
EL
GB (9): 3 Semi trucks of rice:
TB (12): 2 container ships of rice Internet
PT
PB (15): Blankets ½ of Jaipur
Exabyte (18): Blankets West coast Big Data
N
Or 1/4th of India
Zettabyte (21): Fills Pacific Ocean Future
Yottabyte(24): An earth-sized rice bowl
Brontobyte (27): Astronomical size
Big Data Computing Vu Pham Introduction to Big Data
What’s making so much data?
Sources: People, machine, organization: Ubiquitous
computing
More people carrying data-generating devices
(Mobile phones with facebook, GPS, Cameras, etc.)
EL
Data on the Internet:
Internet live stats
PT
http://www.internetlivestats.com/
N
EL
data every day
100s of
? TBs of
millions of
GPS
PT enabled
devices
sold
annually
N
25+ TBs of 2+ billion
log data people
every day on the
Web by
76 million smart meters end 2011
in 2009…
200M by 2014
EL
PT
N
EL
particular topic was trending would take so long that
the result would be meaningless by the time it was
computed.
PT
Big Data come up with a solution to store this data in
novel ways in order to make it more accessible, and
N
also to come up with methods of performing analysis
on it.
EL
Sharing
Analysing
Visualization
PT
N
EL
PT
N
EL
Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis
PT
Convert 350 billion annual meter readings to
better predict power consumption
N
Data Volume
44x increase from 2009 2020
EL
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
PT
N
Exponential increase in
collected/generated data
EL
PT
N
EL
3.8 million square miles, amassing
67 terabytes of data. It analyzes
PT
seismic slips in the San Andreas fault,
sure, but also the plume of magma
underneath Yellowstone and much,
N
much more.
(http://www.msnbc.msn.com/id/44363
598/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
Big Data Computing Vu Pham Introduction to Big Data
Velocity (Speed)
Velocity: Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise in
EL
order to maximize its value.
Scrutinize 5 million trade events created each day
PT
to identify potential fraud
Analyze 500 million daily call detail records in real-
time to predict customer churn faster
N
EL
Online Data Analytics
Late decisions ➔ missing opportunities
Examples PT
E-Promotions: Based on your current location, your purchase history,
N
what you like ➔ send promotions right now for store next to you
EL
Mobile devices
(tracking all objects all the time)
PT
Social media and networks Scientific instruments
(all of us are generating data) (collecting all sorts of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
EL
& Compelling and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
PT Customer
Friend Invitations
to join a
Game or Activity
that expands
N
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
EL
Unstructured –text, sensor data, audio, video
Semi Structured : web data, log files
PT
N
EL
Social Network, Semantic Web (RDF), …
Streaming Data
EL
Variety
Plus 1
Value PT
N
EL
Variability
Viscosity & Volatility
Viability,
Venue,
PT
N
Vocabulary, Vagueness,
…
EL
Unify your data systems
All 3 above will lead to increased data collaboration
PT
-> add value to your big data
N
EL
they use to make decisions.
How can you act upon information if you don’t
trust it?
PT
Establishing trust in big data presents a huge
N
challenge as the variety and number of sources
grows.
EL
PT
N
EL
satellite imagery vs social media posts
PT
N
prediction quality vs human impact
Language evolution
EL
Data availability
Sampling processes
PT
Changes in characteristics of the data source
N
EL
Volatility: rate of data loss and stable lifetime
of data
PT
Scientific data often has practically unlimited
lifespan, but social / business data may evaporate
N
in finite time
EL
Venue
Where does the data live and how do you get it?
Vocabulary
PT
Metadata describing structure, content, & provenance
Schemas, semantics, ontologies, taxonomies, vocabularies
N
Vagueness
Confusion about what “Big Data” means
EL
Standardize by reducing Variety:
Format
Standards
Structure
PT
N
EL
PT
N
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
EL
Old Model: Few companies are generating data, all others are consuming data
PT
N
New Model: all of us are generating data, and all of us are consuming data
EL
- All types of data, and many sources
- Very large datasets
- More of a real-time
EL
Traditional DW architectures
(e.g. Exadata, Teradata) are
apps PT
not well-suited for big data
EL
PT
N
EL
We have also described characteristics of Big Data i.e.
Volume, Velocity, Variety and more V’s, Big Data Analytics,
Big Data Landscape and Big Data Technology.
PT
N
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Big Data Enabling Technologies
Preface
Content of this Lecture:
In this lecture, we will discuss a brief introduction to
Big Data Enabling Technologies.
EL
PT
N
EL
A recent survey says that 80% of the data created in
the world are unstructured.
PT
One challenge is how we can store and process this big
N
amount of data. In this lecture, we will discuss the top
technologies used to store and analyse Big Data.
EL
system of Hadoop which splits big data and distribute
across many nodes in a cluster.
PT
a. Scaling out of H/W resources
b. Fault Tolerant
MapReduce: Programming model that simplifies parallel
N
programming.
a. Map-> apply ()
b. Reduce-> summarize ()
c. Google used MapReduce for Indexing websites.
EL
Users specify a map function that processes a
key/value pair to generate a set of intermediate
PT
key/value pairs, and a reduce function that merges all
intermediate values associated with the same
N
intermediate key
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
job scheduling technology in the open source Hadoop
distributed processing framework.
PT
YARN is responsible for allocating system resources to
the various applications running in a Hadoop cluster
N
and scheduling tasks to be executed on different
cluster nodes.
EL
PT
N
EL
access big data.
EL
lot of attraction both in academia and in industry.
PT
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation.
N
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation
EL
distributed applications.
Key attributed of such data
Small size
Dynamic
PT
Performance sensitive
https://zookeeper.apache.org/
N
Critical
In very simple words, it is a central store of key-value using
which distributed systems can coordinate. Since it needs to be
able to handle the load, Zookeeper itself runs on many
machines.
Big Data Computing Vu Pham Big Data Enabling Technologies
NoSQL
While the traditional SQL can be effectively used to
handle large amount of structured data, we need
NoSQL (Not Only SQL) to handle unstructured data.
EL
NoSQL databases store unstructured data with no
particular schema
PT
N
Each row can have its own set of column values. NoSQL
gives better performance in storing massive amount of
data.
EL
PT
N
EL
Cassandra handles the huge amount of data with its
distributed architecture.
PT
Data is placed on different machines with more than
N
one replication factor that provides high availability
and no single point of failure.
EL
PT
In the image above, circles are Cassandra nodes and
N
lines between the circles shows distributed
architecture, while the client is sending data to the
node
EL
Initially, it was Google Big Table, afterwards it was re-
named as HBase and is primarily written in Java.
PT
HBase can store massive amounts of data from
terabytes to petabytes.
N
EL
PT
N
EL
Streaming data input from HDFS, Kafka, Flume, TCP
sockets, Kinesis, etc.
PT
Spark ML (Machine Learning) functions and GraphX
N
graph processing algorithms are fully applicable to
streaming data .
EL
PT
N
EL
Apache Kafka is an open source distributed streaming
PT
platform capable of handling trillions of events a day,
Kafka is based on an abstraction of a distributed
commit log
N
EL
PT
N
EL
MLlib is Spark's scalable machine learning library
consisting of common learning algorithms and utilities,
including classification, regression, clustering,
PT
collaborative filtering, dimensionality reduction.
N
EL
PT
N
EL
abstraction.
PT
GraphX reuses Spark RDD concept, simplifies graph
analytics tasks, provides the ability to make operations
on a directed multigraph with properties attached to
N
each vertex and edge.
EL
PT
N
GraphX is a thin layer on top of the Spark
general-purpose dataflow framework (lines of code).
EL
YARN
NoSQL
Hive
Map Reduce
Apache Spark
Zookeeper
PT
N
Cassandra
Hbase
Spark Streaming
Kafka
Spark MLlib
GraphX
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Big Data Hadoop Stack
Preface
Content of this Lecture:
EL
Data.
PT
We will also look into the Hadoop stack and
applications and technologies associated with Big Data
N
solutions.
EL
PT
N
EL
framework for storage and large scale
processing of the data-sets on clusters of
PT
commodity hardware.
N
EL
Nutch Search Engine Project.
PT
Doug, who was working at Yahoo at the time, who is now
actually a chief architect at Cloudera, has named this project
after his son’s toy elephant, Hadoop.
N
EL
PT
N
Hadoop started out as a simple batch processing framework.
EL
We can distribute and scale across very easily
in a very cost effective manner.
PT
N
EL
If we think about an individual machine or rack of machines, or a
PT
large cluster or super computer, they all fail at some point of time or
some of their components will fail. These failures are so common
that we have to account for them ahead of the time.
N
And all of these are actually handled within the Hadoop framework
system. So the Apache's Hadoop MapReduce and HDFS components
were originally derived from the Google's MapReduce and Google's
file system. Another very interesting thing that Hadoop brings is a
new approach to data.
Big Data Computing Vu Pham Big Data Hadoop Stack
New Approach to Data: Keep all data
EL
PT
A new approach is, we can keep all the data that we have, and we
can take that data and analyze it in new interesting ways. We can
N
do something that's called schema and read style.
And we can actually allow new analysis. We can bring more data
into simple algorithms, which has shown that with more
granularity, you can actually achieve often better results in taking
a small amount of data and then some really complex analytics on
it.
Big Data Computing Vu Pham Big Data Hadoop Stack
Apache Hadoop Framework
& its Basic Modules
EL
PT
N
EL
file system that stores data on a commodity machine.
Providing very high aggregate bandwidth across the entire
cluster.
PT
Hadoop YARN: It is a resource management platform
responsible for managing compute resources in the cluster
N
and using them in order to schedule users and
applications.
Hadoop MapReduce: It is a programming model that
scales data across a lot of different processes.
EL
PT
N
EL
PT
Two major pieces of Hadoop are: Hadoop Distribute the File System and the
N
MapReduce, a parallel processing framework that will map and reduce data.
These are both open source and inspired by the technologies developed at
Google.
If we talk about this high level infrastructure, we start talking about things like
TaskTrackers and JobTrackers, the NameNodes and DataNodes.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS
Hadoop distributed file system
EL
PT
N
EL
Each node in Hadoop instance typically has a single name
node, and a cluster of data nodes that formed this HDFS
cluster.
PT
Each HDFS stores large files, typically in ranges of
N
gigabytes to terabytes, and now petabytes, across
multiple machines. And it can achieve reliability by
replicating the cross multiple hosts, and therefore does
not require any range storage on hosts.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS
EL
PT
N
EL
PT
N
EL
PT
N
The typical MapReduce engine will consist of a job tracker, to which client
applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster. Struggling to
keep the word as close to the data as possible, as balanced as possible.
EL
power and data centers continue to
grow quickly, because the YARN
research manager focuses
PT
exclusively on scheduling. It can
manage those very large clusters
quite quickly and easily.
YARN is completely compatible with
N
the MapReduce. Existing
MapReduce application end users
can run on top of the Yarn without
disrupting any of their existing
processes.
EL
PT
N
Hadoop 2.0 provides a more general processing platform, that is not constraining to this
map and reduce kinds of processes.
The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities
of the job tracker, resource management, and the job scheduling and monitoring, and
to do two separate units. The idea is to have a global resource manager, and per
application master manager.
Big Data Computing Vu Pham Big Data Hadoop Stack
What is Yarn ?
Yarn enhances the power of the Hadoop compute cluster, without
being limited by the map produce kind of framework.
It's scalability's great. The processing power and data centers
continue to grow quickly, because the YARN research manager
focuses exclusively on scheduling. It can manage those very large
EL
clusters quite quickly and easily.
YARN is completely compatible with the MapReduce. Existing
PT
MapReduce application end users can run on top of the Yarn without
disrupting any of their existing processes.
It does have a Improved cluster utilization as well. The resource
N
manager is a pure schedule or they just optimize this cluster
utilization according to the criteria such as capacity, guarantees,
fairness, how to be fair, maybe different SLA's or service level
agreements.
Scalability MapReduce Compatibility Improved cluster utilization
EL
Yarn allows multiple access engines, either open source or
proprietary, to use Hadoop as a common standard for either batch or
PT
interactive processing, and even real time engines that can
simultaneous acts as a lot of different data, so you can put streaming
kind of applications on top of YARN inside a Hadoop architecture,
N
and seamlessly work and communicate between these
environments.
Fairness Iterative Modeling Multiple
Machine Access
Supports Other Workloads Learning Engines
Big Data Computing Vu Pham Big Data Hadoop Stack
The Hadoop “Zoo”
EL
PT
N
EL
PT
N
EL
PT
N
Had their original MapReduce, and they were storing and processing
large amounts of data.
Like to be able to access that data and access it in a SQL like
language. So they built the SQL gateway to adjust the data into the
MapReduce cluster and be able to query some of that data as well.
EL
PT
Then, they realized they needed a high-level specific language to
N
access MapReduce in the cluster and submit some of those jobs. So
Sawzall came along.
Then, Evenflow came along and allowed to chain together complex
work codes and coordinate events and service across this kind of a
framework or the specific cluster they had at the time.
Big Data Computing Vu Pham Big Data Hadoop Stack
Original Google Stack
EL
PT
Then, Dremel came along. Dremel was a columnar storage in the
N
metadata manager that allows us to manage the data and is able to
process a very large amount of unstructured data.
Then Chubby came along as a coordination system that would manage
all of the products in this one unit or one ecosystem that could process
all these large amounts of structured data seamlessly.
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
Major Components
PT
N
EL
PT
N
EL
Column-oriented database management system
Key-value store
PT
Based on Google Big Table
Can hold extremely large data
N
Dynamic data model
Not a Relational DBMS
EL
Data analysis problems as data flows
PT
Originally developed at Yahoo 2006
N
EL
PT
A good example of PIG applications is ETL transaction model that
describes how a process will extract data from a source, transporting
N
according to the rules set that we specify, and then load it into a data
store.
PIG can ingest data from files, streams, or any other sources using the
UDF: a user-defined functions that we can write ourselves.
When it has all the data it can perform, select, iterate and do kinds of
transformations.
Big Data Computing Vu Pham Big Data Hadoop Stack
EL
PT
N
EL
SQL-like language!
Facilitates querying and managing large datasets in
HDFS
PT
Mechanism to project structure onto this data and
N
query the data using a SQL-like language called
HiveQL
EL
Oozie Coordinator jobs!
PT
Supports MapReduce, Pig, Apache Hive, and
Sqoop, etc.
N
EL
configuration information naming
services
PT
Providing distributed synchronization
and providing group services
N
EL
to enhance the reliability mechanisms, fail over, recovery, and all
the other mechanisms that keep the cluster safe and reliable.
It uses simple extensible data model that allows us to apply all
PT
kinds of online analytic applications.
N
EL
PT
N
EL
users to submit low latencies queries to the data that's stored within the
HDFS or the Hbase without acquiring a ton of data movement and
manipulation.
PT
Impala is integrated with Hadoop, and it works within the same power
system, within the same format metadata, all the security and reliability
resources and management workflows.
N
It brings that scalable parallel database technology on top of the Hadoop.
It actually allows us to submit SQL like queries at much faster speeds with a
lot less latency.
EL
PT
N
EL
storage system approach. And it's implemented and supports
something called Scala language, and provides unique environment
for data processing.
PT
Spark is really great for more complex kinds of analytics, and it's great
at supporting machine learning libraries.
N
It is yet again another open source computing frame work and it was
originally developed at MP labs at the University of California
Berkeley and it was later donated to the Apache software foundation
where it remains today as well.
EL
Spark is really well suited for these machined learning kinds of
applications that often times have iterative sorting in memory kinds
of computation.
PT
Spark requires a cluster management and a distributed storage
system. So for the cluster management, Spark supports standalone
N
native Spark clusters, or you can actually run Spark on top of a
Hadoop yarn, or via patching mesas.
For distributor storage, Spark can interface with any of the variety of
storage systems, including the HDFS, Amazon S3.
EL
PT
N