0% found this document useful (0 votes)

27 views

Week-1 Lecture Notes

big data

Uploaded by

727822tuad027

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Week-1 Lecture Notes

big data

Uploaded by

727822tuad027

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 121

Introduction to Big Data

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Introduction to Big Data
Preface
Content of this Lecture:
In this lecture, we will discuss a brief introduction to
Big Data: Why Big Data, Where did it come from?,
Challenges and applications of Big Data, Characteristics

EL
of Big Data i.e. Volume, Velocity, Variety and more V’s.

PT
N

Big Data Computing Vu Pham Introduction to Big Data

What’s Big Data?
Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.

EL
PT
The challenges include capture, curation, storage, search, sharing, transfer,
analysis, and visualization.
N
The trend to larger data sets is due to the additional information derivable
from analysis of a single large set of related data, as compared to separate
smaller sets with the same total amount of data, allowing correlations to be
found to "spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-time
roadway traffic conditions.”
Big Data Computing Vu Pham Introduction to Big Data
Facts and Figures
Walmart handles 1 million customer transactions/hour.
Facebook handles 40 billion photos from its user base!
Facebook inserts 500 terabytes of new data every day.
Facebook stores, accesses, and analyzes 30+ Petabytes of user

EL
generated data.
A flight generates 240 terabytes of flight data in 6-8 hours of flight.
More than 5 billion people are calling, texting, tweeting and

PT
browsing on mobile phones worldwide.
Decoding the human genome originally took 10 years to process;
N
now it can be achieved in one week.8
The largest AT&T database boasts titles including the largest volume
of data in one unique database (312 terabytes) and the second
largest number of rows in a unique database (1.9 trillion), which
comprises AT&T’s extensive calling records.

Big Data Computing Vu Pham Introduction to Big Data

An Insight
Byte: One grain of rice
KB(3): One cup of rice:
MB (6): 8 bags of rice: Desktop

EL
GB (9): 3 Semi trucks of rice:
TB (12): 2 container ships of rice Internet

PT
PB (15): Blankets ½ of Jaipur
Exabyte (18): Blankets West coast Big Data
N
Or 1/4th of India
Zettabyte (21): Fills Pacific Ocean Future
Yottabyte(24): An earth-sized rice bowl
Brontobyte (27): Astronomical size
Big Data Computing Vu Pham Introduction to Big Data
What’s making so much data?
Sources: People, machine, organization: Ubiquitous
computing
More people carrying data-generating devices
(Mobile phones with facebook, GPS, Cameras, etc.)

EL
Data on the Internet:
Internet live stats
PT
http://www.internetlivestats.com/
N

Big Data Computing Vu Pham Introduction to Big Data

Source of Data Generation
12+ TBs 4.6 billion
of tweet data 30 billion RFID tags camera
every day today phones
(1.3B in 2005) world
wide

EL
data every day

100s of
? TBs of

millions of
GPS

PT enabled
devices
sold
annually
N
25+ TBs of 2+ billion
log data people
every day on the
Web by
76 million smart meters end 2011
in 2009…
200M by 2014

Big Data Computing Vu Pham Introduction to Big Data

An Example of Big Data at Work
Crowdsourcing

EL
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Where is the problem?
Traditional RDBMS queries isn't sufficient to get useful
information out of the huge volume of data
To search it with traditional tools to find out if a

EL
particular topic was trending would take so long that
the result would be meaningless by the time it was
computed.
PT
Big Data come up with a solution to store this data in
novel ways in order to make it more accessible, and
N
also to come up with methods of performing analysis
on it.

Big Data Computing Vu Pham Introduction to Big Data

Challenges
Capturing
Storing
Searching

EL
Sharing
Analysing
Visualization
PT
N

Big Data Computing Vu Pham Introduction to Big Data

IBM considers Big Data (3V’s):
The 3V’s: Volume, Velocity and Variety.

EL
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Volume (Scale)
Volume: Enterprises are awash with ever-growing
data of all types, easily amassing terabytes even
Petabytes of information.

EL
Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis

PT
Convert 350 billion annual meter readings to
better predict power consumption
N

Big Data Computing Vu Pham Introduction to Big Data

Volume (Scale)

Data Volume
44x increase from 2009 2020

EL
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially

PT
N
Exponential increase in
collected/generated data

Big Data Computing Vu Pham Introduction to Big Data

Example 1: CERN’s Large Hydron Collider(LHC)

EL
PT
N

CERN’s Large Hydron Collider (LHC) generates 15 PB a year

Big Data Computing Vu Pham Introduction to Big Data
Example 2: The Earthscope
•The Earthscope is the world's largest
science project. Designed to track
North America's geological evolution,
this observatory records data over

EL
3.8 million square miles, amassing
67 terabytes of data. It analyzes

PT
seismic slips in the San Andreas fault,
sure, but also the plume of magma
underneath Yellowstone and much,
N
much more.
(http://www.msnbc.msn.com/id/44363
598/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
Big Data Computing Vu Pham Introduction to Big Data
Velocity (Speed)
Velocity: Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise in

EL
order to maximize its value.
Scrutinize 5 million trade events created each day

PT
to identify potential fraud
Analyze 500 million daily call detail records in real-
time to predict customer churn faster
N

Big Data Computing Vu Pham Introduction to Big Data

Examples: Velocity (Speed)

Data is begin generated fast and need to be

processed fast

EL
Online Data Analytics
Late decisions ➔ missing opportunities
Examples PT
E-Promotions: Based on your current location, your purchase history,
N
what you like ➔ send promotions right now for store next to you

Healthcare monitoring: sensors monitoring your activities and body ➔

any abnormal measurements require immediate reaction

Big Data Computing Vu Pham Introduction to Big Data

Real-time/Fast Data

EL
Mobile devices
(tracking all objects all the time)

PT
Social media and networks Scientific instruments
(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks

N
(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion

Big Data Computing Vu Pham Introduction to Big Data

Real-Time Analytics/Decision Requirement

Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors

EL
& Compelling and their offers; in
time to Counter

Improving the
Marketing
Effectiveness of a
PT Customer
Friend Invitations
to join a
Game or Activity
that expands
N
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively

Big Data Computing Vu Pham Introduction to Big Data

Variety (Complexity)
Variety: Big data is any type of data –

Structured Data (example: tabular data)

EL
Unstructured –text, sensor data, audio, video
Semi Structured : web data, log files
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Examples: Variety (Complexity)
Relational Data (Tables/Transaction/Legacy
Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data

EL
Social Network, Semantic Web (RDF), …

Streaming Data

A single application can be PT

You can only scan the data once

generating/collecting many types of data

N
Big Public Data (online, weather, finance, etc)

To extract knowledge➔ all these types of data need to

linked together

Big Data Computing Vu Pham Introduction to Big Data

The 3 Big V’s (+1)
Big 3V’s
Volume
Velocity

EL
Variety
Plus 1
Value PT
N

Big Data Computing Vu Pham Introduction to Big Data

The 3 Big V’s (+1) (+ N more)
Plus many more
Veracity
Validity

EL
Variability
Viscosity & Volatility
Viability,
Venue,
PT
N
Vocabulary, Vagueness,
…

Big Data Computing Vu Pham Introduction to Big Data

EL
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Value
Integrating Data
Reducing data complexity
Increase data availability

EL
Unify your data systems
All 3 above will lead to increased data collaboration
PT
-> add value to your big data
N

Big Data Computing Vu Pham Introduction to Big Data

Veracity
Veracity refers to the biases ,noise and
abnormality in data, trustworthiness of data.
1 in 3 business leaders don’t trust the information

EL
they use to make decisions.
How can you act upon information if you don’t
trust it?
PT
Establishing trust in big data presents a huge
N
challenge as the variety and number of sources
grows.

Big Data Computing Vu Pham Introduction to Big Data

Valence
Valence refers to the connectedness of big data.
Such as in the form of graph networks

EL
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Validity
Accuracy and correctness of the data relative to a
particular use
Example: Gauging storm intensity

EL
satellite imagery vs social media posts

PT
N
prediction quality vs human impact

Big Data Computing Vu Pham Introduction to Big Data

Variability
How the meaning of the data changes over time

Language evolution

EL
Data availability
Sampling processes

PT
Changes in characteristics of the data source
N

Big Data Computing Vu Pham Introduction to Big Data

Viscosity & Volatility
Both related to velocity
Viscosity: data velocity relative to timescale of
event being studied

EL
Volatility: rate of data loss and stable lifetime
of data
PT
Scientific data often has practically unlimited
lifespan, but social / business data may evaporate
N
in finite time

Big Data Computing Vu Pham Introduction to Big Data

More V’s
Viability
Which data has meaningful relations to questions of
interest?

EL
Venue
Where does the data live and how do you get it?
Vocabulary
PT
Metadata describing structure, content, & provenance
Schemas, semantics, ontologies, taxonomies, vocabularies
N
Vagueness
Confusion about what “Big Data” means

Big Data Computing Vu Pham Introduction to Big Data

Dealing with Volume
Distill big data down to small information
Parallel and automated analysis
Automation requires standardization

EL
Standardize by reducing Variety:
Format
Standards
Structure
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Harnessing Big Data

EL
PT
N
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

Big Data Computing Vu Pham Introduction to Big Data

The Model Has Changed…

The Model of Generating/Consuming Data has Changed

EL
Old Model: Few companies are generating data, all others are consuming data

PT
N
New Model: all of us are generating data, and all of us are consuming data

Big Data Computing Vu Pham Introduction to Big Data

What’s driving Big Data

- Optimizations and predictive analytics

- Complex statistical analysis

EL
- All types of data, and many sources
- Very large datasets
- More of a real-time

PT - Ad-hoc querying and reporting

N
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

Big Data Computing Vu Pham Introduction to Big Data

Big Data Analytics
Big data is more real-time in
nature than traditional
Dataware house (DW)
applications

EL
Traditional DW architectures
(e.g. Exadata, Teradata) are

apps PT
not well-suited for big data

Shared nothing, massively

N
parallel processing, scale out
architectures are well-suited
for big data apps

Big Data Computing Vu Pham Introduction to Big Data

Big Data Technology

EL
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Conclusion
In this lecture, we have defined Big Data and discussed
the challenges and applications of Big Data.

EL
We have also described characteristics of Big Data i.e.
Volume, Velocity, Variety and more V’s, Big Data Analytics,
Big Data Landscape and Big Data Technology.
PT
N

Big Data Computing Vu Pham Introduction to Big Data

Big Data Enabling Technologies

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Big Data Enabling Technologies
Preface
Content of this Lecture:
In this lecture, we will discuss a brief introduction to
Big Data Enabling Technologies.

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Introduction
Big Data is used for a collection of data sets so large
and complex that it is difficult to process using
traditional tools.

EL
A recent survey says that 80% of the data created in
the world are unstructured.
PT
One challenge is how we can store and process this big
N
amount of data. In this lecture, we will discuss the top
technologies used to store and analyse Big Data.

Big Data Computing Vu Pham Big Data Enabling Technologies

Apache Hadoop
Apache Hadoop is an open source software framework for
big data.
It has two basic parts:
Hadoop Distributed File System (HDFS) is the storage

EL
system of Hadoop which splits big data and distribute
across many nodes in a cluster.

PT
a. Scaling out of H/W resources
b. Fault Tolerant
MapReduce: Programming model that simplifies parallel
N
programming.
a. Map-> apply ()
b. Reduce-> summarize ()
c. Google used MapReduce for Indexing websites.

Big Data Computing Vu Pham Big Data Enabling Technologies

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Map Reduce
MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.

EL
Users specify a map function that processes a
key/value pair to generate a set of intermediate
PT
key/value pairs, and a reduce function that merges all
intermediate values associated with the same
N
intermediate key

Big Data Computing Vu Pham Big Data Enabling Technologies

Map Reduce

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Hadoop Ecosystem

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Hadoop Ecosystem

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS Architecture

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

YARN
YARN – Yet Another Resource Manager.

Apache Hadoop YARN is the resource management and

EL
job scheduling technology in the open source Hadoop
distributed processing framework.

PT
YARN is responsible for allocating system resources to
the various applications running in a Hadoop cluster
N
and scheduling tasks to be executed on different
cluster nodes.

Big Data Computing Vu Pham Big Data Enabling Technologies

YARN Architecture

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Hive
Hive is a distributed data management for Hadoop.

It supports SQL-like query option HiveSQL (HSQL) to

EL
access big data.

It can be primarily used for Data mining purpose.

PT
It runs on top of Hadoop.
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Apache Spark
Apache Spark is a big data analytics framework that
was originally developed at the University of California,
Berkeley's AMPLab, in 2012. Since then, it has gained a

EL
lot of attraction both in academia and in industry.

PT
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation.
N
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation

Big Data Computing Vu Pham Big Data Enabling Technologies

ZooKeeper
ZooKeeper is a highly reliable distributed coordination kernel,
which can be used for distributed locking, configuration
management, leadership election, work queues,….
Zookeeper is a replicated service that holds the metadata of

EL
distributed applications.
Key attributed of such data
Small size

Dynamic
PT
Performance sensitive
https://zookeeper.apache.org/
N
Critical
In very simple words, it is a central store of key-value using
which distributed systems can coordinate. Since it needs to be
able to handle the load, Zookeeper itself runs on many
machines.
Big Data Computing Vu Pham Big Data Enabling Technologies
NoSQL
While the traditional SQL can be effectively used to
handle large amount of structured data, we need
NoSQL (Not Only SQL) to handle unstructured data.

EL
NoSQL databases store unstructured data with no
particular schema
PT
N
Each row can have its own set of column values. NoSQL
gives better performance in storing massive amount of
data.

Big Data Computing Vu Pham Big Data Enabling Technologies

NoSQL

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Cassandra
Apache Cassandra is highly scalable, distributed and
high-performance NoSQL database. Cassandra is
designed to handle a huge amount of data.

EL
Cassandra handles the huge amount of data with its
distributed architecture.
PT
Data is placed on different machines with more than
N
one replication factor that provides high availability
and no single point of failure.

Big Data Computing Vu Pham Big Data Enabling Technologies

Cassandra

EL
PT
In the image above, circles are Cassandra nodes and
N
lines between the circles shows distributed
architecture, while the client is sending data to the
node

Big Data Computing Vu Pham Big Data Enabling Technologies

HBase
HBase is an open source, distributed database,
developed by Apache Software foundation.

EL
Initially, it was Google Big Table, afterwards it was re-
named as HBase and is primarily written in Java.

PT
HBase can store massive amounts of data from
terabytes to petabytes.
N

Big Data Computing Vu Pham Big Data Enabling Technologies

HBase

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Spark Streaming
Spark Streaming is an extension of the core Spark API
that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.

EL
Streaming data input from HDFS, Kafka, Flume, TCP
sockets, Kinesis, etc.
PT
Spark ML (Machine Learning) functions and GraphX
N
graph processing algorithms are fully applicable to
streaming data .

Big Data Computing Vu Pham Big Data Enabling Technologies

Spark Streaming

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Kafka, Streaming Ecosystem
Apache Kafka is an open-source stream-processing
software platform developed by the Apache Software
Foundation written in Scala and Java.

EL
Apache Kafka is an open source distributed streaming

PT
platform capable of handling trillions of events a day,
Kafka is based on an abstraction of a distributed
commit log
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Kafka

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Spark MLlib
Spark MLlib is a distributed machine-learning
framework on top of Spark Core.

EL
MLlib is Spark's scalable machine learning library
consisting of common learning algorithms and utilities,
including classification, regression, clustering,
PT
collaborative filtering, dimensionality reduction.
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Spark MLlib Component

EL
PT
N

Big Data Computing Vu Pham Big Data Enabling Technologies

Spark GraphX
GraphX is a new component in Spark for graphs and
graph-parallel computation. At a high level, GraphX
extends the Spark RDD by introducing a new graph

EL
abstraction.

PT
GraphX reuses Spark RDD concept, simplifies graph
analytics tasks, provides the ability to make operations
on a directed multigraph with properties attached to
N
each vertex and edge.

Big Data Computing Vu Pham Big Data Enabling Technologies

Spark GraphX

EL
PT
N
GraphX is a thin layer on top of the Spark
general-purpose dataflow framework (lines of code).

Big Data Computing Vu Pham Big Data Enabling Technologies

Conclusion
In this lecture, we given a brief overview of following Big Data
Enabling Technologies:
Apache Hadoop
Hadoop Ecosystem
HDFS Architecture

EL
YARN
NoSQL
Hive
Map Reduce
Apache Spark
Zookeeper
PT
N
Cassandra
Hbase
Spark Streaming
Kafka
Spark MLlib
GraphX

Big Data Computing Vu Pham Big Data Enabling Technologies

Hadoop Stack for Big Data

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Big Data Hadoop Stack
Preface
Content of this Lecture:

In this lecture, we will provide insight into Hadoop

technologies opportunities and challenges for Big

EL
Data.

PT
We will also look into the Hadoop stack and
applications and technologies associated with Big Data
N
solutions.

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop Beginnings

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

What is Hadoop ?

Apache Hadoop is an open source software

EL
framework for storage and large scale
processing of the data-sets on clusters of
PT
commodity hardware.
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop Beginnings
Hadoop was created by Doug Cutting and Mike Cafarella in
2005

It was originally developed to support distribution of the

EL
Nutch Search Engine Project.

PT
Doug, who was working at Yahoo at the time, who is now
actually a chief architect at Cloudera, has named this project
after his son’s toy elephant, Hadoop.
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Moving Computation to Data

EL
PT
N
Hadoop started out as a simple batch processing framework.

The idea behind Hadoop is that instead of moving data to

computation, we move computation to data.

Big Data Computing Vu Pham Big Data Hadoop Stack

Scalability

Scalability's at it's core of a Hadoop system.

We have cheap computing storage.

EL
We can distribute and scale across very easily
in a very cost effective manner.
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Reliability
Hardware Failures
Handles
Automatically!

EL
If we think about an individual machine or rack of machines, or a

PT
large cluster or super computer, they all fail at some point of time or
some of their components will fail. These failures are so common
that we have to account for them ahead of the time.
N
And all of these are actually handled within the Hadoop framework
system. So the Apache's Hadoop MapReduce and HDFS components
were originally derived from the Google's MapReduce and Google's
file system. Another very interesting thing that Hadoop brings is a
new approach to data.
Big Data Computing Vu Pham Big Data Hadoop Stack
New Approach to Data: Keep all data

EL
PT
A new approach is, we can keep all the data that we have, and we
can take that data and analyze it in new interesting ways. We can
N
do something that's called schema and read style.
And we can actually allow new analysis. We can bring more data
into simple algorithms, which has shown that with more
granularity, you can actually achieve often better results in taking
a small amount of data and then some really complex analytics on
it.
Big Data Computing Vu Pham Big Data Hadoop Stack
Apache Hadoop Framework
& its Basic Modules

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Framework Basic Modules
Hadoop Common: It contains libraries and utilities needed
by other Hadoop modules.
Hadoop Distributed File System (HDFS): It is a distributed

EL
file system that stores data on a commodity machine.
Providing very high aggregate bandwidth across the entire
cluster.
PT
Hadoop YARN: It is a resource management platform
responsible for managing compute resources in the cluster
N
and using them in order to schedule users and
applications.
Hadoop MapReduce: It is a programming model that
scales data across a lot of different processes.

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Framework Basic Modules

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

High Level Architecture of Hadoop

EL
PT
Two major pieces of Hadoop are: Hadoop Distribute the File System and the
N
MapReduce, a parallel processing framework that will map and reduce data.
These are both open source and inspired by the technologies developed at
Google.

If we talk about this high level infrastructure, we start talking about things like
TaskTrackers and JobTrackers, the NameNodes and DataNodes.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS
Hadoop distributed file system

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

HDFS: Hadoop distributed file system
Distributed, scalable, and portable file-system written in
Java for the Hadoop framework.

EL
Each node in Hadoop instance typically has a single name
node, and a cluster of data nodes that formed this HDFS
cluster.
PT
Each HDFS stores large files, typically in ranges of
N
gigabytes to terabytes, and now petabytes, across
multiple machines. And it can achieve reliability by
replicating the cross multiple hosts, and therefore does
not require any range storage on hosts.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

HDFS

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

MapReduce Engine

EL
PT
N
The typical MapReduce engine will consist of a job tracker, to which client
applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster. Struggling to
keep the word as close to the data as possible, as balanced as possible.

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Hadoop NextGen MapReduce (YARN)
Yarn enhances the power of the
Hadoop compute cluster, without
being limited by the map produce
kind of framework.
It's scalability's great. The processing

EL
power and data centers continue to
grow quickly, because the YARN
research manager focuses

PT
exclusively on scheduling. It can
manage those very large clusters
quite quickly and easily.
YARN is completely compatible with
N
the MapReduce. Existing
MapReduce application end users
can run on top of the Yarn without
disrupting any of their existing
processes.

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop 1.0 vs. Hadoop 2.0

EL
PT
N
Hadoop 2.0 provides a more general processing platform, that is not constraining to this
map and reduce kinds of processes.
The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities
of the job tracker, resource management, and the job scheduling and monitoring, and
to do two separate units. The idea is to have a global resource manager, and per
application master manager.
Big Data Computing Vu Pham Big Data Hadoop Stack
What is Yarn ?
Yarn enhances the power of the Hadoop compute cluster, without
being limited by the map produce kind of framework.
It's scalability's great. The processing power and data centers
continue to grow quickly, because the YARN research manager
focuses exclusively on scheduling. It can manage those very large

EL
clusters quite quickly and easily.
YARN is completely compatible with the MapReduce. Existing

PT
MapReduce application end users can run on top of the Yarn without
disrupting any of their existing processes.
It does have a Improved cluster utilization as well. The resource
N
manager is a pure schedule or they just optimize this cluster
utilization according to the criteria such as capacity, guarantees,
fairness, how to be fair, maybe different SLA's or service level
agreements.
Scalability MapReduce Compatibility Improved cluster utilization

Big Data Computing Vu Pham Big Data Hadoop Stack

What is Yarn ?
It supports other work flows other than just map reduce.
Now we can bring in additional programming models, such as graph
process or iterative modeling, and now it's possible to process the
data in your base. This is especially useful when we talk about
machine learning applications.

EL
Yarn allows multiple access engines, either open source or
proprietary, to use Hadoop as a common standard for either batch or

PT
interactive processing, and even real time engines that can
simultaneous acts as a lot of different data, so you can put streaming
kind of applications on top of YARN inside a Hadoop architecture,
N
and seamlessly work and communicate between these
environments.
Fairness Iterative Modeling Multiple
Machine Access
Supports Other Workloads Learning Engines
Big Data Computing Vu Pham Big Data Hadoop Stack
The Hadoop “Zoo”

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Hadoop Ecosystem

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Original Google Stack

EL
PT
N
Had their original MapReduce, and they were storing and processing
large amounts of data.
Like to be able to access that data and access it in a SQL like
language. So they built the SQL gateway to adjust the data into the
MapReduce cluster and be able to query some of that data as well.

Big Data Computing Vu Pham Big Data Hadoop Stack

Original Google Stack

EL
PT
Then, they realized they needed a high-level specific language to
N
access MapReduce in the cluster and submit some of those jobs. So
Sawzall came along.
Then, Evenflow came along and allowed to chain together complex
work codes and coordinate events and service across this kind of a
framework or the specific cluster they had at the time.
Big Data Computing Vu Pham Big Data Hadoop Stack
Original Google Stack

EL
PT
Then, Dremel came along. Dremel was a columnar storage in the
N
metadata manager that allows us to manage the data and is able to
process a very large amount of unstructured data.
Then Chubby came along as a coordination system that would manage
all of the products in this one unit or one ecosystem that could process
all these large amounts of structured data seamlessly.

Big Data Computing Vu Pham Big Data Hadoop Stack

Facebook’s Version of the Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Yahoo Version of the Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

LinkedIn’s Version of the Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Cloudera’s Version of the Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop Ecosystem

EL
Major Components
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Sqoop
Tool designed for efficiently transferring bulk
data between Apache Hadoop and structured
datastores such as relational databases

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

HBASE
Hbase is a key component of the Hadoop stack, as its
design caters to applications that require really fast random
access to significant data set.

EL
Column-oriented database management system
Key-value store
PT
Based on Google Big Table
Can hold extremely large data
N
Dynamic data model
Not a Relational DBMS

Big Data Computing Vu Pham Big Data Hadoop Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

PIG
High level programming on top of Hadoop
MapReduce
The language: Pig Latin

EL
Data analysis problems as data flows

PT
Originally developed at Yahoo 2006
N

Big Data Computing Vu Pham Big Data Hadoop Stack

PIG for ETL

EL
PT
A good example of PIG applications is ETL transaction model that
describes how a process will extract data from a source, transporting
N
according to the rules set that we specify, and then load it into a data
store.
PIG can ingest data from files, streams, or any other sources using the
UDF: a user-defined functions that we can write ourselves.
When it has all the data it can perform, select, iterate and do kinds of
transformations.
Big Data Computing Vu Pham Big Data Hadoop Stack
EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Hive
Data warehouse software facilitates querying and
managing large datasets residing in distributed
storage

EL
SQL-like language!
Facilitates querying and managing large datasets in
HDFS
PT
Mechanism to project structure onto this data and
N
query the data using a SQL-like language called
HiveQL

Big Data Computing Vu Pham Big Data Hadoop Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Oozie
Workflow
Workflow scheduler system to manage Apache
Hadoop jobs

EL
Oozie Coordinator jobs!

PT
Supports MapReduce, Pig, Apache Hive, and
Sqoop, etc.
N

Big Data Computing Vu Pham Big Data Hadoop Stack

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Zookeeper
Provides operational services for a
Hadoop cluster group services
Centralized service for: maintaining

EL
configuration information naming
services
PT
Providing distributed synchronization
and providing group services
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Flume
Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data
It has a simple and very flexible architecture based on streaming
data flows. It's quite robust and fall tolerant, and it's really tunable

EL
to enhance the reliability mechanisms, fail over, recovery, and all
the other mechanisms that keep the cluster safe and reliable.
It uses simple extensible data model that allows us to apply all

PT
kinds of online analytic applications.
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Additional Cloudera Hadoop Components Impala

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Impala
Cloudera, Impala was designed specifically at Cloudera, and it's a query
engine that runs on top of the Apache Hadoop. The project was officially
announced at the end of 2012, and became a publicly available, open source
distribution.
Impala brings scalable parallel database technology to Hadoop and allows

EL
users to submit low latencies queries to the data that's stored within the
HDFS or the Hbase without acquiring a ton of data movement and
manipulation.

PT
Impala is integrated with Hadoop, and it works within the same power
system, within the same format metadata, all the security and reliability
resources and management workflows.
N
It brings that scalable parallel database technology on top of the Hadoop.
It actually allows us to submit SQL like queries at much faster speeds with a
lot less latency.

Big Data Computing Vu Pham Big Data Hadoop Stack

Additional Cloudera Hadoop Components Spark
The New Paradigm

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Spark
Apache Spark™ is a fast and general engine for large-scale data
processing
Spark is a scalable data analytics platform that incorporates primitives
for in-memory computing and therefore, is allowing to exercise some
different performance advantages over traditional Hadoop's cluster

EL
storage system approach. And it's implemented and supports
something called Scala language, and provides unique environment
for data processing.

PT
Spark is really great for more complex kinds of analytics, and it's great
at supporting machine learning libraries.
N
It is yet again another open source computing frame work and it was
originally developed at MP labs at the University of California
Berkeley and it was later donated to the Apache software foundation
where it remains today as well.

Big Data Computing Vu Pham Big Data Hadoop Stack

Spark Benefits
In contrast to Hadoop's two stage disk based MapReduce paradigm
Multi-stage in-memory primitives provides performance up to 100
times faster for certain applications.
Allows user programs to load data into a cluster's memory and query
it repeatedly

EL
Spark is really well suited for these machined learning kinds of
applications that often times have iterative sorting in memory kinds
of computation.
PT
Spark requires a cluster management and a distributed storage
system. So for the cluster management, Spark supports standalone
N
native Spark clusters, or you can actually run Spark on top of a
Hadoop yarn, or via patching mesas.
For distributor storage, Spark can interface with any of the variety of
storage systems, including the HDFS, Amazon S3.

Big Data Computing Vu Pham Big Data Hadoop Stack

Conclusion
In this lecture, we have discussed the specific components
and basic processes of the Hadoop architecture, software
stack, and execution environment.

EL
PT
N

Big Data Computing Vu Pham Big Data Hadoop Stack

Lussier mh17 HumanRelOrg10e-wm PDF
100% (4)
Lussier mh17 HumanRelOrg10e-wm PDF
513 pages
Nptel Big Data Full PPT Book With Assignment Solution Rajiv Mishra IIT Patna 2021
100% (1)
Nptel Big Data Full PPT Book With Assignment Solution Rajiv Mishra IIT Patna 2021
1,103 pages
20.tooth Segmentation On Dental Meshes Using Morphologic Skeleton
No ratings yet
20.tooth Segmentation On Dental Meshes Using Morphologic Skeleton
13 pages
Advanced Referral Skills Workshop
100% (1)
Advanced Referral Skills Workshop
18 pages
Iste Stds Self Assessment
No ratings yet
Iste Stds Self Assessment
4 pages
BD-Unit-1
No ratings yet
BD-Unit-1
63 pages
L1
No ratings yet
L1
53 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
221 pages
An Introduction To Big Data: Data Management For Data Science
No ratings yet
An Introduction To Big Data: Data Management For Data Science
32 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
UNIT I
No ratings yet
UNIT I
25 pages
PPT01-Introduction To Big Data
No ratings yet
PPT01-Introduction To Big Data
34 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Big Data Analytics
No ratings yet
Big Data Analytics
96 pages
BigData Nptel
No ratings yet
BigData Nptel
813 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
Bda Chapter 1 Techneo
No ratings yet
Bda Chapter 1 Techneo
27 pages
Wibd Notes
No ratings yet
Wibd Notes
32 pages
huawei
No ratings yet
huawei
29 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
Troduction To Ig Ata: Big Data Analytics
No ratings yet
Troduction To Ig Ata: Big Data Analytics
52 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Bigdata 201126054145 PDF
No ratings yet
Bigdata 201126054145 PDF
23 pages
CC&BD Unit 3
No ratings yet
CC&BD Unit 3
16 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
31 pages
Lec 7 Hadoop Intro (3)
No ratings yet
Lec 7 Hadoop Intro (3)
48 pages
Big Data - The Journey Begins
No ratings yet
Big Data - The Journey Begins
19 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Big Data With Cloud Computing Discussions and Challenges
No ratings yet
Big Data With Cloud Computing Discussions and Challenges
9 pages
Unit 1_BDS_DS307
No ratings yet
Unit 1_BDS_DS307
47 pages
Lect 3 Big Data Lesson02
No ratings yet
Lect 3 Big Data Lesson02
51 pages
What Is Big Data
No ratings yet
What Is Big Data
8 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
Understanding The Big Data Problems and Their Solutions Using Hadoop and Map-Reduce
No ratings yet
Understanding The Big Data Problems and Their Solutions Using Hadoop and Map-Reduce
7 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Big Data
No ratings yet
Big Data
25 pages
Unit 1
No ratings yet
Unit 1
54 pages
Acc 411 Topic 2
No ratings yet
Acc 411 Topic 2
30 pages
4-Big Data Management
No ratings yet
4-Big Data Management
40 pages
Big Data Shivani
No ratings yet
Big Data Shivani
78 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Module 1.1 - Introduction To Big Data
No ratings yet
Module 1.1 - Introduction To Big Data
18 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Lecture 3-Introduction to Big Data
No ratings yet
Lecture 3-Introduction to Big Data
25 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Unit-1 Introduction to big data analytics
No ratings yet
Unit-1 Introduction to big data analytics
57 pages
Apache Hadoop Training For Developers Day 1
No ratings yet
Apache Hadoop Training For Developers Day 1
136 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
29 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Unit 1
No ratings yet
Unit 1
76 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
BDA GTU Study Material Presentations Unit-1 09082021103431AM
No ratings yet
BDA GTU Study Material Presentations Unit-1 09082021103431AM
57 pages
14 Big Data
No ratings yet
14 Big Data
39 pages
Intro To Big Data
No ratings yet
Intro To Big Data
23 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Big Data for Executives and Market Professionals - Third Edition: Big Data
From Everand
Big Data for Executives and Market Professionals - Third Edition: Big Data
Jose Antonio Ribeiro Neto
No ratings yet
PGET 2019 Prospectus
No ratings yet
PGET 2019 Prospectus
22 pages
Vaccine Controversy
No ratings yet
Vaccine Controversy
8 pages
Shinova DM8D Defibrillator - User Manual
No ratings yet
Shinova DM8D Defibrillator - User Manual
63 pages
After Defense Mango Tree Spraying Drone Powered by Infrared Frequency
100% (1)
After Defense Mango Tree Spraying Drone Powered by Infrared Frequency
83 pages
The Golden Diamond - PDF Version 1
100% (2)
The Golden Diamond - PDF Version 1
8 pages
Chapter 05 - Answer
No ratings yet
Chapter 05 - Answer
36 pages
How To Make A Circle Skirt Without A Pattern
100% (3)
How To Make A Circle Skirt Without A Pattern
22 pages
Food and Grocery Retail Logistics
No ratings yet
Food and Grocery Retail Logistics
14 pages
Pausing and Stress Handout
No ratings yet
Pausing and Stress Handout
2 pages
(Ebook) Interactions: collaboration skills for school professionals by Cook, Lynne; Friend, Marilyn Penovich ISBN 9781269374507, 9781292041674, 1269374508, 1292041676 All Chapters Instant Download
100% (3)
(Ebook) Interactions: collaboration skills for school professionals by Cook, Lynne; Friend, Marilyn Penovich ISBN 9781269374507, 9781292041674, 1269374508, 1292041676 All Chapters Instant Download
65 pages
Experiment 6.1
75% (4)
Experiment 6.1
3 pages
AnnexC2 Checklist of Documentary Requirements
No ratings yet
AnnexC2 Checklist of Documentary Requirements
3 pages
Email Writing
100% (2)
Email Writing
36 pages
Exam Preparation
No ratings yet
Exam Preparation
11 pages
Download ebooks file Financial Risk Modelling and Portfolio Optimization with R all chapters
No ratings yet
Download ebooks file Financial Risk Modelling and Portfolio Optimization with R all chapters
24 pages
Anatomy and Physiology Midterm Chapter Quiz Skeletal and Muscular System Skeletal System
No ratings yet
Anatomy and Physiology Midterm Chapter Quiz Skeletal and Muscular System Skeletal System
8 pages
Processes: Prospects and Challenges of AI and Neural Network Algorithms in MEMS Microcantilever Biosensors
No ratings yet
Processes: Prospects and Challenges of AI and Neural Network Algorithms in MEMS Microcantilever Biosensors
25 pages
IA Reaction Time Vs Age
No ratings yet
IA Reaction Time Vs Age
8 pages
MYKONOS SEA EXCURSIONS Presentation 2024
No ratings yet
MYKONOS SEA EXCURSIONS Presentation 2024
6 pages
Tool Room Broucher
No ratings yet
Tool Room Broucher
2 pages
Systematic Review of Adaptive Learning Research Designs Context
No ratings yet
Systematic Review of Adaptive Learning Research Designs Context
29 pages
FINAL FULL ADVT 52 2022
No ratings yet
FINAL FULL ADVT 52 2022
15 pages
Baking Techniques PDF
No ratings yet
Baking Techniques PDF
7 pages
Rutting in Asphalt
No ratings yet
Rutting in Asphalt
15 pages
Iridescent Volume 01.01 PDF
No ratings yet
Iridescent Volume 01.01 PDF
207 pages
P.G. Diploma Examination - 2012: 110. Phonetics and Spoken English
No ratings yet
P.G. Diploma Examination - 2012: 110. Phonetics and Spoken English
2 pages