0% found this document useful (0 votes)
3 views

Chap5_BigDataComputingAndProcessing

The document provides an overview of Big Data computing and processing, focusing on Apache Hadoop and its components, including YARN for resource management and the evolution to Hadoop 2. It discusses the limitations of Hadoop 1, the introduction of Spark for in-memory data processing, and the capabilities of Apache Tez and Storm for complex data workflows. Additionally, it covers Oozie as a workflow scheduler for managing various job types in Hadoop environments.

Uploaded by

Pavan Frustum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chap5_BigDataComputingAndProcessing

The document provides an overview of Big Data computing and processing, focusing on Apache Hadoop and its components, including YARN for resource management and the evolution to Hadoop 2. It discusses the limitations of Hadoop 1, the introduction of Spark for in-memory data processing, and the capabilities of Apache Tez and Storm for complex data workflows. Additionally, it covers Oozie as a workflow scheduler for managing various job types in Hadoop environments.

Uploaded by

Pavan Frustum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

DS 644: Introduction to Big Data

Chapter 5. Big Data Computing and Processing

Yijie Zhang
New Jersey Institute of Technology

Some of the slides were provided through the courtesy of Dr.


Ching-Yung Lin at Columbia University

1
Remind -- Apache Hadoop

The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.

The project includes these modules:


• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS ): A distributed file system that provides high-
throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

http://hadoop.apache.org
2
Remind -- Hadoop-related Apache Projects

• Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop


clusters.It also provides a dashboard for viewing cluster health and ability to view
MapReduce, Pig and Hive applications visually.
• Avro : A data serialization system.
• Cassandra : A scalable multi-master database with no single points of failure.
• Chukwa : A data collection system for managing large distributed systems.
• HBase : A scalable, distributed database that supports structured data storage
for large tables.
• Hive : A data warehouse infrastructure that provides data summarization and ad
hoc querying.
• Mahout : A Scalable machine learning and data mining library.
• Pig : A high-level data-flow language and execution framework for parallel
computation.
• Spark : A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• Tez : A generalized data-flow programming framework, built on Hadoop YARN,
which provides a powerful and flexible engine to execute an arbitrary DAG of
tasks to process data for both batch and interactive use-cases.
• ZooKeeper : A high-performance coordination service for distributed
applications. 3
Four distinctive layers of Hadoop 1

4
Haoop 1 execution

1. The client application submits an application request to the JobTracker.


2. The JobTracker determines how many processing resources are needed to execute the entire
application.
3. The JobTracker looks at the state of the slave nodes and queues all the map tasks and reduce tasks
for execution.
4. As processing slots become available on the slave nodes, map tasks are deployed to the slave nodes.
Map tasks are assigned to nodes where the same data is stored.
5. The TaskTracker monitors task progress. If failure, the task is restarted on the next available slot.
6. After the map tasks are finished, reduce tasks process the interim results sets from the map tasks.
7. The result set is returned to the client application. 5
Limitation of Hadoop 1

• MapReduce is a successful batch-oriented programming model.

• A glass ceiling in terms of wider use:


– Exclusive tie to MapReduce, which means it could be used only for batch-style
workloads and for general-purpose analysis.

• Triggered demands for additional processing modes:


– Stream data processing (Storm)
– Message passing (MPI)
– Graph analysis

➔ Demand is growing for real-time and ad-hoc analysis


➔ Analysts ask many smaller questions against subsets of data
and need a near-instant response.
➔ Some analysts are more used to SQL & Relational databases

YARN was created to move beyond the limitation


of a Hadoop 1 / MapReduce world.
6
YARN: Resource Management to Support Parallel Computing

• YARN – Yet Another Resource Negotiator

– A resource management tool that enables the other parallel processing frameworks to
run on Hadoop.

– A general-purpose resource management facility that can schedule and assign CPU
cycles and memory (and in the future, other resources, such as network bandwidth)
from the Hadoop cluster to waiting applications.

➔Starting from Hadoop 2, YARN has converted Hadoop from simply a batch
processing engine into a platform for many different modes of data
processing
• From traditional batch to interactive queries to streaming analysis.

7
Hadoop 2 Data Processing Architecture

8
YARN’s application execution

• The client submits an application to the Resource Manager.


• The Resource Manager asks a Node Manager to create an Application Master Instance (AMI) and
starts up.
• Application Master initializes itself and register with the Resource Manager
• Application Master figures out how many resources are needed to execute the application.
• Application Master then requests the necessary resources from the Resource Manager. It sends
heartbeat message to the Resource Manager throughout its lifetime.
• The Resource Manager accepts the request and queue up.
• As the requested resources become available on the slave nodes, the Resource Manager grants the
Application Master leases for containers on specific slave nodes.
• …. ➔ only need to decide on how much memory tasks can have.
9
MapReduce WordCount revisit

http://www.alex-hanna.com
10
MapReduce Data Flow (Hadoop 1)

http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
11
Spark

Fast, Interactive, Language-Integrated


Cluster Computing

Download source release:


www.spark-project.org
12
Spark Goals

Extend the MapReduce model to better support


two common classes of analytics applications:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining (user query)

Enhance programmability:
• Integrate into Scala programming language
• Allow interactive use from Scala interpreter

13
Motivation

Most current cluster programming models


are based on acyclic data flow from stable
storage to stable storage

Map
Reduce

Input Map Output

Reduce
Map

14
Motivation

Most current cluster programming models are based


on acyclic data flow from stable storage to stable
storage

Map
Reduce
Benefits of data flow: runtime can decide
where
Input to runMap
tasks and can automatically
Output
recover from failures
Reduce
Map

15
Motivation

Acyclic data flow is inefficient for applications that


repeatedly reuse a working set of data:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining tools (R, Excel, Python)

With current frameworks, applications must reload


data from stable storage on each query, which is
time consuming!

16
Solution:
Resilient Distributed Datasets (RDDs)

• Allow apps to keep working sets in memory for


efficient reuse

• Retain the attractive properties of MapReduce


• Fault tolerance, data locality, scalability

• Support a wide range of applications

17
Programming Model
Two stages: transformations followed by actions
•Core structure: Resilient distributed datasets (RDDs)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map,
filter, groupBy, join, …) on data in stable storage
• Can be cached for efficient reuse
•Perform multiple various Actions on RDDs
• Count, reduce, collect, save, …

Note that
• Before Spark 2.0, the main programming interface of Spark was
the Resilient Distributed Dataset (RDD)
• After Spark 2.0, RDDs are replaced by Dataset
• Strongly-typed like a RDD, but with richer optimizations
• The RDD interface is still supported 18
Example: Log Mining

Load error messages from a log into memory, then


interactively search for various patterns
Scala
Base RDD Cache 1
Transformed RDD
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()

Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Result: full-text search of Worker Block 2
Result: scaled to 1 TB data in 5-7 sec
Wikipedia in <1 sec (vs 20 sec
(vs 170 sec for on-disk data) Block 3
for on-disk data)
19
RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct


lost partitions

Ex:

messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

filter map
(func = _.contains(...)) (func = _.split(...))

20
Example: Logistic Regression

Goal: find the best line separating two sets of points

The found line can be used to classify new points.

random initial line

target

21
Example: Logistic Regression

Scala

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {


val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

22
Logistic Regression Performance

127 s / iteration

first iteration 174 s


further iterations 6 s

23
Spark Applications

In-memory data mining on Hive data (Conviva)


Predictive analytics (Quantifind)
City traffic prediction (Mobile Millennium)
Twitter spam classification (Monarch)
Collaborative filtering via matrix factorization

24
Data Processing Frameworks Built on Spark

Pregel on Spark (Bagel)


Google message passing model for graph computation
200 lines of code
Hive on Spark (Shark)
3000 lines of code
Compatible with Apache Hive
ML operators in Scala

25
Implementation

• Runs on Apache
Mesos to share Spark Hadoop MPI

resources with
Hadoop & other apps Mesos

• Can read from any Node Node Node Node


Hadoop input source
(e.g. HDFS)
• No changes to Scala
compiler

26
Spark Scheduler

• Dryad-like DAGs B:
A:
• Pipelines functions
within a stage G:
• Cache-aware work Stage 1 groupBy
reuse & locality
C: D: F:
• Partitioning-aware
to avoid shuffles
map
E: join

Stage 2 union Stage 3

= cached data partition


27
Interactive Spark

Modified Scala interpreter to allow Spark to be


used interactively from the command line
Required two changes:
• Modified wrapper code generation so that
each line typed has references to objects for
its dependencies
• Distribute generated classes over the network

28
Spark Operations

map flatMap
filter union
Transformations join
sample
(define a new cogroup
groupByKey
RDD) cross
reduceByKey
sortByKey mapValues

collect
Actions reduce
(return a result to count
driver program) save
lookupKey

29
Apache Tez
The Apache TEZ® project is aimed at building an application framework
which allows for a complex directed-acyclic-graph (DAG) of tasks for
processing data. It is currently built atop Apache Hadoop YARN.

30
Tez’s characteristics

• Dataflow graph with


vertices to express,
model, and execute
data processing logic
• Performance via
Dynamic Graph
Reconfiguration
• Flexible Input-
Processor-Output task
model
• Optimal Resource
Management

31
By allowing projects like Apache Hive and Apache Pig to run
a complex DAG of tasks, Tez can be used to process data,
that earlier took multiple MR jobs, now in a single Tez job as
shown below.

32
Apache Storm
Stream Processing
-- On Hadoop, you run MapReduce jobs; On Storm, you run Topologies.
-- Two kinds of nodes on a Storm cluster:
-- the master node runs “Nimbus”
-- the worker nodes called the Supervisor.

33
How Storm processes data?

34
Storm’s Goals and Plans

35
Oozie Workflow Scheduler for Hadoop
• Oozie supports a wide range of job types, including Pig, Hive, and MapReduce, as well
as jobs coming from Java programs and Shell scripts.

start
OK
OK secondJob end
firstJob (MapReduce)
(Pig) error
Sample Oozie XML file
error
kill 36
Action and Control Nodes

<workflow-app name="foo-wf"..
OK END <start to="[NODE-NAME]"/>
START MapReduce <map-reduce>
...
...

ERROR
</map-reduce>
<kill name="[NODE-NAME]">
Action <message>Error occurred</message
</kill>
Control Node <end name="[NODE-NAME]"/>
KILL </workflow-app>

• Control Flow
• Workflows begin with START node
– start, end, kill
– decision • Workflows succeed with END node
– fork, join • Workflows fail with KILL node
• Actions • Several actions support JSP Expression Language
– MapReduce (EL)
– Java
– Pig
• Oozie Coordination Engine can trigger workflows by
– Hive
– Time (Periodically)
– HDFS commands
– Data Availability (Data appears in a directory)

37
Schedule Workflow by Time

38
Schedule Workflow by Time and Data Availability

39
Install Oozie
• $ mkdir <OOZIE_HOME>/libext
• Download ExtJS and place under
<OOZIE_HOME>/libext
– ext-2.2.zip
• Place Hadoop libs under libext
– $ cd <OOZIE_HOME>
– $ tar xvf oozie-hadooplibs-3.1.3-cdh4.0.0.tar.gz
– $ cp oozie-3.1.3-cdh4.0.0/hadooplibs/hadooplib-2.0.0- cdh4.0.0/*.jar libext/
• Configure Oozie with components under libext
– $ bin/oozie-setup.sh

• Create environment variable for default url


– export OOZIE_URL=http://localhost:11000/oozie
– This allows you to use $oozie command without providing url
• Update oozie-site.xml to point to Hadoop configuration
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/home/hadoop/Training/CDH4/hadoop-2.0.0-cdh4.0.0/conf</value>
</property>

• Setup Oozie database


– $./bin/ooziedb.sh create -sqlfile oozie.sql -run DB Connection.

40
Install Oozie

•Update core-site.xml to allow Oozie become “hadoop” and for that user to connect
from any host
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
<description>Allow the superuser oozie to impersonate any members of the group group1 and
group2</description>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
<description>The superuser can connect only from host1 and host2 to impersonate a
user</description>
</property>

41
Start Oozie
$ oozie-start.sh

Setting OOZIE_HOME: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0


Setting OOZIE_CONFIG: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0/conf
Sourcing: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0/conf/oozie-env.sh
setting OOZIE_LOG=/home/hadoop/Training/logs/oozie
setting CATALINA_PID=/home/hadoop/Training/hadoop_work/pids/oozie.pid
Setting OOZIE_CONFIG_FILE: oozie-site.xml
Setting OOZIE_DATA: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0/data Using
OOZIE_LOG: /home/hadoop/Training/logs/oozie
Setting OOZIE_LOG4J_FILE: oozie-log4j.properties Setting OOZIE_LOG4J_RELOAD:
10
Setting OOZIE_HTTP_HOSTNAME: localhost Setting OOZIE_HTTP_PORT: 11000
Setting OOZIE_ADMIN_PORT: 11001
...
...
...
$ oozie admin -status http://localhost:11000/oozie/
System mode: NORMAL

15 42
Running Oozie Examples

• Extract examples packaged with Oozie


– $ cd $OOZIE_HOME
– $ tar xvf oozie-examples.tar.gz
• Copy examples to HDFS from user’s home directory
– $ hdfs dfs -put examples examples
• Run an example
– $ oozie job -config examples/apps/map-reduce/job.properties - run
• Check Web Console
– http://localhost:11000/oozie/

43
An example workflow

START Count Each Letter OK Find Max Letter OK


MapReduce Clean Up
MapReduce

ERROR

KILL END

Count Each Letter


- Action Node This source is in HadoopSamples
MapReduce
project under
/src/main/resources/mr/workflows
- Control Flow Node

END - Control Node

44
Workflow definition

<workflow-app xmlns="uri:oozie:workflow:0.2" name="most-seen-letter">


<start to="count-each-letter"/>
<action name="count-each-letter"> START Action Node
<map-reduce> to count-each-letter
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
MapReduce action
<prepare>
<delete path="${nameNode}${outputDir}"/>
<delete path="${nameNode}${intermediateDir}"/>
MapReduce
</prepare>
have optional <configuration> Pass property that will be
Prepare ... set on MapReduce job’s
section <property> Configuration object
<name>mapreduce.job.map.class</name>
<value>mr.wordcount.StartsWithCountMapper</value>
</property>
...
</configuration>
</map-reduce>
<ok to="find-max-letter"/> In case of success, go to the
<error to="fail"/> next job; in case of failure, go to
</action> fail node
...

45
Package and Run Your Workflow
1. Create application directory structure with
workflow definitions and resources
– Workflow.xml, jars, etc..
2. Copy application directory to HDFS
3. Create application configuration file
– specify location of the application directory on HDFS
– specify location of the namenode and resource manager
4. Submit workflow to Oozie
– Utilize oozie command line
5. Monitor running workflow(s)
• Two options
– Command line ($oozie)
– Web Interface (http://localhost:11000/oozie)

mostSeenLetter-oozieWorkflow
|--lib/
Oozie Application | |--HadoopSamples.jar
Application
Workflow Root
Directory |--workflow.xml

•Must comply to directory Libraries should be placed under lib directory


structure spec
Workflow.xml defines workflow
46
Public datasets available from Statistical Computing

http://stat-computing.org/dataexpo/

47
Airline On-time Performance Dataset

• Data Source: Airline On-time Performance data set (flight


data set).
– All the logs of domestic flights from the period of
October 1987 to April 2008.
– Each record represents an individual flight where
various details are captured:
• Time and date of arrival and departure
• Originating and destination airports
• Amount of time taken to taxi from the runway to the
gate.

– Download it from:
– http://stat-computing.org/dataexpo/2009/
– https://dataverse.harvard.edu/dataset.xhtml?persist
entId=doi:10.7910/DVN/HG7NV7

48
Flight Data Schema

49
MapReduce Use Case Example – flight data

• Problem: count the number of flights for each carrier

• Solution using a serial approach (not MapReduce):

50
MapReduce Use Case Example – flight data
• Problem: count the number of flights for each carrier
• Solution using MapReduce (parallel way):

51
MapReduce steps for
flight data computation

52
FlightsByCarrier application
Create FlightsByCarrier.java:

53
FlightsByCarrier application

54
FlightsByCarrier Mapper

55
FlightsByCarrier Reducer

56
Run the code

57
See Result

58
Using Pig Script for fast application development

• Problem: calculate the total miles flown for all flights flown in one year
• How much work is needed using MapReduce?
• What if we use Pig?
• totalmiles.pig

• Execute it: pig totalmiles.pig


• See result: hdfs dfs –cat /user/root/totalmiles/part-r-00000
➔ 775009272
59
Pig: a data flow language

Data flow language


•Define a data stream
• Typically using “LOAD …”
•Apply a series of transformations to the data
• FILTER, GROUP, COUNT, DUMP, etc. 60
Pig example for WordCount

61
Characteristics of Pig

62
Pig vs. SQL

In comparison to SQL, Pig


1. uses lazy evaluation,
2. uses ETL (Extract-Transform-Load),
3. is able to store data at any point during a pipeline,
4. declares execution plans,
5. supports pipeline splits.

On the other hand, it has been argued DBMSs are substantially faster than the MapReduce
system once the data is loaded, but that loading the data takes considerably longer in the
database systems. It has also been argued RDBMSs offer out of the box support for
column-storage, working with compressed data, indexes for efficient random data
access, and transaction-level fault tolerance.

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead
declarative. In SQL users can specify that data from two tables must be joined, but not
what join implementation to use. Pig Latin allows users to specify an implementation or
aspects of an implementation to be used in executing a script in several ways.

Pig Latin programming is similar to specifying a query execution plan.

63
Pig Data Types and Syntax

Atom: An atom is any single value, such as a string or number


Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type.
Bag: A bag is a collection of non-unique tuples.
Map: A map is a collection of key value pairs.

64
Pig Latin Operators

65
Pig Latin Expressions

66
Pig User-Defined Functions (UDFs)

Command to run the script.

The .java contains UPPER

https://pig.apache.org/docs/r0.7.0/udf.html 67
Apache Hive

68
Using Hive to Create a Table

69
Another Hive Example

70
Exercises

1. Download Airline Data and one of your own selected datasets


from Stat-Computing.org
3. Learn to use PIG. You can try the example in the reference
4. Use Oozie to schedule a few jobs
5. Try HBase. Use your own example
6. Try Hive. Use your own example

http://stat-computing.org/dataexpo/

71
Questions?

72

You might also like