Chap5_BigDataComputingAndProcessing
Chap5_BigDataComputingAndProcessing
Yijie Zhang
New Jersey Institute of Technology
1
Remind -- Apache Hadoop
The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
http://hadoop.apache.org
2
Remind -- Hadoop-related Apache Projects
4
Haoop 1 execution
– A resource management tool that enables the other parallel processing frameworks to
run on Hadoop.
– A general-purpose resource management facility that can schedule and assign CPU
cycles and memory (and in the future, other resources, such as network bandwidth)
from the Hadoop cluster to waiting applications.
➔Starting from Hadoop 2, YARN has converted Hadoop from simply a batch
processing engine into a platform for many different modes of data
processing
• From traditional batch to interactive queries to streaming analysis.
7
Hadoop 2 Data Processing Architecture
8
YARN’s application execution
http://www.alex-hanna.com
10
MapReduce Data Flow (Hadoop 1)
http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
11
Spark
Enhance programmability:
• Integrate into Scala programming language
• Allow interactive use from Scala interpreter
13
Motivation
Map
Reduce
Reduce
Map
14
Motivation
Map
Reduce
Benefits of data flow: runtime can decide
where
Input to runMap
tasks and can automatically
Output
recover from failures
Reduce
Map
15
Motivation
16
Solution:
Resilient Distributed Datasets (RDDs)
17
Programming Model
Two stages: transformations followed by actions
•Core structure: Resilient distributed datasets (RDDs)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map,
filter, groupBy, join, …) on data in stable storage
• Can be cached for efficient reuse
•Perform multiple various Actions on RDDs
• Count, reduce, collect, save, …
Note that
• Before Spark 2.0, the main programming interface of Spark was
the Resilient Distributed Dataset (RDD)
• After Spark 2.0, RDDs are replaced by Dataset
• Strongly-typed like a RDD, but with richer optimizations
• The RDD interface is still supported 18
Example: Log Mining
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Result: full-text search of Worker Block 2
Result: scaled to 1 TB data in 5-7 sec
Wikipedia in <1 sec (vs 20 sec
(vs 170 sec for on-disk data) Block 3
for on-disk data)
19
RDD Fault Tolerance
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
filter map
(func = _.contains(...)) (func = _.split(...))
20
Example: Logistic Regression
target
21
Example: Logistic Regression
Scala
var w = Vector.random(D)
println("Final w: " + w)
22
Logistic Regression Performance
127 s / iteration
23
Spark Applications
24
Data Processing Frameworks Built on Spark
25
Implementation
• Runs on Apache
Mesos to share Spark Hadoop MPI
…
resources with
Hadoop & other apps Mesos
26
Spark Scheduler
• Dryad-like DAGs B:
A:
• Pipelines functions
within a stage G:
• Cache-aware work Stage 1 groupBy
reuse & locality
C: D: F:
• Partitioning-aware
to avoid shuffles
map
E: join
28
Spark Operations
map flatMap
filter union
Transformations join
sample
(define a new cogroup
groupByKey
RDD) cross
reduceByKey
sortByKey mapValues
collect
Actions reduce
(return a result to count
driver program) save
lookupKey
29
Apache Tez
The Apache TEZ® project is aimed at building an application framework
which allows for a complex directed-acyclic-graph (DAG) of tasks for
processing data. It is currently built atop Apache Hadoop YARN.
30
Tez’s characteristics
31
By allowing projects like Apache Hive and Apache Pig to run
a complex DAG of tasks, Tez can be used to process data,
that earlier took multiple MR jobs, now in a single Tez job as
shown below.
32
Apache Storm
Stream Processing
-- On Hadoop, you run MapReduce jobs; On Storm, you run Topologies.
-- Two kinds of nodes on a Storm cluster:
-- the master node runs “Nimbus”
-- the worker nodes called the Supervisor.
33
How Storm processes data?
34
Storm’s Goals and Plans
35
Oozie Workflow Scheduler for Hadoop
• Oozie supports a wide range of job types, including Pig, Hive, and MapReduce, as well
as jobs coming from Java programs and Shell scripts.
start
OK
OK secondJob end
firstJob (MapReduce)
(Pig) error
Sample Oozie XML file
error
kill 36
Action and Control Nodes
<workflow-app name="foo-wf"..
OK END <start to="[NODE-NAME]"/>
START MapReduce <map-reduce>
...
...
ERROR
</map-reduce>
<kill name="[NODE-NAME]">
Action <message>Error occurred</message
</kill>
Control Node <end name="[NODE-NAME]"/>
KILL </workflow-app>
• Control Flow
• Workflows begin with START node
– start, end, kill
– decision • Workflows succeed with END node
– fork, join • Workflows fail with KILL node
• Actions • Several actions support JSP Expression Language
– MapReduce (EL)
– Java
– Pig
• Oozie Coordination Engine can trigger workflows by
– Hive
– Time (Periodically)
– HDFS commands
– Data Availability (Data appears in a directory)
37
Schedule Workflow by Time
38
Schedule Workflow by Time and Data Availability
39
Install Oozie
• $ mkdir <OOZIE_HOME>/libext
• Download ExtJS and place under
<OOZIE_HOME>/libext
– ext-2.2.zip
• Place Hadoop libs under libext
– $ cd <OOZIE_HOME>
– $ tar xvf oozie-hadooplibs-3.1.3-cdh4.0.0.tar.gz
– $ cp oozie-3.1.3-cdh4.0.0/hadooplibs/hadooplib-2.0.0- cdh4.0.0/*.jar libext/
• Configure Oozie with components under libext
– $ bin/oozie-setup.sh
40
Install Oozie
•Update core-site.xml to allow Oozie become “hadoop” and for that user to connect
from any host
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
<description>Allow the superuser oozie to impersonate any members of the group group1 and
group2</description>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
<description>The superuser can connect only from host1 and host2 to impersonate a
user</description>
</property>
41
Start Oozie
$ oozie-start.sh
15 42
Running Oozie Examples
43
An example workflow
ERROR
KILL END
44
Workflow definition
45
Package and Run Your Workflow
1. Create application directory structure with
workflow definitions and resources
– Workflow.xml, jars, etc..
2. Copy application directory to HDFS
3. Create application configuration file
– specify location of the application directory on HDFS
– specify location of the namenode and resource manager
4. Submit workflow to Oozie
– Utilize oozie command line
5. Monitor running workflow(s)
• Two options
– Command line ($oozie)
– Web Interface (http://localhost:11000/oozie)
mostSeenLetter-oozieWorkflow
|--lib/
Oozie Application | |--HadoopSamples.jar
Application
Workflow Root
Directory |--workflow.xml
http://stat-computing.org/dataexpo/
47
Airline On-time Performance Dataset
– Download it from:
– http://stat-computing.org/dataexpo/2009/
– https://dataverse.harvard.edu/dataset.xhtml?persist
entId=doi:10.7910/DVN/HG7NV7
48
Flight Data Schema
49
MapReduce Use Case Example – flight data
50
MapReduce Use Case Example – flight data
• Problem: count the number of flights for each carrier
• Solution using MapReduce (parallel way):
51
MapReduce steps for
flight data computation
52
FlightsByCarrier application
Create FlightsByCarrier.java:
53
FlightsByCarrier application
54
FlightsByCarrier Mapper
55
FlightsByCarrier Reducer
56
Run the code
57
See Result
58
Using Pig Script for fast application development
• Problem: calculate the total miles flown for all flights flown in one year
• How much work is needed using MapReduce?
• What if we use Pig?
• totalmiles.pig
61
Characteristics of Pig
62
Pig vs. SQL
On the other hand, it has been argued DBMSs are substantially faster than the MapReduce
system once the data is loaded, but that loading the data takes considerably longer in the
database systems. It has also been argued RDBMSs offer out of the box support for
column-storage, working with compressed data, indexes for efficient random data
access, and transaction-level fault tolerance.
Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead
declarative. In SQL users can specify that data from two tables must be joined, but not
what join implementation to use. Pig Latin allows users to specify an implementation or
aspects of an implementation to be used in executing a script in several ways.
63
Pig Data Types and Syntax
64
Pig Latin Operators
65
Pig Latin Expressions
66
Pig User-Defined Functions (UDFs)
https://pig.apache.org/docs/r0.7.0/udf.html 67
Apache Hive
68
Using Hive to Create a Table
69
Another Hive Example
70
Exercises
http://stat-computing.org/dataexpo/
71
Questions?
72