0% found this document useful (0 votes)

5 views72 pages

Chap5_BigDataComputingAndProcessing

The document provides an overview of Big Data computing and processing, focusing on Apache Hadoop and its components, including YARN for resource management and the evolution to Hadoop 2. It discusses the limitations of Hadoop 1, the introduction of Spark for in-memory data processing, and the capabilities of Apache Tez and Storm for complex data workflows. Additionally, it covers Oozie as a workflow scheduler for managing various job types in Hadoop environments.

Uploaded by

Pavan Frustum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views72 pages

Chap5_BigDataComputingAndProcessing

Uploaded by

Pavan Frustum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

DS 644: Introduction to Big Data

Chapter 5. Big Data Computing and Processing

Yijie Zhang
New Jersey Institute of Technology

Some of the slides were provided through the courtesy of Dr.

Ching-Yung Lin at Columbia University

1
Remind -- Apache Hadoop

The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.

The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS ): A distributed file system that provides high-
throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

http://hadoop.apache.org
2
Remind -- Hadoop-related Apache Projects

• Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop

clusters.It also provides a dashboard for viewing cluster health and ability to view
MapReduce, Pig and Hive applications visually.
• Avro : A data serialization system.
• Cassandra : A scalable multi-master database with no single points of failure.
• Chukwa : A data collection system for managing large distributed systems.
• HBase : A scalable, distributed database that supports structured data storage
for large tables.
• Hive : A data warehouse infrastructure that provides data summarization and ad
hoc querying.
• Mahout : A Scalable machine learning and data mining library.
• Pig : A high-level data-flow language and execution framework for parallel
computation.
• Spark : A fast and general compute engine for Hadoop data. Spark provides a
simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph
computation.
• Tez : A generalized data-flow programming framework, built on Hadoop YARN,
which provides a powerful and flexible engine to execute an arbitrary DAG of
tasks to process data for both batch and interactive use-cases.
• ZooKeeper : A high-performance coordination service for distributed
applications. 3
Four distinctive layers of Hadoop 1

4
Haoop 1 execution

1. The client application submits an application request to the JobTracker.

2. The JobTracker determines how many processing resources are needed to execute the entire
application.
3. The JobTracker looks at the state of the slave nodes and queues all the map tasks and reduce tasks
for execution.
4. As processing slots become available on the slave nodes, map tasks are deployed to the slave nodes.
Map tasks are assigned to nodes where the same data is stored.
5. The TaskTracker monitors task progress. If failure, the task is restarted on the next available slot.
6. After the map tasks are finished, reduce tasks process the interim results sets from the map tasks.
7. The result set is returned to the client application. 5
Limitation of Hadoop 1

• MapReduce is a successful batch-oriented programming model.

• A glass ceiling in terms of wider use:

– Exclusive tie to MapReduce, which means it could be used only for batch-style
workloads and for general-purpose analysis.

• Triggered demands for additional processing modes:

– Stream data processing (Storm)
– Message passing (MPI)
– Graph analysis

➔ Demand is growing for real-time and ad-hoc analysis

➔ Analysts ask many smaller questions against subsets of data
and need a near-instant response.
➔ Some analysts are more used to SQL & Relational databases

YARN was created to move beyond the limitation

of a Hadoop 1 / MapReduce world.
6
YARN: Resource Management to Support Parallel Computing

• YARN – Yet Another Resource Negotiator

– A resource management tool that enables the other parallel processing frameworks to
run on Hadoop.

– A general-purpose resource management facility that can schedule and assign CPU
cycles and memory (and in the future, other resources, such as network bandwidth)
from the Hadoop cluster to waiting applications.

➔Starting from Hadoop 2, YARN has converted Hadoop from simply a batch
processing engine into a platform for many different modes of data
processing
• From traditional batch to interactive queries to streaming analysis.

7
Hadoop 2 Data Processing Architecture

8
YARN’s application execution

• The client submits an application to the Resource Manager.

• The Resource Manager asks a Node Manager to create an Application Master Instance (AMI) and
starts up.
• Application Master initializes itself and register with the Resource Manager
• Application Master figures out how many resources are needed to execute the application.
• Application Master then requests the necessary resources from the Resource Manager. It sends
heartbeat message to the Resource Manager throughout its lifetime.
• The Resource Manager accepts the request and queue up.
• As the requested resources become available on the slave nodes, the Resource Manager grants the
Application Master leases for containers on specific slave nodes.
• …. ➔ only need to decide on how much memory tasks can have.
9
MapReduce WordCount revisit

http://www.alex-hanna.com
10
MapReduce Data Flow (Hadoop 1)

http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
11
Spark

Fast, Interactive, Language-Integrated

Cluster Computing

Download source release:

www.spark-project.org
12
Spark Goals

Extend the MapReduce model to better support

two common classes of analytics applications:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining (user query)

Enhance programmability:
• Integrate into Scala programming language
• Allow interactive use from Scala interpreter

13
Motivation

Most current cluster programming models

are based on acyclic data flow from stable
storage to stable storage

Map
Reduce

Input Map Output

Reduce
Map

14
Motivation

Most current cluster programming models are based

on acyclic data flow from stable storage to stable
storage

Map
Reduce
Benefits of data flow: runtime can decide
where
Input to runMap
tasks and can automatically
Output
recover from failures
Reduce
Map

15
Motivation

Acyclic data flow is inefficient for applications that

repeatedly reuse a working set of data:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining tools (R, Excel, Python)

With current frameworks, applications must reload

data from stable storage on each query, which is
time consuming!

16
Solution:
Resilient Distributed Datasets (RDDs)

• Allow apps to keep working sets in memory for

efficient reuse

• Retain the attractive properties of MapReduce

• Fault tolerance, data locality, scalability

• Support a wide range of applications

17
Programming Model
Two stages: transformations followed by actions
•Core structure: Resilient distributed datasets (RDDs)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map,
filter, groupBy, join, …) on data in stable storage
• Can be cached for efficient reuse
•Perform multiple various Actions on RDDs
• Count, reduce, collect, save, …

Note that
• Before Spark 2.0, the main programming interface of Spark was
the Resilient Distributed Dataset (RDD)
• After Spark 2.0, RDDs are replaced by Dataset
• Strongly-typed like a RDD, but with richer optimizations
• The RDD interface is still supported 18
Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns
Scala
Base RDD Cache 1
Transformed RDD
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()

Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Result: full-text search of Worker Block 2
Result: scaled to 1 TB data in 5-7 sec
Wikipedia in <1 sec (vs 20 sec
(vs 170 sec for on-disk data) Block 3
for on-disk data)
19
RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct

lost partitions

Ex:

messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

filter map
(func = _.contains(...)) (func = _.split(...))

20
Example: Logistic Regression

Goal: find the best line separating two sets of points

The found line can be used to classify new points.

random initial line

target

21
Example: Logistic Regression

Scala

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

22
Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s

23
Spark Applications

In-memory data mining on Hive data (Conviva)

Predictive analytics (Quantifind)
City traffic prediction (Mobile Millennium)
Twitter spam classification (Monarch)
Collaborative filtering via matrix factorization
…

24
Data Processing Frameworks Built on Spark

Pregel on Spark (Bagel)

Google message passing model for graph computation
200 lines of code
Hive on Spark (Shark)
3000 lines of code
Compatible with Apache Hive
ML operators in Scala

25
Implementation

• Runs on Apache
Mesos to share Spark Hadoop MPI
…
resources with
Hadoop & other apps Mesos

• Can read from any Node Node Node Node

Hadoop input source
(e.g. HDFS)
• No changes to Scala
compiler

26
Spark Scheduler

• Dryad-like DAGs B:
A:
• Pipelines functions
within a stage G:
• Cache-aware work Stage 1 groupBy
reuse & locality
C: D: F:
• Partitioning-aware
to avoid shuffles
map
E: join

Stage 2 union Stage 3

= cached data partition

27
Interactive Spark

Modified Scala interpreter to allow Spark to be

used interactively from the command line
Required two changes:
• Modified wrapper code generation so that
each line typed has references to objects for
its dependencies
• Distribute generated classes over the network

28
Spark Operations

map flatMap
filter union
Transformations join
sample
(define a new cogroup
groupByKey
RDD) cross
reduceByKey
sortByKey mapValues

collect
Actions reduce
(return a result to count
driver program) save
lookupKey

29
Apache Tez
The Apache TEZ® project is aimed at building an application framework
which allows for a complex directed-acyclic-graph (DAG) of tasks for
processing data. It is currently built atop Apache Hadoop YARN.

30
Tez’s characteristics

• Dataflow graph with

vertices to express,
model, and execute
data processing logic
• Performance via
Dynamic Graph
Reconfiguration
• Flexible Input-
Processor-Output task
model
• Optimal Resource
Management

31
By allowing projects like Apache Hive and Apache Pig to run
a complex DAG of tasks, Tez can be used to process data,
that earlier took multiple MR jobs, now in a single Tez job as
shown below.

32
Apache Storm
Stream Processing
-- On Hadoop, you run MapReduce jobs; On Storm, you run Topologies.
-- Two kinds of nodes on a Storm cluster:
-- the master node runs “Nimbus”
-- the worker nodes called the Supervisor.

33
How Storm processes data?

34
Storm’s Goals and Plans

35
Oozie Workflow Scheduler for Hadoop
• Oozie supports a wide range of job types, including Pig, Hive, and MapReduce, as well
as jobs coming from Java programs and Shell scripts.

start
OK
OK secondJob end
firstJob (MapReduce)
(Pig) error
Sample Oozie XML file
error
kill 36
Action and Control Nodes

<workflow-app name="foo-wf"..
OK END <start to="[NODE-NAME]"/>
START MapReduce <map-reduce>
...
...

ERROR
</map-reduce>
<kill name="[NODE-NAME]">
Action <message>Error occurred</message
</kill>
Control Node <end name="[NODE-NAME]"/>
KILL </workflow-app>

• Control Flow
• Workflows begin with START node
– start, end, kill
– decision • Workflows succeed with END node
– fork, join • Workflows fail with KILL node
• Actions • Several actions support JSP Expression Language
– MapReduce (EL)
– Java
– Pig
• Oozie Coordination Engine can trigger workflows by
– Hive
– Time (Periodically)
– HDFS commands
– Data Availability (Data appears in a directory)

37
Schedule Workflow by Time

38
Schedule Workflow by Time and Data Availability

39
Install Oozie
• $ mkdir <OOZIE_HOME>/libext
• Download ExtJS and place under
<OOZIE_HOME>/libext
– ext-2.2.zip
• Place Hadoop libs under libext
– $ cd <OOZIE_HOME>
– $ tar xvf oozie-hadooplibs-3.1.3-cdh4.0.0.tar.gz
– $ cp oozie-3.1.3-cdh4.0.0/hadooplibs/hadooplib-2.0.0- cdh4.0.0/*.jar libext/
• Configure Oozie with components under libext
– $ bin/oozie-setup.sh

• Create environment variable for default url

– export OOZIE_URL=http://localhost:11000/oozie
– This allows you to use $oozie command without providing url
• Update oozie-site.xml to point to Hadoop configuration
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/home/hadoop/Training/CDH4/hadoop-2.0.0-cdh4.0.0/conf</value>
</property>

• Setup Oozie database

– $./bin/ooziedb.sh create -sqlfile oozie.sql -run DB Connection.

40
Install Oozie

•Update core-site.xml to allow Oozie become “hadoop” and for that user to connect
from any host
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
<description>Allow the superuser oozie to impersonate any members of the group group1 and
group2</description>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
<description>The superuser can connect only from host1 and host2 to impersonate a
user</description>
</property>

41
Start Oozie
$ oozie-start.sh

Setting OOZIE_HOME: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0

Setting OOZIE_CONFIG: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0/conf
Sourcing: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0/conf/oozie-env.sh
setting OOZIE_LOG=/home/hadoop/Training/logs/oozie
setting CATALINA_PID=/home/hadoop/Training/hadoop_work/pids/oozie.pid
Setting OOZIE_CONFIG_FILE: oozie-site.xml
Setting OOZIE_DATA: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0/data Using
OOZIE_LOG: /home/hadoop/Training/logs/oozie
Setting OOZIE_LOG4J_FILE: oozie-log4j.properties Setting OOZIE_LOG4J_RELOAD:
10
Setting OOZIE_HTTP_HOSTNAME: localhost Setting OOZIE_HTTP_PORT: 11000
Setting OOZIE_ADMIN_PORT: 11001
...
...
...
$ oozie admin -status http://localhost:11000/oozie/
System mode: NORMAL

15 42
Running Oozie Examples

• Extract examples packaged with Oozie

– $ cd $OOZIE_HOME
– $ tar xvf oozie-examples.tar.gz
• Copy examples to HDFS from user’s home directory
– $ hdfs dfs -put examples examples
• Run an example
– $ oozie job -config examples/apps/map-reduce/job.properties - run
• Check Web Console
– http://localhost:11000/oozie/

43
An example workflow

START Count Each Letter OK Find Max Letter OK

MapReduce Clean Up
MapReduce

ERROR

KILL END

Count Each Letter

- Action Node This source is in HadoopSamples
MapReduce
project under
/src/main/resources/mr/workflows
- Control Flow Node

END - Control Node

44
Workflow definition

<workflow-app xmlns="uri:oozie:workflow:0.2" name="most-seen-letter">

<start to="count-each-letter"/>
<action name="count-each-letter"> START Action Node
<map-reduce> to count-each-letter
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
MapReduce action
<prepare>
<delete path="${nameNode}${outputDir}"/>
<delete path="${nameNode}${intermediateDir}"/>
MapReduce
</prepare>
have optional <configuration> Pass property that will be
Prepare ... set on MapReduce job’s
section <property> Configuration object
<name>mapreduce.job.map.class</name>
<value>mr.wordcount.StartsWithCountMapper</value>
</property>
...
</configuration>
</map-reduce>
<ok to="find-max-letter"/> In case of success, go to the
<error to="fail"/> next job; in case of failure, go to
</action> fail node
...

45
Package and Run Your Workflow
1. Create application directory structure with
workflow definitions and resources
– Workflow.xml, jars, etc..
2. Copy application directory to HDFS
3. Create application configuration file
– specify location of the application directory on HDFS
– specify location of the namenode and resource manager
4. Submit workflow to Oozie
– Utilize oozie command line
5. Monitor running workflow(s)
• Two options
– Command line ($oozie)
– Web Interface (http://localhost:11000/oozie)

mostSeenLetter-oozieWorkflow
|--lib/
Oozie Application | |--HadoopSamples.jar
Application
Workflow Root
Directory |--workflow.xml

•Must comply to directory Libraries should be placed under lib directory

structure spec
Workflow.xml defines workflow
46
Public datasets available from Statistical Computing

http://stat-computing.org/dataexpo/

47
Airline On-time Performance Dataset

• Data Source: Airline On-time Performance data set (flight

data set).
– All the logs of domestic flights from the period of
October 1987 to April 2008.
– Each record represents an individual flight where
various details are captured:
• Time and date of arrival and departure
• Originating and destination airports
• Amount of time taken to taxi from the runway to the
gate.

– Download it from:
– http://stat-computing.org/dataexpo/2009/
– https://dataverse.harvard.edu/dataset.xhtml?persist
entId=doi:10.7910/DVN/HG7NV7

48
Flight Data Schema

49
MapReduce Use Case Example – flight data

• Problem: count the number of flights for each carrier

• Solution using a serial approach (not MapReduce):

50
MapReduce Use Case Example – flight data
• Problem: count the number of flights for each carrier
• Solution using MapReduce (parallel way):

51
MapReduce steps for
flight data computation

52
FlightsByCarrier application
Create FlightsByCarrier.java:

53
FlightsByCarrier application

54
FlightsByCarrier Mapper

55
FlightsByCarrier Reducer

56
Run the code

57
See Result

58
Using Pig Script for fast application development

• Problem: calculate the total miles flown for all flights flown in one year
• How much work is needed using MapReduce?
• What if we use Pig?
• totalmiles.pig

• Execute it: pig totalmiles.pig

• See result: hdfs dfs –cat /user/root/totalmiles/part-r-00000
➔ 775009272
59
Pig: a data flow language

Data flow language

•Define a data stream
• Typically using “LOAD …”
•Apply a series of transformations to the data
• FILTER, GROUP, COUNT, DUMP, etc. 60
Pig example for WordCount

61
Characteristics of Pig

62
Pig vs. SQL

In comparison to SQL, Pig

1. uses lazy evaluation,
2. uses ETL (Extract-Transform-Load),
3. is able to store data at any point during a pipeline,
4. declares execution plans,
5. supports pipeline splits.

On the other hand, it has been argued DBMSs are substantially faster than the MapReduce
system once the data is loaded, but that loading the data takes considerably longer in the
database systems. It has also been argued RDBMSs offer out of the box support for
column-storage, working with compressed data, indexes for efficient random data
access, and transaction-level fault tolerance.

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead
declarative. In SQL users can specify that data from two tables must be joined, but not
what join implementation to use. Pig Latin allows users to specify an implementation or
aspects of an implementation to be used in executing a script in several ways.

Pig Latin programming is similar to specifying a query execution plan.

63
Pig Data Types and Syntax

Atom: An atom is any single value, such as a string or number

Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type.
Bag: A bag is a collection of non-unique tuples.
Map: A map is a collection of key value pairs.

64
Pig Latin Operators

65
Pig Latin Expressions

66
Pig User-Defined Functions (UDFs)

Command to run the script.

The .java contains UPPER

https://pig.apache.org/docs/r0.7.0/udf.html 67
Apache Hive

68
Using Hive to Create a Table

69
Another Hive Example

70
Exercises

1. Download Airline Data and one of your own selected datasets

from Stat-Computing.org
3. Learn to use PIG. You can try the example in the reference
4. Use Oozie to schedule a few jobs
5. Try HBase. Use your own example
6. Try Hive. Use your own example

http://stat-computing.org/dataexpo/

71
Questions?

Asif Currimbhoy's The Refugee A One-Act Play
100% (5)
Asif Currimbhoy's The Refugee A One-Act Play
5 pages
Muslim Morning and Evening Routine
100% (1)
Muslim Morning and Evening Routine
23 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Inventory and Cost of Goods Sold (Practice Quiz)
100% (1)
Inventory and Cost of Goods Sold (Practice Quiz)
4 pages
DOC-20250510-WA0005.
No ratings yet
DOC-20250510-WA0005.
84 pages
Grazia UK - Issue 894, 11 November 2024
No ratings yet
Grazia UK - Issue 894, 11 November 2024
124 pages
The Zebra Finch
No ratings yet
The Zebra Finch
351 pages
Solutions To TSTST 2015: 57th IMO 2016, Hong Kong
No ratings yet
Solutions To TSTST 2015: 57th IMO 2016, Hong Kong
6 pages
Priva Compass Manual
No ratings yet
Priva Compass Manual
150 pages
Hadoop
No ratings yet
Hadoop
83 pages
DC Unit V.docx
No ratings yet
DC Unit V.docx
26 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Ite2004 Software-Testing Eth 1.0 37 Ite2004
No ratings yet
Ite2004 Software-Testing Eth 1.0 37 Ite2004
2 pages
B Ed 2 Year Health and Physical Education 2530 B 2019
No ratings yet
B Ed 2 Year Health and Physical Education 2530 B 2019
2 pages
Assembly Language
No ratings yet
Assembly Language
3 pages
IS R-7000 WL Cable Grease PDS 22287
No ratings yet
IS R-7000 WL Cable Grease PDS 22287
7 pages
Huma - Hameed@lmdc - Edu.pk: Vitro Drug Release Studies
100% (1)
Huma - Hameed@lmdc - Edu.pk: Vitro Drug Release Studies
3 pages
HADOOP
No ratings yet
HADOOP
10 pages
Ozone Furniture Locks
No ratings yet
Ozone Furniture Locks
13 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
5.3 Design Codes and Standards
No ratings yet
5.3 Design Codes and Standards
1 page
Page 6 - S or Z
No ratings yet
Page 6 - S or Z
1 page
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
M5
No ratings yet
M5
18 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Quiz 05 - Stock valuation
No ratings yet
Quiz 05 - Stock valuation
11 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Story Genius From The Writer Magazine
67% (3)
Story Genius From The Writer Magazine
2 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Monikesh Patel - CV
No ratings yet
Monikesh Patel - CV
2 pages
Overview
No ratings yet
Overview
25 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Game Theory Applications in Construction
No ratings yet
Game Theory Applications in Construction
16 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
UNIT2 BDA
No ratings yet
UNIT2 BDA
12 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
4.Big Data Platforms
No ratings yet
4.Big Data Platforms
49 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
2014 A Level Econs Essay Questions
No ratings yet
2014 A Level Econs Essay Questions
2 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Rehabilitation Nursing
No ratings yet
Rehabilitation Nursing
11 pages
Let Us Learn Gurmukhi - Book II
No ratings yet
Let Us Learn Gurmukhi - Book II
53 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
MANUSKRIP
No ratings yet
MANUSKRIP
20 pages
SPARK
No ratings yet
SPARK
66 pages
PRINCIPLES - OF - LANGUAGE - ASSESSMENT Chapter Two
No ratings yet
PRINCIPLES - OF - LANGUAGE - ASSESSMENT Chapter Two
18 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big Data Management
No ratings yet
Big Data Management
38 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
BD Notes
No ratings yet
BD Notes
11 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
ECS765P_W4_Introduction to Spark
No ratings yet
ECS765P_W4_Introduction to Spark
39 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
LUXEON Versat 3030 HP CW 150: Industry-Leading Solutions For Exterior Automotive Lighting
No ratings yet
LUXEON Versat 3030 HP CW 150: Industry-Leading Solutions For Exterior Automotive Lighting
17 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Fault Passage Detection Seminar Report .
No ratings yet
Fault Passage Detection Seminar Report .
45 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Control Con Octave
100% (1)
Control Con Octave
114 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Business Plan On Steel
No ratings yet
Business Plan On Steel
4 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Circular Fashion Sweet Lingua V2
No ratings yet
Circular Fashion Sweet Lingua V2
6 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Application of Drying
No ratings yet
Application of Drying
31 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Chap5_BigDataComputingAndProcessing

Uploaded by

Chap5_BigDataComputingAndProcessing

Uploaded by

DS 644: Introduction to Big Data

Chapter 5. Big Data Computing and Processing

Some of the slides were provided through the courtesy of Dr.

The project includes these modules:

• Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop

1. The client application submits an application request to the JobTracker.

• MapReduce is a successful batch-oriented programming model.

• A glass ceiling in terms of wider use:

• Triggered demands for additional processing modes:

➔ Demand is growing for real-time and ad-hoc analysis

YARN was created to move beyond the limitation

• YARN – Yet Another Resource Negotiator

• The client submits an application to the Resource Manager.

Fast, Interactive, Language-Integrated

Download source release:

Extend the MapReduce model to better support

Most current cluster programming models

Input Map Output

Most current cluster programming models are based

Acyclic data flow is inefficient for applications that

With current frameworks, applications must reload

• Allow apps to keep working sets in memory for

• Retain the attractive properties of MapReduce

• Support a wide range of applications

Load error messages from a log into memory, then

RDDs maintain lineage information that can be used to reconstruct

HDFS File Filtered RDD Mapped RDD

Goal: find the best line separating two sets of points

The found line can be used to classify new points.

random initial line

val data = spark.textFile(...).map(readPoint).cache()

for (i <- 1 to ITERATIONS) {

first iteration 174 s

In-memory data mining on Hive data (Conviva)

Pregel on Spark (Bagel)

• Can read from any Node Node Node Node

Stage 2 union Stage 3

= cached data partition

Modified Scala interpreter to allow Spark to be

• Dataflow graph with

• Create environment variable for default url

• Setup Oozie database

Setting OOZIE_HOME: /home/hadoop/Training/CDH4/oozie-3.1.3-cdh4.0.0

• Extract examples packaged with Oozie

START Count Each Letter OK Find Max Letter OK

Count Each Letter

END - Control Node

<workflow-app xmlns="uri:oozie:workflow:0.2" name="most-seen-letter">

•Must comply to directory Libraries should be placed under lib directory

• Data Source: Airline On-time Performance data set (flight

• Problem: count the number of flights for each carrier

• Solution using a serial approach (not MapReduce):

• Execute it: pig totalmiles.pig

Data flow language

In comparison to SQL, Pig

Pig Latin programming is similar to specifying a query execution plan.

Atom: An atom is any single value, such as a string or number

Command to run the script.

The .java contains UPPER

1. Download Airline Data and one of your own selected datasets

You might also like