0% found this document useful (0 votes)
46 views52 pages

1 - HADOOP Crash Course

The document provides an introduction to batch processing systems, focusing on the Hadoop ecosystem, including its components like HDFS and the MapReduce programming model. It outlines the differences between centralized and distributed data processing, the lifecycle of batch data processing, and various tools for data ingestion, storage, processing, and access. Additionally, it discusses the architecture and functionalities of Hadoop, emphasizing its scalability, fault tolerance, and resource management capabilities.

Uploaded by

Karim Osama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views52 pages

1 - HADOOP Crash Course

The document provides an introduction to batch processing systems, focusing on the Hadoop ecosystem, including its components like HDFS and the MapReduce programming model. It outlines the differences between centralized and distributed data processing, the lifecycle of batch data processing, and various tools for data ingestion, storage, processing, and access. Additionally, it discusses the architecture and functionalities of Hadoop, emphasizing its scalability, fault tolerance, and resource management capabilities.

Uploaded by

Karim Osama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

CIT650 Introduction to Big Data

Batch Processing Systems


Part 1

1
Objectives

By the end of the lecture you should be able to Understand:

The difference between centralized and distributed data processing


The lifecycle of batch data processing
Hadoop Ecosystem
HDFS
MapReduce programming model

2
What Computing Model Would Stand?

A single machine cannot handle that amount

3
Big Data System Requirements

We need distributed computing frameworks Like Hadoop

4
Two Ways to Build a System

Monolithic: Twice cost to have double processing and memory does


not mean twice the performance
Distributed: Many small cheap machines come together to act as a
single entity. Scales linearly as you add more nodes.
5
Server Farms
All these machines need to be
coordinated by a single piece of
software to take care of
Data partitioning

Coordinate computing tasks

Handle fault tolerance and


recovery

Schedule tasks and allocate


resources

6
Cluster Management
Back in 2000, Google realized that traditional data management
approaches will not scale with Big data:
Google File System to manage distributed storage of data
MapReduce: to abstract a distributed-parallel data processing model

7
Hadoop

Is an open source software framework that


follows Google’s File System, HDFS,
Supports MapReduce as the programming
model
Cluster management and scheduling of data
processing jobs
Horizontal scaling as new nodes join the
cluster
Provide fault tolerance and transparent
rescheduling of failed jobs/tasks
Manage resource allocation

8
Hadoop V 1.X Versus V 2.X

9
Core Hadoop

10
Hadoop Ecosystem

For storage: Hadoop Distributed File System (HDFS)


For computation: Map/Reduce Model
Other tools
Data ingestion
Sqoop
Flume
Storage
Hbase
Processing
Map/Reduce
Spark
Hive
Resource
management
Yarn
Source: https://www.edureka.co/blog/hadoop-ecosystem

11
Ingest Data

• Flume
• Collects, aggregates, and transfers big data
• Has a simple and flexible architecture based on streaming data flows
• Uses a simple extensible data model that allows for online analytic
application

• Sqoop
• Designed to transfer data between relational database systems and
Hadoop
• Accesses the database to understand the schema of the data
• Generates a MapReduce application to import or export the data

12
Store Data

• HBase
• A non-relational database that runs on top of HDFS
• Provides real time wrangling on data
• Stores data as indexes to allow for random and faster access to data

• Cassandra
• A scalable, NoSQL database designed to have no single point of
failure

13
Process and Analyze Data

• Pig
• Analyzes large amounts of data
• Operates on the client side of a cluster
• A procedural data flow language

• Hive
• Used for creating reports
• Operates on the server side of a cluster
• A declarative programming language

14
Access Data

• Elastic Search
• Provides real-time search and analytics capabilities

• Apache Drill
• schema-free SQL query engine for big data
• unified view of different data sources, allowing users to query data
across various sources seamlessly

15
Access Data

• Impala
• Scalable and easy to use platform for everyone
• No programming skills required

• Hue
• Stands for Hadoop user experience
• Allows you to upload, browse, and query data
• Runs Pig jobs and workflow
• Provides editors for several SQL Query languages like Hive and MySQL

16
Processing Lifecycle

17
Apache Scoop

Data exchange tool between


Hadoop and relational databases
Relies on RDBMS for data
schema
Table at a time import
Uses MapReduce jobs to
parallelize data exchange
Map only jobs, discussed later
Commands
sqoop import connect XXX
table MY TABLE
num-mappers 8 . . .
sqoop export . . .

18
Apache Scoop
scoop-env.sh

19
Apache Flume
A distributed, reliable service to
collect /transform
(unstructured) data
Logs
Data streams
Components:
Client: hosted on the origin of
data to deliver events to the
source
Source: Receives events from
their origin and delivers to one
or more channels
Channel: A transient store
where events are persisted
until consumed by sink(s)
Sink: consumes events from a
channel and deliver it to the
next destination
20
Apache Pig

A data processing platform with high-level language


Language Pig Latin
Alternative to write low-level MapReduce jobs
Good at transformation and joining
Pig interpreter runs on a client machine:
Turns Pig Latin scripts into MapReduce/Spark jobs
Submits those jobs to Hadoop cluster

o r d e r s = LOAD ’ / u ser / t r a i n i n g / o r d er s ’ AS ( ord Id , c u st I d , c o st ) ;


groups = GROUP o r d e r s BY c u s t I d ;
t o t a l s = FOREACH groups GENERATE group , SUM( o r d e r s . c o st ) AS t ;
r e s u l t = JOIN t o t a l s BY group , p eo p l e BY c u s t I d ;
DUMP r e s u l t ;

21
Apache Hive

A data warehouse on top


of Hadoop
Originally started at
Facebook
SQL-like interface for
data processing
(HiveQL)
Folder structure for data
storage can be leveraged
in Hive commands
HiveQL queries are
translated into
MapReduce jobs
transparently

22
Example Big Data Applications

• Social media marketing analysis


• Customer Segmentation and Targeting
• Campaign Performance Analytics

• Shopping pattern analysis


• Customer Journey Mapping
• Inventory Management Optimization

• Traffic pattern recognition


• Smart Transportation Systems
• Public Safety and Emergency Response
• Urban Planning and Infrastructure Optimization

23
Example Big Data Applications

• Large data transformation


• Batch Processing for Data Transformation
• Streamlining Data Pipelines

• Credit card fraud detection


• Anomaly Detection with Machine Learning
• Data Processing for Large Transaction
Volumes

24
Hadoop Vrs. Other Distributed Systems

Data is distributed in advance: at their arrival (scalability)


Data is replicated throughout the cluster (availability/reliability)
Minimize data transfer (locality)
A unified programming model that abstracts
Data distribution
Nodes communications
Fault tolerant
Clean separation of business needs (what you write) and handling of
data processing (what is hidden)

25
Core Hadoop

26
Storing Data in Hadoop

Efficient data storage is the foundation for efficient processing


Data can be stored in:
Raw format in HDFS
Columnar view Hbase
Data serialization formats
Raw
Avro
We talk about that later

27
Hadoop Distributed File System (HDFS)

Lowest level of data storage


Yet, at a higher abstraction level over
OS-specific file system
Partition at ingestion time
Replicate for high-availability and fault
tolerance
Abstract physical partition location
(Which node in the cluster) from the
application
Support serving several applications in
parallel on the same file partition
Inspired by Google File System

28
HDFS Vrs. Other Distributed File Systems

HDFS was designed with the following objectives in mind:


Partition and distribute a single file across different machines
Favor larger partition sizes
Data replication
Local processing (as much as possible)
HDFS is optimized for:
Reading sequentially versus (random access and writing)
No updates on files
No local caching

29
HBase vs. HDFS

30
HDFS Architecture

Source: Figure 2-1 in book Professional Hadoop Solutions

31
Name Node

A single node that keeps the metadata of HDFS


In some high-availability setting, there is a secondary name node
All the catalog (meta data) are kept in memory for fast access
A copy of the in-memory data is periodically flushed to the disk
(FsImage file)
Clients need to access the name node first to get info about the actual
blocks
As a name node can be accessed concurrently, a logging mechanism
similar to databases is used to track the updates on the catalog.
Name node maintains a daemon process to handle the requests and to
receive heartbeats from other data nodes

32
HDFS files

A single large file is partitioned into several blocks


Size of either 64 MB or 128MB
Compare that to block sizes on ordinary file systems
This is why sequential access is much better as the disk will make less
numbers of seeks
What would be the costs/benefits if we use smaller block sizes?

33
Writing a File to HDSF

When a client is writing data to an HDFS file, this data is first written to a
local file.
When the local file accumulates a full block of data, the client consults the
NameNode to get a list of DataNodes that are assigned to host replicas of
that block.
The client then writes the data block from its local storage to the first
DataNode in 4K portions.
The DataNode stores the received blocks in a local file system, and forwards
that portion of data to the next DataNode in the list.
The same operation is repeated by the next receiving DataNode until the
last node in the replica set receives data.

34
Writing a File to HDSF Cont.

This DataNode stores data locally without sending it any further


If one of the DataNodes fails while the block is being written, it is removed
from the pipeline
The NameNode re-replicates it to make up for the missing replica caused by
the failed DataNode
When a file is closed, the remaining data in the temporary local file is
pipelined to the DataNodes
If the NameNode dies before the file is closed, the file is lost.

35
Replica Placement

Replica placement is crucial for reliability of HDFS


Should not place the replicas on the same rack
All decisions about placement of partitions/replicas are made by the
NameNode
NameNode tracks the availability of Data Nodes by means of
Heartbeats
Every 3 seconds, NameNode should receive a heartbeat and a block
report from each data node
Block report allows verifying the list of stored blocks on the data node
Data node with a missing heartbeat is declared dead, based on the
catalog, replicas missing on this node are made up for trhough
NameNode sending replicas to other available data nodes

36
HDFS Federation

By default, HDFS has a single


NameNode. What is wrong with
that? If NameNode daemon
process goes down, the cluster is
inaccessible
A solution: HDFS Federation
Namespace Scalability:
Horizontal scalability to access
meta data as to access the
data itself
Performance: Higher
throughput as NameNodes
can be queried concurrently
Isolation: Serve blocking
applications by different
NameNodes
Is it more reliable?
37
HDFS High-availability

Each NameNode is backedup


with a slave other NameNode
that keeps a copy of the catalog
The slave node provides a
failover replacement of the
primary NameNode
Both nodes must have access to
a shared storage area
Data nodes have to send
heartbeats and block reports to
both the master and slave
NameNodes.

38
Core Hadoop

39
Processing Data with MapReduce

A simple abstraction for distributed data processing


Invented by Google in 2004
Two operations
Map: think of it as a transformation either 1:1 or 1:M that will be
applied to each element of your data
Reduce: think of it as a form of aggregating or compacting the data
M:1
Hadoop provides the open source implementation of that model
It was believed that all sorts of large-scale data processing can be
tackled by MapReduce, this is not true as you will see in future
lectures

40
MapReduce Overview

You need to implement map,


reduce or both
Map
Can be used to split elements,
Can be used to filter elements
Reduce
To compute aggregates
To combine results

41
MapReduce Overview

You need to implement map,


reduce or both
Map
Can be used to split elements,
Can be used to filter elements
Reduce
To compute aggregates
To combine results

42
MapReduce Workflow

43
Word Count: Mapper

44
Word Count: Mapper

45
Word Count: Combiner (optional) (local reducer)

Combiners run locally on the node where the mapper runs. They can be
used to reduce the number of records emitted to the next phase

46
Word Count: Partitioning/Shuffling/Sorting

Partitioning
Distributes the key-space to the number of reducers available.
Number of reducers need not be the same as number of mappers
Assume two reducers
Makes sure that same keys go to the same reducer
Usually implemented via a hash function
Shuffling
Preparing partitioned keys for the reduce step
Sorting
Transforms (k,v1), (k,v2), (k,v3) →(kx,{v1,v2,v3})

47
Word Count: Reducer

There will be two files written to the specified output path. The number is
equivalent to the number of reducers.

48
Other Uses for MapReduce

Reduce phase is optional in a MapReduce job


Running face-recognition on millions of images
Map-only job
Input (Image ID, Image)
Output (Image ID, list of features)
Features are loaded to distributed cache
The results of each mapper will be written in a separate file in the
output folder
What to do if we want all results in a single file?
Add a single reducer
However, this will come with a large overhead as shuffle and sort will
be invoked

49
Core Hadoop

50
R.I.P MapReduce?

51
R.I.P MapReduce?

52

You might also like