1 - HADOOP Crash Course
1 - HADOOP Crash Course
1
Objectives
2
What Computing Model Would Stand?
3
Big Data System Requirements
4
Two Ways to Build a System
6
Cluster Management
Back in 2000, Google realized that traditional data management
approaches will not scale with Big data:
Google File System to manage distributed storage of data
MapReduce: to abstract a distributed-parallel data processing model
7
Hadoop
8
Hadoop V 1.X Versus V 2.X
9
Core Hadoop
10
Hadoop Ecosystem
11
Ingest Data
• Flume
• Collects, aggregates, and transfers big data
• Has a simple and flexible architecture based on streaming data flows
• Uses a simple extensible data model that allows for online analytic
application
• Sqoop
• Designed to transfer data between relational database systems and
Hadoop
• Accesses the database to understand the schema of the data
• Generates a MapReduce application to import or export the data
12
Store Data
• HBase
• A non-relational database that runs on top of HDFS
• Provides real time wrangling on data
• Stores data as indexes to allow for random and faster access to data
• Cassandra
• A scalable, NoSQL database designed to have no single point of
failure
13
Process and Analyze Data
• Pig
• Analyzes large amounts of data
• Operates on the client side of a cluster
• A procedural data flow language
• Hive
• Used for creating reports
• Operates on the server side of a cluster
• A declarative programming language
14
Access Data
• Elastic Search
• Provides real-time search and analytics capabilities
• Apache Drill
• schema-free SQL query engine for big data
• unified view of different data sources, allowing users to query data
across various sources seamlessly
15
Access Data
• Impala
• Scalable and easy to use platform for everyone
• No programming skills required
• Hue
• Stands for Hadoop user experience
• Allows you to upload, browse, and query data
• Runs Pig jobs and workflow
• Provides editors for several SQL Query languages like Hive and MySQL
16
Processing Lifecycle
17
Apache Scoop
18
Apache Scoop
scoop-env.sh
19
Apache Flume
A distributed, reliable service to
collect /transform
(unstructured) data
Logs
Data streams
Components:
Client: hosted on the origin of
data to deliver events to the
source
Source: Receives events from
their origin and delivers to one
or more channels
Channel: A transient store
where events are persisted
until consumed by sink(s)
Sink: consumes events from a
channel and deliver it to the
next destination
20
Apache Pig
21
Apache Hive
22
Example Big Data Applications
23
Example Big Data Applications
24
Hadoop Vrs. Other Distributed Systems
25
Core Hadoop
26
Storing Data in Hadoop
27
Hadoop Distributed File System (HDFS)
28
HDFS Vrs. Other Distributed File Systems
29
HBase vs. HDFS
30
HDFS Architecture
31
Name Node
32
HDFS files
33
Writing a File to HDSF
When a client is writing data to an HDFS file, this data is first written to a
local file.
When the local file accumulates a full block of data, the client consults the
NameNode to get a list of DataNodes that are assigned to host replicas of
that block.
The client then writes the data block from its local storage to the first
DataNode in 4K portions.
The DataNode stores the received blocks in a local file system, and forwards
that portion of data to the next DataNode in the list.
The same operation is repeated by the next receiving DataNode until the
last node in the replica set receives data.
34
Writing a File to HDSF Cont.
35
Replica Placement
36
HDFS Federation
38
Core Hadoop
39
Processing Data with MapReduce
40
MapReduce Overview
41
MapReduce Overview
42
MapReduce Workflow
43
Word Count: Mapper
44
Word Count: Mapper
45
Word Count: Combiner (optional) (local reducer)
Combiners run locally on the node where the mapper runs. They can be
used to reduce the number of records emitted to the next phase
46
Word Count: Partitioning/Shuffling/Sorting
Partitioning
Distributes the key-space to the number of reducers available.
Number of reducers need not be the same as number of mappers
Assume two reducers
Makes sure that same keys go to the same reducer
Usually implemented via a hash function
Shuffling
Preparing partitioned keys for the reduce step
Sorting
Transforms (k,v1), (k,v2), (k,v3) →(kx,{v1,v2,v3})
47
Word Count: Reducer
There will be two files written to the specified output path. The number is
equivalent to the number of reducers.
48
Other Uses for MapReduce
49
Core Hadoop
50
R.I.P MapReduce?
51
R.I.P MapReduce?
52