0% found this document useful (0 votes)
102 views54 pages

DSCI 5350 - Lecture 2 PDF

This document provides an introduction to Hadoop, including: 1) It describes the key components of Hadoop architecture including the master node (NameNode), worker nodes (DataNodes), and daemons. 2) It explains that Hadoop is a distributed file system that stores large amounts of data across commodity hardware and provides fault tolerance through data replication. 3) It discusses how Hadoop distributions from companies like Cloudera provide enterprise-ready versions of Hadoop with additional support and integration.

Uploaded by

Praz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views54 pages

DSCI 5350 - Lecture 2 PDF

This document provides an introduction to Hadoop, including: 1) It describes the key components of Hadoop architecture including the master node (NameNode), worker nodes (DataNodes), and daemons. 2) It explains that Hadoop is a distributed file system that stores large amounts of data across commodity hardware and provides fault tolerance through data replication. 3) It discusses how Hadoop distributions from companies like Cloudera provide enterprise-ready versions of Hadoop with additional support and integration.

Uploaded by

Praz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

DSCI 5350 – Big Data Analytics

Lecture 2 - Introduction to Hadoop

Kashif Saeed
1
Lecture Outline

• Big Data Deployment Scenarios


• History: Distributed Systems
• What is Hadoop
• Hadoop Architecture

2
Why Organizations need Big Data?

• Handling the Volume & Variety of data


IOT, streaming data, weblogs, sensor/machine data
Unstructured, PDF, voice, text, email, social media
• Cheap Storage
Cost/TB in Hadoop is significantly less than traditional
databases
• Need for advanced analytics
• More data-driven business decisions are made
• More data is needed to run sophisticated and accurate models

Big Data != Hadoop


Big Data is a concept; Hadoop is one implementation of that concept
3
Big Data Deployment Scenarios

• On-Premises Scenarios
Hadoop with MapReduce as a processing engine –
already a legacy
Hadoop with Spark as processing engine
Spark with NoSQL (or other data stores)
• Cloud Scenarios
Persistent Cluster on Cloud
Non-persistent Cluster on Cloud
Managed cluster on Cloud

4
Big Data Cloud Deployment - concepts

Non-Persistent Cluster
Pay as you go
No investment in hardware, setup, maintenance
Setup, Process data, and terminate the cluster
Entire process from setup, processing, and termination
can be automated
Since you terminate the cluster, you should not store the
outcome on the cluster itself – AWS uses S3 for storage,
GCP uses Cloud Storage for storage

5
What will be cover in this class

• We will use Cloudera VM to learn and practice about


HDFS, Hadoop ecosystem tools (Hive, Sqoop), and
Spark
The learning can be used for on-premises or cloud
deployments
Spark coding and practice
• Spark standalone deployment overview (can be used
for practicing Spark)
• Overview of AWS Cloud
• Overview of GCP (with hands-on)
• GCP Big Data and ML (with hands-on)
6
The Origin of Hadoop

7
Distributed Systems

• A Distributed system is a cluster of multiple machines


working together at the software level for storing and
processing data
• Multiple Machines = More processing power
• The software enables recovery from failure and
exceptions, handles storage, and the read/write
to/from storage
• Distributed systems work with central storage
All machines read and write to the same storage in
distributed systems

8
Challenges with Distributed Systems

• Programming complexity
• The software needs to be sophisticated enough to handle all
scenarios that can go wrong (exception and error handling)
• Centralized Data Storage
• With hundreds of computers reading and writing to the
storage, the read/write speed of the storage becomes the
bottleneck
• Data Processing & Storage at different layers
• With Data being centralized and processing being localized,
you have to bring the data to the processing engine by
copying the data
• Replicating/Moving data to processing engine is not possible
when dealing with petabytes or Exabyte's of data
9
Figure 1: Distributed System
10
Google’s Solution for Distributed Systems

• GFS - Google File System modified the Distributed


Systems to address the storage bottleneck by
decentralizing the data storage
• Distributed storage – Data processed where it is stored
• GFS can store large volumes of data because data is not
replicated/moved
• Distributed Processing of Data - MapReduce
• Hadoop is based on the solution used by Google in the
1990’s and published in 2003-2004
• Doug Cutting was working on an open source project
called Nutch and used Google’s publications to solve
the problem, which eventually became Hadoop
11
What is Hadoop

12
What is Hadoop

• Hadoop is an Infrastructure Software for processing, storing,


and analyzing large volumes of data.
The software part handles distribution of data, handling failures etc., while
the hardware part handles the storage of data and processing power
• Hadoop is Distributed – Several machines in Hadoop cluster
• Hadoop is Scalable – You can add more machines to the
cluster which proportionally adds capacity
• Hadoop is Fault-tolerant – It can recover from hardware
failures
- Master re-assigns work
- Data is replicated by default on 3 machines
- Nodes which recover rejoin the cluster automatically
• Hadoop is Open Source
- Overseen by Apache
- Close to 100 Committers from companies like Cloudera, Hortonworks,
Facebook, LinkedIn, Yahoo, Google contributing to the Ecosystem 13
Differences between RDBMS & Hadoop

• RDBMS is good for point queries or updates, where the dataset


has been indexed to deliver low-latency retrieval
• Hadoop suits applications where the data is written once and
read many times, whereas a relational database is good for
datasets that are continually updated

Traditional RDBMS Hadoop

Data size Gigabytes/Terabytes Petabytes+


Access Interactive and batch Batch

Updates Read and write many Write once, read


times many times
Structure Static schema Dynamic schema
Scaling Vertical Horizontal
14
Hadoop Distributions

• Hadoop is open-source and can be downloaded from


Apache’s website
• What is a Distribution?
A distribution is an enterprise-ready version of an open-source
software
It is tested thoroughly, supported, and integrates well with
other software
Most companies prefer using distributions over open-source
software
• Cloudera, Hortonworks, and MapR are the most widely
used enterprise-ready Hadoop Distributions
We will use Cloudera’s distribution for some parts of this class
Cloudera and Hortonworks merged in 2018 – see link below

https://www.cloudera.com/about/news-and-blogs/press-releases/2018-10-03-cloudera-and-hortonworks-
announce-merger-to-create-worlds-leading-next-generation-data-platform-and-deliver-industrys-first-
enterprise-data-cloud.html 15
Hadoop Architecture

16
Hadoop Cluster Terminologies

• Cluster - a group of computers working together


• Node - an individual computer in the cluster
There are two kind of nodes; Master node (Name Node)
and Worker nodes (Data Node)
• Master node – a node that manages the distribution
of work across worker nodes and the
distribution/retrieval of data to/from worker nodes
• Daemon - a program running on a node
• Commodity hardware – a term used for affordable
hardware which generally does not have the bells and
whistles like RAID or hot swappable CPUs
• Applies mainly to Data nodes; Name nodes should be highly
reliable
17
Master Node (Name Node)

• Master node – manages the work


• Coordinates the work and data storage on the cluster
• Daemons on Master nodes manage the entire cluster
• A failed daemon on Master node may result in the entire
cluster to be unavailable
• Master nodes are configured for high availability in an
active-passive mode
• Master nodes use Carrier-class hardware
• Dual power supplies and dual Ethernet cards
• Hard drives use RAID (redundant array of inexpensive disks)
to protect from data loss
• Reasonable RAM and CPU’s required
• 64 GB for 20 nodes or less
• 96 GB for up to 300 nodes 18
Worker Nodes (Data Nodes)

• Worker nodes – perform the work


• Can be scaled horizontally
• Daemons on worker nodes handle the data processing on that node
• A failed worker node does not bring the cluster down because of data replication
and high availability
• Worker node hardware
• Midrange CPUs are okay
• More RAM the better
• Memory intensive processing framework and tools are being used e.g. Spark and Impala
• HDFS caching can take advantage of extra RAM
• 512 GB RAM/node or better is not uncommon
• More disk is needed
• By default the HDFS data is replicated 3 times
• 20-30% of cluster capacity is needed for temporary raw storage
• 4 x your data storage need is a good number for estimation
• A good practical maximum is 36 TB per worker node (12 x 3TB drives)
• 7200 RPM SATA/SATA II drives are fine; no need to buy 15000 RPM drives
19
NameNode High Availability

• In Hadoop versions prior to 2.0, the NameNode was a single point of


failure
• Hadoop v. 2.0 introduced 2 additional configurations: High Availability
(HA) and Federated Name Nodes (FNN).
• HA allowed using two NNs in an active-passive relationship. The two NNs
relied on a shared information space using either shared storage (NFS) or
the Quorum Journal Manager.
• FNN introduced the concept of a namespace composed of a NN and
dependent data nodes; for example, NN1 managed DN1–3, NN2 managed
DN4–6, and NN3 managed DN7–9. HA and FNN are not incompatible,
thus it is possible to have, for example, three active NNs and three
standby NNs managing three namespaces.
• Version 3 of Hadoop will support more than one standby NNs in an HA
configuration.

20
21
Hadoop Core Components

• HDFS for Storage


• Processing Framework
• MapReduce
• Spark
• Impala (on Cloudera distributions only)
• Resource Manager
• YARN

22
23
Hadoop Distributed File System (HDFS)

• HDFS (storage layer for Hadoop) - a


filesystem written in Java that sits on top
of the native filesystem
Directory structure in HDFS has nothing to do
with the directory structure on local
filesystem
• It is a filesystem that can store any type
of data including text files, PDFs, images,
video, etc.
• Files in Hadoop system are Immutable
New data can be appended at the end of the
file, but existing data cannot be modified

24
Image Source: Cloudera Academic Alliance
Hadoop Distributed File System (HDFS)

• Works better with very large files as compared to


many small files
• Less files means less time to Seek data location
• Seeks are expensive operations and only needed when you
need to analyze a subset of data
• Large files minimize seeks
• Best suited for sequential data access and streaming rather
than random data access because sequential data access
requires less Seeks
• Files in HDFS can be hundred of MB or even GB in size
• If Importing data into Hadoop, it is better to combine the
data and ingest one large file as opposed to ingesting 50
smaller files
25
HDFS - Blocks

• Uses Blocks to store a file or part of a file


• Default size is 64MB, Recommended size is 128MB
• Compare it to 4KB in UNIX file system
• One HDFS Block comprises of many OS Blocks
• Blocks only use the needed space to store a file
E.g. a 280 MB file uses 2 x 128 MB blocks and 24 MB from the 3rd block as
opposed to consuming the whole 3rd block
• Blocks are replicated to multiple nodes, which allows
for a fault tolerant system if a node fails
• Replication factor is configurable to determine how many
copies of a file are kept
• Default is 3 copies of each block on different nodes

26
27
Image Source: Cloudera Academic Alliance
Processing Framework: MapReduce

• MapReduce was the original and only processing


framework for Hadoop initially
• Still available as a processing engine, though already
considered legacy by most clients
• MapReduce is:
• Platform and Language independent
• Record-oriented data processing (key value pair)
• Performs task distribution across multiple nodes
• Typically written in Java

28
Processing Framework: Others

• Multiple frameworks may exist on a Hadoop cluster


MapReduce
Spark
Impala
• Each framework is designed to consume all of the cluster
resources
• Different frameworks do not have knowledge of other active
frameworks in a cluster
• Different frameworks are competing for resources
• YARN (Yet Another Resource Negotiator) was developed to
manage resources for different frameworks
Allocates resources to different frameworks based on system settings or
demand
29
30
Image Source: Cloudera Academic Alliance
31
Image Source: Cloudera Academic Alliance
Getting Data In and Out of
Hadoop

32
Storing and Retrieving Files

33
Image Source: Cloudera Academic Alliance
34
Image Source: Cloudera Academic Alliance
35
Image Source: Cloudera Academic Alliance
The data itself is NEVER
retrieved via the Name Node
36
Image Source: Cloudera Academic Alliance
Accessing Hadoop System

• From Command Line


• hdfs dfs
• Using Spark
• Accessed via URL
• Using Ecosystem Projects
• Flume - Data from websites, system logs
• Sqoop – For transfer between HDFS and RDBMS
• Hue – Web interface for analyzing data
• Using BI tools

37
Image Source: Cloudera Academic Alliance
The Hadoop Ecosystem
and additional tools integrated in CDH

38
39
Data Storage: Apache HBase

• Apache HBase – Hadoop NoSQL Database


NoSQL Distributed database built on HDFS
Designed for applications requiring fast,
random access to large volumes of data
Scales to support very large amounts of data
and high throughput
Can handle a table with millions of columns
and billions of rows
Modeled after Google’s ‘BigTable’

40
Apache Sqoop

• High speed import to HDFS from RDBMS


(and vice versa)
• Supports many data databases like
Netezza, Mongo, MySQL, Teradata, Oracle
• Covered later in this course

41
Streaming Systems: Flume, Kafka, Flink,
Others
• Apache Flume
• Distributed service for ingesting streaming data
• Ideally suited for event data from multiple
systems – For example, log files
• Covered later in this course
• Kafka
• A high throughput, scalable messaging system
• Distributed, reliable publish-subscribe system
• Integrates with Flume and Spark Streaming
• Use Cases: Large scale messaging, log
aggregation, customer activity tracking etc.

42
Image Source: Cloudera Academic Alliance
Data Processing: Apache Spark

• Spark is Large scale Data Processing engine


• Has the ability to load dataset into memory of data nodes
• Spark provides a high level programming API which allows
programmers to focus on logic as apposed to plumbing
• Spark programs work with Cluster Resource Management
framework/tools
• Runs on the cluster with data stored in HDFS
• Supports wide range of workloads
• Machine Learning
• Business Intelligence
• Streaming
• Batch Processing
• Will be covered in detail later in the course

43
Data Processing: MapReduce

• MapReduce is the original Processing


framework in Hadoop
• Java based
• MapReduce was the core processing
engine before Spark was introduced
• Many existing tools are built using
MapReduce, however it is not used in the
industry much anymore

44
Data Processing: Apache Pig

• Apache Pig is a scripting language built on


Hadoop for high-level data processing
An alternative to writing MapReduce code
Especially good for joining and transforming data
• Pig interpreter runs on the client machine
Turns Pig Latin scripts into MapReduce or Spark
jobs

45
Image Source: Cloudera Academic Alliance
High Performance SQL: Cloudera Impala

• Impala is high performance SQL engine


Developed by Cloudera
Open source and released under Apache
Inspired by Google’s Dremel project
Used for Interactive analysis
• Impala supports SQL (Impala SQL)
Stores data in HDFS as database tables
Hive on Spark is available in its early stages

46
SQL on MapReduce: Apache Hive

• Hive is an abstraction layer on top of


Hadoop
Uses SQL-like language called HiveQL
Similar to Impala SQL
Useful for data processing and ETL
• Hive executes queries using MapReduce
Hive on Spark is available as well (Power Hive)

47
UI Interface: Hue

• Hue = Hadoop User Interface


• Provides a Web Front-end to Hadoop
• Upload and browse data
• Query tables in Impala and Hive
• Run Spark and Pig jobs
• Search
• Created by Cloudera; now 100% open source
project released under Apache license

48
Workflow Management: Apache Oozie

• Oozie is the Workflow engine


• Defines dependencies between jobs
• ETL like interface for Workflow management
• Oozie server submits the jobs to the cluster in
correct sequence

49
50
51
Apache Incubator Projects

• Apache Incubator contains all Apache projects under


development
• It contains Hadoop and non-Hadoop projects
• Graduates from the Incubator are added to the
corresponding stack
• http://incubator.apache.org/projects/#current

52
Cloudera Labs

• The class will use a VM provided by Cloudera


• The class will use labs provided by Cloudera Academic
alliance
• All labs refer to a fictitious company – Loudacre
Mobile
• All files needed for the labs are available on the VM

53
Cloudera Labs - continued

• Your Virtual Machine


• Login as user training (password training)
• Pre-installed and configured with CDH and Spark

• Training Material:
• ~/training_materials/dev1 folder on the VM

• Course Data:
• ~/training_materials/data

54

You might also like