0% found this document useful (0 votes)
89 views40 pages

PPT04-Hadoop Infrastructure Layer

The document describes Hadoop's infrastructure layer and processing concepts. It discusses Hadoop architecture, deployment models including on-premise, virtual, private cloud and public cloud. It outlines Hadoop's assumptions about its infrastructure including using local disks, non-RAID storage, and assumptions about network topology. It also describes Hadoop implementation details like how the JobTracker and TaskTracker work to process jobs in a distributed manner.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views40 pages

PPT04-Hadoop Infrastructure Layer

The document describes Hadoop's infrastructure layer and processing concepts. It discusses Hadoop architecture, deployment models including on-premise, virtual, private cloud and public cloud. It outlines Hadoop's assumptions about its infrastructure including using local disks, non-RAID storage, and assumptions about network topology. It also describes Hadoop implementation details like how the JobTracker and TaskTracker work to process jobs in a distributed manner.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

COMP6725 - Big Data Technologies

TOPIK 4
HADOOP INFRASTRUCTURE LAYER
LEARNING OUTCOMES

At the end of this session, students will be able to:


o LO1. Describe big data architecture layer and processing concepts
OUTCOMES

Students are able to describe big data architecture layer and processing concepts
OUTLINE

1. Hadoop Architecture
2. Hadoop Infrastructure Layer
3. 6 Reasons Why Hadoop on the Cloud
4. Hadoop’s Assumptions about its Infrastructure
5. Hadoop’s Implementation
6. Virtual Infrastructure VS Physical DataCenter
7. Virtual Infrastructure Implications
8. Hadoop on Cloud Infrastructures Reason
9. Hosting on local VMs
HADOOP ARCHITECTURE
HADOOP

o Hadoop is an open-source software framework for storing data and running


applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
o Apache Hadoop is an open-source framework written in Java that supports
processing of large data sets in streaming access pattern across clusters in a
distributed computing environment. Can Store a large volume of structured,
semi-structured, and unstructured data in a distributed file system (DFS) and
process them in parallel.
WHY IS HADOOP IMPORTANT?

o Ability to store and process huge amounts of any kind of data, quickly.
o Computing power.
o Fault tolerance.
o Flexibility.
o Low cost.
o Scalability.
HADOOP ARCHITECTURE

Pic 4.1. Hadoop architecture.


Source : Big Data Concepts, Technology, and Architecture. 2021
HADOOP ECOSYSTEM

Hadoop ecosystem comprises four


different layers
1. Data storage layer;
2. Data Processing layer;
3. Data access layer;
4. Data management layer.

Pic 4.2. Hadoop ecosystem..


Source : Big Data Concepts, Technology, and Architecture. 2021
HADOOP INFRASTRUCTURE LAYER
DEPLOYMENT MODELS

On-premise Full Custom Bare-metal

Hadoop Appliance

Hadoop Hosting

Hadoop on the Cloud

Hadoop-as-a-service
Cloud
VIRTUAL HADOOP

• Pure physical to hypervisor


Virtualization • Abstract physical hardware
• Provide virtual for higher-level services

• A collection of virtualized physical hardware


• Catalogs of software
A Private Cloud
• Owned and managed by the same company or
group

• Like a private cloud but owned by outside entity


A Public Cloud • Loss of control, intermingling of data or
undesirable issues
STRENGTHS OF VM-HOSTED HADOOP

o A single image can be cloned - lower operations costs.


o Hadoop clusters can be set up on demand.
o Physical infrastructure can be reused.
o You only pay for the CPU time you need.
o The cluster size can be expanded or contracted on demand.
BENEFITS OF A PRIVATE CLOUD HADOOP

o A cluster can be set up in minutes


o It can flexibly use a variety of hardware (DAS, SAN, NAS)
o It is cost effective (lower capital expenses than physical deployment and lower
operating expenses than public cloud deployment)
o Streamlined management tools lower the complexity of initial configuration and
maintenance
o High availability and fault tolerance increase uptime
ADVANTAGES OF A PUBLIC CLOUD
HADOOP

o If you use a turnkey solution or Hadoop-as-a-Service, there is very little setup to


perform.
o Hadoop-as-a-Service requires no maintenance.
o If you lack the on-premise computing power to host a Hadoop cluster big
enough to meet your needs, running Hadoop in the cloud will give you what you
want without requiring new hardware purchases.
o When using Hadoop in the cloud, you generally pay only for the time you use.
That beats paying to maintain local Hadoop servers 24/7 if you only use them
some of the time.
o If the data you analyze is stored in the cloud, running Hadoop in the same cloud
eliminates the need to perform large data transfers over the network when
ingesting data into Hadoop.
6 REASONS WHY HADOOP ON THE CLOUD
6 REASONS WHY HADOOP ON THE CLOUD

o Lowering the cost of innovation


o Procuring large scale resources quickly
o Handling Batch Workloads Efficiently
o Handling Variable Resource Requirements
o Running Closer to the Data
o Simplifying Hadoop Operations
HADOOP’S ASSUMPTIONS ABOUT ITS
INFRASTRUCTURE
HADOOP’S ASSUMPTIONS ABOUT ITS
INFRASTRUCTURE (1/2)
o A large cluster of physical servers, which may reboot, but generally recover, with all their
local on-server HDD storage.
o Non-RAIDed Hard disks in the servers. This is the lowest cost per Terabyte of any storage.
It has good (local) bandwidth when retrieving sequential data: once the disks start
seeking for the next blocks, performance suffers badly.
o Dedicated CPUs; the CPU types are known, and clusters are (usually) built from
homogeneous hardware.
o Servers with monotonically increasing clocks, roughly synchronized via an NTP server. That
is: time goes forward, on all servers simultaneously.
o Dedicated network with exclusive use of a high-performance switch, fast 1-10 Gb/s server
Ethernet and faster 10 + Gb/s "backplane" interconnect between racks.
o A relative static data network topology: data nodes do not move around.
HADOOP’S ASSUMPTIONS ABOUT ITS
INFRASTRUCTURE (2/2)
o Exclusive use of the network by trusted users.
o High performance infrastructure services (DNS, reverse DNS, NFS storage for
NameNode snapshots)
o The primary failure modes of machines are HDD failures, re-occurring memory
failures, or overheating damage caused by fan failures.
o Machine failures are normally independent, with the exception of the failure of Top
of Rack switches, which can take a whole rack offline. Router/Switch misconfiguration
can have a similar effect.
o If the entire datacenter restarts, almost all the machines will come back up along
with their data.
HADOOP’S IMPLEMENTATION
HADOOP’S IMPLEMENTATION DETAILS
(1/2)
o HDFS uses local disks for storage, replicating data across machines.
o The MR engine scheduler that assumes that the Hadoop work has exclusive use of
the server and tries to keep the disks and CPU as busy as possible.
o Leases and timeouts are based on local clocks, not complex distributed system clocks
such as Lamport Clocks. That is in the Hadoop layer, and in the entire network stack -
TCP also uses local clocks.
o Topology scripts can be written to describe the network topology; these are used to
place data and work.
o Data is usually transmitted between machines unencrypted.
HADOOP’S IMPLEMENTATION DETAILS
(2/2)
o Code running on machines in the cluster (including user-supplied MR jobs), can usually be
assumed to not be deliberately malicious, unless in secure setups.
o Missing hard disks are usually missing because they have failed, so the data stored on them
should be replicated and the disk left alone.
o Servers that are consistently slow to complete jobs should be blacklisted: no new work should
be sent to them.
o The JobTracker should try and keep the cluster as busy as possible, to maximize ROI on the
servers and datacenter.
o When a JobTracker has no work to perform, the servers are left idle.
o If the entire datacenter restarts, the filesystem can recover, provided you have set up the
NameNode and Secondary NameNode properly.
JOBTRACKER

o Service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster,
ideally the nodes that have the data, or at least are in the same rack.
• Client applications submit jobs to the Job tracker.
• The JobTracker talks to the NameNode to determine the location of the data
• The JobTracker locates TaskTracker nodes with available slots at or near the data
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed, and the work is scheduled on a different
TaskTracker.
• A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what
to do then: it may resubmit the job elsewhere, it may mark that specific record as
something to avoid, and it may may even blacklist the TaskTracker as unreliable.
• When the work is completed, the JobTracker updates its status.
• Client applications can poll the JobTracker for information.
TASKTRACKER

o A node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from
a JobTracker.
o Every TaskTracker is configured with a set of slots, these indicate the number of tasks
that it can accept.
o When the JobTracker tries to find somewhere to schedule a task within the
MapReduce operations, it first looks for an empty slot on the same server that hosts
the DataNode containing the data, and if not, it looks for an empty slot on a machine
in the same rack.
o Spawns a separate JVM processes to do the actual work; this is to ensure that
process failure does not take down the task tracker.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (1/3)
o Storage could be one or more of transient virtual drives, transient local physical
drives, persistent local virtual drives, or remote SAN-mounted block stores or file
systems.
o Storage in virtual hard drives might cause a lot of seeking if they share the same
physical hard drive, even if it appears to be sequential access to the VM.
o Networking may be slower and throttled by the infrastructure provider.
o Virtual Machines are requested on demand from the infrastructure: the machines
could be allocated anywhere in the infrastructure, possibly on servers running other
VMs at the same time.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (2/3)

o The other VMs may be heavy resource (CPU, IO and network) users, which could
cause the Hadoop jobs to suffer. OTOH, the heavy load of Hadoop could cause
problems for the other users of the server, if the underlying hypervisor lacks
proper isolation features and/or policies.
o VMs could be suspended and restarted without OS notification, this can cause
clocks to move forward in jumps of many seconds.
o If the Hadoop clusters share the VLAN with other users (which is not
recommended), other users on the network may be able to listen to traffic, to
disrupt it, and to access ports that are not authenticating all access.
o Some infrastructures may move VMs around; this can actually move clocks
backwards when the new physical host's clock is behind that of the original host.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (3/3)

o Replication to transient hard drives is no longer a reliable way to persist data.


o On some cloud providers, network topology may not visible to the Hadoop
cluster, though latency and bandwidth tests may be used to infer "closeness", to
build a de-facto topology.
o The correct way to deal with a VM that is showing re-occuring failures is to
release the VM and ask for a new one, instead of blacklisting it.
o The JobTracker may want to request extra VMs when there is extra demand.
o The JobTracker may want to release VMs when there is idle time.
o Like all hosted services, a failure of the hosting infrastructure could lose all
machines simultaneously though not necessarily permanently.
VIRTUAL INFRASTRUCTURE IMPLICATIONS
VIRTUAL INFRASTRUCTURE
IMPLICATIONS (1/2)
o When you request a VM, it's performance may vary from previous requests
(when missing isolation feature/policy). This can be due to CPU differences, or
the other workloads.
o There is no point writing topology scripts, if cloud vendor doesn't expose
physical topology to you in some way. OTOH, Project Serengeti configures the
topology script automatically for Apache Hadoop 1.2+ on vSphere.
o All network ports must be closed by way of firewall and routing information,
apart from those ports critical for Hadoop, which must then run with security on.
o All data you wish to keep must be kept on permanent storage: mounted block
stores, remote filesystems or external databases. This goes for both input and
output.
VIRTUAL INFRASTRUCTURE
IMPLICATIONS (2/2)
o People or programs need to track machine failures and react to them by releasing those
machines and requesting new ones.
o If the cluster is idle. some machines can be decommissioned.
o If the cluster is overloaded, some temporary TaskTracker only servers can be brought up
for short periods of time and killed when no longer needed.
o If the cluster needs to be expanded for a longer duration, worker nodes acting as both a
DataNode and TaskTracker can be brought up.
o If the entire cluster goes down or restarts, all transient hard disks will be lost (some cloud
vendors treat VM disk as transient and provide other reliable storage service, but others
are not. This note is only for previous vendor), and all data stored within the HDFS cluster
with it.
HADOOP ON CLOUD INFRASTRUCTURES REASON
HADOOP ON CLOUD
INFRASTRUCTURES REASON (1/2)
o For private cloud, where the admins can properly provision virtual infrastructure for Hadoop:
• HDFS is as reliable and efficient as in physical with dedicated and/or shared local storage
depending on the isolation requirements
• Virtualization can provide higher hardware utilization by consolidating multiple Hadoop
clusters and other workload on the same physical cluster.
• Higher performance for some workload (including terasort) than physical for multi-CPU
socket machines (typically recommended for Hadoop deployment) due to better NUMA
control at hypervisor layer and reduced OS cache and IO contention with multi-VM per
host than the physical deployment where there is only one OS per host.
• Per tenant VLAN (VXLAN) can provide better security than typical shared physical
Hadoop cluster, especially for YARN (in Hadoop 2+), where new non-MR workloads pose
challenges to security.
HADOOP ON CLOUD
INFRASTRUCTURES REASON (2/2)
o Given the choice between a virtual Hadoop and no Hadoop, virtual Hadoop is
compelling.
o Using Apache Hadoop as your MapReduce infrastructure gives you Cloud vendor
independence, and the option of moving to a permanent physical deployment later.
o It is the only way to execute the tools that work with Hadoop and the layers above it
in a Cloud environment.
o If you store your persistent data in a cloud-hosted storage infrastructure, analyzing
the data in the provider's computation infrastructure is the most cost-effective way
to do so.
HOSTING ON LOCAL VMS
HOSTING ON LOCAL VMS

o This is a good tactic if your physical machines run windows and you need to bring up
a Linux system running Hadoop, and/or you want to simulate the complexity of a
small Hadoop cluster.
• Have enough RAM for the VM to not swap.
• Don't try and run more than one VM per physical host with less than 2 CPU
cores or limited memory, it will only make things slower.
• Use host shared folders to access persistent input and output data.
• Consider making the default filesystem a file: URL so that all storage is really on
the physical host. It's often faster (for Linux guests) and preserves data better.
ThankYOU...
SUMMARY
o Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
o Why is Hadoop important? Ability to store and process huge amounts of any
kind of data, quickly, Computing power, Fault tolerance, Flexibility, Low cost, and
Scalability.
o Hadoop ecosystem comprises four different layers : Data storage layer, Data
Processing layer, Data access layer and Data management layer.
o You can bring up Hadoop in virtualized infrastructures with many benefits.
Sometimes it even makes sense for public cloud, for development and
production. For production use, be aware that the differences between physical
and virtual infrastructures could pose additional gotchas to your data integrity
and security without proper planning and provisioning.
REFERENCES

o Balusamy. Balamurugan, Abirami.Nandhini, Kadry.R, Seifedine, & Gandomi. Amir H. (2021). Big Data
Concepts, Technology, and Architecture. 1st. Wiley. ISBN: 978-1-119-70182-8. Chapter 5
o Arshdeep Bahga & Vijay Madisetti. (2016). Big Data Science & Analytics: A Hands-On Approach. 1st E.
VPT. India. ISBN: 9781949978001. Chapter 3
o Accenture Technology Labs. 2014. Cloud-based Hadoop Deployments: Benefits and Considerations.
https://www.yumpu.com/en/document/view/30180663/accenture-cloud-based-hadoop-
deployments-benefits-and-considerations/21
o https://www.youtube.com/watch?v=aReuLtY0YMI&t=78s
o https://www.youtube.com/watch?v=mzZj2AJ6uz8
o https://www.sas.com/en_us/insights/big-data/hadoop.html
o https://cwiki.apache.org/confluence/display/HADOOP2/Virtual+Hadoop
o https://cwiki.apache.org/confluence/display/HADOOP2/JobTracker
o https://www.thoughtworks.com/insights/blog/6-reasons-why-hadoop-cloud-makes-sense
o https://blog.syncsort.com/2017/06/big-data/5-reasons-hadoop-in-the-cloud/
o https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
o Webster, C., 2015. Hadoop Virtualization. O’Reilly Media, Inc.,

You might also like