PPT04-Hadoop Infrastructure Layer
PPT04-Hadoop Infrastructure Layer
TOPIK 4
HADOOP INFRASTRUCTURE LAYER
LEARNING OUTCOMES
Students are able to describe big data architecture layer and processing concepts
OUTLINE
1. Hadoop Architecture
2. Hadoop Infrastructure Layer
3. 6 Reasons Why Hadoop on the Cloud
4. Hadoop’s Assumptions about its Infrastructure
5. Hadoop’s Implementation
6. Virtual Infrastructure VS Physical DataCenter
7. Virtual Infrastructure Implications
8. Hadoop on Cloud Infrastructures Reason
9. Hosting on local VMs
HADOOP ARCHITECTURE
HADOOP
o Ability to store and process huge amounts of any kind of data, quickly.
o Computing power.
o Fault tolerance.
o Flexibility.
o Low cost.
o Scalability.
HADOOP ARCHITECTURE
Hadoop Appliance
Hadoop Hosting
Hadoop-as-a-service
Cloud
VIRTUAL HADOOP
o Service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster,
ideally the nodes that have the data, or at least are in the same rack.
• Client applications submit jobs to the Job tracker.
• The JobTracker talks to the NameNode to determine the location of the data
• The JobTracker locates TaskTracker nodes with available slots at or near the data
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed, and the work is scheduled on a different
TaskTracker.
• A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what
to do then: it may resubmit the job elsewhere, it may mark that specific record as
something to avoid, and it may may even blacklist the TaskTracker as unreliable.
• When the work is completed, the JobTracker updates its status.
• Client applications can poll the JobTracker for information.
TASKTRACKER
o A node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from
a JobTracker.
o Every TaskTracker is configured with a set of slots, these indicate the number of tasks
that it can accept.
o When the JobTracker tries to find somewhere to schedule a task within the
MapReduce operations, it first looks for an empty slot on the same server that hosts
the DataNode containing the data, and if not, it looks for an empty slot on a machine
in the same rack.
o Spawns a separate JVM processes to do the actual work; this is to ensure that
process failure does not take down the task tracker.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (1/3)
o Storage could be one or more of transient virtual drives, transient local physical
drives, persistent local virtual drives, or remote SAN-mounted block stores or file
systems.
o Storage in virtual hard drives might cause a lot of seeking if they share the same
physical hard drive, even if it appears to be sequential access to the VM.
o Networking may be slower and throttled by the infrastructure provider.
o Virtual Machines are requested on demand from the infrastructure: the machines
could be allocated anywhere in the infrastructure, possibly on servers running other
VMs at the same time.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (2/3)
o The other VMs may be heavy resource (CPU, IO and network) users, which could
cause the Hadoop jobs to suffer. OTOH, the heavy load of Hadoop could cause
problems for the other users of the server, if the underlying hypervisor lacks
proper isolation features and/or policies.
o VMs could be suspended and restarted without OS notification, this can cause
clocks to move forward in jumps of many seconds.
o If the Hadoop clusters share the VLAN with other users (which is not
recommended), other users on the network may be able to listen to traffic, to
disrupt it, and to access ports that are not authenticating all access.
o Some infrastructures may move VMs around; this can actually move clocks
backwards when the new physical host's clock is behind that of the original host.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (3/3)
o This is a good tactic if your physical machines run windows and you need to bring up
a Linux system running Hadoop, and/or you want to simulate the complexity of a
small Hadoop cluster.
• Have enough RAM for the VM to not swap.
• Don't try and run more than one VM per physical host with less than 2 CPU
cores or limited memory, it will only make things slower.
• Use host shared folders to access persistent input and output data.
• Consider making the default filesystem a file: URL so that all storage is really on
the physical host. It's often faster (for Linux guests) and preserves data better.
ThankYOU...
SUMMARY
o Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
o Why is Hadoop important? Ability to store and process huge amounts of any
kind of data, quickly, Computing power, Fault tolerance, Flexibility, Low cost, and
Scalability.
o Hadoop ecosystem comprises four different layers : Data storage layer, Data
Processing layer, Data access layer and Data management layer.
o You can bring up Hadoop in virtualized infrastructures with many benefits.
Sometimes it even makes sense for public cloud, for development and
production. For production use, be aware that the differences between physical
and virtual infrastructures could pose additional gotchas to your data integrity
and security without proper planning and provisioning.
REFERENCES
o Balusamy. Balamurugan, Abirami.Nandhini, Kadry.R, Seifedine, & Gandomi. Amir H. (2021). Big Data
Concepts, Technology, and Architecture. 1st. Wiley. ISBN: 978-1-119-70182-8. Chapter 5
o Arshdeep Bahga & Vijay Madisetti. (2016). Big Data Science & Analytics: A Hands-On Approach. 1st E.
VPT. India. ISBN: 9781949978001. Chapter 3
o Accenture Technology Labs. 2014. Cloud-based Hadoop Deployments: Benefits and Considerations.
https://www.yumpu.com/en/document/view/30180663/accenture-cloud-based-hadoop-
deployments-benefits-and-considerations/21
o https://www.youtube.com/watch?v=aReuLtY0YMI&t=78s
o https://www.youtube.com/watch?v=mzZj2AJ6uz8
o https://www.sas.com/en_us/insights/big-data/hadoop.html
o https://cwiki.apache.org/confluence/display/HADOOP2/Virtual+Hadoop
o https://cwiki.apache.org/confluence/display/HADOOP2/JobTracker
o https://www.thoughtworks.com/insights/blog/6-reasons-why-hadoop-cloud-makes-sense
o https://blog.syncsort.com/2017/06/big-data/5-reasons-hadoop-in-the-cloud/
o https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
o Webster, C., 2015. Hadoop Virtualization. O’Reilly Media, Inc.,