0% found this document useful (0 votes)

89 views40 pages

PPT04-Hadoop Infrastructure Layer

The document describes Hadoop's infrastructure layer and processing concepts. It discusses Hadoop architecture, deployment models including on-premise, virtual, private cloud and public cloud. It outlines Hadoop's assumptions about its infrastructure including using local disks, non-RAID storage, and assumptions about network topology. It also describes Hadoop implementation details like how the JobTracker and TaskTracker work to process jobs in a distributed manner.

Uploaded by

TsabitAlaykRidhollah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views40 pages

PPT04-Hadoop Infrastructure Layer

Uploaded by

TsabitAlaykRidhollah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

COMP6725 - Big Data Technologies

TOPIK 4
HADOOP INFRASTRUCTURE LAYER
LEARNING OUTCOMES

At the end of this session, students will be able to:

o LO1. Describe big data architecture layer and processing concepts
OUTCOMES

Students are able to describe big data architecture layer and processing concepts
OUTLINE

1. Hadoop Architecture
2. Hadoop Infrastructure Layer
3. 6 Reasons Why Hadoop on the Cloud
4. Hadoop’s Assumptions about its Infrastructure
5. Hadoop’s Implementation
6. Virtual Infrastructure VS Physical DataCenter
7. Virtual Infrastructure Implications
8. Hadoop on Cloud Infrastructures Reason
9. Hosting on local VMs
HADOOP ARCHITECTURE
HADOOP

o Hadoop is an open-source software framework for storing data and running

applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
o Apache Hadoop is an open-source framework written in Java that supports
processing of large data sets in streaming access pattern across clusters in a
distributed computing environment. Can Store a large volume of structured,
semi-structured, and unstructured data in a distributed file system (DFS) and
process them in parallel.
WHY IS HADOOP IMPORTANT?

o Ability to store and process huge amounts of any kind of data, quickly.
o Computing power.
o Fault tolerance.
o Flexibility.
o Low cost.
o Scalability.
HADOOP ARCHITECTURE

Pic 4.1. Hadoop architecture.

Source : Big Data Concepts, Technology, and Architecture. 2021
HADOOP ECOSYSTEM

Hadoop ecosystem comprises four

different layers
1. Data storage layer;
2. Data Processing layer;
3. Data access layer;
4. Data management layer.

Pic 4.2. Hadoop ecosystem..

Source : Big Data Concepts, Technology, and Architecture. 2021
HADOOP INFRASTRUCTURE LAYER
DEPLOYMENT MODELS

On-premise Full Custom Bare-metal

Hadoop Appliance

Hadoop Hosting

Hadoop on the Cloud

Hadoop-as-a-service
Cloud
VIRTUAL HADOOP

• Pure physical to hypervisor

Virtualization • Abstract physical hardware
• Provide virtual for higher-level services

• A collection of virtualized physical hardware

• Catalogs of software
A Private Cloud
• Owned and managed by the same company or
group

• Like a private cloud but owned by outside entity

A Public Cloud • Loss of control, intermingling of data or
undesirable issues
STRENGTHS OF VM-HOSTED HADOOP

o A single image can be cloned - lower operations costs.

o Hadoop clusters can be set up on demand.
o Physical infrastructure can be reused.
o You only pay for the CPU time you need.
o The cluster size can be expanded or contracted on demand.
BENEFITS OF A PRIVATE CLOUD HADOOP

o A cluster can be set up in minutes

o It can flexibly use a variety of hardware (DAS, SAN, NAS)
o It is cost effective (lower capital expenses than physical deployment and lower
operating expenses than public cloud deployment)
o Streamlined management tools lower the complexity of initial configuration and
maintenance
o High availability and fault tolerance increase uptime
ADVANTAGES OF A PUBLIC CLOUD
HADOOP

o If you use a turnkey solution or Hadoop-as-a-Service, there is very little setup to

perform.
o Hadoop-as-a-Service requires no maintenance.
o If you lack the on-premise computing power to host a Hadoop cluster big
enough to meet your needs, running Hadoop in the cloud will give you what you
want without requiring new hardware purchases.
o When using Hadoop in the cloud, you generally pay only for the time you use.
That beats paying to maintain local Hadoop servers 24/7 if you only use them
some of the time.
o If the data you analyze is stored in the cloud, running Hadoop in the same cloud
eliminates the need to perform large data transfers over the network when
ingesting data into Hadoop.
6 REASONS WHY HADOOP ON THE CLOUD
6 REASONS WHY HADOOP ON THE CLOUD

o Lowering the cost of innovation

o Procuring large scale resources quickly
o Handling Batch Workloads Efficiently
o Handling Variable Resource Requirements
o Running Closer to the Data
o Simplifying Hadoop Operations
HADOOP’S ASSUMPTIONS ABOUT ITS
INFRASTRUCTURE
HADOOP’S ASSUMPTIONS ABOUT ITS
INFRASTRUCTURE (1/2)
o A large cluster of physical servers, which may reboot, but generally recover, with all their
local on-server HDD storage.
o Non-RAIDed Hard disks in the servers. This is the lowest cost per Terabyte of any storage.
It has good (local) bandwidth when retrieving sequential data: once the disks start
seeking for the next blocks, performance suffers badly.
o Dedicated CPUs; the CPU types are known, and clusters are (usually) built from
homogeneous hardware.
o Servers with monotonically increasing clocks, roughly synchronized via an NTP server. That
is: time goes forward, on all servers simultaneously.
o Dedicated network with exclusive use of a high-performance switch, fast 1-10 Gb/s server
Ethernet and faster 10 + Gb/s "backplane" interconnect between racks.
o A relative static data network topology: data nodes do not move around.
HADOOP’S ASSUMPTIONS ABOUT ITS
INFRASTRUCTURE (2/2)
o Exclusive use of the network by trusted users.
o High performance infrastructure services (DNS, reverse DNS, NFS storage for
NameNode snapshots)
o The primary failure modes of machines are HDD failures, re-occurring memory
failures, or overheating damage caused by fan failures.
o Machine failures are normally independent, with the exception of the failure of Top
of Rack switches, which can take a whole rack offline. Router/Switch misconfiguration
can have a similar effect.
o If the entire datacenter restarts, almost all the machines will come back up along
with their data.
HADOOP’S IMPLEMENTATION
HADOOP’S IMPLEMENTATION DETAILS
(1/2)
o HDFS uses local disks for storage, replicating data across machines.
o The MR engine scheduler that assumes that the Hadoop work has exclusive use of
the server and tries to keep the disks and CPU as busy as possible.
o Leases and timeouts are based on local clocks, not complex distributed system clocks
such as Lamport Clocks. That is in the Hadoop layer, and in the entire network stack -
TCP also uses local clocks.
o Topology scripts can be written to describe the network topology; these are used to
place data and work.
o Data is usually transmitted between machines unencrypted.
HADOOP’S IMPLEMENTATION DETAILS
(2/2)
o Code running on machines in the cluster (including user-supplied MR jobs), can usually be
assumed to not be deliberately malicious, unless in secure setups.
o Missing hard disks are usually missing because they have failed, so the data stored on them
should be replicated and the disk left alone.
o Servers that are consistently slow to complete jobs should be blacklisted: no new work should
be sent to them.
o The JobTracker should try and keep the cluster as busy as possible, to maximize ROI on the
servers and datacenter.
o When a JobTracker has no work to perform, the servers are left idle.
o If the entire datacenter restarts, the filesystem can recover, provided you have set up the
NameNode and Secondary NameNode properly.
JOBTRACKER

o Service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster,
ideally the nodes that have the data, or at least are in the same rack.
• Client applications submit jobs to the Job tracker.
• The JobTracker talks to the NameNode to determine the location of the data
• The JobTracker locates TaskTracker nodes with available slots at or near the data
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed, and the work is scheduled on a different
TaskTracker.
• A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what
to do then: it may resubmit the job elsewhere, it may mark that specific record as
something to avoid, and it may may even blacklist the TaskTracker as unreliable.
• When the work is completed, the JobTracker updates its status.
• Client applications can poll the JobTracker for information.
TASKTRACKER

o A node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from
a JobTracker.
o Every TaskTracker is configured with a set of slots, these indicate the number of tasks
that it can accept.
o When the JobTracker tries to find somewhere to schedule a task within the
MapReduce operations, it first looks for an empty slot on the same server that hosts
the DataNode containing the data, and if not, it looks for an empty slot on a machine
in the same rack.
o Spawns a separate JVM processes to do the actual work; this is to ensure that
process failure does not take down the task tracker.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (1/3)
o Storage could be one or more of transient virtual drives, transient local physical
drives, persistent local virtual drives, or remote SAN-mounted block stores or file
systems.
o Storage in virtual hard drives might cause a lot of seeking if they share the same
physical hard drive, even if it appears to be sequential access to the VM.
o Networking may be slower and throttled by the infrastructure provider.
o Virtual Machines are requested on demand from the infrastructure: the machines
could be allocated anywhere in the infrastructure, possibly on servers running other
VMs at the same time.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (2/3)

o The other VMs may be heavy resource (CPU, IO and network) users, which could
cause the Hadoop jobs to suffer. OTOH, the heavy load of Hadoop could cause
problems for the other users of the server, if the underlying hypervisor lacks
proper isolation features and/or policies.
o VMs could be suspended and restarted without OS notification, this can cause
clocks to move forward in jumps of many seconds.
o If the Hadoop clusters share the VLAN with other users (which is not
recommended), other users on the network may be able to listen to traffic, to
disrupt it, and to access ports that are not authenticating all access.
o Some infrastructures may move VMs around; this can actually move clocks
backwards when the new physical host's clock is behind that of the original host.
VIRTUAL INFRASTRUCTURE VS PHYSICAL
DATACENTER (3/3)

o Replication to transient hard drives is no longer a reliable way to persist data.

o On some cloud providers, network topology may not visible to the Hadoop
cluster, though latency and bandwidth tests may be used to infer "closeness", to
build a de-facto topology.
o The correct way to deal with a VM that is showing re-occuring failures is to
release the VM and ask for a new one, instead of blacklisting it.
o The JobTracker may want to request extra VMs when there is extra demand.
o The JobTracker may want to release VMs when there is idle time.
o Like all hosted services, a failure of the hosting infrastructure could lose all
machines simultaneously though not necessarily permanently.
VIRTUAL INFRASTRUCTURE IMPLICATIONS
VIRTUAL INFRASTRUCTURE
IMPLICATIONS (1/2)
o When you request a VM, it's performance may vary from previous requests
(when missing isolation feature/policy). This can be due to CPU differences, or
the other workloads.
o There is no point writing topology scripts, if cloud vendor doesn't expose
physical topology to you in some way. OTOH, Project Serengeti configures the
topology script automatically for Apache Hadoop 1.2+ on vSphere.
o All network ports must be closed by way of firewall and routing information,
apart from those ports critical for Hadoop, which must then run with security on.
o All data you wish to keep must be kept on permanent storage: mounted block
stores, remote filesystems or external databases. This goes for both input and
output.
VIRTUAL INFRASTRUCTURE
IMPLICATIONS (2/2)
o People or programs need to track machine failures and react to them by releasing those
machines and requesting new ones.
o If the cluster is idle. some machines can be decommissioned.
o If the cluster is overloaded, some temporary TaskTracker only servers can be brought up
for short periods of time and killed when no longer needed.
o If the cluster needs to be expanded for a longer duration, worker nodes acting as both a
DataNode and TaskTracker can be brought up.
o If the entire cluster goes down or restarts, all transient hard disks will be lost (some cloud
vendors treat VM disk as transient and provide other reliable storage service, but others
are not. This note is only for previous vendor), and all data stored within the HDFS cluster
with it.
HADOOP ON CLOUD INFRASTRUCTURES REASON
HADOOP ON CLOUD
INFRASTRUCTURES REASON (1/2)
o For private cloud, where the admins can properly provision virtual infrastructure for Hadoop:
• HDFS is as reliable and efficient as in physical with dedicated and/or shared local storage
depending on the isolation requirements
• Virtualization can provide higher hardware utilization by consolidating multiple Hadoop
clusters and other workload on the same physical cluster.
• Higher performance for some workload (including terasort) than physical for multi-CPU
socket machines (typically recommended for Hadoop deployment) due to better NUMA
control at hypervisor layer and reduced OS cache and IO contention with multi-VM per
host than the physical deployment where there is only one OS per host.
• Per tenant VLAN (VXLAN) can provide better security than typical shared physical
Hadoop cluster, especially for YARN (in Hadoop 2+), where new non-MR workloads pose
challenges to security.
HADOOP ON CLOUD
INFRASTRUCTURES REASON (2/2)
o Given the choice between a virtual Hadoop and no Hadoop, virtual Hadoop is
compelling.
o Using Apache Hadoop as your MapReduce infrastructure gives you Cloud vendor
independence, and the option of moving to a permanent physical deployment later.
o It is the only way to execute the tools that work with Hadoop and the layers above it
in a Cloud environment.
o If you store your persistent data in a cloud-hosted storage infrastructure, analyzing
the data in the provider's computation infrastructure is the most cost-effective way
to do so.
HOSTING ON LOCAL VMS
HOSTING ON LOCAL VMS

o This is a good tactic if your physical machines run windows and you need to bring up
a Linux system running Hadoop, and/or you want to simulate the complexity of a
small Hadoop cluster.
• Have enough RAM for the VM to not swap.
• Don't try and run more than one VM per physical host with less than 2 CPU
cores or limited memory, it will only make things slower.
• Use host shared folders to access persistent input and output data.
• Consider making the default filesystem a file: URL so that all storage is really on
the physical host. It's often faster (for Linux guests) and preserves data better.
ThankYOU...
SUMMARY
o Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs.
o Why is Hadoop important? Ability to store and process huge amounts of any
kind of data, quickly, Computing power, Fault tolerance, Flexibility, Low cost, and
Scalability.
o Hadoop ecosystem comprises four different layers : Data storage layer, Data
Processing layer, Data access layer and Data management layer.
o You can bring up Hadoop in virtualized infrastructures with many benefits.
Sometimes it even makes sense for public cloud, for development and
production. For production use, be aware that the differences between physical
and virtual infrastructures could pose additional gotchas to your data integrity
and security without proper planning and provisioning.
REFERENCES

o Balusamy. Balamurugan, Abirami.Nandhini, Kadry.R, Seifedine, & Gandomi. Amir H. (2021). Big Data
Concepts, Technology, and Architecture. 1st. Wiley. ISBN: 978-1-119-70182-8. Chapter 5
o Arshdeep Bahga & Vijay Madisetti. (2016). Big Data Science & Analytics: A Hands-On Approach. 1st E.
VPT. India. ISBN: 9781949978001. Chapter 3
o Accenture Technology Labs. 2014. Cloud-based Hadoop Deployments: Benefits and Considerations.
https://www.yumpu.com/en/document/view/30180663/accenture-cloud-based-hadoop-
deployments-benefits-and-considerations/21
o https://www.youtube.com/watch?v=aReuLtY0YMI&t=78s
o https://www.youtube.com/watch?v=mzZj2AJ6uz8
o https://www.sas.com/en_us/insights/big-data/hadoop.html
o https://cwiki.apache.org/confluence/display/HADOOP2/Virtual+Hadoop
o https://cwiki.apache.org/confluence/display/HADOOP2/JobTracker
o https://www.thoughtworks.com/insights/blog/6-reasons-why-hadoop-cloud-makes-sense
o https://blog.syncsort.com/2017/06/big-data/5-reasons-hadoop-in-the-cloud/
o https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
o Webster, C., 2015. Hadoop Virtualization. O’Reilly Media, Inc.,

Materi UML PDF
No ratings yet
Materi UML PDF
255 pages
PPT01-Introduction To Big Data
No ratings yet
PPT01-Introduction To Big Data
34 pages
Power Designer 16.6 - Data - Modeling
No ratings yet
Power Designer 16.6 - Data - Modeling
638 pages
Data Warehouse Architecture
100% (3)
Data Warehouse Architecture
63 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Database Lifecycle
100% (1)
Database Lifecycle
35 pages
2000 Procedimientos Industriales - Formoso
100% (2)
2000 Procedimientos Industriales - Formoso
1,219 pages
Ume Ingepac Ef MD Eng PDF
100% (1)
Ume Ingepac Ef MD Eng PDF
351 pages
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
No ratings yet
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
16 pages
Algoritma Data Mining
No ratings yet
Algoritma Data Mining
114 pages
505 Enhanced Instruction Manual V1
67% (3)
505 Enhanced Instruction Manual V1
212 pages
Software Development Life Cycle - Pak Ben
No ratings yet
Software Development Life Cycle - Pak Ben
39 pages
Big Data
No ratings yet
Big Data
957 pages
10 Contoh Coding Bahasa Pemrograman Java
No ratings yet
10 Contoh Coding Bahasa Pemrograman Java
15 pages
Gunadarma Big Data Text Mining - Compressed
No ratings yet
Gunadarma Big Data Text Mining - Compressed
66 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
What Are The Four Important Attributes That All Professional Software Should Have
No ratings yet
What Are The Four Important Attributes That All Professional Software Should Have
1 page
Brainmatics Curricula 2023
No ratings yet
Brainmatics Curricula 2023
28 pages
Apsara Clouder Cloud Computing Exam Quiz
No ratings yet
Apsara Clouder Cloud Computing Exam Quiz
54 pages
1 - Pengenalan Enterprise Architecture
No ratings yet
1 - Pengenalan Enterprise Architecture
37 pages
Nathan Gusti Ryan - Best Practice Disaster Recovery Platform Virtualisasi VMware VSphere (2018)
No ratings yet
Nathan Gusti Ryan - Best Practice Disaster Recovery Platform Virtualisasi VMware VSphere (2018)
104 pages
Data Warehousing Architecture Best Practices
100% (3)
Data Warehousing Architecture Best Practices
61 pages
International Journal of Data Engineering IJDE - V2 - I2
No ratings yet
International Journal of Data Engineering IJDE - V2 - I2
77 pages
Dokumentasi Software Testing Berstandar IEEE 829-2008 Untuk Learning Management System Fakultas Ilmu Komputer Universitas Subang
No ratings yet
Dokumentasi Software Testing Berstandar IEEE 829-2008 Untuk Learning Management System Fakultas Ilmu Komputer Universitas Subang
13 pages
Romi Dosdonts Aiapps For Research May2023
No ratings yet
Romi Dosdonts Aiapps For Research May2023
70 pages
Teori UML InstaGram: Assepsaifulloh
No ratings yet
Teori UML InstaGram: Assepsaifulloh
163 pages
Case 1. Isni Asyari RI
No ratings yet
Case 1. Isni Asyari RI
9 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
26 pages
CC Managed Service Provider
No ratings yet
CC Managed Service Provider
22 pages
Magnetos Maintenance and Overhaul PDF
100% (1)
Magnetos Maintenance and Overhaul PDF
64 pages
Software Testing Report
100% (1)
Software Testing Report
22 pages
SCRUM-CMMI Summary v2.2 - Contoh Dan Exercise
No ratings yet
SCRUM-CMMI Summary v2.2 - Contoh Dan Exercise
43 pages
Synopsis Hotel Room Booking
50% (2)
Synopsis Hotel Room Booking
19 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Laporan Test Plan
No ratings yet
Laporan Test Plan
11 pages
Chapter8 Structuring System Data Requirements
No ratings yet
Chapter8 Structuring System Data Requirements
64 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
Cloud Computing Architecture Module III
No ratings yet
Cloud Computing Architecture Module III
14 pages
PenetrationTesting Notes
No ratings yet
PenetrationTesting Notes
4 pages
Cloud Computing As A Suitable Alternative To The Traditional Paper
100% (1)
Cloud Computing As A Suitable Alternative To The Traditional Paper
8 pages
Manajemen Data DMBOK (Kelompok 1)
No ratings yet
Manajemen Data DMBOK (Kelompok 1)
55 pages
Pengantar Basis Data
No ratings yet
Pengantar Basis Data
21 pages
Software Development Life Cycle Models and
No ratings yet
Software Development Life Cycle Models and
7 pages
Cases Unit Testing
No ratings yet
Cases Unit Testing
7 pages
Iot Network Architecture and Design: Mms Institut Teknologi Del
No ratings yet
Iot Network Architecture and Design: Mms Institut Teknologi Del
39 pages
Data Mining Dan Bigdata
No ratings yet
Data Mining Dan Bigdata
38 pages
Pertemuan-10 Tata Kelola Sistem Informasi
No ratings yet
Pertemuan-10 Tata Kelola Sistem Informasi
28 pages
Managing Knowledge: Knowledge Work and Artificial Intelligence
No ratings yet
Managing Knowledge: Knowledge Work and Artificial Intelligence
34 pages
Silabus - Cyber Security
No ratings yet
Silabus - Cyber Security
6 pages
Jawaban Exercise SAP SCM 100 Week 2
No ratings yet
Jawaban Exercise SAP SCM 100 Week 2
16 pages
Future of Cyber Security
No ratings yet
Future of Cyber Security
9 pages
CC - Unit-2, Cloud Computing Architecture
No ratings yet
CC - Unit-2, Cloud Computing Architecture
16 pages
Wso2 Enterprise Service Bus
100% (1)
Wso2 Enterprise Service Bus
21 pages
Presentation Basics
No ratings yet
Presentation Basics
74 pages
DBDM Lecture Notes
No ratings yet
DBDM Lecture Notes
242 pages
SOP Git Branching
No ratings yet
SOP Git Branching
2 pages
CISSP Common Body of Knowledge Review in
No ratings yet
CISSP Common Body of Knowledge Review in
145 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Manisha's Journey
No ratings yet
Manisha's Journey
6 pages
Silabus Oracle Database 11g: SQL Fundamentals I - Ltmi
No ratings yet
Silabus Oracle Database 11g: SQL Fundamentals I - Ltmi
3 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
37 pages
Understanding Confusion Matrix
No ratings yet
Understanding Confusion Matrix
4 pages
Answers
No ratings yet
Answers
2 pages
Operating Systems: Chapter 2 - Operating System Structures
No ratings yet
Operating Systems: Chapter 2 - Operating System Structures
56 pages
UG Courses of Study 2007
100% (2)
UG Courses of Study 2007
147 pages
The Next Frontier For Innovation, Competition and Productivity
No ratings yet
The Next Frontier For Innovation, Competition and Productivity
23 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
Project Management Book1
100% (1)
Project Management Book1
25 pages
Revised CET Draft TS (Rev 03) For Proposed HPWMS System at PBS As Per Comments Observations 26 12 2023
No ratings yet
Revised CET Draft TS (Rev 03) For Proposed HPWMS System at PBS As Per Comments Observations 26 12 2023
137 pages
ABACUS Features US
No ratings yet
ABACUS Features US
2 pages
Pro Python System Administration 2nd Edition Rytis Sileika Download
100% (1)
Pro Python System Administration 2nd Edition Rytis Sileika Download
60 pages
Final PDF
100% (1)
Final PDF
71 pages
MS Excel-2
No ratings yet
MS Excel-2
14 pages
How To Bypass or Remove A BIOS Password
No ratings yet
How To Bypass or Remove A BIOS Password
5 pages
Quick Installation Guide: HD Ultra-Wide View Wi-Fi Camera Dcs-960L
No ratings yet
Quick Installation Guide: HD Ultra-Wide View Wi-Fi Camera Dcs-960L
48 pages
COSSy Manual
No ratings yet
COSSy Manual
10 pages
Lecture9 Polymorphism
No ratings yet
Lecture9 Polymorphism
97 pages
Academic
No ratings yet
Academic
8 pages
CRO Registration Notice (Dated, 04 Mar. 2025)
No ratings yet
CRO Registration Notice (Dated, 04 Mar. 2025)
9 pages
Big Data PPT Sybca
No ratings yet
Big Data PPT Sybca
8 pages
Permasense Brochure V4.3
No ratings yet
Permasense Brochure V4.3
8 pages
PREPOSITIONS OF PLACE - Quizizz
No ratings yet
PREPOSITIONS OF PLACE - Quizizz
6 pages
Lesson 2.2 Understanding Files: Slideshow Created by Sarel Myburgh Updated by Savon (25-Feb-23)
No ratings yet
Lesson 2.2 Understanding Files: Slideshow Created by Sarel Myburgh Updated by Savon (25-Feb-23)
8 pages
Bakhsh 2020 - Effect of Light Irradiation Condition On Gap Formation Under Polymeric Dental Restoration. OCT Study
No ratings yet
Bakhsh 2020 - Effect of Light Irradiation Condition On Gap Formation Under Polymeric Dental Restoration. OCT Study
7 pages
Tellio - Job Ad 07-17-2023
No ratings yet
Tellio - Job Ad 07-17-2023
2 pages
Difference Between Gs & OGS
No ratings yet
Difference Between Gs & OGS
1 page
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet

PPT04-Hadoop Infrastructure Layer

Uploaded by

PPT04-Hadoop Infrastructure Layer

Uploaded by

COMP6725 - Big Data Technologies

At the end of this session, students will be able to:

o Hadoop is an open-source software framework for storing data and running

Pic 4.1. Hadoop architecture.

Hadoop ecosystem comprises four

Pic 4.2. Hadoop ecosystem..

On-premise Full Custom Bare-metal

Hadoop on the Cloud

• Pure physical to hypervisor

• A collection of virtualized physical hardware

• Like a private cloud but owned by outside entity

o A single image can be cloned - lower operations costs.

o A cluster can be set up in minutes

o If you use a turnkey solution or Hadoop-as-a-Service, there is very little setup to

o Lowering the cost of innovation

o Replication to transient hard drives is no longer a reliable way to persist data.

You might also like