CC Unit 5
CC Unit 5
CC Unit 5
5.1 HADOOP
• Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models.
• The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers.
• Hadoop is designed to scale up from single server to thousands of machines,
each offering local computation and storage.
5.1.1 Hadoop Architecture
At its core, Hadoop has two major layers namely “
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
MapReduce
(Distributed Computation)
(Distributed Storage)
Common
Name Node
Read
• Hadoop Common ” These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN ” This is a framework for job scheduling and cluster resource
management.
Hadoop Common
• Hadoop Common refers to the collection of common utilities and libraries that
support other Hadoop modules. It is an essential part or module of the Apache
Hadoop Framework, along with the Hadoop Distributed File System (HDFS),
Hadoop YARN and Hadoop MapReduce.
• Like all other modules, Hadoop Common assumes that hardware failures are
common and that these should be automatically handled in software by the
Hadoop Framework.
Hadoop Common is also known as Hadoop Core.
Hadoop YARN
• The fundamental idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate daemons. The idea
is to have a global ResourceManager ( RM ) and per- application
ApplicationMaster (AM). An application is either a single job or a DAG of
jobs.
• The ResourceManager and the NodeManager form the data-computation
framework.
• The ResourceManager is the ultimate authority that arbitrates resources among
all the applications in the system.
• The NodeManager is the per-machine framework agent who is responsible for
containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the ResourceManager/Scheduler.
• The per-application ApplicationMaster is, in effect, a framework specific
library and is tasked with negotiating resources from the ResourceManager
and working with the NodeManager(s) to execute and monitor the tasks.
Node
Manager
Client
Resource Node
Manager Manager
Client
Node
MapReduce Status
Manager
Job Submission
Node Status
Resource Request Container Container
Copy
Part 0 HDFS
Replication
Input (key, value) pairs Input (key, value) pairs Input (key, value) pairs
Googld cloud
infrastructure
FIGURE 5.6 Google cloud platform and major building blocks, the blocks shown are large
clusters of low-cost servers.
• Google is one of the larger cloud application providers, although its
fundamental service program is private and outside people cannot use the
Google infrastructure to build their own service.
• The building blocks of Google’s cloud computing application include the
Google File System for storing large amounts of data, the MapReduce
programming framework for application developers, Chubby for distributed
application lock services, and BigTable as a storage service for accessing
structural or semistructural data. With these building blocks, Google has built
many cloud applications.
• Figure 5.6 shows the overall architecture of the Google cloud infrastructure. A
typical cluster configuration can run the Google File System, MapReduce
jobs, and BigTable servers for structure data. Extra services such as Chubby
for distributed locks can also run in the clusters. GAE runs the user program
on Google’s infrastructure.
• As it is a platform running third-party programs, application developers now
do not need to worry about the maintenance of servers. GAE can be thought of
as the combination of several software components. The frontend is an
application framework which is similar to other web application frameworks
such as ASP, J2EE, and JSP. The applications can run similar to web
application containers. The frontend can be used as the dynamic web serving
infrastructure which can provide the full support of common technologies.
5.4.3 Functional Modules of GAE
• The GAE platform comprises the following five major components. The GAE
is not an infrastructure platform, but rather an application development
platform for users. We describe the component functionalities separately.
a. The datastore offers object-oriented, distributed, structured data storage
services based on BigTable techniques. The datastore secures data
management operations.
b. The application runtime environment offers a platform for scalable web
programming and execution. It supports two development languages:
Python and Java.
c. The software development kit (SDK) is used for local application
development. The SDK allows users to execute test runs of local
applications and upload application code.
d. The administration console is used for easy management of user application
development cycles, instead of for physical resource management.
e. The GAE web service infrastructure provides special interfaces to guarantee
flexible use and management of storage and network resources by GAE.
• Google offers essentially free GAE services to all Gmail account owners. You
can register for a GAE account or use your Gmail account name to sign up for
the service. The service is free within a quota. If you exceed the quota, the
page instructs you on how to pay for the service. Then you download the SDK
and read the Python or Java guide to get started.
• Note that GAE only accepts Python, Ruby, and Java programming languages.
The platform does not provide any IaaS services, unlike Amazon, which offers
Iaas and PaaS. This model allows the user to deploy user-built applications on
top of the cloud infrastructure that are built using the programming languages
and software tools supported by the provider (e.g., Java, Python). Azure does
this similarly for .NET. The user does not manage the underlying cloud
infrastructure. The cloud provider facilitates support of application
development, testing, and operation support on a well-defined service
platform.
5.4.4 GAE Applications
• Well-known GAE applications include the Google Search Engine, Google
Docs, Google Earth, and Gmail. These applications can support large numbers
of users simultaneously. Users can interact with Google applications via the
web interface provided by each application. Third-party application providers
can use GAE to build cloud applications for providing services. The
applications are all run in the Google data centers. Inside each data center,
there might be thousands of server nodes to form different clusters. (See the
previous section.) Each cluster can run multipurpose servers.
• GAE supports many web applications. One is a storage service to store
application-specific data in the Google infrastructure. The data can be
persistently stored in the backend storage server while still providing the
facility for queries, sorting, and even transactions similar to traditional
database systems.
• GAE also provides Google-specific services, such as the Gmail account
service (which is the login service, that is, applications can use the Gmail
account directly). This can eliminate the tedious work of building customized
user management components in web applications. Thus, web applications
built on top of GAE can use the APIs authenticating users and sending e- mail
using Google accounts.
5.5 PROGRAMMING SUPPORT OF GOOGLE APP ENGINE
5.5.1 Programming the Google App Engine
• Figure 5.7 summarizes some key features of GAE programming model for two
supported languages: Java and Python. A client environment that includes an
Eclipse plug-in for Java allows you to debug your GAE on your local
machine. Also, the GWT Google Web Toolkit is available for Java web
application developers. Developers can use this, or any other language using
a JVM based interpreter or compiler, such as JavaScript or Ruby. Python is
often used with frameworks such as Django and CherryPy, but Google also
supplies a built in webapp Python environment.
• There are several powerful constructs for storing and accessing data. The data
store is a NOSQL data management system for entities that can be, at most, 1
MB in size and are labeled by a set of schema-less properties. Queries can
retrieve entities of a given kind filtered and sorted by the values of the
properties.
• Java offers Java Data Object (JDO) and Java Persistence API (JPA) interfaces
implemented by the open source Data Nucleus Access platform, while Python
has a SQL-like query language called GQL. The data store is strongly
consistent and uses optimistic concurrency control.
Table 5.1 Comparison of MapReduce++ Subcategories along with the Loosely
Synchronous Category Used in MPI
• An update of an entity occurs in a transaction that is retried a fixed number
of times if other processes are trying to update the same entity simultaneously.
Your application can execute multiple data store operations in a single
transaction which either all succeed or all fail together. The data store
implements transactions across its distributed network using “entity groups.”
• A transaction manipulates entities within a single group. Entities of the same
group are stored together for efficient execution of transactions. Your GAE
application can assign entities to groups when the entities are created. The
performance of the data store can be enhanced by in-memory caching using
the memcache, which can also be used independently of the data store.
• Recently, Google added the blobstore which is suitable for large files as its size
limit is 2 GB. There are several mechanisms for incorporating external
resources. The Google SDC Secure Data Connection can tunnel through the
Internet and link your intranet to an external GAE application.
• The URL Fetch operation provides the ability for applications to fetch
resources and communicate with other hosts over the Internet using HTTP
and HTTPS requests. There is a specialized mail mechanism to send e-mail
from your GAE application.
• Applications can access resources on the Internet, such as web services or other
data, using GAE’s URL fetch service. The URL fetch service retrieves web
resources using the same highspeed Google infrastructure that retrieves web
pages for many other Google products. There are dozens of Google
“corporate” facilities including maps, sites, groups, calendar, docs, and
YouTube, among others. These support the Google Data API which can be
used inside GAE.
• An application can use Google Accounts for user authentication. Google
Accounts handles user account creation and sign-in, and a user that already has
a Google account (such as a Gmail account) can use that account with your
app. GAE provides the ability to manipulate image data using a dedicated
Images service which can resize, rotate, flip, crop, and enhance images. An
application can perform tasks outside of responding to web requests. Your
application can perform these tasks on a schedule that you configure, such as
on a daily or hourly basis using “cron jobs,” handled by the Cron service.
• Alternatively, the application can perform tasks added to a queue by the
application itself, such as a background task created while handling a request.
A GAE application is configured to consume resources up to certain limits or
quotas. With quotas, GAE ensures that your application won’t exceed your
budget, and that other applications running on GAE won’t impact the
performance of your app. In particular, GAE use is free up to certain quotas.
5.5.2 Google File System (GFS)
• GFS was built primarily as the fundamental storage service for Google’s
search engine. As the size of the web data that was crawled and saved was
quite substantial, Google needed a distributed file system to redundantly store
massive amounts of data on cheap and unreliable computers.
• None of the traditional distributed file systems can provide such functions and
hold such large amounts of data. In addition, GFS was designed for Google
applications, and Google applications were built for GFS. In traditional file
system design, such a philosophy is not attractive, as there
should be a clear interface between applications and the file system, such as
a POSIX interface.
• There are several assumptions concerning GFS. One is related to the
characteristic of the cloud computing hardware infrastructure (i.e., the high
component failure rate). As servers are composed of inexpensive commodity
components, it is the norm rather than the exception that concurrent failures
will occur all the time. Another concerns the file size in GFS. GFS typically
will hold a large number of huge files, each 100MB or larger, with files that
are multiple GB in size quite common.
• Thus, Google has chosen its file data block size to be 64MB instead of the
4 KB in typical traditional file systems. The I/O pattern in the Google
application is also special. Files are typically written once, and the write
operations are often the appending data blocks to the end of files. Multiple
appending operations might be concurrent. There will be a lot of large
streaming reads and only a little random access. As for large streaming reads,
highly sustained throughput is much more important than low latency.
• Thus, Google made some special decisions regarding the design of GFS. As
noted earlier, a 64 MB block size was chosen. Reliability is achieved by using
replications (i.e., each chunk or data block of a file is replicated across more
than three chunk servers). A single master coordinates access as wellas keeps
the metadata. This decision simplified the design and managementof the
whole cluster.
• Developers do not need to consider many difficult issues in distributed
systems, such as distributed consensus. There is no data cache in GFS as large
streaming reads and writes represent neither time nor space locality. GFS
provides a similar, but not identical, POSIX file system accessing interface.
The distinct difference is that the application can even see the physical location
of file blocks. Such a scheme can improve the upper-layer applications. The
customized API can simplify the problem and focus on Google applications.
/foo/bar
GFS client GFS master
Secondary
replica A
Primary
5
Legend:
Control
Secondary Data
replica B
<html
“com.cnn.www”
BigTable cell
...
.
...
FIGURE 5.11 Tablet location hierarchy in using the BigTable.
5.5.4 Chubby, Google’s Distributed Lock Service
• Chubby is intended to provide a coarse-grained locking service. It can store
small files inside Chubby storage which provides a simple namespace as a
file system tree.
• The files stored in Chubby are quite small compared to the huge files in GFS.
Based on the Paxos agreement protocol, the Chubby system can be quite
reliable despite the failure of any member node. Figure 5.12 showsthe
overall architecture of the Chubby system.
Master
Chubby
application
Client processes
Hadoop Common refers to the collection of common utilities and libraries that
support other Hadoop modules. It is an essential part or module of the Apache
Hadoop Framework, along with the Hadoop Distributed File System (HDFS),
Hadoop YARN and Hadoop MapReduce.
Hadoop Common is also known as Hadoop Core.
• Multi-tenancy
• Cluster Utilization
• Scalability
• Compatibility
• Hadoop framework allows the user to quickly write and test distributed systems.
It is efficient, and it automatic distributes the data and work across the machines
and in turn, utilizes the underlying parallelism of the CPU cores.
17. How Google File System is used for google search engine?
• GFS was built primarily as the fundamental storage service for Google’s search
engine.
• As the size of the web data that was crawled and saved was quite substantial,
Google needed a distributed file system to redundantly store massive amounts of
data on cheap and unreliable computers.
PART B
1. Explain Hadoop architecture with neat diagram.
2. Discuss Hadoop YARN Architecture with its diagram.
3. How MapReduce implemented in Hadoop?.
4. What is GAE?.Explain GAE architecture and its functional modules.
5. Explain Google File System (GFS) architecture.
6. Discuss BigTable data model and its system structure in detail.
7. Explain OpenStack Nova system architecture.
8. Discuss the federated services and its applications.