0% found this document useful (0 votes)
21 views15 pages

CCDT Unit 2

Chapter 2 of the CCDT outlines the cloud computing architecture, detailing the cloud reference model which includes three service models: SaaS, PaaS, and IaaS, along with their respective features and use cases. It also discusses various cloud reference models such as NIST, CSA, and OCCI, emphasizing their roles in standardizing cloud services and ensuring security. Additionally, the chapter covers the layered architecture of cloud computing, data centers, and the importance of interconnection networks in facilitating efficient data transfer and communication within cloud environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

CCDT Unit 2

Chapter 2 of the CCDT outlines the cloud computing architecture, detailing the cloud reference model which includes three service models: SaaS, PaaS, and IaaS, along with their respective features and use cases. It also discusses various cloud reference models such as NIST, CSA, and OCCI, emphasizing their roles in standardizing cloud services and ensuring security. Additionally, the chapter covers the layered architecture of cloud computing, data centers, and the importance of interconnection networks in facilitating efficient data transfer and communication within cloud environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

CCDT CHAPTER 2

Cloud Computing Architecture


Cloud Reference Model
The cloud computing reference model is an abstract model that divides a cloud computing environment into
abstraction layers and cross-layer functions to characterize and standardize its functions. This reference
model divides cloud computing activities into three cross-layer functions and five logical layers.
Each layer describes different things that might be present in a cloud computing environment, such as
computing systems, networking, storage equipment, virtualization software, security measures, control and
management software, and so forth. It also explains the connections between these organizations. The five
layers are the Physical layer, virtual layer, control layer, service orchestration layer, and service layer.
The Cloud Computing reference model is divided into 3 major service models:
Software as a Service (SaaS)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
The below diagram explains the cloud computing reference model:

1. SaaS
Software as a Service (SaaS) is a form of application delivery that relieves users of the burden of software
maintenance while making development and testing easier for service providers.
The cloud delivery model's top layer is where applications are located. End customers get access to the
services this tier offers via web portals. Because online software services provide the same functionality as
locally installed computer programs, consumers (users) are rapidly switching from them. Today, ILMS and
other application software can be accessed via the web as a service.
In terms of data access, collaboration, editing, storage, and document sharing, SaaS is unquestionably a
crucial service. Email service in a web browser is the most well-known and widely used example of SaaS,
but SaaS applications are becoming more cooperative and advanced.
Features of SaaS are as follows:
The cloud consumer has full control over all the cloud services.
The provider has full control over software applications-based services.
The cloud provider has partial control over the implementation of cloud services.
The consumer has limited control over the implementation of these cloud services.
2. PaaS
Platform as a Service is a strategy that offers a high level of abstraction to make a cloud readily
programmable in addition to infrastructure-oriented clouds that offer basic compute and storage capabilities
(PaaS). Developers can construct and deploy apps on a cloud platform without necessarily needing to know
how many processors or how much memory their applications would use. A PaaS offering that provides a
scalable environment for creating and hosting web applications is Google App Engine, for instance.
Features of PaaS layer are as follows:
The cloud provider has entire rights or control over the provision of cloud services to consumers.
The cloud consumer has selective control based on the resources they need or have opted for on the application
server, database, or middleware.
Consumers get environments in which they can develop their applications or databases. These environments are
usually very visual and very easy to use.
Provides options for scalability and security of the user’s resources.
Services to create workflows and websites.
Services to connect users’ cloud platforms to other external platforms.
3. IaaS
Infrastructure as a Service (IaaS) offers storage and computer resources that developers and IT organizations
use to deliver custom/business solutions. IaaS delivers computer hardware (servers, networking technology,
storage, and data center space) as a service. It may also include the delivery of OS and virtualization
technology to manage the resources. Here, the more important point is that IaaS customers rent computing
resources instead of buying and installing them in their data centers. The service is typically
paid for on a usage basis. The service may include dynamic scaling so that if the customers need more
resources than expected, they can get them immediately.
The control of the IaaS layer is as follows:
The consumer has full/partial control over the infrastructure of the cloud, servers, and databases.
The consumer has control over the Virtual Machines' implementation and maintenance.
The consumer has a choice of already installed VM machines with pre-installed Operating systems.
The cloud provider has full control over the data centers and the other hardware involved in them.
It has the ability to scale resources based on the usage of users.
It can also copy data worldwide so that data can be accessed from anywhere in the world as soon as possible.

TYPES:
The NIST Cloud Reference Model
Various cloud computing reference models are used to represent consumers’ different requirements.
The National Institute of Standards and Technology (NIST) is an American organization responsible for
adopting and developing cloud computing standards. The NIST cloud computing model comprises five
crucial features:
Measured Service
On-demand self-service
Resource pooling
Rapid elasticity
Broad network access
They follow the same three service models defined earlier: SaaS, PaaS, and IaaS, and mention four
deployment models: i.e., Private, Community, Public, and Hybrid cloud.
The CSA Cloud Reference Model
Security in the cloud is a rising concern. With so much data being available and distributed on the cloud,
vendors must establish proper controls and boundaries. The Cloud Security Alliance (CSA) reference
model defines these responsibilities. It states that IaaS is the most basic level of service, followed by PaaS
and then SaaS. Each of them inherits the security intricacies of the predecessor, which also means that any
concerns are propagated forward. The proposal from the CSA is that any cloud computing model should
include the below-mentioned security mechanisms:
Access control
Audit trail
Certification
Authority
The OCCI Cloud Reference Model
The Open Cloud Computing Interface (OCCI) is a set of specifications and standards that defines how
various cloud vendors deliver services to their customers. It helps streamline the creation of system calls and
APIs for every provider. This model not only helps with security but also helps create managed services,
monitoring, and other system management tasks that can be beneficial. The main pillars of the OCCI cloud
computing reference model are:
Interoperability – Enable diverse cloud providers to operate simultaneously without data translation between
multiple API calls
Portability – Move away from vendor lock-in and allow customers to move among providers depending on
their business objectives with limited technical expenses, thus fostering competition in the market
Integration – The feature can be offered to the customer with any infrastructure
Extensibility – Using the meta-model and discovering features, OCCI servers can interact with other OCCI
servers using extensions.
The CIMI & DMTF Cloud Reference Model
The Cloud Infrastructure Management Interface (CIMI) Model is an open standard specification for
APIs to manage cloud infrastructure. CIMI aims to ensure users can manage the cloud infrastructure simply
by standardizing interactions between the cloud environment and the developers. The CIMI standard is
defined by the Distributed Management Task Force (DMTF). They provide the protocols used for the
Representational State Transfer (REST) protocol using HTTP, although the same can be extended for other
protocols.
Each resource in the model has a MIME type that contextualizes the request and response payload. URIs
identify resources; each resource’s representation contains an ID attribute known as the ‘Cloud Entry Point’.
All other resources in the environment will then have iterative links associated with this resource.
NIST Cloud Computing Reference Model: Developed by the National Institute of Standards and Technology
(NIST), this model is one of the most widely recognized and used. It defines five essential characteristics, three
service models, and four deployment models of cloud computing. The essential characteristics include on-
demand self-service, broad network access, resource pooling, rapid elasticity, and measured service.
IBM Cloud Computing Reference Architecture: IBM's model provides a comprehensive framework for
designing and implementing cloud solutions. It covers various aspects such as infrastructure, platforms,
applications, services, security, management, and governance.
The Open Group Cloud Computing Reference Architecture: This reference model focuses on providing a
standardized approach to architecting cloud solutions. It emphasizes interoperability, portability, and scalability
across different cloud environments.
Cloud Security Alliance (CSA) Cloud Control Matrix (CCM): While not a traditional reference model, the
CCM provides a structured framework for assessing the security controls of cloud service providers. It helps
organizations evaluate the security posture of cloud services based on various criteria and compliance
requirements.
ISO/IEC 17788 Cloud Computing Overview and Vocabulary: This ISO standard provides a comprehensive
overview of cloud computing concepts, terminology, and relationships between different cloud components. It
serves as a foundational reference for understanding cloud computing principles and practices.
These reference models serve as guidelines for organizations to understand, design, and implement cloud
solutions effectively. They provide a common language and framework for discussing and evaluating cloud
computing concepts, helping organizations make informed decisions about cloud adoption and deployment
strategies.

Layered Architecture of Cloud

Application Layer
The application layer, which is at the top of the stack, is where the actual cloud apps are located. Cloud
applications, as opposed to traditional applications, can take advantage of the automatic scaling functionality
to gain greater performance, availability, and lower operational costs.
This layer consists of different Cloud Services which are used by cloud users. Users can access these
applications according to their needs. Applications are divided into Execution layers and Application
layers.
For an application to transfer data, the application layer determines whether communication partners are
available. Whether enough cloud resources are accessible for the required communication is decided at the
application layer. Applications must cooperate to communicate, and an application layer is in charge of this.
The application layer, in particular, is responsible for processing IP traffic handling protocols like Telnet and
FTP. Other examples of application layer systems include web browsers, SNMP protocols, HTTP protocols,
or HTTPS, which is HTTP’s successor protocol.

Platform Layer
The operating system and application software make up this layer.
Users should be able to rely on the platform to provide them with Scalability, Dependability, and Security
Protection which gives users a space to create their apps, test operational processes, and keep track of
execution outcomes and performance. SaaS application implementation’s application layer foundation.
The objective of this layer is to deploy applications directly on virtual machines.
Operating systems and application frameworks make up the platform layer, which is built on top of the
infrastructure layer. The platform layer’s goal is to lessen the difficulty of deploying programmers directly
into VM containers.
By way of illustration, Google App Engine functions at the platform layer to provide API support for
implementing storage, databases, and business logic of ordinary web apps.

Infrastructure Layer
It is a layer of virtualization where physical resources are divided into a collection of virtual resources using
virtualization technologies like Xen, KVM, and VMware.
This layer serves as the Central Hub of the Cloud Environment, where resources are constantly added
utilizing a variety of virtualization techniques.
A base upon which to create the platform layer. constructed using the virtualized network, storage, and
computing resources. Give users the flexibility they want.
Automated resource provisioninng is made possible by virtualization, which also improves infrastructure
management.
The infrastructure layer sometimes referred to as the virtualization layer, partitions the physical resources
using virtualization technologies like Xen, KVM, Hyper-V, and VMware to create a pool of compute and
storage resources.
The infrastructure layer is crucial to cloud computing since virtualization technologies are the only ones that
can provide many vital capabilities, like dynamic resource assignment.
Data centre Layer
In a cloud environment, this layer is responsible for Managing Physical Resources such as servers, switches,
routers, power supplies, and cooling systems.
Providing end users with services requires all resources to be available and managed in data centers.
Physical servers connect through high-speed devices such as routers and switches to the data center.
In software application designs, the division of business logic from the persistent data it manipulates is well-
established. This is due to the fact that the same data cannot be incorporated into a single application because
it can be used in numerous ways to support numerous use cases. The requirement for this data to become a
service has arisen with the introduction of microservices.
A single database used by many microservices creates a very close coupling. As a result, it is hard to deploy
new or emerging services separately if such services need database modifications that may have an impact on
other services. A data layer containing many databases, each serving a single microservice or perhaps a few
closely related microservices, is needed to break complex service interdependencies.

Cloud Service Models


There are the following three types of cloud service models -
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
Cloud Computing Reference Model Use Cases and Examples
IaaS use case
A business requires to provision VMs for their rapidly growing demand. They can utilise an IaaS provider
such as Amazon Web Services (AWS) to scale their infrastructure without needing physical hardware.
PaaS use case
A software development team wants to focus on building applications rather than worrying about the
underlying system and architecture. They can choose a PaaS provider, like Google App Engine (By Google
Cloud Platform – GCP) or Heroku, to deploy and manage their applications.
SaaS use case
A business requires a Customer Relationship Management (CRM) tool to manage its sales and customer
interactions. Instead of spending time and resources to build their in-house software, businesses can opt for
SaaS offerings like Salesforce or HubSpot via the cloud without worrying about maintaining it.
What is a Data Center?
A data center - also known as a data center or data center - is a facility made up of networked computers,
storage systems, and computing infrastructure that businesses and other organizations use to organize, process,
and store large amounts of data. And to broadcast. A business typically relies heavily on applications, services,
and data within a data center, making it a focal point and critical asset for everyday operations.
Enterprise data centers increasingly incorporate cloud computing resources and facilities to secure and protect
in-house, onsite resources. As enterprises increasingly turn to cloud computing, the boundaries between cloud
providers' data centers and enterprise data centers become less clear.
How do Data Centers work?
A data center facility enables an organization to assemble its resources and infrastructure for data processing,
storage, and communication, including:
systems for storing, sharing, accessing, and processing data across the organization;
physical infrastructure to support data processing and data communication; And
Utilities such as cooling, electricity, network access, and uninterruptible power supplies (UPS).
Gathering all these resources in one data center enables the organization to:
protect proprietary systems and data;
Centralizing IT and data processing employees, contractors, and vendors;
Enforcing information security controls on proprietary systems and data; And
Realize economies of scale by integrating sensitive systems in one place.
Why are data centers important?
Data centers support almost all enterprise computing, storage, and business applications. To the extent that the
business of a modern enterprise runs on computers, the data center is business.
Data centers enable organizations to concentrate their processing power, which in turn enables the organization
to focus its attention on:
IT and data processing personnel;
computing and network connectivity infrastructure; And
Computing Facility Security.
What are the main components of Data Centers?
Elements of a data center are generally divided into three categories:
Calculation
enterprise data storage
networking
A modern data center concentrates an organization's data systems in a well-protected physical
infrastructure, which includes:
Server;
storage subsystems;
networking switches, routers, and firewalls;
cabling; And
Physical racks for organizing and interconnecting IT equipment.
Datacenter Resources typically include:
power distribution and supplementary power subsystems;
electrical switching;
UPS;
backup generator;
ventilation and data center cooling systems, such as in-row cooling configurations and computer room air
conditioners; And
Adequate provision for network carrier (telecom) connectivity.
It demands a physical facility with physical security access controls and sufficient square footage to hold the
entire collection of infrastructure and equipment.
Data Center Design:
Choosing the right location for a data center is crucial. Factors such as power availability, connectivity options,
proximity to users, and environmental considerations like climate and natural disaster risk must be evaluated.
Once the location is determined, the physical layout of the data center should be designed to optimize space
utilization, airflow management, and accessibility for maintenance and expansion. Robust power and cooling
systems, including redundant power sources, backup generators, UPS, and efficient cooling mechanisms, are
essential to ensure continuous operation and prevent overheating. Moreover, installing a reliable networking
infrastructure, including switches, routers, and firewalls, is crucial for facilitating high-speed data transfer and
secure connectivity within the data center and to external networks.
INTERCONNECTION NETWORK
The interconnection network, also known as the data center network or fabric, refers to the infrastructure that
connects servers, storage devices, networking equipment, and other components within a data center. It plays a
critical role in enabling communication and data transfer between these devices, ensuring efficient operation and
high performance of data center services and applications.
Components of an Interconnection Network:
Switches and Routers:
Switches and routers are the backbone of the interconnection network. Switches facilitate communication
between devices within the same network segment, while routers handle traffic between different network
segments or across multiple networks.
These devices are responsible for forwarding data packets based on destination addresses, ensuring that data is
delivered to the intended recipient efficiently.
Cabling Infrastructure:
High-quality cabling infrastructure, including copper or fiber-optic cables, is essential for connecting devices
within the data center. The choice of cabling depends on factors such as bandwidth requirements, distance, and
environmental conditions.
Proper cable management and organization are critical to maintaining signal integrity, minimizing interference,
and ensuring the reliability and scalability of the interconnection network.
Network Interfaces:
Network interfaces, such as Ethernet and Fibre Channel adapters, enable devices to connect to the
interconnection network. These interfaces may be integrated into servers, storage arrays, and networking
equipment or added as expansion cards or modules.
Network Protocols and Standards:
Various network protocols and standards govern the communication and data transfer within the interconnection
network. Examples include Ethernet, Fibre Channel, InfiniBand, and TCP/IP.
Each protocol has its own characteristics, advantages, and use cases, and the choice of protocol depends on
factors such as performance requirements, scalability, and compatibility with existing infrastructure.
Key Considerations for Interconnection Network Design:
Topology:
The network topology defines the physical and logical arrangement of network devices and connections within
the data center. Common topologies include star, mesh, tree, and hybrid configurations.
The choice of topology depends on factors such as scalability, fault tolerance, latency, and cost. For example, a
mesh topology provides redundancy and fault tolerance but may require more cabling and higher complexity
compared to a star topology.
Bandwidth and Throughput:
Adequate bandwidth and throughput are essential for supporting the data transfer requirements of applications
and services within the data center. High-speed interconnections, such as 10Gbps, 25Gbps, 40Gbps, or 100Gbps
links, may be necessary to meet the performance demands of modern applications.
Network capacity planning should take into account current and future data traffic patterns, application
requirements, and growth projections to ensure that the interconnection network can scale to meet evolving
needs.
Latency and Packet Loss:
Minimizing latency and packet loss is critical for maintaining the responsiveness and reliability of data center
applications and services. Low-latency interconnections are particularly important for real-time applications
such as video streaming, online gaming, and financial trading.
Factors such as network congestion, distance, and the quality of network equipment and infrastructure can affect
latency and packet loss. Designing the network with redundant paths, traffic prioritization, and Quality of
Service (QoS) mechanisms can help mitigate these issues.
Redundancy and Fault Tolerance:
Redundancy is essential for ensuring high availability and fault tolerance within the interconnection network.
Redundant switches, routers, links, and power supplies help prevent single points of failure and minimize
downtime.
Techniques such as link aggregation (LACP), spanning tree protocol (STP), and virtual router redundancy
protocol (VRRP) can be used to create redundant paths and automatically reroute traffic in case of failures.
Security:
Security measures such as firewalls, intrusion detection/prevention systems (IDS/IPS), encryption, access
controls, and authentication mechanisms are essential for protecting the interconnection network from
unauthorized access, data breaches, and cyber threats.
Segmenting the network into virtual LANs (VLANs) or using virtual private networks (VPNs) can help isolate
and secure traffic between different user groups, departments, or applications.
Scalability:
The interconnection network should be designed to scale and grow with the evolving needs of the data center.
Modular switches, expandable chassis, and flexible cabling infrastructure facilitate easy expansion and upgrades
as demand increases.
Scalability considerations should encompass not only the physical infrastructure but also network management
and provisioning processes to ensure smooth and efficient scaling of resources.
Management and Monitoring:
Effective management and monitoring tools are essential for maintaining visibility, control, and performance
optimization of the interconnection network. Network management software, monitoring systems, and analytics
platforms help administrators identify and troubleshoot issues, optimize resource utilization, and enforce
security policies.
Automated provisioning, configuration management, and predictive analytics can streamline network operations
and improve responsiveness to changing conditions and requirements.
In summary, the interconnection network is a critical component of a data center infrastructure, enabling
efficient communication and data transfer between devices and facilitating the delivery of services and
applications. By carefully considering factors such as topology, bandwidth, latency, redundancy, security,
scalability, and management, organizations can design a robust and reliable interconnection network that meets
their performance, availability, and security requirements.

Architectural Design of Compute and Storage Clouds –

Cloud Programming and Software: Fractures of cloud programming

Cloud programming refers to the development of software applications specifically designed to run on cloud
computing platforms. These applications leverage the scalability, flexibility, and cost-effectiveness of cloud
infrastructure to deliver services over the Internet. However, like any form of software development, cloud
programming has its challenges and fractures, which are areas of difficulty or complexity. Here are some
fractures commonly encountered in cloud programming:

1. Distributed Computing Complexity:


- Cloud applications often involve distributed computing, where components of the application run on
different servers or even different data centers. Managing communication between distributed components,
handling data consistency, and ensuring fault tolerance can be complex and challenging.

2. Scalability and Performance:


- While cloud platforms offer scalability, designing applications that can effectively scale horizontally to
handle increasing loads can be challenging. Ensuring that the application maintains performance under varying
workloads and effectively utilizes cloud resources requires careful design and optimization.

3. Data Management:
- Managing data in the cloud, including storage, retrieval, and processing, can be complex. Issues such as data
consistency, data integrity, data partitioning, and data privacy must be carefully addressed to ensure the
reliability and security of cloud-based applications.

4. Security and Compliance:


- Security is a significant concern in cloud programming, as applications are exposed to potential threats such
as unauthorized access, data breaches, and denial-of-service attacks. Ensuring proper authentication,
authorization, encryption, and compliance with regulatory requirements is essential but can be challenging.

5. Cost Optimization:
- While cloud computing offers cost advantages such as pay-as-you-go pricing and resource elasticity,
optimizing costs can be challenging. Balancing performance requirements with cost considerations, optimizing
resource utilization, and avoiding unexpected expenses requires careful monitoring and management.

6. Vendor Lock-In:
- Cloud platforms often offer proprietary services and APIs, which can lead to vendor lock-in. Developers
must carefully consider the trade-offs between leveraging platform-specific features for convenience and the
risk of being tied to a specific cloud provider, which can limit flexibility and portability.

7. Integration and Interoperability:


- Cloud applications often need to integrate with other systems, services, and data sources, both within and
outside the cloud environment. Ensuring seamless integration and interoperability between disparate systems
can be complex, especially when dealing with legacy systems or heterogeneous environments.

8. Resource Management:
- Effectively managing cloud resources, including compute instances, storage, and networking, requires
careful planning and optimization. Overprovisioning can lead to unnecessary costs, while under provisioning
can result in performance issues or service disruptions.

9. Development and Deployment Workflow:


- Cloud programming involves new development and deployment workflows compared to traditional software
development. Adopting cloud-native development practices, such as continuous integration/continuous
deployment (CI/CD), microservices architecture, and containerization, may require changes in development
processes and tooling.

10. Monitoring and Debugging:


- Monitoring and debugging cloud applications can be challenging due to the distributed and dynamic nature
of cloud environments. Tools and techniques for monitoring performance, identifying bottlenecks, and
troubleshooting issues in distributed systems are essential but may require specialized knowledge and skills.

Addressing these fractures requires a combination of technical expertise, careful design, robust architecture, and
effective management practices. By understanding and mitigating these challenges, developers can build
resilient, scalable, and cost-effective cloud applications that meet the needs of modern businesses and users.

Parallel Computing and Distributed Computing


Parallel Computing:
In parallel computing, multiple processors perform multiple tasks assigned to them simultaneously. Memory
in parallel systems can either be shared or distributed. Parallel computing provides concurrency and saves
time and money.
Distributed Computing:
In distributed computing, we have multiple autonomous computers which seems to the user as a single
system. In distributed systems, there is no shared memory and computers communicate with each other
through message passing. In distributed computing a single task is divided among different computers.
Difference between Parallel Computing and Distributed Computing:
S.N
O Parallel Computing Distributed Computing

1. Many operations are performed simultaneously System components are located at different locations

2. Single computer is required Uses multiple computers

3. Multiple processors perform multiple operations Multiple computers perform multiple operations

4. It may have shared or distributed memory It have only distributed memory


Processors communicate with each other through
5. Computer communicate with each other through message passing.
bus

Improves system scalability, fault tolerance and resource sharing


6. Improves the system performance
capabilities

MapReduce
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient to
use. MapReduce is a programming model used for efficient processing in parallel over large data sets in a
distributed manner. The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-different optimizations. The
purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for
providing less overhead over the cluster network and to reduce the processing power. The MapReduce task is
mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:


Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There can be
multiple clients available that continuously send jobs for processing to the Hadoop MapReduce Manager.
Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so many
smaller tasks that the client wants to process or execute.
Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the job-parts
combined to produce the final output.
Input Data: The data set that is fed to the MapReduce for processing.
Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the Hadoop
MapReduce Master. Now, the MapReduce master will divide this job into further equivalent job-parts. These
job-parts are then made available for the Map and Reduce Task. This Map and Reduce task will contain the
program as per the requirement of the use-case that the particular company is solving. The developer writes
their logic to fulfill the requirement that the industry requires. The input data which we are using is then fed
to the Map Task and the Map will generate intermediate key-value pair as its output. The output of Map i.e.
these key-value pairs are then fed to the Reducer and the final output is stored on the HDFS. There can be n
number of Map and Reduce tasks made available for processing the data as per the requirement. The
algorithm for Map and Reduce is made in a very optimized way such that the time complexity or space
complexity is minimal.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
Map: As the name suggests its main use is to map the input data in key-value pairs. The input to the map
may be a key-value pair where the key can be the ID of some kind of address and value is the actual value
that it keeps. The Map() function will be executed in its memory repository on each of these input key-value
pairs and generates the intermediate key-value pair which works as input for the Reducer
or Reduce() function.

Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort and send to
the Reduce() function. Reducer aggregate or group the data based on its key-value pair as per the reducer
algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across the cluster and
also to schedule each map on the Task Tracker running on the same data node since there can be hundreds of
data nodes available in the cluster.

Task Tracker: The Task Tracker can be considered as the actual slaves that are working on the instruction
given by the Job Tracker. This Task Tracker is deployed on each of the nodes available in the cluster that
executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job History Server. The Job
History Server is a daemon process that saves and stores historical information about the task or application,
like the logs that are generated during or after the job execution are stored on Job History Server.

Hadoop
Hadoop is an open-source software framework that is used for storing and processing large amounts of data in
a distributed computing environment. It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data and
performing the computation. Its framework is based on Java programming with some native code in C and
shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large amounts of data in
a distributed computing environment. It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large datasets.

Hadoop has two main components:


HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which allows for the
storage of large amounts of data across multiple machines. It is designed to work with commodity hardware,
which makes it cost-effective.
YARN (Yet Another Resource Negotiator): This is the resource management component of Hadoop, which
manages the allocation of resources (such as CPU and memory) for processing the data stored in HDFS.
Hadoop also includes several additional modules that provide additional functionality, such as Hive (a SQL-
like query language), Pig (a high-level platform for creating MapReduce programs), and HBase (a non-
relational, distributed database).
Hadoop is commonly used in big data scenarios such as data warehousing, business intelligence, and machine
learning. It’s also used for data processing, data analysis, and data mining. It enables the distributed
processing of large data sets across clusters of computers using a simple programming model.
History of Hadoop
Apache Software Foundation is the developer of Hadoop, and its co-founders are Doug Cutting and Mike
Cafarella. Its co-founder Doug Cutting named it after his son’s toy elephant. In October 2003 the first paper
released was Google File System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April 2006
Hadoop 0.1.0 was released.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage and
processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add more
capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate even in
the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same node where it will
be processed, this feature helps to reduce the network traffic and improve the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure that the data is
always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of data in a
distributed fashion, making it easy to implement a wide variety of data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored is
consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data across the
cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the storage
space and improve the performance.
YARN: A resource management platform that allows multiple data processing engines like real-time
streaming, batch processing, and interactive SQL, to run and process data stored in HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across
various nodes in form of large clusters. Also in case of a node failure, the system operates and data transfer
takes place between the nodes which are facilitated by HDFS.

HDFS
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,
scalable, block structured, can process a large amount of data simultaneously and many more. Disadvantages
of HDFS: It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues
related to potential stability, restrictive and rough in nature. Hadoop also supports a wide range of software
packages such as Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache
Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
Some common frameworks of Hadoop
Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
Drill- It consists of user-defined functions and is used for data exploration.
Storm- It allows real-time processing and streaming of data.
Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is
widely used for data processing. It also supports Java, Python, and Scala.
Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data.
Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster.
Hadoop framework is made up of the following modules:
Hadoop MapReduce- a MapReduce programming model for handling and processing large data.
Hadoop Distributed File System- distributed files in clusters among nodes.
Hadoop YARN- a platform which manages computing resources.
Hadoop Common- it contains packages and libraries which are used for other modules.
Advantages and Disadvantages of Hadoop
Advantages:
Ability to store a large amount of data.
High flexibility.
Cost effective.
High computational power.
Tasks are independent.
Linear scaling.
High level Language for Cloud
High-level languages for cloud computing are programming languages specifically designed to simplify the
development of cloud-native applications and services. These languages provide abstractions, libraries, and
frameworks that enable developers to leverage cloud infrastructure and services effectively. Here are some high-
level languages commonly used in cloud computing:
JavaScript/Node.js:
JavaScript, particularly with Node.js runtime, is widely used for building server-side applications and
microservices in the cloud. It offers an asynchronous event-driven programming model, which is well-suited for
handling I/O-intensive operations common in cloud environments. Node.js also has a rich ecosystem of libraries
and frameworks (e.g., Express.js, Nest.js) for building scalable and efficient cloud applications.
Python:
Python is a popular language for cloud development due to its simplicity, readability, and versatility. It has
extensive libraries and frameworks (e.g., Django, Flask) for web development, data processing, machine
learning, and automation, making it suitable for a wide range of cloud applications. Python is commonly used
for writing serverless functions, web APIs, and data processing pipelines in the cloud.
Java:
Java is a robust and widely adopted language for building enterprise-grade cloud applications. It has a mature
ecosystem of libraries, frameworks (e.g., Spring Boot), and tools for developing scalable, reliable, and
performant applications. Java's platform independence and strong typing make it well-suited for building cloud-
native microservices, APIs, and backend systems.
Go (Golang):
Go is a statically typed, compiled language developed by Google, designed for building fast and efficient
software. It has built-in support for concurrency and provides a simple and efficient runtime environment,
making it well-suited for building cloud-native applications, especially microservices and distributed systems.
Go's simplicity, performance, and ease of deployment make it a popular choice for cloud development.
C#/.NET:
C# and the .NET framework are widely used for building cloud applications on the Microsoft Azure
platform. .NET Core, the cross-platform and open-source version of .NET, enables developers to build and
deploy cloud-native applications on various cloud environments. C# is commonly used for developing web
applications, APIs, and microservices in the cloud.
Ruby:
Ruby is a dynamic, object-oriented language known for its simplicity and productivity. It has a rich ecosystem
of libraries and frameworks (e.g., Ruby on Rails) for building web applications, APIs, and microservices.
Ruby's expressive syntax and convention-over-configuration approach make it well-suited for rapid
development of cloud-based applications.
Scala:
Scala is a statically typed language that combines object-oriented and functional programming paradigms. It is
often used with the Akka framework for building highly concurrent and distributed applications in the cloud.
Scala's interoperability with Java and its support for functional programming concepts make it suitable for
building scalable and resilient cloud applications.
These high-level languages provide developers with the tools and abstractions necessary to build cloud-native
applications efficiently. Each language has its own strengths, ecosystem, and community support, allowing
developers to choose the one that best fits their requirements and preferences.
Google App Engine (GAE)?
A scalable runtime environment, Google App Engine is mostly used to run Web applications. These dynamic
scales as demand change over time because of Google’s vast computing infrastructure. Because it offers a
secure execution environment in addition to a number of services, App Engine makes it easier to develop
scalable and high-performance Web apps. Google’s applications will scale up and down in response to
shifting demand. Croon tasks, communications, scalable data stores, work queues, and in-memory caching
are some of these services.
The App Engine SDK facilitates the testing and professionalization of applications by emulating the
production runtime environment and allowing developers to design and test applications on their own PCs.
When an application is finished being produced, developers can quickly migrate it to App Engine, put in
place quotas to control the cost that is generated, and make the programmer available to everyone. Python,
Java, and Go are among the languages that are currently supported.
The development and hosting platform Google App Engine, which powers anything from web programming
for huge enterprises to mobile apps, uses the same infrastructure as Google’s large-scale internet services. It
is a fully managed PaaS (platform as a service) cloud computing platform that uses in-built services to run
your apps. You can start creating almost immediately after receiving the software development kit (SDK).
You may immediately access the Google app developer’s manual once you’ve chosen the language you wish
to use to build your app.
After creating a Cloud account, you may Start Building your App
Using the Go template/HTML package
Python-based webapp2 with Jinja2
PHP and Cloud SQL
using Java’s Maven
The app engine runs the programmers on various servers while “sandboxing” them. The app engine allows
the program to use more resources in order to handle increased demands. The app engine powers programs
like Snapchat, Rovio, and Khan Academy.
Features of App Engine
Runtimes and Languages
To create an application for an app engine, you can use Go, Java, PHP, or Python. You can develop and test
an app locally using the SDK’s deployment toolkit. Each language’s SDK and nun time are unique. Your
program is run in a:
Java Run Time Environment version 7, Python Run Time environment version 2.7 , PHP runtime’s PHP 5.4
environment, Go runtime 1.2 environment, Generally Usable Features
These are protected by the service-level agreement and depreciation policy of the app engine. The
implementation of such a feature is often stable, and any changes made to it are backward-compatible. These
include communications, process management, computing, data storage, retrieval, and search, as well as app
configuration and management. Features like the HRD migration tool, Google Cloud SQL, logs, datastore,
dedicated Memcached, blob store, Memcached, and search are included in the categories of data storage,
retrieval, and search.
Features in Preview
In a later iteration of the app engine, these functions will undoubtedly be made broadly accessible. However,
because they are in the preview, their implementation may change in ways that are backward-incompatible.
Sockets, MapReduce, and the Google Cloud Storage Client Library are a few of them.
Experimental Features
These might or might not be made broadly accessible in the next app engine updates. They might be changed
in ways that are irreconcilable with the past. The “trusted tester” features, however, are only accessible to a
limited user base and require registration in order to utilize them. The experimental features include
Prospective Search, Page Speed, OpenID, Restore/Backup/Datastore Admin, Task Queue Tagging,
MapReduce, and Task Queue REST API. App metrics analytics, datastore admin/backup/restore, task queue
tagging, MapReduce, task queue REST API, OAuth, prospective search, OpenID, and Page Speed are some
of the experimental features.
Third-Party Services
As Google provides documentation and helper libraries to expand the capabilities of the app engine platform,
your app can perform tasks that are not built into the core product you are familiar with as app engine. To do
this, Google collaborates with other organizations. Along with the helper libraries, the partners frequently
provide exclusive deals to app engine users.
Advantages of Google App Engine
The Google App Engine has a lot of benefits that can help you advance your app ideas. This comprises:
Infrastructure for Security: The Internet infrastructure that Google uses is arguably the safest in the entire
world. Since the application data and code are hosted on extremely secure servers, there has rarely been any
kind of illegal access to date.
Faster Time to Market: For every organization, getting a product or service to market quickly is crucial.
When it comes to quickly releasing the product, encouraging the development and maintenance of an app is
essential. A firm can grow swiftly with Google Cloud App Engine’s assistance.
Quick to Start: You don’t need to spend a lot of time prototyping or deploying the app to users because there
is no hardware or product to buy and maintain.
Easy to Use: The tools that you need to create, test, launch, and update the applications are included in
Google App Engine (GAE).
Rich set of APIs & Services: Several built-in APIs and services in Google App Engine enable developers to
create strong, feature-rich apps.
Scalability: This is one of the deciding variables for the success of any software. When using the Google app
engine to construct apps, you may access technologies like GFS, Big Table, and others that Google uses to
build its apps.
Performance and Reliability: Among international brands, Google ranks among the top ones. Therefore,
you must bear that in mind while talking about performance and reliability.
Cost Savings: To administer your servers, you don’t need to employ engineers or even do it yourself. The
money you save might be put toward developing other areas of your company.
Platform Independence: Since the app engine platform only has a few dependencies, you can easily relocate
all of your data to another environment.

Architecture of GAE
Programming support diagram

You might also like