0% found this document useful (0 votes)
3 views28 pages

CC MQP Solutions

The document outlines the evolution of computer technologies across five generations, highlighting key eras such as Mainframe, Minicomputer, Personal Computer, Portable Devices, and HPC/HTC. It discusses the benefits of cloud computing, including cost reduction and improved resource sharing, and explains the Message Passing Interface (MPI) for parallel computing. Additionally, it addresses virtualization concepts, system attacks, and the architecture of cloud ecosystems, emphasizing the importance of security and efficient resource management.

Uploaded by

ahemm044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

CC MQP Solutions

The document outlines the evolution of computer technologies across five generations, highlighting key eras such as Mainframe, Minicomputer, Personal Computer, Portable Devices, and HPC/HTC. It discusses the benefits of cloud computing, including cost reduction and improved resource sharing, and explains the Message Passing Interface (MPI) for parallel computing. Additionally, it addresses virtualization concepts, system attacks, and the architecture of cloud ecosystems, emphasizing the importance of security and efficient resource management.

Uploaded by

ahemm044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Q.01 a.

Explain the Platform Evolution of different computer technologies with a


neat diagram.
The Platform Evolution
Computer technologies have evolved over five generations, with each lasting 10 to 20 years. The
transitions were not sudden; there was often a 10-year overlap between generations.

Generation-wise Platform Evolution:

1. Mainframe Era (1950–1970):

o Built to serve large businesses and governments.

o Examples: IBM 360, CDC 6400.

o Large centralized computers with limited access.

2. Minicomputer Era (1960–1980):

o Lower-cost systems for smaller businesses and colleges.

o Examples: DEC PDP 11, VAX Series.

o More interactive and affordable.

3. Personal Computer (PC) Era (1970–1990):

o Rise of personal computing with VLSI microprocessors.

o Widespread usage in homes, schools, and offices.

4. Portable Devices Era (1980–2000):

o Growth of laptops, PDAs, and wireless devices.

o Enabled mobility and ubiquitous computing.


5. HPC and HTC Era (1990–Present):

o Use of High-Performance Computing (HPC) and High-Throughput Computing


(HTC).

o Technologies: Clusters, Grids, Cloud Computing.

o Used in both scientific and commercial web-scale applications.

Current Trends in Computing:

• Focus on web-based shared resources and big data.

• Use of supercomputers (MPP) replaced by clusters of homogeneous nodes.

• HTC systems emphasize peer-to-peer (P2P) sharing and cloud/web services.

Q.01 b. Outline eight reasons to adapt the cloud for upgraded Internet
applications and web services.

1. Desired location in areas with protected space and higher energy efficiency
2. Sharing of peak-load capacity among a large pool of users, improving overall utilization
3. Separation of infrastructure maintenance duties from domain-specific application
development
4. Significant reduction in cloud computing cost, compared with traditional computing paradigms
5. Cloud computing programming and application development
6. Service and data discovery and content/service distribution
7. Privacy, security, copyright, and reliability issues
8. Service agreements, business models, and pricing policies
Q.01 c. Briefly explain Message Passing Interface (MPI).

❖ It is a standard library used to allow communication between multiple processes in parallel


computing.
❖ It is commonly used in supercomputers and clusters for high-performance tasks.
❖ Processes work independently and exchange data through message passing.
❖ Functions like MPI_Send and MPI_Recv are used to send and receive messages.
OR
Q.2 a. Summarize VM Primitive Operations with relevant diagram.
• The VMM (Virtual Machine Monitor) gives a virtual machine (VM) view to the guest operating
system.
• With full virtualization, the VMM gives a VM that looks exactly like a real machine.

• This allows standard operating systems like Windows 2000 or Linux to run as if they are on actual
hardware.

• Mendel Rosenblum explained the low-level VMM operations, as shown in Figure 1.13.

Basic VM Operations:

1. A VM can be shared (multiplexed) between different physical hardware machines. (Figure


1.13(a))

2. A VM can be paused (suspended) and saved to stable storage. (Figure 1.13(b))

3. A suspended VM can be resumed or moved to a new hardware platform. (Figure 1.13(c))

4. A VM can be moved (migrated) from one hardware machine to another. (Figure 1.13(d))
Benefits:

• VMs can be used on any available hardware platform.

• Easy to move distributed applications.

• Helps use server resources better.

• Many server functions can run on one hardware machine.

• Avoids too many physical servers (server sprawl).

• VMware said this method can increase server use from 5–15% to 60–80%.

Q.02 b. Illustrate Various system attacks and network threats to the


cyberspace, resulting in 4 types of losses with a neat diagram.
Threats to Systems and Networks

• Clusters, grids, clouds, and P2P systems must be protected to be trusted.

• Network viruses have caused major damage to routers and servers.

• These attacks have resulted in large money losses in business and government.

• Information leaks cause loss of confidentiality.

• Data integrity may be lost due to user changes, Trojan horses, and spoofing attacks.

• Denial of Service (DoS) attacks stop system working and break Internet connections.

• Attackers may misuse systems when there is no proper authentication.

• Open systems like data centers and P2P networks are easy targets for attackers.
• Attacks can damage computers, networks, and storage systems.

• Network issues in routers and gateways reduce trust in public systems.

Loss of Confidentiality

• Happens when private information is exposed without permission.

• Caused by actions like eavesdropping, traffic analysis, or EM/RF interception.

Loss of Integrity

• Occurs when data is modified, tampered, or misused.

• Caused by penetration, masquerade, bypassing controls, and no proper authorization.

Loss of Availability

• Happens when systems or services become unavailable to users.

• Caused by DoS (Denial of Service), Trojan Horses, or service spoofing attacks.

Improper Authentication / Illegitimate Use

• Happens when attackers gain access without proper login or rights.

• Leads to misuse of resources and data theft through weak or missing authentication.
MODULE-02
Q.03 a. Demonstrate the architecture of a computer system before and
after virtualization.

Before Virtualization:

Only one operating system runs directly on the hardware.

All applications run on this single OS.

If the OS crashes, all applications stop working.

Hardware usage is low and not efficient.

Cannot run multiple OS (like Windows and Linux) on the same system.

No separation between apps one faulty app can affect others.

Software testing or running different environments is not possible.

Adding new apps or OS requires system reboot or reinstall.

Not suitable for cloud, data centers, or multi-user environments.

Overall, less flexible and harder to manage.

After Virtualization:

A Hypervisor (Virtual Machine Monitor) runs on top of hardware.

You can create multiple Virtual Machines (VMs) on one system.

Each VM has its own OS and apps runs independently.

If one VM fails, others are not affected.

Hardware is used better by sharing across VMs.


You can run Windows, Linux, etc. together on the same machine.

Easy to move, copy, or back up VMs.

Ideal for testing, cloud services, and hosting.

Server load is balanced and managed well.

Overall, more flexible, cost-saving, and easy to manage.

Q.03 b. Compare Physical versus Virtual Clusters.

Sl.
Physical Clusters Virtual Clusters
No.

Made of real physical machines connected Made of virtual machines running on one or more
1.
through a network. physical servers.

Each node needs separate hardware like


2. Multiple VMs share the same hardware resources.
CPU, RAM, storage.

3. High cost due to more physical equipment. Cost-effective as fewer physical machines are needed.

Requires large physical space and power Saves space and energy as many VMs run on fewer
4.
supply. machines.

Difficult to scale; new hardware must be


5. Easy to scale; just create a new VM.
added manually.

Setup and maintenance are time-consuming


6. Setup is faster using virtualization tools.
and complex.
If one node fails, it may impact the entire
7. VM failures are isolated; others remain unaffected.
cluster.

High efficiency; hardware is shared among multiple


8. Less efficient; some machines may stay idle.
VMs.

Moving data or applications between nodes


9. VMs can be moved easily between servers.
is harder.

Manual monitoring and control of each Easier to manage using virtualization software (e.g.,
10.
physical server. VMware, VirtualBox).

OR
Q.04 a. Construct the Live migration process of a VM from one host to
another.

VM States Before Migration


Active VM is running and performing tasks.

Paused VM is created but currently not processing.

Suspended VM data is stored to disk and is inactive.

Steps in Live VM Migration

1. Pre-Migration (Step 0)
VM runs on Host A.
Destination Host B is selected and prepared.

2. Reservation (Step 1)
A container for the VM is initialized on Host B.

3. Iterative Pre-Copy (Step 2)


Memory is copied in multiple rounds to Host B.
Changed memory (dirty pages) is re-copied.
VM continues running during this time.
4. Stop and Copy (Step 3)
VM on Host A is suspended.
Final memory and CPU/network state is transferred.
Network is redirected to Host B.
This short delay is called downtime.
5. Commit (Step 4)
VM's data is released from Host A.
Host A no longer controls the VM.
6. Activation (Step 5)
VM is started on Host B.
It resumes work and connects to local devices.

Q.04 b. Develop a architecture of livewire for Intrusion Detection using a


dedicated VM.

Intrusion means unauthorized access to a system by local or network users.

Intrusion Detection System (IDS) is used to detect such unauthorized activities.

IDS can be of two types:

o HIDS (Host-based IDS): Runs on the same machine it's monitoring but is at
risk if the system is attacked.

o NIDS (Network-based IDS): Monitors network traffic but can't detect fake
(spoofed) actions.
In a virtualized system, guest VMs are isolated, so even if one VM is attacked, it
doesn't affect the others like NIDS.
The Virtual Machine Monitor (VMM) monitors access requests and behaves like a
HIDS by tracing fake actions.

VM-based IDS can be implemented in two ways:

1. As a separate process in each VM or a high-privileged VM.

2. Integrated directly into the VMM with full hardware access.

A VM-based IDS includes:

o Policy Engine: Analyzes and applies security rules.

o Policy Module: Uses tools like PTrace to trace and enforce policies in guest
VMs.

It's hard to prevent intrusions immediately, so post-attack analysis is important.

Logs are used to study attack behavior, but if the OS is compromised, logs may be
untrustworthy.

Honeypots and Honeynets are also used:

o They attract attackers with fake systems to protect real systems.

o Can be physical or virtual.

o Virtual honeypots must ensure the VM can't attack the host or VMM.
MODULE-03
Q.05 a. Outline six design objectives for cloud computing.

1. Shift from desktop to data center


– Computing, storage, and software are moved from personal desktops to centralized data centers
via the Internet.

2. Service provisioning and cloud economics


– Cloud services are provided through SLAs (Service Level Agreements) with users.
– Pricing is based on a pay-as-you-go model, and services should use power and resources
efficiently.

3. Scalability in performance
– Cloud systems must support more users by scaling up performance as needed.

4. Data privacy protection


– Users should feel confident that cloud providers can keep their private data safe and secure.

5. High quality of cloud services


– Quality of Service (QoS) should be standardized to allow smooth working across different
providers.

6. New standards and interfaces


– Universal APIs and access protocols are needed to avoid data lock-in and ensure flexibility in
moving apps between cloud platforms.

Q.05 b. With a neat diagram, build a cloud ecosystem with a private cloud.
❖ A cloud ecosystem includes cloud providers, users, and technologies working together.

❖ Public clouds are commonly used and form the base of the cloud ecosystem.

❖ Private and hybrid clouds allow organizations to use both internal and public cloud resources.

❖ Users want flexible platforms to run services like websites and databases.

❖ Cloud management provides virtual resources over an IaaS platform.

❖ Virtual infrastructure management allocates virtual machines across server clusters.

❖ VM managers handle and control VMs running on physical machines like Xen, KVM, and VMware.

❖ Tools like OpenNebula, vSphere, Eucalyptus, and Nimbus are used to manage cloud systems.

❖ Many startup companies use cloud resources instead of building their own IT setups.
❖ Interfaces like Amazon EC2WS, Nimbus WSRF, and ElasticHosts REST are used to access cloud
services.

❖ VI tools also support load balancing, dynamic resizing, and efficient use of server resources.

Q.05 c. Organize Functional Modules of GAE.


• GAE is a Platform-as-a-Service (PaaS) used to build and run web applications on Google’s
cloud.

• It provides several important modules to support app development and hosting.

Functional Modules

1. Runtime Environment
– Runs applications written in Java, Python, Go, or PHP.

2. Datastore
– NoSQL database service for storing structured data.

3. Task Queues
– Handles background tasks without blocking user requests.

4. Memcache
– Provides fast, in-memory caching for frequently accessed data.

5. User Authentication
– Offers APIs to manage user login and identity.

6. App Versioning and Deployment


– Supports multiple versions of the app; easy deployment and rollback.
OR
Q.06 a. Identify basic requirements for managing the resources of a data center.
❖ Managing a data center means handling all its operations smoothly, securely, and efficiently.
❖ These management issues are based on real experiences in IT and cloud service industries.

1. User Satisfaction
– The system should give good service to users for many years (minimum 30 years).
– Quality of service (QoS) must be maintained always.

2. Controlled Information Flow


– Data must flow properly between systems without delay.
– The system must provide high availability (HA) and continuous service.

3. Multiuser Management
– The data center should support many users at the same time.
– It should handle activities like traffic control, database updates, and server monitoring.

4. Scalability
– As more users or data come in, the system should be ready to grow.
– Storage, processing power, I/O, power supply, and cooling must be easily expandable.

5. Reliability in Virtualized Systems


– The system must support fault tolerance, failover, and live migration of virtual machines.
– This helps recover quickly from hardware failure or disasters.

6. Cost Efficiency
– The total cost must be low for both cloud providers and users.
– This includes hardware, electricity, staff, and maintenance.

7. Security and Data Protection


– Strong security must be there to protect data from hackers or internal misuse.
– Data privacy and integrity must be maintained at all times.

8. Green Computing (Energy Efficiency)


– Power-saving systems should be used to reduce energy use.
– Eco-friendly designs help in saving operational costs and reducing pollution.

9. Service Automation
– Automated tools should manage routine tasks like backups, load balancing, and patch updates.
– This improves speed, accuracy, and reduces manual errors.

10. Monitoring and Reporting


– The system must continuously monitor performance and usage.
– Reports help in planning for upgrades, detecting issues, and ensuring smooth operation.
Q.06 b. Summarize six open challenges in cloud architecture development.

Challenge 1 – Service Availability and Data Lock-in Problem

1. If a cloud service fails, the whole system may stop, especially if run by a single company.

2. Using multiple cloud providers can increase service availability.

3. Proprietary APIs cause "lock-in" — users can't easily move apps/data between clouds.

Challenge 2 – Data Privacy and Security Concerns

1. Cloud systems are open to cyberattacks like DDoS, malware, and VM hijacking.

2. Data can be stolen or misused if not properly encrypted or protected.

3. Some countries require data to stay within their borders, adding legal issues.

Challenge 3 – Unpredictable Performance and Bottlenecks

1. VMs share CPU/memory well, but I/O (like disk access) causes slowdowns.

2. Large-scale applications face data transfer delays and traffic problems.

3. Bottlenecks must be avoided using better placement and hardware upgrades.

Challenge 4 – Distributed Storage and Software Bugs

1. Cloud systems need storage that can grow and shrink with demand.

2. Debugging cloud errors is hard because bugs appear only at a large scale.

3. Virtual machines and simulators can help collect useful debugging info.

Challenge 5 – Cloud Scalability, Interoperability, and Standardization

1. Cloud services must scale quickly without breaking SLAs.

2. Standard VM formats (like OVF) help run apps on different platforms.

3. Cross-platform migration between Intel/AMD is still difficult.

Challenge 6 – Software Licensing and Reputation Sharing

1. Cloud needs flexible software licenses (e.g., pay-per-use or bulk licensing).

2. One bad user can damage the whole cloud's reputation (e.g., IP blacklisting).

3. Legal liability between provider and customer must be handled in SLAs.


Q.06 c. Summarize six open challenges in cloud architecture development.

1. AWS uses the IaaS (Infrastructure as a Service) model


– It gives virtual machines (VMs) through EC2 to run cloud applications.

2. S3 and EBS for storage


– S3 (Simple Storage Service) is used for object-based storage.
– EBS (Elastic Block Store) gives block-level storage for regular apps.

3. SQS and SNS for messaging


– SQS (Simple Queue Service) stores messages even if the receiver is offline.
– SNS (Simple Notification Service) is used for sending notifications.

4. Extra services for performance


– ELB (Elastic Load Balancer) spreads traffic across EC2 instances.
– CloudWatch monitors resources like CPU, memory, and network use.
– Auto Scaling adds or removes instances based on demand.
MODULE-04
Q.07 a. Demonstrate surfaces of attacks in a cloud computing environment with neat
diagram.

1. Starting cloud use is too easy


– Many users start using cloud services without understanding the security risks or ethics.
– This creates chances for misuse or unsafe actions.

2. Clouds can be used for large attacks


– Cloud systems can be misused to launch big cyber-attacks on other systems.
– Preventing such misuse is an important challenge.

3. Three types of risks in cloud


– (i) Traditional threats, (ii) Availability issues, (iii) Third-party data control risks.
– These problems make cloud computing risky without proper care.

4. Traditional threats
– These include DDoS attacks, phishing, SQL injection, cross-site scripting, etc.
– In clouds, these threats affect many users because resources are shared.

5. Authentication & user access issues


– Users from the same company may need different access levels.
– Mixing company security rules with cloud rules is not easy.

6. Cloud attacks are hard to trace


– It’s difficult to identify how a cloud system is attacked.
– Traditional investigation methods like logs don’t work well in clouds.
7. Service availability problems
– Power failure, hardware crash, or natural disasters can shut down cloud services.
– If data is locked inside the cloud, it affects companies badly.

8. Third-party trust issues


– Cloud providers may use untrusted vendors or low-quality hardware.
– Users lose data because they have no control or transparency.

9. Cloud providers are not responsible


– For example, AWS terms say they are not liable for data loss or failure.
– This creates risk for users because there’s no strong guarantee.

10. Top 7 threats (CSA Report 2010)


– Abuse of cloud, Insecure APIs, Malicious insiders, Shared tech issues, Account hijacking,
Data loss/leakage, and Unknown risk profile.
– IaaS is affected by all 7, PaaS and SaaS are affected by fewer.

Q.07 b. List out the top cloud security threats of CSA2016.

1. Data Breaches
– Unauthorized access to sensitive or confidential data.

2. Weak Identity, Credential and Access Management


– Poor password practices or stolen credentials lead to unauthorized access.

3. Insecure APIs (Application Programming Interfaces)


– Vulnerable APIs can be exploited by attackers to gain access.

4. System Vulnerabilities
– Bugs or flaws in software can allow attackers to exploit systems.

5. Account Hijacking
– Attackers use stolen credentials to take over accounts and services.

6. Malicious Insiders
– Employees or partners with access misuse their privileges.

7. Advanced Persistent Threats (APTs)


– Long-term targeted attacks aimed at stealing data or damaging systems.

8. Data Loss
– Accidental deletion, system failure, or lack of backups leads to permanent data loss.

9. Insufficient Due Diligence


– Organizations adopt cloud without fully understanding responsibilities or risks.
10. Abuse and Nefarious Use of Cloud Services
– Attackers use cloud resources for spamming, malware hosting, or launching DDoS attacks.

11. Denial of Service (DoS)


– Attackers overload cloud services, making them unavailable to legitimate users.

12. Shared Technology Vulnerabilities


– Risks from the multi-tenant model where many users share the same infrastructure.

Q.07 c. Select four widely-accepted fair information practices that “Consumer oriented
commercial web sites that collect personal identifying information from or about consumers
online would be required to comply with.
1. Notice

o Websites must clearly inform users about their data collection practices.

o This includes what data is collected, how it is collected (e.g., cookies), how it’s used, and if it is
shared with other entities.

2. Choice

o Users must be given choices on how their personal data is used.

o This includes both internal use (like marketing) and external use (sharing with third parties).

3. Access

o Users must be allowed to view, correct, or delete their personal information.

o This ensures transparency and user control over their data.

4. Security

o Websites must take reasonable steps to protect user data from theft or misuse.

o The approach should be technologically neutral and flexible for future developments.

OR
Q.08 a. Summarize The design goals of Xoar are.

• Xoar is a modified version of Xen, designed to improve system security using microkernel
principles.

• It assumes trusted system administrators manage the system and threats mainly come from guest VMs or
bugs in the management code.

• It maintains all Xen functionalities while controlling privileges tightly—each component gets
only what it needs.

• Interfaces are minimized to reduce attack surfaces, and sharing is avoided or made explicitly logged.

• Components run only when needed to reduce the time window for attacks.

• Xoar allows secure audit logging for better traceability.

• There are four types of components:


– Always running (e.g., XenStore-State)
– Used at boot time and then removed
– Loaded when requested
– Restarted on a timer

• Modular design reduces the risk and footprint of the system, with only a small performance
impact.

• Examples include: Builder (starts VMs), QEMU (device emulation), and drivers like PCIBack and
NetBack.
Q.08 b. Explain mobile devices and cloud security.

1. Mobile Cloud Ecosystem

o Mobile apps use cloud services for data storage, backups, and processing because
devices have limited CPU, memory, and storage.

2. Security Challenges on Mobile Devices

o Mobile devices often connect over public or untrusted Wi-Fi networks, which can be
intercepted by attackers.

o They are frequently lost or stolen, increasing the risk of unauthorized data access.

3. Authentication and Identity Management

o Weak device-side authentication (like reused or weak passwords) can let attackers
access cloud accounts.

o Modern best practices recommend using multi-factor authentication (MFA).

4. Data Protection in Transit and at Rest

o Data must be encrypted both while traveling (e.g., TLS/SSL) and while stored in the cloud.

o End-to-end encryption ensures only device users can read sensitive data.

5. App and API Security

o Malicious or vulnerable apps might access or leak user data saved in the cloud.

o Secure mobile apps need trustworthy APIs with proper access controls.

6. Device and App Management

o Enterprises use Mobile Device Management (MDM) tools to enforce


encryption, require strong passcodes, push updates, and enable remote wipe capabilities.

o This protects data even if a device is lost or stolen.

Q.08 c. Model an overview of reputation system design options.

1. Centralized Reputation System


o One central authority collects, evaluates, and manages reputation scores.

o Easy to control and monitor but can be a single point of failure or target for attack.

2. Distributed Reputation System

o Reputation data is shared among peers or systems.


o No single control point, but more complex to manage consistency and trust.

3. User-Based Ratings

o Users give direct feedback (e.g., stars, likes, reviews) after a service or transaction.

o Simple to implement but can be manipulated using fake reviews or Sybil attacks.
4. Behavior-Based Monitoring

o The system monitors actual behavior (e.g., uptime, response time, data accuracy).

o More reliable and objective, but needs complex tracking and analytics.

5. Context-Aware Reputation

o Reputation scores depend on the specific service or environment (e.g., reliability in


storage vs. computation).

o Allows fine-grained trust decisions for different scenarios.

6. Time-Based Reputation

o Reputation fades over time if not updated, encouraging ongoing good behavior.

o Prevents users from building high scores and then acting maliciously later.

7. Incentive-Driven Models

o Users are rewarded (credits, trust scores) for providing accurate feedback or behaving
well.

o Encourages honesty but needs protection against misuse of incentives.


MODULE-05
Q.09 a. Outline Important Cloud Platform Capabilities.

1. On-Demand Self-Service

o Users can access computing resources (like servers, storage) whenever they need, without
human help.

2. Broad Network Access

o Services are available over the internet and can be used from laptops, phones, or tablets.
3. Resource Pooling

o Cloud providers share resources (like storage, memory) among many users using
virtualization.

4. Rapid Elasticity

o Resources can be increased or decreased quickly based on need (auto-scaling).

5. Measured Service (Pay-as-You-Go)

o Users only pay for what they use (like mobile recharge) — helps save money.
6. High Availability

o Cloud platforms make sure services run 24/7 without downtime using backup and load
balancing.

7. Security and Compliance

o Provides user authentication, data encryption, firewalls, and follows government


policies.

8. Automation

o Many tasks like backups, updates, scaling can be done automatically without manual work.

9. Multi-Tenancy

o Multiple users can use the same cloud system securely and privately.

10. APIs and Developer Tools

• Easy-to-use tools for developers to build, test, and deploy applications on the cloud.

Q.09 b. Organize the steps involved in MapReduce.


1. Input Splitting

o The input data is split into small parts (blocks) for processing.

2. Map Function

o Each part is processed in parallel by the Map function.

o It converts data into key-value pairs.


(Example: “apple” becomes <apple, 1>)

3. Shuffling

o The system groups all values with the same key together.
(All <apple, 1> pairs are brought together.)

4. Sorting

o Keys are sorted before passing them to the reducer.

5. Reduce Function

o All grouped key-value pairs are processed by the Reduce function.

o It performs operations like counting, summing, or averaging.


(Example: <apple, [1,1,1]> becomes <apple, 3>)

6. Output Generation

o The final output is stored in the file system in key-value format.

Simple Example

Input: A list of words → ["apple", "apple", "banana"]

Map Output:
<apple, 1>, <apple, 1>, <banana, 1>

Shuffle + Reduce Output:


<apple, 2>, <banana, 1>

OR
Q.10 a. Explain with a neat diagram how data flows in running a MapReduce job at
various task trackers using the Hadoop library.

Phase 1: Setup and Input

1. Data Partitioning

o Input file stored in HDFS is split into M pieces (Input Splits).

o Each split is assigned to a Map Task.

2. Computation Partitioning

o User writes Map() and Reduce() functions.

o Hadoop system forks user programs and distributes to workers.

3. Master and Workers Setup

o One instance becomes the Master (JobTracker).

o Others become Workers (TaskTrackers or NodeManagers).

o Master assigns Map/Reduce Tasks to workers.


Phase 2: Map Side Processing

4. Input Reading

o Each Map worker reads its split and passes it to the Map() function.

5. Map Function Execution

o Produces intermediate (key, value) pairs.

6. Combiner (Optional)

o Combines values locally (e.g., local sum) to reduce network data.

7. Partitioning Function

o Intermediate data is split into R partitions (one per Reduce task) using: Hash(key) mod
R
Phase 3: Shuffle and Reduce

8. Synchronization

o Reduce workers wait for all Map tasks to complete.

9. Communication

o Reduce workers fetch partitions from all Map workers using RPC.

10. Sorting & Grouping

• Keys are sorted and grouped (all values with same key together).

11. Reduce Function Execution

• Final results are written to HDFS output files.

Q.10 b. Explain Data mutation sequence in GFS with diagram1.


1. Client Requests Chunk Info

o Client contacts Master to ask which chunk server has the lease for the chunk and where other
replicas are.

2. Master Responds

o Master replies with:

▪ Identity of Primary replica

▪ Locations of Secondary replicas

o Client caches this info for future requests.

3. Client Pushes Data

o Client sends the data to all replicas (primary + secondary).

o Data is stored in a buffer cache at each chunk server.

o This step is decoupled from control flow for better performance.

4. Client Sends Write Request to Primary

o After all servers receive the data, the client informs the Primary to begin mutation.
o Primary assigns serial numbers to maintain write order.

5. Primary Forwards to Secondaries

o Primary sends write request to all secondary replicas, enforcing the same
serial order.

6. Secondaries Acknowledge

o Secondaries confirm mutation is applied successfully.

7. Primary Responds to Client

o After all secondaries reply, primary responds to the client.

o If any error occurs, the client:

▪ Marks write as failed

▪ Retries the mutation (steps 3–7 or restart from step 1 if needed)

You might also like