0% found this document useful (0 votes)
0 views59 pages

Uds24201j Unit IV

The document provides a comprehensive overview of SQL to MongoDB mapping, highlighting the differences in terminology and structure between relational and non-relational databases. It covers various SQL commands and their MongoDB equivalents, as well as techniques for optimizing query performance in MongoDB, such as indexing, caching, and proper data modeling. Additionally, it outlines the benefits of using MongoDB queries, including flexible schema design, high performance, and scalability.

Uploaded by

Priya Rajappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views59 pages

Uds24201j Unit IV

The document provides a comprehensive overview of SQL to MongoDB mapping, highlighting the differences in terminology and structure between relational and non-relational databases. It covers various SQL commands and their MongoDB equivalents, as well as techniques for optimizing query performance in MongoDB, such as indexing, caching, and proper data modeling. Additionally, it outlines the benefits of using MongoDB queries, including flexible schema design, high performance, and scalability.

Uploaded by

Priya Rajappa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

UDS24201J - UNIT IV

SQL TO MONGODB MAPPING


Introduction
A Database Management System (DBMS) is a system software used to create, manage and
work with databases. MongoDB is a non-relational database management system. Unlike
SQL(Structured Query Language) databases based on tables, MongoDB is a document-based
database.
MongoDB is often an integral part of the Javascript web ecosystem, MERN, or MEAN
stack.
An analogy to RDBM Systems:
In traditional Relational Database Management Systems, data is stored in databases with
rows and columns tables.
Mongodb is a NOSQL DBMS. In NOSQL DBMS, information is stored in a BSON format.
A BSON format is simply a binary version of JSON. Data is stored in the form
of collections.
If we create a diagrammatic analogy between SQL tables and NOSQL collections, we would
have the following analogy:

Comparison between NOSQL and SQL DBMS


In every NOSQL collection, BSON Documents exist with single data storages called fields,
just like columns exist in rows in SQL tables.
Also see, Natural Join in SQL

1
UDS24201J - UNIT IV

Concepts in SQL and MongoDB


Before we compare SQL terminology to MongoDB terminology, let us first understand some
basic concepts of SQL and MongoDB.

Field This is a name and value pair.

Document A group of fields is called a document.

Collection A group of documents clubbed together, is called a collection.

Database A physical container to keep multiple collections is called a


database.
Now, let us see how these MongoDB concepts match up with SQL terminology.

SQL Terminology MongoDB Terms

Database Database

Table Collection

Row Document

Column Field

Index Index

Primary Key Primary Key

Aggregation Aggregation Pipeline

SELECT INTO NEW_TABLE $out

MERGE INTO TABLE $merge

table joins $lookup, embedded


document

As discussed in the first table, a field is the smallest part of a MongoDB, just like a column is
the smallest part of a SQL table. Like columns clubbed together form rows and rows form
tables, they are analogous to collections and documents.
Another important analogy that needs to be discussed here is that of the primary key. In SQL
systems we can assign any unique column as the primary key. In MongoDB every collection
has an _id field, which is automatically taken as the primary key.

2
UDS24201J - UNIT IV

Mapping SQL Commands to MongoDB


For learning the difference between SQL and MongoDB we will take an example. Let us
assume a table called Demo in SQL. For the analogy, let us assume a collection called Demo
in MongoDB which contains documents of the following type:
{
_id: ObjectId("50646isz351sd"),
user_id: "xyz",
age: 19,
status: 'Y'
}
Create and Alter Commands
The following table includes Create and Alter commands in SQL and MongoDB.

SQL Statements MongoDB Command

CREATE TABLE Demo ( db.createCollection ( " Demo " )


id MEDIUMINT NOT NULL
AUTO_INCREMENT,
user_id Varchar(20),
age Number,
status char(1),
PRIMARY KEY (id)
)

ALTER TABLE Demo ADD join_date DATETIME db.Demo.updateMany(


{ },
{ $set: { join_date: new Date() }
}
)

ALTER TABLE Demo DROP COLUMN join_date db.Demo.updateMany(


{ },
{ $unset: { "join_date": "" } }
)

CREATE INDEX idx_user_id_asc ON Demo ( db.people.createIndex ( { user_id: 1


user_id ) })

CREATE INDEX idx_user_id_asc ON people db.people.createIndex( { user_id:


(user_id) 123, age: 1} )

DROP TABLE people db.people.drop ()

3
UDS24201J - UNIT IV

Insert Statements
The following table includes insertion commands in SQL and MongoDB.

SQL Insert Statement MongoDB Insert Statement

INSERT INTO Demo (user_id, db.Demo.insertOne(


age, { user_id: "ninja", age: 19, status:
status) "N" }
VALUES ("ninja", 19, "N") )

Delete Statements
The following table includes delete commands in SQL and MongoDB.

SQL Delete Statement MongoDB Delete Command

DELETE FROM Demo WHERE status = "N" db.Demo.deleteMany( { status:


"N" } )

DELETE FROM Demo db.Demo.deleteMany( { } )


Update Statements
The following table includes updation commands in SQL and MongoDB.

SQL Update Statement MongoDB Update Statement

UPDATE Demo SET status = "Y" db.Demo.updateMany( { age: {


WHERE age > 18 $gt: 18 } }, { $set: { status: "Y" }
})

UPDATE Demo SET age = age + 2 db.Demo.updateMany( { status:


WHERE status = "N" "N" } , { $inc: { age: 2 } } )

Selection Commands
The following table includes selection commands in SQL and MongoDB.

SQL Commands MongoDB Statements

SELECT * FROM Demo db.Demo.find()

SELECT id, user_id, status FROM Demo db.Demo.find( { }, { user_id: 1,


status: 1 } )

SELECT user_id, status FROM Demo db.Demo.find( { }, { user_id: 1,


status: 1, _id: 0 } )

SELECT * FROM Demo WHERE status = "B" db.Demo.find( { status: "B" } )

4
UDS24201J - UNIT IV

SELECT user_id, status FROM Demo WHERE db.Demo.find( { status: "A" }, {


status = "A" user_id: 1, status: 1, _id: 0 } )

SELECT * FROM Demo WHERE status != "A" db.Demo.find( { status: { $ne: "A" }
})

SELECT * db.Demo.find(
FROM Demo { status: "A",
WHERE status = "A" age: 50 }
AND age = 50 )

SELECT * db.Demo.find(
FROM Demo { $or: [ { status: "A" } , { age: 50 }
WHERE status = "A" ]}
OR age = 50 )

SELECT * db.Demo.find(
FROM Demo WHERE age > 25 { age: { $gt: 25 } }
)

SELECT * db.Demo.find(
FROM Demo WHERE age < 25 { age: { $lt: 25 } }
)

SELECT * FROM Demo WHERE age > 25 db.Demo.find(


AND age <= 50 { age: { $gt: 25, $lte: 50 } }
)

SELECT COUNT(*) FROM Demo db. Demo.count()

SELECT COUNT(user_id) FROM Demo db.Demo.count( { user_id: { $exists:


true } } )

SELECT COUNT(*) FROM Demo db.Demo.count( { age: { $gt: 30 } } )


WHERE age > 30

SELECT DISTINCT(status) FROM Demo db.Demo.aggregate( [ { $group : { _id


: "$status" } } ] )

SELECT * FROM Demo LIMIT 1 db.Demo.find().limit(1)

SELECT *FROM Demo db.Demo.find().limit(5).skip(10)


LIMIT 5
SKIP 10

EXPLAIN SELECT * FROM Demo WHERE db.Demo.find( { status: "A" }


status = "A" ).explain()

5
UDS24201J - UNIT IV

HOW TO OPTIMIZE QUERY PERFORMANCE


MongoDB is a popular NoSQL database known for its flexibility, scalability, and powerful
capabilities in handling large datasets. However, to fully use MongoDB's potential,
optimizing query performance is essential. Efficient MongoDB queries not only reduce
response times but also help in lowering resource consumption, making our application
faster and more responsive.
MongoDB Query Optimization
MongoDB query optimization involves improving the efficiency and speed of database
queries to enhance application performance. By utilizing specific techniques and best
practices, we can minimize query execution time and resource consumption. By refining
how data is retrieved, processed, and stored, we can ensure that queries execute faster,
leading to improved application performance and reduced load on your system.
Table of Content
 Indexing
 Query Optimization
 Projection
 Pagination
 Caching
 Proper Data Modeling
 Use of Proper Data Types
 Monitoring and Maintenance
1. Indexing
Indexing is the process of creating data structures that improve the speed of data retrieval
operations on a database table. In MongoDB, indexes are created on specific fields to
accelerate query execution by reducing the number of documents that need to be scanned.
Indexing is one of the most powerful techniques to optimize query performance
in MongoDB. It involves creating data structures that allow for quick retrieval of data.
MongoDB supports various types of indexes, including single-field, compound, multi-key,
and text indexes. Indexing can significantly boost query performance, especially for
frequently accessed fields.
Types of indexes in MongoDB
MongoDB supports various types of indexes, each serving different use cases
Single-field index: Indexes created on a single field.

6
UDS24201J - UNIT IV

Compound index: Indexes that include multiple fields to handle complex queries more
efficiently.
Multi-key index: Indexes for fields that store arrays, indexing each element of the array.
Text index: Optimized indexes for full-text search operations.
Creating indexes in MongoDB
Indexes can be created using the createIndex() method, and proper indexing ensures faster
data retrieval, especially on frequently queried fields. Here’s an example of creating a
compound index on field1 and field2:
// Create a single-field index
db.collection.createIndex({ field: 1 });

// Create a compound index


db.collection.createIndex({ field1: 1, field2: -1 });
Explanation: This compound index will help improve the performance of queries that filter
or sort by both field1 and field2.
2. Optimizing Queries with the explain() Method
Query optimization involves analyzing and refining queries to minimize execution time and
resource consumption. MongoDB provides the explain() method to analyze query
performance and identify potential optimization opportunities.
Using explain() method for query analysis
The explain() method provides detailed information about query execution, including query
plan, index usage, and execution statistics. This tool helps identify
performance bottlenecks and areas for improvement by showing whether the query is using
an index and how many documents are scanned.
// Analyze query performance
db.collection.find({ field: value }).explain();
Explanation: The explain() method returns execution details, including whether the query
used an index and the number of documents scanned. This helps identify inefficient queries
and optimize them.
3. Projection
Projection in MongoDB allows us to limit the fields returned by a query to only the necessary
ones. By limiting the number of fields returned, we reduce the data transfer overhead,
improving query execution time and response time, especially with large documents.

7
UDS24201J - UNIT IV

Importance of projection in query optimization


Projection helps in minimizing network overhead by transmitting only required data fields.
This reduces the overall data transfer size and improves query performance, especially
when dealing with large datasets.
// Retrieve specific fields
db.collection.find({ field: value }, { field1: 1, field2: 1 });
Explanation: This query will return only the field1 and field2 values from the documents
that match the field condition, minimizing the amount of data transferred.
4. Pagination
Pagination helps manage large result sets by breaking them into smaller, more manageable
chunks. By using limit() and skip() methods, we can retrieve data in manageable chunks,
reducing resource consumption and improving response times.
Implementing pagination in MongoDB
// Retrieve data in chunks
db.collection.find({}).skip(10).limit(10);
Explanation: skip(20) skips the first 20 documents, while limit(10) ensures only 10
documents are returned. This combination allows for efficient pagination
5. Caching: Speed Up Repeated Queries
Caching is another powerful method to improve MongoDB query performance. Frequently
accessed data can be cached in memory, reducing the need for repeated queries to
the database. This minimizes the load on the MongoDB server and speeds up response times.
Role of caching in query optimization
Caching minimizes redundant database calls by storing frequently accessed data in memory.
This reduces the load on the database and enhances overall application performance.
Integrating MongoDB with an external caching layer, such as Redis or Memcached, can
significantly reduce the number of database queries. By storing frequently accessed data in
memory, caching ensures faster access to commonly requested data.
6. Proper Data Modeling
Efficient data modeling ensures that queries retrieve data with minimal overhead. Proper
structuring of data can significantly reduce the need for complex queries and joins, leading to
faster data retrieval.
Strategies for Efficient data modeling
Some best practices for efficient data modeling include:

8
UDS24201J - UNIT IV

Nested documents: Embed related data within a single document to minimize joins.
Arrays: Use arrays to represent one-to-many relationships efficiently.
Denormalization: Duplicate data when necessary to avoid complex joins.
7. Use of Proper Data Types
Choosing the right data types for fields is critical for query performance and storage
efficiency. For example, storing dates as Date types enables efficient date range queries,
while using the correct numeric types can reduce memory usage and improve the speed of
arithmetic operations. Using proper data types ensures data integrity and optimal
performance. For example, storing dates as Date types enables efficient date range queries
Best practices for selecting data types:
Use Date types for date fields.
Use appropriate numeric types (Int32, Double, etc.) to save space.
Avoid storing large strings or objects in fields that don't require it.
8. Monitoring and Maintenance
Regular monitoring of MongoDB is essential to ensure that it performs optimally over time.
MongoDB provides several tools for monitoring, such as the db.serverStatus() method and
the MongoDB Atlas dashboard. Setting up alerting for critical metrics, like memory usage or
query performance, helps identify issues early.
Set up alerting for critical metrics like CPU usage and memory consumption.
Monitor query performance and identify slow-running queries for optimization.
Maintenance tasks to ensure optimal performance
Regular index optimization: Rebuild indexes periodically to ensure they remain efficient.
Data compaction: Regularly compact data files to reclaim disk space and optimize storage.
Index updates: Create and update indexes based on query patterns to ensure efficient index
utilization
Conclusion
Optimizing MongoDB queries is crucial for improving application performance and
ensuring that our database can handle large datasets efficiently. By implementing techniques
such as proper indexing, query analysis with explain(), efficient pagination, caching, and
using the right data types, we can drastically reduce query execution time and resource
consumption. Additionally, maintaining good data modeling practices and regularly
monitoring your MongoDB deployment will keep it performing at its best.

9
UDS24201J - UNIT IV

BENEFITS OF MONGODB QUERIES

MongoDB queries offer several key benefits for developers and businesses, particularly for
handling large amounts of unstructured or semi-structured data. Here are some of the benefits
of using MongoDB queries:
1. Flexible Schema Design: MongoDB is a NoSQL database, so its schema can be
dynamic. Queries can handle documents with different structures, making it easier to
work with varying data types.
2. High Performance: MongoDB is optimized for speed, particularly with its ability to
index and perform real-time queries on large datasets. Its support for in-memory
storage can also significantly boost query performance.
3. Powerful Querying: MongoDB offers rich querying capabilities such as:
o Text search: MongoDB supports full-text search, allowing you to search for
words and phrases within string fields.
o Aggregation: MongoDB’s aggregation framework enables complex data
transformation and analysis, such as grouping, sorting, and performing
calculations on data.
o Geospatial queries: MongoDB supports geospatial indexing and queries,
allowing you to efficiently query location-based data.
4. Scalability: MongoDB's horizontal scaling capabilities allow it to scale across many
servers to handle increasing workloads. It supports sharding (splitting data across
multiple machines), which is beneficial for applications with high data volumes.
5. ACID Transactions (from 4.0 onward): MongoDB supports multi-document ACID
transactions, which means you can run queries that need multiple operations to
complete atomically, ensuring data consistency.
6. Real-Time Data Access: MongoDB allows for real-time data updates and instant
querying, which is useful for applications that require real-time analytics or live data
processing.
7. Rich Aggregation Framework: MongoDB's aggregation framework provides a
powerful set of operations for transforming and combining data, such as $match,
$group, $sort, and $project. This allows developers to build complex queries with
ease.

10
UDS24201J - UNIT IV

8. Ease of Integration: MongoDB's query syntax is similar to JSON, making it easy to


integrate with modern web technologies and applications. It’s especially useful for
developers familiar with JavaScript and Node.js.
9. Indexing: MongoDB allows you to index fields in documents to speed up query
operations. You can create different types of indexes, such as compound indexes, text
indexes, and geospatial indexes, to optimize query performance.
10. Replication and Fault Tolerance: MongoDB supports replica sets, which are copies
of data stored on multiple machines. This provides high availability and fault
tolerance, ensuring queries can be answered even when some parts of the system fail.
These benefits make MongoDB a powerful tool for building scalable, high-performance
applications, especially when working with large amounts of data or data that is subject to
frequent changes.
DISTRIBUTED OPERATING SYSTEMS
A Distributed Operating System refers to a model in which applications run on multiple
interconnected computers, offering enhanced communication and integration capabilities
compared to a network operating system.
What is a Distributed Operating System?
In a Distributed Operating System, multiple CPUs are utilized, but for end-users, it appears
as a typical centralized operating system. It enables the sharing of various resources such as
CPUs, disks, network interfaces, nodes, and computers across different sites, thereby
expanding the available data within the entire system.
Effective communication channels like high-speed buses and telephone lines connect all
processors, each equipped with its own local memory and other neighboring processors.
Due to its characteristics, a distributed operating system is classified as a loosely coupled
system. It encompasses multiple computers, nodes, and sites, all interconnected
through LAN/WAN lines. The ability of a Distributed OS to share processing resources
and I/O files while providing users with a virtual machine abstraction is an important
feature.
The diagram below illustrates the structure of a distributed operating system:

11
UDS24201J - UNIT IV

Applications of Distributed Operating System


Distributed operating systems find applications across various domains where distributed
computing is essential. Here are some notable applications:
 Cloud Computing Platforms:
o Distributed operating systems form the backbone of cloud computing
platforms like Amazon Web Services (AWS), Microsoft Azure, and Google
Cloud Platform (GCP).
o These platforms provide scalable, on-demand computing resources
distributed across multiple data centers, enabling organizations to deploy and
manage applications, storage, and services in a distributed manner.
 Internet of Things (IoT):
o Distributed operating systems play a crucial role in IoT networks, where
numerous interconnected devices collect and exchange data.
o These operating systems manage communication, coordination, and data
processing tasks across distributed IoT devices, enabling applications such as
smart home automation, industrial monitoring, and environmental sensing.
 Distributed Databases:
o Distributed operating systems are used in distributed database management
systems (DDBMS) to manage and coordinate data storage and processing
across multiple nodes or servers.
o These systems ensure data consistency, availability, and fault tolerance in
distributed environments, supporting applications such as online transaction
processing (OLTP), data warehousing, and real-time analytics.
 Content Delivery Networks (CDNs):
12
UDS24201J - UNIT IV

o CDNs rely on distributed operating systems to deliver web content, media,


and applications to users worldwide.
o These operating systems manage distributed caching, content replication,
and request routing across a network of edge servers, reducing latency and
improving performance for users accessing web content from diverse
geographic locations.
 Peer-to-Peer (P2P) Networks:
o Distributed operating systems are used in peer-to-peer networks to enable
decentralized communication, resource sharing, and collaboration among
distributed nodes.
o These systems facilitate file sharing, content distribution, and decentralized
applications (DApps) by coordinating interactions between peers without
relying on centralized servers.
 High-Performance Computing (HPC):
o Distributed operating systems are employed in HPC clusters and
supercomputers to coordinate parallel processing tasks across multiple nodes
or compute units.
o These systems support scientific simulations, computational modeling, and
data-intensive computations by distributing workloads and managing
communication between nodes efficiently.
 Distributed File Systems:
o Distributed operating systems power distributed file systems like Hadoop
Distributed File System (HDFS), Google File System (GFS), and CephFS.
o These file systems enable distributed storage and retrieval of large-scale data
sets across clusters of machines, supporting applications such as big data
analytics, data processing, and content storage.
Security in Distributed Operating system
Protection and security are crucial aspects of a Distributed Operating System, especially in
organizational settings. Measures are employed to safeguard the system from potential
damage or loss caused by external sources. Various security measures can be implemented,
including authentication methods such as username/password and user key. One Time
Password (OTP) is also commonly utilized in distributed OS security applications.

13
UDS24201J - UNIT IV

SHELL COMMANDS TO MANAGE HDFS


HDFS is the primary or major component of the Hadoop ecosystem which is responsible
for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files. To use the HDFS commands, first
you need to start the Hadoop services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps
Commands:
1. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File
System) commands.
2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So
let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /sample => '/' means absolute path
bin/hdfs dfs -mkdir sample2 => Relative path -> the folder will be
created relative to the home directory.
3. touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:

14
UDS24201J - UNIT IV

bin/hdfs dfs -touchz /sample/myfile.txt


4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store.
This is the most important command. Local filesystem means the files present on the
OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder sample present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /sample
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /sample
5. cat: To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside sample folder.
bin/hdfs dfs -cat /sample/AI.txt ->
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
7. bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /sample ../Desktop/hero
(OR)
bin/hdfs dfs -get /sample/myfile.txt ../Desktop/hero
myfile.txt from sample folder will be copied to folder hero present on Desktop.
Note: Observe that we don’t write bin/hdfs while checking the things present on local
filesystem.
8. moveFromLocal: This command will move file from local to hdfs.
Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /sample

15
UDS24201J - UNIT IV

9. cp: This command is used to copy files within hdfs. Lets copy
folder sample to sample_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /sample /sample_copied
10. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from sample folder to sample_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /sample/myfile.txt /sample_copied
11. rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /sample_copied -> It will delete all the content inside the
directory then the directory itself.
12. du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /sample
13. dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /sample
14. stat: It will give the last modified time of directory or path. In short it will give stats of
the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>

16
UDS24201J - UNIT IV

Example:
bin/hdfs dfs -stat /sample
15. setrep: This command is used to change the replication factor of a file/directory in
HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-
site.xml).
Example 1: To change the replication factor to 6 for sample.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 sample.txt
Example 2: To change the replication factor to 4 for a directory sampleInput stored in
HDFS.
bin/hdfs dfs -setrep -R 4 /sample
Note: The -w means wait till the replication is completed. And -R means recursively,
we use it for directories as they may also contain many files and folders inside them.

TYPES OF DISTRIBUTED OPERATING SYSTEM


There are many types of Distributed Operating System, some of them are as follows:
1. Client-Server Systems
In a client-server system within a distributed operating system, clients request services or
resources from servers over a network. Clients initiate communication, send requests, and
handle user interfaces, while servers listen for requests, perform tasks, and manage
resources.
 This model allows for scalable resource utilization, efficient sharing, modular
development, centralized control, and fault tolerance.
 It facilitates collaboration between distributed entities, promoting the development of
reliable, scalable, and interoperable distributed systems.
2. Peer-to-Peer(P2P) Systems
In peer-to-peer (P2P) systems, interconnected nodes directly communicate and collaborate
without centralized control. Each node can act as both a client and a server, sharing
resources and services with other nodes. P2P systems enable decentralized resource
sharing, self-organization, and fault tolerance.
 They support efficient collaboration, scalability, and resilience to failures without
relying on central servers.

17
UDS24201J - UNIT IV

 This model facilitates distributed data sharing, content distribution, and computing
tasks, making it suitable for applications like file sharing, content delivery, and
blockchain networks.
3. Middleware
Middleware acts as a bridge between different software applications or components,
enabling communication and interaction across distributed systems. It abstracts
complexities of network communication, providing services like message passing, remote
procedure calls (RPC), and object management.
 Middleware facilitates interoperability, scalability, and fault tolerance by decoupling
application logic from underlying infrastructure.
 It supports diverse communication protocols and data formats, enabling seamless
integration between heterogeneous systems.
 Middleware simplifies distributed system development, promotes modularity, and
enhances system flexibility, enabling efficient resource utilization and improved system
reliability.
4. Three-Tier
In a distributed operating system, the three-tier architecture divides tasks into presentation,
logic, and data layers. The presentation tier, comprising client machines or devices, handles
user interaction. The logic tier, distributed across multiple nodes or servers, executes
processing logic and coordinates system functions.
 The data tier manages storage and retrieval operations, often employing distributed
databases or file systems across multiple nodes.
 This modular approach enables scalability, fault tolerance, and efficient resource
utilization, making it ideal for distributed computing environments.
5. N-Tier
In an N-tier architecture, applications are structured into multiple tiers or layers beyond the
traditional three-tier model. Each tier performs specific functions, such as presentation,
logic, data processing, and storage, with the flexibility to add more tiers as needed. In a
distributed operating system, this architecture enables complex applications to be divided
into modular components distributed across multiple nodes or servers.
 Each tier can scale independently, promoting efficient resource utilization, fault
tolerance, and maintainability.

18
UDS24201J - UNIT IV

 N-tier architectures facilitate distributed computing by allowing components to run on


separate nodes or servers, improving performance and scalability.
 This approach is commonly used in large-scale enterprise systems, web applications,
and distributed systems requiring high availability and scalability.
FEATURES OF DISTRIBUTED OPERATINT SYSTEM
A Distributed Operating System manages a network of independent computers as a unified
system, providing transparency, fault tolerance, and efficient resource management. It
integrates multiple machines to appear as a single coherent entity, handling complex
communication, coordination, and scalability challenges to optimize performance
and reliability.
Features of Distributed Operating System
Fundamental Features of Distributed Operating System
The fundamental features of a Distributed Operating System (DOS) are designed to manage
multiple interconnected computers as a unified system. Below is a detailed look at these
core features:
1. Transparency
Transparency in a distributed operating system means that the system hides the
complexities of the underlying network and distributed architecture from users and
applications. This includes:
 Access Transparency: Ensures that users and applications can access resources (e.g.,
files, devices) without needing to know their physical location or the details of the
network. Accessing a remote file appears the same as accessing a local file.
 Location Transparency: Users and applications are unaware of the physical location
of resources. For example, a file or service might be located on any node in the
network, but it appears as if it is on the local machine.
 Migration Transparency: Resources can be moved from one node to another without
affecting the user’s perception of the resource. This allows for dynamic load balancing
and resource management.
 Replication Transparency: Users and applications are unaware of the replication of
resources across multiple nodes for fault tolerance and load balancing. They interact
with a single logical resource.

19
UDS24201J - UNIT IV

 Concurrency Transparency: Ensures that multiple users or applications accessing the


same resource simultaneously do not interfere with each other, providing a consistent
view of the resource.
2. Scalability
Scalability refers to the system’s ability to handle growing amounts of work or to be
expanded to accommodate more nodes. This includes:
 Horizontal Scalability: Adding more nodes to the system to increase capacity and
performance. A scalable distributed operating system can efficiently integrate new
nodes with minimal disruption.
 Vertical Scalability: Enhancing the capacity of existing nodes (e.g., upgrading
hardware) to handle increased load. Although less common in distributed contexts, it is
still a relevant aspect.
3. Fault Tolerance and Reliability
Fault tolerance ensures that the system continues to operate correctly even in the presence
of hardware or software failures. Key aspects include:
 Redundancy: Duplication of critical components or services to ensure that if one fails,
another can take over. This might involve replicating data or having backup nodes.
 Failover Mechanisms: Automatic switching to backup systems or nodes when a failure
occurs, ensuring continuity of service and minimizing downtime.
 Fault Detection and Recovery: Mechanisms for detecting failures and initiating
recovery processes, such as reassigning tasks or recovering lost data, to maintain system
reliability.
4. Resource Management
Efficient resource management involves coordinating and allocating resources across
multiple nodes in the distributed system:
 Distributed Resource Allocation: Managing the allocation of resources such as CPU
time, memory, and storage across different nodes. This includes load balancing to
distribute workloads evenly.
 Scheduling and Load Balancing: Techniques for managing the execution of tasks and
balancing the load to prevent any single node from becoming a bottleneck. This ensures
optimal performance and resource utilization.

20
UDS24201J - UNIT IV

 Resource Virtualization: Abstracting the underlying hardware resources to provide a


virtualized view of resources, making them available to applications in a consistent
manner.
5. Communication and Coordination
Effective communication and coordination are essential for the operation of a distributed
system:
 Inter-Process Communication (IPC): Mechanisms for processes running on different
nodes to communicate with each other. This can involve message passing, remote
procedure calls (RPCs), or other communication methods.
 Synchronization: Techniques to ensure that processes or threads accessing shared
resources do so in a coordinated manner, avoiding conflicts and ensuring data
consistency. This may involve distributed locking or consensus protocols.
Security Features in Distributed Operating System
Below is the security features in distributed operating system:
1. Authentication and Authorization
 Authentication: This process verifies the identity of users or systems attempting to
access the distributed system. Techniques include username/password combinations,
multi-factor authentication (MFA), and digital certificates. Authentication ensures that
only authorized entities can access the system.
 Authorization: After authentication, authorization determines what resources or actions
the authenticated entity is permitted to access or perform. This involves defining and
enforcing permissions and access rights, often through access control lists (ACLs) or
role-based access control (RBAC).
2. Data Encryption and Integrity
 Data Encryption: Encryption protects data by converting it into a format that can only
be read by someone with the appropriate decryption key. This ensures confidentiality
both in transit (e.g., using TLS/SSL) and at rest (e.g., using AES encryption).
 Data Integrity: Techniques like hashing (e.g., SHA-256) and checksums ensure that
data remains unchanged and uncorrupted during transmission or storage. Integrity
checks help detect unauthorized modifications or corruption.

21
UDS24201J - UNIT IV

3. Access Control Mechanisms


 Access Control: Access control mechanisms manage how resources are accessed and
by whom. They can be based on policies that determine access based on user roles,
attributes, or other criteria.
o Discretionary Access Control (DAC): Resource owners decide who can
access their resources and what operations they can perform.
o Mandatory Access Control (MAC): Access decisions are made based on
predefined policies, often enforced by the operating system or security
software.
o Role-Based Access Control (RBAC): Access is granted based on the roles
assigned to users, with permissions associated with each role.
Consistency and Data Management Features in Distributed Operating System
Below are the consistency and data management features in distributed operating system:
1. Consistency Models
 Strong Consistency: Guarantees that once a write operation completes, all subsequent
reads will reflect that write. This model ensures that all nodes see the same data at all
times but may incur higher latency and reduced availability.
 Eventual Consistency: Allows for temporary inconsistencies between nodes, with the
guarantee that, over time, all nodes will converge to the same state. This model
prioritizes availability and partition tolerance but may lead to stale reads.
 Causal Consistency: Ensures that operations that are causally related are seen by all
nodes in the same order. Operations that are not causally related can be seen in different
orders, balancing between strong and eventual consistency.
2. Data Replication and Synchronization
 Data Replication: Involves creating and maintaining multiple copies of data across
different nodes to enhance availability and fault tolerance. Replication strategies can be
synchronous (updates are propagated immediately) or asynchronous (updates are
propagated later).
 Data Synchronization: Ensures that all replicas of data are kept up-to-date and
consistent. Techniques for synchronization include two-phase commit (2PC) and
quorum-based approaches.

22
UDS24201J - UNIT IV

3. Distributed Databases
 Distributed Databases: Databases that store data across multiple nodes or locations.
They provide a unified view of the data despite its physical distribution. Key features
include support for distributed transactions, replication, and consistent querying.
Fault Tolerance Mechanisms in Distributed Operating System
Below is the fault tolerance mechanism in distributed operating system:
1. Redundancy Strategies
 Redundancy: Involves duplicating critical components or systems to ensure reliability
and availability. Strategies include:
o Data Redundancy: Multiple copies of data are stored across different nodes.
o Hardware Redundancy: Using backup hardware components (e.g., servers,
disks) to take over in case of failure.
2. Recovery Techniques
 Recovery Techniques: Methods for restoring the system to a stable state after a failure.
Techniques include:
o Checkpointing: Periodically saving the state of a system so that it can be
restored to a recent, consistent point in case of failure.
o Rollback and Replay: Reverting to a previous state and reapplying
operations to recover from failures.
3. Error Handling and Detection
 Error Handling: Mechanisms to manage and mitigate the effects of errors or failures.
This includes retrying operations, compensating for errors, and using error recovery
procedures.
 Error Detection: Techniques for identifying errors or anomalies, such as using error
logs, monitoring systems, and health checks to detect and address issues promptly.
EXAMPLES OF DISTRIBUTED OPERATING SYSTEM
There are various examples of the distributed operating system. Some of them are as
follows:
Solaris
It is designed for the SUN multiprocessor workstations
OSF/1
It's compatible with Unix and was designed by the Open Foundation Software Company.
Micros

23
UDS24201J - UNIT IV

The MICROS operating system ensures a balanced data load while allocating jobs to all
nodes in the system.
DYNIX
It is developed for the Symmetry multiprocessor computers.
Locus
It may be accessed local and remote files at the same time without any location hindrance.
Mach
It allows the multithreading and multitasking features.
Real-life Example of Distributed Operating System
1. Web search
We Have different web pages, multimedia content, and scanned documents that we need to
search. The purpose of web search is to index the content of the web. So to help us, we use
different search engines like Google, Yahoo, Bing, etc. These search engines use
distributed architecture.
2. Banking system
Suppose there is a bank whose headquarters is in New Delhi. That bank has branch offices
in cities like Ludhiana, Noida, Faridabad, and Chandigarh. You can operate your bank by
going to any of these branches. How is this possible? It’s because whatever changes you
make at one branch office are reflected at all branches. This is because of the distributed
system.
3. Massively multiplayer online games
Nowadays, you can play online games where you can play games with a person sitting in
another country in a real-time environment. How’s it possible? It is because of distributed
architecture.
ADVANTAGES AND DISADVANTAGES OF DISTRIBUTED OPERATING
SYSTEM
Advantages of Distributed Systems
Below are the key advantages of Distributed Systems:
Scalability:
 Horizontal Scaling: To support growth and manage higher loads, additional nodes
can be added with simplicity.
 Load Balancing: To increase responsiveness and performance, divide workloads
among several servers.

24
UDS24201J - UNIT IV

Fault Tolerance and Reliability:


 Redundancy: Duplicate critical components and services to prevent single points of
failure.
 Failover Mechanisms: Automatically switch to backup systems or nodes in case of
failure, ensuring continuous operation.
Resource Sharing:
 Distributed Resources: Utilize diverse resources (e.g., storage, processing power)
across different locations.
 Collaboration: Allow multiple users or systems to work together and share
resources efficiently.
Performance Improvement:
 Parallel Processing: Increase processing speed by executing jobs concurrently
across several nodes.
 Reduced Latency: Reduce response times by serving requests from nodes that are
closer to you geographically.
Flexibility and Adaptability:
 Modular Design: Add or remove parts of the system without impacting the system
as a whole to enable system upgrades and updates.
 Easy Upgrades: Gradually add additional features or enhancements without causing
major problems.
Geographical Distribution:
 Global Reach: Deploy apps and services closer to users worldwide to improve
performance and accessibility.
 Disaster Recovery: Store and manage data across different regions to protect
against localized disasters or outages.
Cost Efficiency:
 Utilization of Commodity Hardware: Use off-the-shelf hardware to reduce costs
compared to specialized systems.
 Dynamic Resource Allocation: Optimize resource usage based on demand,
potentially lowering operational costs.
Disadvantages of Distributed Systems
Below are the disadvantages of distributed systems:
Complexity:

25
UDS24201J - UNIT IV

 System Design: Compared to centralized systems, distributed systems are naturally


more complex to design and administer.
 Maintenance and Troubleshooting: Because of its distributed nature and the
possibility of inconsistencies, debugging and maintaining a distributed system can
be difficult.
Network Dependency:
 Latency Issues: Network delays and bandwidth limitations can impact system
performance and responsiveness.
 Network Failures: Dependence on network connectivity means that network
outages or failures can disrupt system operations.
Data Consistency Challenges:
 Synchronization: Ensuring that all nodes have a consistent view of data can be
difficult, especially in the presence of network partitions or delays.
 Consistency Models: Different nodes might have varying data consistency levels,
leading to potential conflicts or outdated information.
Security Concerns:
 Data Protection: Securing data transmitted over networks and stored across
multiple nodes can be complex and requires robust encryption and access controls.
 Authentication and Authorization: Managing authentication and authorization
across distributed components can be more challenging than in centralized systems.
Increased Overhead:
 Communication Overhead: Inter-node communication adds latency and consumes
bandwidth, which can affect overall system performance.
 Resource Management: Managing state and allocating resources among distributed
nodes can add complexity and overhead.
Cost Considerations:
 Infrastructure Costs: Setting up and maintaining a distributed system can involve
significant costs related to hardware, networking, and software.
 Operational Costs: Ongoing costs for network management, system administration,
and monitoring can be higher compared to centralized systems.
Data Integrity Issues:
 Replication Lag: Data replication across nodes can lead to inconsistencies if
updates are not propagated promptly.

26
UDS24201J - UNIT IV

 Conflict Resolution: Handling data conflicts and ensuring data integrity in the face
of concurrent updates can be complex.
DESIGN AND IMPLEMENTATION OF DISTRIBUTED OPERATING SYSTEM
Designing and implementing a distributed operating system (DOS) involves creating a
system that coordinates the interaction of multiple machines or nodes in a network, providing
the appearance of a single coherent operating system to the user. These systems offer various
advantages such as fault tolerance, resource sharing, and scalability. Below is an overview of
the design principles and steps involved in implementing a distributed operating system.
Design Principles of a Distributed Operating System:
1. Transparency:
o Access Transparency: Hides the differences in data access mechanisms.
o Location Transparency: The location of resources should be hidden from
users and applications.
o Replication Transparency: The user should not be aware if a resource is
replicated across nodes.
o Concurrency Transparency: Multiple users can concurrently access
resources without interference.
o Failure Transparency: The system should continue to work despite partial
failures.
2. Scalability:
o The system should be able to scale by adding more machines without
significant changes in architecture or performance degradation.
o The load should be distributed efficiently among nodes.
3. Fault Tolerance:
o The system must tolerate failures in individual machines or networks.
o Redundancy and replication should be employed to ensure data availability
and minimize downtime.
4. Resource Management:
o Resources such as CPU, memory, disk, and network bandwidth need to be
allocated and managed across nodes in a fair and efficient manner.
o Centralized management (single node manages resources) or decentralized
management (no central authority, resources managed locally by each node)
can be used.

27
UDS24201J - UNIT IV

5. Synchronization and Communication:


o Clock synchronization: Ensuring that all nodes have synchronized clocks is
crucial for consistency and coordination.
o Message passing: Nodes in a distributed system need to communicate using a
reliable messaging protocol.
o Distributed mutual exclusion: Mechanisms are required to ensure that only
one process can access a shared resource at a time.
6. Security:
o Authentication, encryption, and secure communication protocols are needed to
protect data and user interactions.
7. Heterogeneity:
o The system should support heterogeneous hardware and software platforms, so
the operating system needs to handle different processor architectures,
operating systems, and networking technologies.
Key Components in the Design of Distributed Operating Systems:
1. Process Management:
o Processes must be managed across multiple machines.
o Process migration: A process may move from one machine to another for
load balancing or fault tolerance.
o Distributed scheduling: The operating system must decide where and when
processes will run on the network.
o Inter-process communication (IPC): Distributed processes must be able to
communicate via message passing, shared memory, or RPC (Remote
Procedure Calls).
2. Memory Management:
o Distributed shared memory (DSM): Allows processes to access memory
across different nodes as though it were local.
o Virtual memory: The operating system must manage distributed virtual
memory, possibly using paging or segmentation techniques.
3. File System:
o The file system must be distributed across multiple nodes.
o Distributed file systems (DFS): Provides a unified file system view to users
despite physical file distribution.

28
UDS24201J - UNIT IV

o File replication: Files may be replicated across several machines to ensure


fault tolerance and quick access.
o Examples: NFS (Network File System), Coda, Google File System (GFS).
4. Networking:
o Reliable communication protocols: Ensure messages are delivered
accurately and in order (e.g., TCP/IP).
o Routing: Decides how messages are sent between nodes, ensuring that
communication is efficient and reliable.
o Middleware: Software layers that facilitate communication between
applications and underlying system resources.
5. Security and Authentication:
o Authentication of users and services to ensure that only authorized entities can
access distributed resources.
o Encryption to protect data in transit and at rest.
o Techniques for secure communication include SSL/TLS and secure message
passing protocols.
Steps in Implementing a Distributed Operating System:
1. System Requirements Analysis:
o Define the goals and the problem that the distributed system aims to solve.
o Determine the system architecture (e.g., client-server, peer-to-peer).
o Analyze the hardware and network resources available.
2. Design the Architecture:
o Choose an architecture: Centralized, decentralized, or hybrid.
o Design the communication protocols: Message passing, remote procedure
calls (RPC), or shared memory.
o Define the resource management strategies: Load balancing, fault tolerance,
process migration.
o Select the appropriate synchronization and consistency algorithms.
3. Implementation:
o Develop the core components such as:
 Process Management: Implement scheduling, process migration, and
inter-process communication mechanisms.

29
UDS24201J - UNIT IV

 Memory Management: Implement distributed shared memory,


paging, and virtual memory techniques.
 File System: Develop a distributed file system with support for file
replication and fault tolerance.
 Networking: Implement reliable communication protocols, routing
algorithms, and middleware.
o Ensure secure communication with encryption and authentication.
o Implement logging, debugging, and monitoring tools for the system.
4. Testing and Debugging:
o Test individual components for correctness.
o Perform integration testing to ensure all components work together.
o Simulate failures to check the fault tolerance and recovery mechanisms.
o Load testing to ensure scalability.
5. Optimization:
o Monitor performance and identify bottlenecks.
o Optimize memory, CPU, and network usage for better efficiency.
o Improve fault tolerance by implementing better recovery algorithms and data
replication techniques.
6. Deployment:
o Deploy the system in stages, starting with a smaller setup and scaling
gradually.
o Monitor performance in real-world scenarios and address any issues that arise.
HADOOP DISTRIBUTED FILE SYSTEM
What is Hadoop Distributed File System (HDFS)?
The Hadoop Distributed File System (HDFS) is the primary data storage
system Hadoop applications use. It's an open source distributed processing framework for
handling data processing, managing pools of big data and storing and supporting related big
data analytics applications.
HDFS employs a NameNode and DataNode architecture to implement a distributed file
system that provides high-performance access to data across highly scalable Hadoop clusters.
It's designed to run on commodity hardware and is a key part of the many Hadoop ecosystem
technologies.

30
UDS24201J - UNIT IV

How does HDFS work?


HDFS is built using the Java language and enables the rapid transfer of data between
compute nodes. At its outset, it was closely coupled with MapReduce, a framework for data
processing that filters and divides up work among the nodes in a cluster and organizes and
condenses the results into a cohesive answer to a query. Similarly, when HDFS takes in data,
it breaks the information down into separate blocks and distributes them to different nodes in
a cluster.
The following describes how HDFS works:
 With HDFS, data is written on the server once and read and reused numerous times.
 HDFS has a primary NameNode, which keeps track of where file data is kept in the
cluster.
 HDFS has multiple DataNodes on a commodity hardware cluster -- typically one per
node in a cluster. The DataNodes are generally organized within the same rack in the data
center. Data is broken down into separate blocks and distributed among the various
DataNodes for storage. Blocks are also replicated across nodes, enabling highly
efficient parallel processing.
 The NameNode knows which DataNode contains which blocks and where the DataNodes
reside within the machine cluster. The NameNode also manages access to the files,
including reads, writes, creates, deletes and the data block replication across the
DataNodes.
 The NameNode operates together with the DataNodes. As a result, the cluster can
dynamically adapt to server capacity demands in real time by adding or subtracting nodes
as necessary.
 The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always
aware of the status of each DataNode. If the NameNode realizes that one DataNode isn't
working properly, it can immediately reassign that DataNode's task to a different node
containing the same data block. DataNodes also communicate with each other, which
enables them to cooperate during normal file operations.
 The HDFS is designed to be highly fault tolerant. The file system replicates -- or copies --
each piece of data multiple times and distributes the copies to individual nodes, placing at
least one copy on a different server rack than the other copies.

31
UDS24201J - UNIT IV

HDFS
architecture centers on commanding NameNodes that hold metadata and DataNodes that
store information in blocks.
HDFS ARCHITECTURE, NAMENODE AND DATANODES
HDFS uses a primary/secondary architecture where each HDFS cluster is comprised of many
worker nodes and one primary node or the NameNode. The NameNode is the controller node,
as it knows the metadata and status of all files including file permissions, names and location
of each block. An application or user can create directories and then store files inside these
directories. The file system namespace hierarchy is like most other file systems, as a user can
create, remove, rename or move files from one directory to another.
The HDFS cluster's NameNode is the primary server that manages the file system namespace
and controls client access to files. As the central component of the Hadoop Distributed File
System, the NameNode maintains and manages the file system namespace and provides
clients with the right access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
NameNode
The NameNode performs the following key functions:
 The NameNode performs file system namespace operations, including opening, closing
and renaming files and directories.
 The NameNode governs the mapping of blocks to the DataNodes.
 The NameNode records any changes to the file system namespace or its properties. An
application can stipulate the number of replicas of a file that the HDFS should maintain.

32
UDS24201J - UNIT IV

 The NameNode stores the number of copies of a file, called the replication factor of that
file.
 To ensure that the DataNodes are alive, the NameNode gets block reports
and heartbeat data.
 In case of a DataNode failure, the NameNode selects new DataNodes for replica creation.
DataNodes
In HDFS, DataNodes function as worker nodes or Hadoop daemons and are typically made
of low-cost off-the-shelf hardware. A file is split into one or more of the blocks that are
stored in a set of DataNodes. Based on their replication factor, the files are internally
partitioned into many blocks that are kept on separate DataNodes.
The DataNodes perform the following key functions:
 The DataNodes serve read and write requests from the clients of the file system.
 The DataNodes perform block creation, deletion and replication when the NameNode
instructs them to do so.
 The DataNodes transfer periodic heartbeat signals to the NameNode to help keep HDFS
health in check.
 The DataNodes provide block reports to NameNode to help keep track of the blocks
included within the DataNodes. For redundancy and higher availability, each block is
copied onto two extra DataNodes by default.
FEATURES OF HDFS
There are several features that make HDFS particularly useful, including the following:
 Data replication. Data replication ensures that the data is always available and prevents
data loss. For example, when a node crashes or there's a hardware failure, replicated data
can be pulled from elsewhere within a cluster, so processing continues while data is being
recovered.
 Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them
across nodes in a large cluster ensures fault tolerance and reliability.
 High availability. Because of replication across nodes, data is available even if the
NameNode or DataNode fails.
 Scalability. HDFS stores data on various nodes in the cluster, so as requirements
increase, a cluster can scale to hundreds of nodes.

33
UDS24201J - UNIT IV

 High throughput. Because HDFS stores data in a distributed manner, the data can be
processed in parallel on a cluster of nodes. This, plus data locality, cuts the processing
time and enables high throughput.
 Data locality. With HDFS, computation happens on the DataNodes where the data
resides, rather than having the data move to where the computational unit is. Minimizing
the distance between the data and the computing process decreases network
congestion and boosts a system's overall throughput.
 Snapshots. HDFS supports snapshots, which capture point-in-time copies of the file
system and protect critical data from user or application errors.
What are the benefits of using HDFS?
There are seven main advantages to using HDFS, including the following:
 Cost effective. The DataNodes that store the data rely on inexpensive off-the-shelf
hardware, which reduces storage costs. Also, because HDFS is open source, there's no
licensing fee.
 Large data set storage. HDFS stores a variety of data of any size and large files -- from
megabytes to petabytes -- in any format, including structured and unstructured data.
 Fast recovery from hardware failure. HDFS is designed to detect faults and
automatically recover on its own.
 Portability. HDFS is portable across all hardware platforms and compatible with several
operating systems, including Windows, Linux and macOS.
 Streaming data access. HDFS is built for high data throughput, which is best for access
to streaming data.
 Speed. Because of its cluster architecture, HDFS is fast and can handle 2 GB of data per
second.
 Diverse data formats. Hadoop data lakes support a wide range of data formats, including
unstructured such as movies, semistructured such as XML files and structured data
for Structured Query Language databases. Data retrieved via Hadoop is schema-free, so it
can be parsed into any schema and can support diverse data analysis in various ways.
HADOOP – CLUSTER
Cluster is a collection of something, a simple computer cluster is a group of various
computers that are connected with each other through LAN(Local Area Network), the nodes
in a cluster share the data, work on the same task and this nodes are good enough to work as a
single unit means all of them to work together.

34
UDS24201J - UNIT IV

Similarly, a Hadoop cluster is also a collection of various commodity hardware(devices that


are inexpensive and amply available). This Hardware components work together as a single
unit. In the Hadoop cluster, there are lots of nodes (can be computer and servers) contains
Master and Slaves, the Name node and Resource Manager works as Master and data node,
and Node Manager works as a Slave. The purpose of Master nodes is to guide the slave nodes
in a single Hadoop cluster. We design Hadoop clusters for storing, analyzing, understanding,
and for finding the facts that are hidden behind the data or datasets which contain some
crucial information. The Hadoop cluster stores different types of data and processes them.
 Structured-Data: The data which is well structured like Mysql.
 Semi-Structured Data: The data which has the structure but not the data type like XML,
Json (Javascript object notation).
 Unstructured Data: The data that doesn’t have any structure like audio, video.
Hadoop Cluster Schema:

Hadoop Clusters Properties

35
UDS24201J - UNIT IV

1. Scalability: Hadoop clusters are very much capable of scaling-up and scaling-down the
number of nodes i.e. servers or commodity hardware. Let’s see with an example of what
actually this scalable property means. Suppose an organization wants to analyze or maintain
around 5PB of data for the upcoming 2 months so he used 10 nodes(servers) in his Hadoop
cluster to maintain all of this data. But now what happens is, in between this month the
organization has received extra data of 2PB, in that case, the organization has to set up or
upgrade the number of servers in his Hadoop cluster system from 10 to 12(let’s consider) in
order to maintain it. The process of scaling up or scaling down the number of servers in the
Hadoop cluster is called scalability.
2. Flexibility: This is one of the important properties that a Hadoop cluster possesses.
According to this property, the Hadoop cluster is very much Flexible means they can handle
any type of data irrespective of its type and structure. With the help of this property, Hadoop
can process any type of data from online web platforms.
3. Speed: Hadoop clusters are very much efficient to work with a very fast speed because the
data is distributed among the cluster and also because of its data mapping capability’s i.e. the
MapReduce architecture which works on the Master-Slave phenomena.
4. No Data-loss: There is no chance of loss of data from any node in a Hadoop cluster
because Hadoop clusters have the ability to replicate the data in some other node. So in case
of failure of any node no data is lost as it keeps track of backup for that data.
5. Economical: The Hadoop clusters are very much cost-efficient as they possess the
distributed storage technique in their clusters i.e. the data is distributed in a cluster among all
the nodes. So in the case to increase the storage we only need to add one more another
hardware storage which is not that much costliest.
Types of Hadoop clusters
1. Single Node Hadoop Cluster
2. Multiple Node Hadoop Cluster

36
UDS24201J - UNIT IV

1. Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name suggests the
cluster is of an only single node which means all our Hadoop Daemons i.e. Name Node, Data
Node, Secondary Name Node, Resource Manager, Node Manager will run on the same
system or on the same machine. It also means that all of our processes will be handled by
only single JVM(Java Virtual Machine) Process Instance.
2. Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the name suggests it
contains multiple nodes. In this kind of cluster set up all of our Hadoop Daemons, will store
in different-different nodes in the same cluster setup. In general, in multiple node Hadoop
cluster setup we try to utilize our higher processing nodes for Master i.e. Name node and
Resource Manager and we utilize the cheaper system for the slave Daemon’s i.e.Node
Manager and Data Node.

HADOOP MAP-REDUCE
Hadoop – Mapper In MapReduce
Map-Reduce is a programming model that is mainly divided into two phases Map
Phase and Reduce Phase. It is designed for processing the data in parallel which is divided
on various machines(nodes). The Hadoop Java programs are consist of Mapper class and
Reducer class along with the driver class. Hadoop Mapper is a function or task which is used
to process all input records from a file and generate the output which works as input for
Reducer. It produces the output by returning new key-value pairs. The input data has to be
converted to key-value pairs as Mapper can not process the raw input records or tuples(key-
value pairs). The mapper also generates some small blocks of data while processing the input
records as a key-value pair. we will discuss the various process that occurs in Mapper, There
key features and how the key-value pairs are generated in the Mapper.
Let’s understand the Mapper in Map-Reduce:

37
UDS24201J - UNIT IV

Mapper is a simple user-defined program that performs some operations on input-splits as per
it is designed. Mapper is a base class that needs to be extended by the developer or
programmer in his lines of code according to the organization’s requirements. input and
output type need to be mentioned under the Mapper class argument which needs to be
modified by the developer.
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we
have 100 Data-Blocks of the dataset we are analyzing then in that case there will be 100
Mapper program or process that runs in parallel on machines(nodes) and produce their own
output known as intermediate output which is then stored on Local Disk, not on HDFS. The
output of the mapper act as input for Reducer which performs some sorting and aggregation
operation on data and produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader, Map, and
Intermediate output disk. The Map Task is completed with the contribution of all this
available component.
1. Input: Input is records or the datasets that are used for analysis purposes. This Input data
is set out with the help of InputFormat. It helps in identifying the location of the Input
data which is stored in HDFS(Hadoop Distributed File System).
2. Input-Splits: These are responsible for converting the physical input data to some logical
form so that Hadoop Mapper can easily handle it. Input-Splits are generated with the help
of InputFormat. A large data set is divided into many input-splits which depend on the
size of the input dataset. There will be a separate Mapper assigned for each input-splits.
Input-Splits are only referencing to the input data, these are not the actual
data. DataBlocks are not the only factor that decides the number of input-splits in a Map-
Reduce. we can manually configure the size of input-splits
in mapred.max.split.size property while the job is executing. All of these input-splits are
utilized by each of the data blocks. The size of input splits is measured in bytes. Each
input-split is stored at some memory location (Hostname Strings). Map-Reduce places

38
UDS24201J - UNIT IV

map tasks near the location of the split as close as it is possible. The input-split with the
larger size executed first so that the job-runtime can be minimized.
3. Record-Reader: Record-Reader is the process which deals with the output obtained from
the input-splits and generates it’s own output as key-value pairs until the file ends. Each
line present in a file will be assigned with the Byte-Offset with the help of Record-
Reader. By-default Record-Reader uses TextInputFormat for converting the data
obtained from the input-splits to the key-value pairs because Mapper can only handle
key-value pairs.
4. Map: The key-value pair obtained from Record-Reader is then feed to the Map which
generates a set of pairs of intermediate key-value pairs.
5. Intermediate output disk: Finally, the intermediate key-value pair output will be stored
on the local disk as intermediate output. There is no need to store the data on HDFS as it
is an intermediate output. If we store this data onto HDFS then the writing cost will be
more because of it’s replication feature. It also increases its execution time. If somehow
the executing job is terminated then, in that case, cleaning up this intermediate output
available on HDFS is also difficult. The intermediate output is always stored on local disk
which will be cleaned up once the job completes its execution. On local disk, this Mapper
output is first stored in a buffer whose default size is 100MB which can be configured
with io.sort.mb property. The output of the mapper can be written to HDFS if and only if
the job is Map job only, In that case, there will be no Reducer task so the intermediate
output is our final output which can be written on HDFS. The number of Reducer tasks
can be made zero manually with job.setNumReduceTasks(0). This Mapper output is of no
use for the end-user as it is a temporary output useful for Reducer only.

Hadoop – Reducer in Map-Reduce


Map-Reduce is a programming model that is mainly divided into two phases i.e. Map Phase
and Reduce Phase. It is designed for processing the data in parallel which is divided on
various machines(nodes). The Hadoop Java programs are consist of Mapper class and
Reducer class along with the driver class. Reducer is the second part of the Map-Reduce
programming model. The Mapper produces the output in the form of key-value pairs which
works as input for the Reducer.
But before sending this intermediate key-value pairs directly to the Reducer some process
will be done which shuffle and sort the key-value pairs according to its key values, which

39
UDS24201J - UNIT IV

means the value of the key is the main decisive factor for sorting. The output generated by the
Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File
System). Reducer mainly performs some computation operation like addition, filtration, and
aggregation. By default, the number of reducers utilized for process the output of the Mapper
is 1 which is configurable and can be changed by the user according to the requirement.
Let’s understand the Reducer in Map-Reduce:

Here, in the above image, we can observe that there are multiple Mapper which are
generating the key-value pairs as output. The output of each mapper is sent to the sorter
which will sort the key-value pairs according to its key value. Shuffling also takes place
during the sorting process and the output will be sent to the Reducer part and final output is
produced.
Let’s take an example to understand the working of Reducer. Suppose we have the data of
a college faculty of all departments stored in a CSV file. In case we want to find the sum of
salaries of faculty according to their department then we can make their dept. title as key and
salaries as value. The Reducer will perform the summation operation on this dataset and
produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
1. Framework overhead increases.
2. Cost of failure Reduces

40
UDS24201J - UNIT IV

3. Increase load balancing.


One thing we also need to remember is that there will always be a one to one mapping
between Reducers and the keys. Once the whole Reducer process is done the output is stored
at the part file(default name) on HDFS(Hadoop Distributed File System). In the output
directory on HDFS, The Map-Reduce always makes a _SUCCESS file and part-r-
00000 file. The number of part files depends on the number of reducers in case we have 5
Reducers then the number of the part file will be from part-r-00000 to part-r-00004. By
default, these files have the name of part-a-bbbbb type. It can be changed manually all we
need to do is to change the below property in our driver code of Map-Reduce.
// Here we are changing output file name from part-r-00000 to sample
job.getConfiguration().set("mapreduce.output.basename", "Welcome")
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
1. Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer. With the
help of HTTP, the framework calls for applicable partition of the output in all Mappers.
2. Sort: In this phase, the output of the mapper that is actually the key-value pairs will be
sorted on the basis of its key value.
3. Reduce: Once shuffling and sorting will be done the Reducer combines the obtained
result and perform the computation operation as per the
requirement. OutputCollector.collect() property is used for writing the output to the
HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
Setting Number Of Reducers In Map-Reduce:
1. With Command Line: While executing our Map-Reduce program we can manually
change the number of Reducer with controller mapred.reduce.tasks.
2. With JobConf instance: In our driver class, we can specify the number of reducers using
the instance of job.setNumReduceTasks(int).
For example job.setNumReduceTasks(2), Here we have 2 Reducers. we can also make
Reducers to 0 in case we need only a Map job.

MIGRATING DATA FROM RDBMS TO HDFS EQUIVALENT USING SPARK:

41
UDS24201J - UNIT IV

Let’s consider a possible scenario where the project stack does not include Hadoop
Framework, but the user wants to migrate the data from a RDBMS to HDFS equivalent
system, for example, Amazon s3. In this scenario, Apache Spark SQL can be used.
Apache Spark SQL has two types of RDBMS components for such a migration, known as
JDBCRDD and JDBCDATAFRAME.
If you need to connect Spark with any RDBMS, then JDBC type 4 driver jar file in /lib
directory needs to be added. The following code can be used to check for JDBC connectivity:

The above code will access MySQL database and read all the data from employee table.
The below-mentioned code can be used to achieve Parallelism by fetching the data:

If you want to write data in the database, the following code can be used:

Using the above code snippets, you can import and export data from RDBMS to a HDFS
equivalent system.

MIGRATING DATA FROM HDFS TO RDBMS EQUIVALENT USING SPARK:

42
UDS24201J - UNIT IV

Procedure
1. Clone the GitHub repository containing the test data.
language-bash
content_paste
git clone https://github.com/brianmhess/DSE-Spark-HDFS.git
2. Load the maximum temperature test data into the Hadoop cluster using WebHDFS.
In this example, the Hadoop node has a hostname of hadoopNode.example.com. Replace it
with the hostname of a node in your Hadoop cluster.
language-bash
content_paste
hadoop fs -mkdir webhdfs://hadoopNode.example.com:50070/user/guest/data &&
hadoop fs -copyFromLocal data/sftmax.csv
webhdfs://hadoopNode:50070/user/guest/data/sftmax.csv
3. Create the keyspace and table and load the minimum temperature test data using cqlsh.
language-cql
content_paste
CREATE KEYSPACE IF NOT EXISTS spark_ex2 WITH REPLICATION = {
'class':'SimpleStrategy', 'replication_factor':1}
DROP TABLE IF EXISTS spark_ex2.sftmin
CREATE TABLE IF NOT EXISTS spark_ex2.sftmin(location TEXT, year INT, month INT,
day INT, tmin DOUBLE, datestring TEXT, PRIMARY KEY ((location), year, month, day))
WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC)
COPY spark_ex2.sftmin(location, year, month, day, tmin, datestring) FROM 'data/sftmin.csv'
4. Ensure that we can access the HDFS data by interacting with the data using hadoop fs.
The following command counts the number of lines of HDFS data.
language-bash
content_paste
hadoop fs -cat webhdfs://hadoopNode.example.com:50070/user/guest/data/sftmax.csv | wc -l
You should see output similar to the following:
results
content_paste
16/05/10 11:21:51 INFO snitch.Workload: Setting my workload to Cassandra 3606
5. Start the Spark console and connect to the DataStax Enterprise cluster.

43
UDS24201J - UNIT IV

language-bash
content_paste
dse spark
Import the Spark Cassandra connector and create the session.
language-scala
content_paste
import com.datastax.spark.connector.cql.CassandraConnector
val connector = CassandraConnector(csc.conf)
val session = connector.openSession()
6. Create the table to store the maximum temperature data.
language-scala
content_paste
session.execute(s"DROP TABLE IF EXISTS spark_ex2.sftmax")
session.execute(s"CREATE TABLE IF NOT EXISTS spark_ex2.sftmax(location TEXT,
year INT, month INT, day INT, tmax DOUBLE, datestring TEXT, PRIMARY KEY
((location), year, month, day)) WITH CLUSTERING ORDER BY (year DESC, month
DESC, day DESC)")
7. Create a Spark RDD from the HDFS maximum temperature data and save it to the table.
First create a case class representing the maximum temperature sensor data:
language-scala
content_paste
case class Tmax(location: String, year: Int, month: Int, day: Int, tmax: Double, datestring:
String)
Read the data into an RDD.
language-scala
content_paste
val tmax_raw =
sc.textFile("webhdfs://sandbox.hortonworks.com:50070/user/guest/data/sftmax.csv")
Transform the data so each record in the RDD is an instance of the Tmax case class.
language-scala
content_paste
val tmax_c10 = tmax_raw.map(x=>x.split(",")).map(x => Tmax(x(0), x(1).toInt, x(2).toInt,
x(3).toInt, x(4).toDouble, x(5)))

44
UDS24201J - UNIT IV

Count the case class instances to make sure it matches the number of records.
language-scala
content_paste
tmax_c10.count
res11: Long = 3606
Save the case class instances to the database.
language-scala
content_paste
tmax_c10.saveToCassandra("spark_ex2", "sftmax")
9. Verify the records match by counting the rows using CQL.
language-scala
content_paste
session.execute("SELECT COUNT(*) FROM spark_ex2.sftmax").all.get(0).getLong(0)
res23: Long = 3606
10. Join the maximum and minimum data into a new table.
Create a Tmin case class to store the minimum temperature sensor data.
language-scala
content_paste
case class Tmin(location: String, year: Int, month: Int, day: Int, tmin: Double, datestring:
String)
val tmin_raw = sc.cassandraTable("spark_ex2", "sftmin")
val tmin_c10 = tmin_raw.map(x => Tmin(x.getString("location"), x.getInt("year"),
x.getInt("month"), x.getInt("day"), x.getDouble("tmin"), x.getString("datestring")))
In order to join RDDs, they need to be PairRDDs, with the first element in the pair being the
join key.
language-scala
content_paste
val tmin_pair = tmin_c10.map(x=>(x.datestring,x))
val tmax_pair = tmax_c10.map(x=>(x.datestring,x))
Create a THiLoDelta case class to store the difference between the maximum and minimum
temperatures.
language-scala
content_paste

45
UDS24201J - UNIT IV

case class THiLoDelta(location: String, year: Int, month: Int, day: Int, hi: Double, low:
Double, delta: Double, datestring: String)
Join the data using the join operation on the PairRDDs. Convert the joined data to the
THiLoDelta case class.
language-scala
content_paste
val tdelta_join1 = tmax_pair1.join(tmin_pair1)
val tdelta_c10 = tdelta_join1.map(x => THiLoDelta(x._2._1._1, x._2._1._2, x._2._1._3,
x._2._1._4, x._2._1._5, x._2._2._5, x._2._1._5 - x._2._2._5, x._1))
Create a new table within Spark using CQL to store the temperature difference data.
language-scala
content_paste
session.execute(s"DROP TABLE IF EXISTS spark_ex2.sftdelta")
session.execute(s"CREATE TABLE IF NOT EXISTS spark_ex2.sftdelta(location TEXT,
year INT, month INT, day INT, hi DOUBLE, low DOUBLE, delta DOUBLE, datestring
TEXT, PRIMARY KEY ((location), year, month, day)) WITH CLUSTERING ORDER BY
(year DESC, month DESC, day DESC)")
Save the temperature difference data to the table.
language-scala
content_paste
tdelta_c10.saveToCassandra("spark_ex2", "sftdelta")
KAFKA STREAM
What is Kafka?
A distributed event streaming framework called Apache Kafka is made to manage fault-
tolerant, high-throughput data streams. It offers a centralized platform for developing real-
time data pipelines and applications, enabling smooth data producer and consumer
connection.
What is Kafka Stream API?
Kafka Streams API can be used to simplify the Stream Processing procedure from
various disparate topics. It can provide distributed coordination, data parallelism,
scalability, and fault tolerance.
This API makes use of the ideas of tasks and partitions as logical units that communicate
with the cluster and are closely related to the subject partitions.

46
UDS24201J - UNIT IV

The fact that the apps you create with Kafka Streams API are regular Java apps that can be
packaged, deployed, and monitored like any other Java application is one of its unique
features
Primary Terminologies Related to Kafka Streams API
 Tasks: Within the Kafka Streams API, tasks are logical processing units that take in
input data, process it, and then output the results.
 Partitions: Segments of Kafka topics that allow applications using Kafka Streams to
scale and process data in parallel.
 Stateful Processing: This refers to the Kafka Streams API's capacity to save and update
state data across stream processing operations, enabling intricate analytics and
transformations.
 Windowing is a method for processing and aggregating data streams in predetermined
time frames, making windowed joins and aggregation possible.
How Kafka Streams API Works?
1. Initialization: Include the kafka-streams dependency in your project in order to start
using the Kafka Streams API.
2. Order of magnitude Construction: Use the Processor API or Streams API DSL to
specify the application's processing logic. This entails defining the data transformations,
output topics, and input subjects.
3. Implementation: Create an instance of the Kafka Streams Topology object and set up
characteristics like state storage, input/output serializers, and processing semantics.
4. Installation: Install your Kafka Streams application in a runtime environment, like a
containerised environment or a standalone Java process.
5. Scaling: To provide higher throughput and fault tolerance, Kafka Streams applications
automatically scale horizontally by dividing work across several instances.
Kafka Stream API Workflow With a Diagram
The following diagram illustrates the workflow of Kafka Stream APIs in between
producers and consumers:

47
UDS24201J - UNIT IV

Usecases of Kafka Streams API


Here are a few handy Kafka Streams examples that leverage Kafka Streams API to simplify
operations:
 Finance Industry can build applications to accumulate data sources for real-time views
of potential exposures. It can also be leveraged for minimizing and detecting fraudulent
transactions.
 It can also be used by logistics companies to build applications to track their shipments
reliably, quickly, and in real-time.
 Travel companies can build applications with the API to help them make real-time
decisions to find the best suitable pricing for individual customers. This allows them to
cross-sell additional services and process reservations and bookings.
 Retailers can leverage this API to decide in real-time on the next best offers, pricing,
personalized promotions, and inventory management.
Working With Kafka Streams API
 To start working with Kafka Streams API you first need to add Kafka_2.12 package to
your application. You can avail of this package in maven:

48
UDS24201J - UNIT IV

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>1.1.0</version>
</dependency>
 A unique feature of the Kafka Streams API is that the applications you build with it are
normal Java applications. These applications can be packaged, deployed, and monitored
like any other Java application – there is no need to install separate processing clusters
or similar special-purpose and expensive infrastructure.
Advantages of Kafka Stream APIs
The following are the advantages of Kafka Stream APIs:
 Simplified Stream Processing: The Kafka Streams API allows developers to
concentrate on application logic by abstracting away the intricacies of stream
processing.
 Seamless Integration: Its smooth integration with the current Kafka infrastructure is
due to its membership in the Kafka ecosystem.
 Scalability: Because of the horizontal scalability provided by the Kafka Streams API,
applications can manage growing data loads.
 Fault Tolerance: Fault tolerance is ensured by built-in processes, which provide
dependable stream processing even in the event of malfunctions.
Disadvantages of Kafka Stream APIs
The following are the disadvantages of Kafka Stream APIs:
 Java-Centric: Mostly concentrated on Java, which could be difficult for developers
familiar to other languages.
 Learning Curve: While streamlining many parts of stream processing, there is some
learning involved in understanding the ideas and APIs of Kafka Streams.
 Complexity: Especially for inexperienced users, managing stateful processing and
windowed processes might be complicated.
 Resource Consumption: Kafka Streams applications have the potential to use a large
amount of memory and compute power, depending on their size.

49
UDS24201J - UNIT IV

Applications of Kafka Stream APIs


The adaptability of the Kafka Streams API makes it possible to use it in a wide range of
sectors, such as retail, banking, logistics, and travel. The possibilities are infinite, ranging
from dynamic pricing optimisation to real-time fraud detection.
 Organisations may analyse streaming data in real-time for insights and decision-
making thanks to real-time analytics.
 Fraud Detection: Offers a platform for identifying and addressing fraudulent activity in
online and financial transactions.
 Supply chain management makes it easier to track and keep an eye on shipments,
inventories, and logistics processes in real time.
 Personalised marketing: Enables real-time analysis of consumer behaviour and
preferences to power customised marketing initiatives.
APACHE-SPARK
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
SPARK-SQL
Spark introduces a programming module for structured data processing called Spark SQL.
It provides a programming abstraction called DataFrame and can act as distributed SQL
query engine.
Features of Spark SQL
The following are the features of Spark SQL −
Integrated − Seamlessly mix SQL queries with Spark programs. Spark SQL lets you query
structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python,
Scala and Java. This tight integration makes it easy to run SQL queries alongside complex
analytic algorithms.

50
UDS24201J - UNIT IV

Unified Data Access − Load and query data from a variety of sources. Schema-RDDs
provide a single interface for efficiently working with structured data, including Apache
Hive tables, parquet files and JSON files.
Hive Compatibility − Run unmodified Hive queries on existing warehouses. Spark SQL
reuses the Hive frontend and MetaStore, giving you full compatibility with existing Hive
data, queries, and UDFs. Simply install it alongside Hive.
Standard Connectivity − Connect through JDBC or ODBC. Spark SQL includes a server
mode with industry standard JDBC and ODBC connectivity.
Scalability − Use the same engine for both interactive and long queries. Spark SQL takes
advantage of the RDD model to support mid-query fault tolerance, letting it scale to large
jobs too. Do not worry about using a different engine for historical data.
Spark SQL Architecture
The following illustration explains the architecture of Spark SQL −

This architecture contains three layers namely, Language API, Schema RDD, and Data
Sources.
Language API − Spark is compatible with different languages and Spark SQL. It is also,
supported by these languages- API (python, scala, java, HiveQL).
Schema RDD − Spark Core is designed with special data structure called RDD. Generally,
Spark SQL works on schemas, tables, and records. Therefore, we can use the Schema RDD
as temporary table. We can call this Schema RDD as Data Frame.

51
UDS24201J - UNIT IV

Data Sources − Usually the Data source for spark-core is a text file, Avro file, etc.
However, the Data Sources for Spark SQL is different. Those are Parquet file, JSON
document, HIVE tables, and Cassandra database.
SPARK – RDD
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain any
type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD is a
fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are not
so efficient.
Data Sharing is Slow in MapReduce
MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set
of high-level operators, without having to worry about work distribution and fault
tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable storage
system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding
storage system, most of the Hadoop applications, they spend more than 90% of the time
doing HDFS read-write operations.
Data Sharing using Spark RDD

52
UDS24201J - UNIT IV

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of
the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.
Recognizing this problem, researchers developed a specialized framework called Apache
Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-
memory processing computation. This means, it stores the state of memory as an object
across the jobs and the object is sharable between those jobs. Data sharing in memory is 10
to 100 times faster than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.
Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results
(State of the JOB), then it will store those results on the disk

Interactive Operations on Spark RDD


This illustration shows interactive operations on Spark RDD. If different queries are run on
the same set of data repeatedly, this particular data can be kept in memory for better
execution times.

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the

53
UDS24201J - UNIT IV

elements around on the cluster for much faster access, the next time you query it. There is
also support for persisting RDDs on disk, or replicated across multiple nodes.
SPARK MLLIB
Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of
popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that
discusses both high-quality algorithm and high speed. The machine learning algorithms like
regression, classification, clustering, pattern mining, and collaborative filtering. Lower
level machine learning primitives like generic gradient descent optimization algorithm are
also present in MLlib.
Spark.ml is the primary Machine Learning API for Spark. The library Spark.ml offers a
higher-level API built on top of DataFrames for constructing ML pipelines.
Spark MLlib tools are given below:
ML Algorithms
Featurization
Pipelines
Persistence
Utilities
ML Algorithms
ML Algorithms form the core of MLlib. These include common learning algorithms such as
classification, regression, clustering, and collaborative filtering.
MLlib standardizes APIs to make it easier to combine multiple algorithms into a single
pipeline, or workflow. The key concepts are the Pipelines API, where the pipeline concept
is inspired by the scikit-learn project.
Transformer:
A Transformer is an algorithm that can transform one DataFrame into another DataFrame.
Technically, a Transformer implements a method transform(), which converts one
DataFrame into another, generally by appending one or more columns. For example:
A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new
column (e.g., feature vectors), and output a new DataFrame with the mapped column
appended.
A learning model might take a DataFrame, read the column containing feature vectors,
predict the label for each feature vector, and output a new DataFrame with predicted labels
appended as a column.

54
UDS24201J - UNIT IV

Estimator:
An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
Technically, an Estimator implements a method fit(), which accepts a DataFrame and
produces a Model, which is a Transformer. For example, a learning algorithm such as
LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel,
which is a Model and hence a Transformer.
Transformer.transform() and Estimator.fit() are both stateless. In the future, stateful
algorithms may be supported via alternative concepts.
Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying
parameters (discussed below).
Featurization
Featurization includes feature extraction, transformation, dimensionality reduction, and
selection.
Feature Extraction is extracting features from raw data.
Feature Transformation includes scaling, renovating, or modifying features
Feature Selection involves selecting a subset of necessary features from a huge set of
features.
Pipelines:
A Pipeline chains multiple Transformers and Estimators together to specify an ML
workflow. It also provides tools for constructing, evaluating and tuning ML Pipelines.
In machine learning, it is common to run a sequence of algorithms to process and learn
from data. MLlib represents such a workflow as a Pipeline, which consists of a sequence of
Pipeline Stages (Transformers and Estimators) to be run in a specific order. We will use
this simple workflow as a running example in this section.
Dataframe
Dataframes provide a more user-friendly API than RDDs. The DataFrame-based API for
MLlib provides a uniform API across ML algorithms and across multiple languages.
Dataframes facilitate practical ML Pipelines, particularly feature transformations.
Persistence:
Persistence helps in saving and loading algorithms, models, and Pipelines. This helps in
reducing time and efforts as the model is persistence, it can be loaded/ reused any time
when needed.
Utilities:

55
UDS24201J - UNIT IV

Utilities for linear algebra, statistics, and data handling. Example: mllib.linalg is MLlib
utilities for linear algebra.
MLFLOW.SPARK
The mlflow.spark module provides an API for logging and loading Spark MLlib models.
This module exports Spark MLlib models with the following flavors:
Spark MLlib (native) format
Allows models to be loaded as Spark Transformers for scoring in a Spark session. Models
with this flavor can be loaded as PySpark PipelineModel objects in Python. This is the main
flavor and is always produced.
mlflow.pyfunc
Supports deployment outside of Spark by instantiating a SparkContext and reading input
data as a Spark DataFrame prior to scoring. Also supports deployment in Spark as a Spark
UDF. Models with this flavor can be loaded as Python functions for performing inference.
This flavor is always produced.
mlflow.mleap
Enables high-performance deployment outside of Spark by leveraging MLeap’s custom
dataframe and pipeline representations. Models with this flavor cannot be loaded back as
Python objects. Rather, they must be deserialized in Java using the mlflow/java package.
This flavor is produced only if you specify MLeap-compatible arguments.
SPARK STRUCTURED STREAMING
Apache Spark Streaming is a real-time data processing framework that enables developers
to process streaming data in near real-time. It is a legacy streaming engine in Apache
Spark that works by dividing continuous data streams into small batches and processing
them using batch processing techniques.
However, Spark Streaming has some limitations, such as lack of fault-tolerance guarantees,
limited API, and lack of support for many data sources. It has also stopped receiving
updates.
Spark Structured Streaming is a newer and more powerful streaming engine that provides a
declarative API and offers end-to-end fault tolerance guarantees. It leverages the power of
Spark’s DataFrame API and can handle both streaming and batch data using the same
programming model. Additionally, Structured Streaming offers a wide range of data
sources, including Kafka, Azure Event Hubs, and more.
Benefits of Spark Streaming

56
UDS24201J - UNIT IV

The key benefits of Spark Streaming include:


Low latency: Spark Structured Streaming achieves low latency through its micro-batching
approach. This technique divides the incoming data stream into small, manageable batches
that are processed in near real-time. By processing these micro-batches quickly and
efficiently, Spark Structured Streaming minimizes the time between data ingestion and
result generation.
Flexibility: Spark Structured Streaming accommodates a wide range of data sources and
formats. It supports integration with popular data sources such as Kafka, Kinesis, Azure
Event Hubs, and various file systems. The unified API for batch and stream processing
means that developers can maintain a single codebase for both historical and real-time data.
Real-time processing: Structured Streaming continuously processes data as it arrives,
providing organizations with the ability to gain immediate insights and take action based on
current information. For example, in IoT applications, Spark Structured Streaming can
process and analyze data from sensors in real-time, enabling predictive maintenance and
optimization of industrial processes.
Integration with other Spark components: Spark Structured Streaming is designed to
work seamlessly with other components of the Apache Spark ecosystem. This includes
Spark SQL for querying structured data, MLlib for machine learning tasks, and GraphX for
graph processing.
Spark Structured Streaming Operating Model
Spark Structured Streaming handles live data streams by dividing them into micro-batches
and processing them as if they were a batch query on a static table. The resulting output is
continuously updated as new data arrives, providing real-time insights on the data. This
streaming data processing model is similar to batch processing.
Handling Input and Output Data
Here are some of the basic concepts in Structured Streaming:
Input Table
In Spark Structured Streaming, the input data stream is treated as an unbounded table that
can be queried using Spark’s DataFrame API. Each micro-batch of data is treated as a new
“chunk” of rows in the unbounded table, and the query engine can generate a result table by
applying operations to the unbounded table, just like a regular batch query. The result table
is continuously updated as new data arrives, providing a real-time view of the streaming
data.

57
UDS24201J - UNIT IV

Output
In Structured Streaming, the output is defined by specifying a mode for the query. There
are three output modes available:
Complete mode: In this mode, the output table contains the complete set of results for all
input data processed so far. Each time the query is executed, the entire output table is
recomputed and written to the output sink. This mode is useful when you need to generate a
complete snapshot of the data at a given point in time.
Update mode: In this mode, the output table contains only the changed rows since the last
time the query was executed. This mode is useful when you want to track changes to the
data over time and maintain a history of the changes. The update mode requires that the
output sink supports atomic updates and deletes.
Append mode: In this mode, the output table contains only the new rows that have been
added since the last time the query was executed. This mode is useful when you want to
continuously append new data to an existing output table. The append mode requires that
the output sink supports appending new data without modifying existing data.
The choice of mode depends on the use case and the capabilities of the output sink. Some
sinks, such as databases or file systems, may support only one mode, while others may
support multiple modes.
Handling Late and Event-Time Data
Event-time data is a concept in stream processing that refers to the time when an event
actually occurred. It is usually different from the processing time, which is the time when
the system receives and processes the event. Event-time data is important in many use
cases, including IoT device-generated events, where the timing of events is critical.
Late data is the data that arrives after the time window for a particular batch of data has
closed. It can occur due to network delays, system failures, or other factors. Late data can
cause issues in processing if not handled correctly, as it can result in incorrect results and
data loss.
Using an event-time column to track IoT device-generated events allows the system to
accurately process events based on when they actually occurred, rather than when they
were received by the system.
Fault Tolerance Semantics

58
UDS24201J - UNIT IV

Fault tolerance semantics refers to the guarantees that a streaming system provides to
ensure that data is processed correctly and consistently in the presence of failures, such as
network failures, node failures, or software bugs.
Idempotent streaming sinks are a feature of fault-tolerant streaming systems that ensure that
data is written to the output sink exactly once, even in the event of failures. An idempotent
sink can be called multiple times with the same data without causing duplicates in the
output. Fault-tolerance semantics – such as end-to-end one-time semantics – provide a high
degree of reliability and consistency in streaming systems.

59

You might also like