Uds24201j Unit IV
Uds24201j Unit IV
1
UDS24201J - UNIT IV
Database Database
Table Collection
Row Document
Column Field
Index Index
As discussed in the first table, a field is the smallest part of a MongoDB, just like a column is
the smallest part of a SQL table. Like columns clubbed together form rows and rows form
tables, they are analogous to collections and documents.
Another important analogy that needs to be discussed here is that of the primary key. In SQL
systems we can assign any unique column as the primary key. In MongoDB every collection
has an _id field, which is automatically taken as the primary key.
2
UDS24201J - UNIT IV
3
UDS24201J - UNIT IV
Insert Statements
The following table includes insertion commands in SQL and MongoDB.
Delete Statements
The following table includes delete commands in SQL and MongoDB.
Selection Commands
The following table includes selection commands in SQL and MongoDB.
4
UDS24201J - UNIT IV
SELECT * FROM Demo WHERE status != "A" db.Demo.find( { status: { $ne: "A" }
})
SELECT * db.Demo.find(
FROM Demo { status: "A",
WHERE status = "A" age: 50 }
AND age = 50 )
SELECT * db.Demo.find(
FROM Demo { $or: [ { status: "A" } , { age: 50 }
WHERE status = "A" ]}
OR age = 50 )
SELECT * db.Demo.find(
FROM Demo WHERE age > 25 { age: { $gt: 25 } }
)
SELECT * db.Demo.find(
FROM Demo WHERE age < 25 { age: { $lt: 25 } }
)
5
UDS24201J - UNIT IV
6
UDS24201J - UNIT IV
Compound index: Indexes that include multiple fields to handle complex queries more
efficiently.
Multi-key index: Indexes for fields that store arrays, indexing each element of the array.
Text index: Optimized indexes for full-text search operations.
Creating indexes in MongoDB
Indexes can be created using the createIndex() method, and proper indexing ensures faster
data retrieval, especially on frequently queried fields. Here’s an example of creating a
compound index on field1 and field2:
// Create a single-field index
db.collection.createIndex({ field: 1 });
7
UDS24201J - UNIT IV
8
UDS24201J - UNIT IV
Nested documents: Embed related data within a single document to minimize joins.
Arrays: Use arrays to represent one-to-many relationships efficiently.
Denormalization: Duplicate data when necessary to avoid complex joins.
7. Use of Proper Data Types
Choosing the right data types for fields is critical for query performance and storage
efficiency. For example, storing dates as Date types enables efficient date range queries,
while using the correct numeric types can reduce memory usage and improve the speed of
arithmetic operations. Using proper data types ensures data integrity and optimal
performance. For example, storing dates as Date types enables efficient date range queries
Best practices for selecting data types:
Use Date types for date fields.
Use appropriate numeric types (Int32, Double, etc.) to save space.
Avoid storing large strings or objects in fields that don't require it.
8. Monitoring and Maintenance
Regular monitoring of MongoDB is essential to ensure that it performs optimally over time.
MongoDB provides several tools for monitoring, such as the db.serverStatus() method and
the MongoDB Atlas dashboard. Setting up alerting for critical metrics, like memory usage or
query performance, helps identify issues early.
Set up alerting for critical metrics like CPU usage and memory consumption.
Monitor query performance and identify slow-running queries for optimization.
Maintenance tasks to ensure optimal performance
Regular index optimization: Rebuild indexes periodically to ensure they remain efficient.
Data compaction: Regularly compact data files to reclaim disk space and optimize storage.
Index updates: Create and update indexes based on query patterns to ensure efficient index
utilization
Conclusion
Optimizing MongoDB queries is crucial for improving application performance and
ensuring that our database can handle large datasets efficiently. By implementing techniques
such as proper indexing, query analysis with explain(), efficient pagination, caching, and
using the right data types, we can drastically reduce query execution time and resource
consumption. Additionally, maintaining good data modeling practices and regularly
monitoring your MongoDB deployment will keep it performing at its best.
9
UDS24201J - UNIT IV
MongoDB queries offer several key benefits for developers and businesses, particularly for
handling large amounts of unstructured or semi-structured data. Here are some of the benefits
of using MongoDB queries:
1. Flexible Schema Design: MongoDB is a NoSQL database, so its schema can be
dynamic. Queries can handle documents with different structures, making it easier to
work with varying data types.
2. High Performance: MongoDB is optimized for speed, particularly with its ability to
index and perform real-time queries on large datasets. Its support for in-memory
storage can also significantly boost query performance.
3. Powerful Querying: MongoDB offers rich querying capabilities such as:
o Text search: MongoDB supports full-text search, allowing you to search for
words and phrases within string fields.
o Aggregation: MongoDB’s aggregation framework enables complex data
transformation and analysis, such as grouping, sorting, and performing
calculations on data.
o Geospatial queries: MongoDB supports geospatial indexing and queries,
allowing you to efficiently query location-based data.
4. Scalability: MongoDB's horizontal scaling capabilities allow it to scale across many
servers to handle increasing workloads. It supports sharding (splitting data across
multiple machines), which is beneficial for applications with high data volumes.
5. ACID Transactions (from 4.0 onward): MongoDB supports multi-document ACID
transactions, which means you can run queries that need multiple operations to
complete atomically, ensuring data consistency.
6. Real-Time Data Access: MongoDB allows for real-time data updates and instant
querying, which is useful for applications that require real-time analytics or live data
processing.
7. Rich Aggregation Framework: MongoDB's aggregation framework provides a
powerful set of operations for transforming and combining data, such as $match,
$group, $sort, and $project. This allows developers to build complex queries with
ease.
10
UDS24201J - UNIT IV
11
UDS24201J - UNIT IV
13
UDS24201J - UNIT IV
14
UDS24201J - UNIT IV
15
UDS24201J - UNIT IV
9. cp: This command is used to copy files within hdfs. Lets copy
folder sample to sample_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /sample /sample_copied
10. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from sample folder to sample_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /sample/myfile.txt /sample_copied
11. rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /sample_copied -> It will delete all the content inside the
directory then the directory itself.
12. du: It will give the size of each file in directory.
Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /sample
13. dus:: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /sample
14. stat: It will give the last modified time of directory or path. In short it will give stats of
the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
16
UDS24201J - UNIT IV
Example:
bin/hdfs dfs -stat /sample
15. setrep: This command is used to change the replication factor of a file/directory in
HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-
site.xml).
Example 1: To change the replication factor to 6 for sample.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 sample.txt
Example 2: To change the replication factor to 4 for a directory sampleInput stored in
HDFS.
bin/hdfs dfs -setrep -R 4 /sample
Note: The -w means wait till the replication is completed. And -R means recursively,
we use it for directories as they may also contain many files and folders inside them.
17
UDS24201J - UNIT IV
This model facilitates distributed data sharing, content distribution, and computing
tasks, making it suitable for applications like file sharing, content delivery, and
blockchain networks.
3. Middleware
Middleware acts as a bridge between different software applications or components,
enabling communication and interaction across distributed systems. It abstracts
complexities of network communication, providing services like message passing, remote
procedure calls (RPC), and object management.
Middleware facilitates interoperability, scalability, and fault tolerance by decoupling
application logic from underlying infrastructure.
It supports diverse communication protocols and data formats, enabling seamless
integration between heterogeneous systems.
Middleware simplifies distributed system development, promotes modularity, and
enhances system flexibility, enabling efficient resource utilization and improved system
reliability.
4. Three-Tier
In a distributed operating system, the three-tier architecture divides tasks into presentation,
logic, and data layers. The presentation tier, comprising client machines or devices, handles
user interaction. The logic tier, distributed across multiple nodes or servers, executes
processing logic and coordinates system functions.
The data tier manages storage and retrieval operations, often employing distributed
databases or file systems across multiple nodes.
This modular approach enables scalability, fault tolerance, and efficient resource
utilization, making it ideal for distributed computing environments.
5. N-Tier
In an N-tier architecture, applications are structured into multiple tiers or layers beyond the
traditional three-tier model. Each tier performs specific functions, such as presentation,
logic, data processing, and storage, with the flexibility to add more tiers as needed. In a
distributed operating system, this architecture enables complex applications to be divided
into modular components distributed across multiple nodes or servers.
Each tier can scale independently, promoting efficient resource utilization, fault
tolerance, and maintainability.
18
UDS24201J - UNIT IV
19
UDS24201J - UNIT IV
20
UDS24201J - UNIT IV
21
UDS24201J - UNIT IV
22
UDS24201J - UNIT IV
3. Distributed Databases
Distributed Databases: Databases that store data across multiple nodes or locations.
They provide a unified view of the data despite its physical distribution. Key features
include support for distributed transactions, replication, and consistent querying.
Fault Tolerance Mechanisms in Distributed Operating System
Below is the fault tolerance mechanism in distributed operating system:
1. Redundancy Strategies
Redundancy: Involves duplicating critical components or systems to ensure reliability
and availability. Strategies include:
o Data Redundancy: Multiple copies of data are stored across different nodes.
o Hardware Redundancy: Using backup hardware components (e.g., servers,
disks) to take over in case of failure.
2. Recovery Techniques
Recovery Techniques: Methods for restoring the system to a stable state after a failure.
Techniques include:
o Checkpointing: Periodically saving the state of a system so that it can be
restored to a recent, consistent point in case of failure.
o Rollback and Replay: Reverting to a previous state and reapplying
operations to recover from failures.
3. Error Handling and Detection
Error Handling: Mechanisms to manage and mitigate the effects of errors or failures.
This includes retrying operations, compensating for errors, and using error recovery
procedures.
Error Detection: Techniques for identifying errors or anomalies, such as using error
logs, monitoring systems, and health checks to detect and address issues promptly.
EXAMPLES OF DISTRIBUTED OPERATING SYSTEM
There are various examples of the distributed operating system. Some of them are as
follows:
Solaris
It is designed for the SUN multiprocessor workstations
OSF/1
It's compatible with Unix and was designed by the Open Foundation Software Company.
Micros
23
UDS24201J - UNIT IV
The MICROS operating system ensures a balanced data load while allocating jobs to all
nodes in the system.
DYNIX
It is developed for the Symmetry multiprocessor computers.
Locus
It may be accessed local and remote files at the same time without any location hindrance.
Mach
It allows the multithreading and multitasking features.
Real-life Example of Distributed Operating System
1. Web search
We Have different web pages, multimedia content, and scanned documents that we need to
search. The purpose of web search is to index the content of the web. So to help us, we use
different search engines like Google, Yahoo, Bing, etc. These search engines use
distributed architecture.
2. Banking system
Suppose there is a bank whose headquarters is in New Delhi. That bank has branch offices
in cities like Ludhiana, Noida, Faridabad, and Chandigarh. You can operate your bank by
going to any of these branches. How is this possible? It’s because whatever changes you
make at one branch office are reflected at all branches. This is because of the distributed
system.
3. Massively multiplayer online games
Nowadays, you can play online games where you can play games with a person sitting in
another country in a real-time environment. How’s it possible? It is because of distributed
architecture.
ADVANTAGES AND DISADVANTAGES OF DISTRIBUTED OPERATING
SYSTEM
Advantages of Distributed Systems
Below are the key advantages of Distributed Systems:
Scalability:
Horizontal Scaling: To support growth and manage higher loads, additional nodes
can be added with simplicity.
Load Balancing: To increase responsiveness and performance, divide workloads
among several servers.
24
UDS24201J - UNIT IV
25
UDS24201J - UNIT IV
26
UDS24201J - UNIT IV
Conflict Resolution: Handling data conflicts and ensuring data integrity in the face
of concurrent updates can be complex.
DESIGN AND IMPLEMENTATION OF DISTRIBUTED OPERATING SYSTEM
Designing and implementing a distributed operating system (DOS) involves creating a
system that coordinates the interaction of multiple machines or nodes in a network, providing
the appearance of a single coherent operating system to the user. These systems offer various
advantages such as fault tolerance, resource sharing, and scalability. Below is an overview of
the design principles and steps involved in implementing a distributed operating system.
Design Principles of a Distributed Operating System:
1. Transparency:
o Access Transparency: Hides the differences in data access mechanisms.
o Location Transparency: The location of resources should be hidden from
users and applications.
o Replication Transparency: The user should not be aware if a resource is
replicated across nodes.
o Concurrency Transparency: Multiple users can concurrently access
resources without interference.
o Failure Transparency: The system should continue to work despite partial
failures.
2. Scalability:
o The system should be able to scale by adding more machines without
significant changes in architecture or performance degradation.
o The load should be distributed efficiently among nodes.
3. Fault Tolerance:
o The system must tolerate failures in individual machines or networks.
o Redundancy and replication should be employed to ensure data availability
and minimize downtime.
4. Resource Management:
o Resources such as CPU, memory, disk, and network bandwidth need to be
allocated and managed across nodes in a fair and efficient manner.
o Centralized management (single node manages resources) or decentralized
management (no central authority, resources managed locally by each node)
can be used.
27
UDS24201J - UNIT IV
28
UDS24201J - UNIT IV
29
UDS24201J - UNIT IV
30
UDS24201J - UNIT IV
31
UDS24201J - UNIT IV
HDFS
architecture centers on commanding NameNodes that hold metadata and DataNodes that
store information in blocks.
HDFS ARCHITECTURE, NAMENODE AND DATANODES
HDFS uses a primary/secondary architecture where each HDFS cluster is comprised of many
worker nodes and one primary node or the NameNode. The NameNode is the controller node,
as it knows the metadata and status of all files including file permissions, names and location
of each block. An application or user can create directories and then store files inside these
directories. The file system namespace hierarchy is like most other file systems, as a user can
create, remove, rename or move files from one directory to another.
The HDFS cluster's NameNode is the primary server that manages the file system namespace
and controls client access to files. As the central component of the Hadoop Distributed File
System, the NameNode maintains and manages the file system namespace and provides
clients with the right access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
NameNode
The NameNode performs the following key functions:
The NameNode performs file system namespace operations, including opening, closing
and renaming files and directories.
The NameNode governs the mapping of blocks to the DataNodes.
The NameNode records any changes to the file system namespace or its properties. An
application can stipulate the number of replicas of a file that the HDFS should maintain.
32
UDS24201J - UNIT IV
The NameNode stores the number of copies of a file, called the replication factor of that
file.
To ensure that the DataNodes are alive, the NameNode gets block reports
and heartbeat data.
In case of a DataNode failure, the NameNode selects new DataNodes for replica creation.
DataNodes
In HDFS, DataNodes function as worker nodes or Hadoop daemons and are typically made
of low-cost off-the-shelf hardware. A file is split into one or more of the blocks that are
stored in a set of DataNodes. Based on their replication factor, the files are internally
partitioned into many blocks that are kept on separate DataNodes.
The DataNodes perform the following key functions:
The DataNodes serve read and write requests from the clients of the file system.
The DataNodes perform block creation, deletion and replication when the NameNode
instructs them to do so.
The DataNodes transfer periodic heartbeat signals to the NameNode to help keep HDFS
health in check.
The DataNodes provide block reports to NameNode to help keep track of the blocks
included within the DataNodes. For redundancy and higher availability, each block is
copied onto two extra DataNodes by default.
FEATURES OF HDFS
There are several features that make HDFS particularly useful, including the following:
Data replication. Data replication ensures that the data is always available and prevents
data loss. For example, when a node crashes or there's a hardware failure, replicated data
can be pulled from elsewhere within a cluster, so processing continues while data is being
recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them
across nodes in a large cluster ensures fault tolerance and reliability.
High availability. Because of replication across nodes, data is available even if the
NameNode or DataNode fails.
Scalability. HDFS stores data on various nodes in the cluster, so as requirements
increase, a cluster can scale to hundreds of nodes.
33
UDS24201J - UNIT IV
High throughput. Because HDFS stores data in a distributed manner, the data can be
processed in parallel on a cluster of nodes. This, plus data locality, cuts the processing
time and enables high throughput.
Data locality. With HDFS, computation happens on the DataNodes where the data
resides, rather than having the data move to where the computational unit is. Minimizing
the distance between the data and the computing process decreases network
congestion and boosts a system's overall throughput.
Snapshots. HDFS supports snapshots, which capture point-in-time copies of the file
system and protect critical data from user or application errors.
What are the benefits of using HDFS?
There are seven main advantages to using HDFS, including the following:
Cost effective. The DataNodes that store the data rely on inexpensive off-the-shelf
hardware, which reduces storage costs. Also, because HDFS is open source, there's no
licensing fee.
Large data set storage. HDFS stores a variety of data of any size and large files -- from
megabytes to petabytes -- in any format, including structured and unstructured data.
Fast recovery from hardware failure. HDFS is designed to detect faults and
automatically recover on its own.
Portability. HDFS is portable across all hardware platforms and compatible with several
operating systems, including Windows, Linux and macOS.
Streaming data access. HDFS is built for high data throughput, which is best for access
to streaming data.
Speed. Because of its cluster architecture, HDFS is fast and can handle 2 GB of data per
second.
Diverse data formats. Hadoop data lakes support a wide range of data formats, including
unstructured such as movies, semistructured such as XML files and structured data
for Structured Query Language databases. Data retrieved via Hadoop is schema-free, so it
can be parsed into any schema and can support diverse data analysis in various ways.
HADOOP – CLUSTER
Cluster is a collection of something, a simple computer cluster is a group of various
computers that are connected with each other through LAN(Local Area Network), the nodes
in a cluster share the data, work on the same task and this nodes are good enough to work as a
single unit means all of them to work together.
34
UDS24201J - UNIT IV
35
UDS24201J - UNIT IV
1. Scalability: Hadoop clusters are very much capable of scaling-up and scaling-down the
number of nodes i.e. servers or commodity hardware. Let’s see with an example of what
actually this scalable property means. Suppose an organization wants to analyze or maintain
around 5PB of data for the upcoming 2 months so he used 10 nodes(servers) in his Hadoop
cluster to maintain all of this data. But now what happens is, in between this month the
organization has received extra data of 2PB, in that case, the organization has to set up or
upgrade the number of servers in his Hadoop cluster system from 10 to 12(let’s consider) in
order to maintain it. The process of scaling up or scaling down the number of servers in the
Hadoop cluster is called scalability.
2. Flexibility: This is one of the important properties that a Hadoop cluster possesses.
According to this property, the Hadoop cluster is very much Flexible means they can handle
any type of data irrespective of its type and structure. With the help of this property, Hadoop
can process any type of data from online web platforms.
3. Speed: Hadoop clusters are very much efficient to work with a very fast speed because the
data is distributed among the cluster and also because of its data mapping capability’s i.e. the
MapReduce architecture which works on the Master-Slave phenomena.
4. No Data-loss: There is no chance of loss of data from any node in a Hadoop cluster
because Hadoop clusters have the ability to replicate the data in some other node. So in case
of failure of any node no data is lost as it keeps track of backup for that data.
5. Economical: The Hadoop clusters are very much cost-efficient as they possess the
distributed storage technique in their clusters i.e. the data is distributed in a cluster among all
the nodes. So in the case to increase the storage we only need to add one more another
hardware storage which is not that much costliest.
Types of Hadoop clusters
1. Single Node Hadoop Cluster
2. Multiple Node Hadoop Cluster
36
UDS24201J - UNIT IV
1. Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name suggests the
cluster is of an only single node which means all our Hadoop Daemons i.e. Name Node, Data
Node, Secondary Name Node, Resource Manager, Node Manager will run on the same
system or on the same machine. It also means that all of our processes will be handled by
only single JVM(Java Virtual Machine) Process Instance.
2. Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the name suggests it
contains multiple nodes. In this kind of cluster set up all of our Hadoop Daemons, will store
in different-different nodes in the same cluster setup. In general, in multiple node Hadoop
cluster setup we try to utilize our higher processing nodes for Master i.e. Name node and
Resource Manager and we utilize the cheaper system for the slave Daemon’s i.e.Node
Manager and Data Node.
HADOOP MAP-REDUCE
Hadoop – Mapper In MapReduce
Map-Reduce is a programming model that is mainly divided into two phases Map
Phase and Reduce Phase. It is designed for processing the data in parallel which is divided
on various machines(nodes). The Hadoop Java programs are consist of Mapper class and
Reducer class along with the driver class. Hadoop Mapper is a function or task which is used
to process all input records from a file and generate the output which works as input for
Reducer. It produces the output by returning new key-value pairs. The input data has to be
converted to key-value pairs as Mapper can not process the raw input records or tuples(key-
value pairs). The mapper also generates some small blocks of data while processing the input
records as a key-value pair. we will discuss the various process that occurs in Mapper, There
key features and how the key-value pairs are generated in the Mapper.
Let’s understand the Mapper in Map-Reduce:
37
UDS24201J - UNIT IV
Mapper is a simple user-defined program that performs some operations on input-splits as per
it is designed. Mapper is a base class that needs to be extended by the developer or
programmer in his lines of code according to the organization’s requirements. input and
output type need to be mentioned under the Mapper class argument which needs to be
modified by the developer.
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we
have 100 Data-Blocks of the dataset we are analyzing then in that case there will be 100
Mapper program or process that runs in parallel on machines(nodes) and produce their own
output known as intermediate output which is then stored on Local Disk, not on HDFS. The
output of the mapper act as input for Reducer which performs some sorting and aggregation
operation on data and produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader, Map, and
Intermediate output disk. The Map Task is completed with the contribution of all this
available component.
1. Input: Input is records or the datasets that are used for analysis purposes. This Input data
is set out with the help of InputFormat. It helps in identifying the location of the Input
data which is stored in HDFS(Hadoop Distributed File System).
2. Input-Splits: These are responsible for converting the physical input data to some logical
form so that Hadoop Mapper can easily handle it. Input-Splits are generated with the help
of InputFormat. A large data set is divided into many input-splits which depend on the
size of the input dataset. There will be a separate Mapper assigned for each input-splits.
Input-Splits are only referencing to the input data, these are not the actual
data. DataBlocks are not the only factor that decides the number of input-splits in a Map-
Reduce. we can manually configure the size of input-splits
in mapred.max.split.size property while the job is executing. All of these input-splits are
utilized by each of the data blocks. The size of input splits is measured in bytes. Each
input-split is stored at some memory location (Hostname Strings). Map-Reduce places
38
UDS24201J - UNIT IV
map tasks near the location of the split as close as it is possible. The input-split with the
larger size executed first so that the job-runtime can be minimized.
3. Record-Reader: Record-Reader is the process which deals with the output obtained from
the input-splits and generates it’s own output as key-value pairs until the file ends. Each
line present in a file will be assigned with the Byte-Offset with the help of Record-
Reader. By-default Record-Reader uses TextInputFormat for converting the data
obtained from the input-splits to the key-value pairs because Mapper can only handle
key-value pairs.
4. Map: The key-value pair obtained from Record-Reader is then feed to the Map which
generates a set of pairs of intermediate key-value pairs.
5. Intermediate output disk: Finally, the intermediate key-value pair output will be stored
on the local disk as intermediate output. There is no need to store the data on HDFS as it
is an intermediate output. If we store this data onto HDFS then the writing cost will be
more because of it’s replication feature. It also increases its execution time. If somehow
the executing job is terminated then, in that case, cleaning up this intermediate output
available on HDFS is also difficult. The intermediate output is always stored on local disk
which will be cleaned up once the job completes its execution. On local disk, this Mapper
output is first stored in a buffer whose default size is 100MB which can be configured
with io.sort.mb property. The output of the mapper can be written to HDFS if and only if
the job is Map job only, In that case, there will be no Reducer task so the intermediate
output is our final output which can be written on HDFS. The number of Reducer tasks
can be made zero manually with job.setNumReduceTasks(0). This Mapper output is of no
use for the end-user as it is a temporary output useful for Reducer only.
39
UDS24201J - UNIT IV
means the value of the key is the main decisive factor for sorting. The output generated by the
Reducer will be the final output which is then stored on HDFS(Hadoop Distributed File
System). Reducer mainly performs some computation operation like addition, filtration, and
aggregation. By default, the number of reducers utilized for process the output of the Mapper
is 1 which is configurable and can be changed by the user according to the requirement.
Let’s understand the Reducer in Map-Reduce:
Here, in the above image, we can observe that there are multiple Mapper which are
generating the key-value pairs as output. The output of each mapper is sent to the sorter
which will sort the key-value pairs according to its key value. Shuffling also takes place
during the sorting process and the output will be sent to the Reducer part and final output is
produced.
Let’s take an example to understand the working of Reducer. Suppose we have the data of
a college faculty of all departments stored in a CSV file. In case we want to find the sum of
salaries of faculty according to their department then we can make their dept. title as key and
salaries as value. The Reducer will perform the summation operation on this dataset and
produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
1. Framework overhead increases.
2. Cost of failure Reduces
40
UDS24201J - UNIT IV
41
UDS24201J - UNIT IV
Let’s consider a possible scenario where the project stack does not include Hadoop
Framework, but the user wants to migrate the data from a RDBMS to HDFS equivalent
system, for example, Amazon s3. In this scenario, Apache Spark SQL can be used.
Apache Spark SQL has two types of RDBMS components for such a migration, known as
JDBCRDD and JDBCDATAFRAME.
If you need to connect Spark with any RDBMS, then JDBC type 4 driver jar file in /lib
directory needs to be added. The following code can be used to check for JDBC connectivity:
The above code will access MySQL database and read all the data from employee table.
The below-mentioned code can be used to achieve Parallelism by fetching the data:
If you want to write data in the database, the following code can be used:
Using the above code snippets, you can import and export data from RDBMS to a HDFS
equivalent system.
42
UDS24201J - UNIT IV
Procedure
1. Clone the GitHub repository containing the test data.
language-bash
content_paste
git clone https://github.com/brianmhess/DSE-Spark-HDFS.git
2. Load the maximum temperature test data into the Hadoop cluster using WebHDFS.
In this example, the Hadoop node has a hostname of hadoopNode.example.com. Replace it
with the hostname of a node in your Hadoop cluster.
language-bash
content_paste
hadoop fs -mkdir webhdfs://hadoopNode.example.com:50070/user/guest/data &&
hadoop fs -copyFromLocal data/sftmax.csv
webhdfs://hadoopNode:50070/user/guest/data/sftmax.csv
3. Create the keyspace and table and load the minimum temperature test data using cqlsh.
language-cql
content_paste
CREATE KEYSPACE IF NOT EXISTS spark_ex2 WITH REPLICATION = {
'class':'SimpleStrategy', 'replication_factor':1}
DROP TABLE IF EXISTS spark_ex2.sftmin
CREATE TABLE IF NOT EXISTS spark_ex2.sftmin(location TEXT, year INT, month INT,
day INT, tmin DOUBLE, datestring TEXT, PRIMARY KEY ((location), year, month, day))
WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC)
COPY spark_ex2.sftmin(location, year, month, day, tmin, datestring) FROM 'data/sftmin.csv'
4. Ensure that we can access the HDFS data by interacting with the data using hadoop fs.
The following command counts the number of lines of HDFS data.
language-bash
content_paste
hadoop fs -cat webhdfs://hadoopNode.example.com:50070/user/guest/data/sftmax.csv | wc -l
You should see output similar to the following:
results
content_paste
16/05/10 11:21:51 INFO snitch.Workload: Setting my workload to Cassandra 3606
5. Start the Spark console and connect to the DataStax Enterprise cluster.
43
UDS24201J - UNIT IV
language-bash
content_paste
dse spark
Import the Spark Cassandra connector and create the session.
language-scala
content_paste
import com.datastax.spark.connector.cql.CassandraConnector
val connector = CassandraConnector(csc.conf)
val session = connector.openSession()
6. Create the table to store the maximum temperature data.
language-scala
content_paste
session.execute(s"DROP TABLE IF EXISTS spark_ex2.sftmax")
session.execute(s"CREATE TABLE IF NOT EXISTS spark_ex2.sftmax(location TEXT,
year INT, month INT, day INT, tmax DOUBLE, datestring TEXT, PRIMARY KEY
((location), year, month, day)) WITH CLUSTERING ORDER BY (year DESC, month
DESC, day DESC)")
7. Create a Spark RDD from the HDFS maximum temperature data and save it to the table.
First create a case class representing the maximum temperature sensor data:
language-scala
content_paste
case class Tmax(location: String, year: Int, month: Int, day: Int, tmax: Double, datestring:
String)
Read the data into an RDD.
language-scala
content_paste
val tmax_raw =
sc.textFile("webhdfs://sandbox.hortonworks.com:50070/user/guest/data/sftmax.csv")
Transform the data so each record in the RDD is an instance of the Tmax case class.
language-scala
content_paste
val tmax_c10 = tmax_raw.map(x=>x.split(",")).map(x => Tmax(x(0), x(1).toInt, x(2).toInt,
x(3).toInt, x(4).toDouble, x(5)))
44
UDS24201J - UNIT IV
Count the case class instances to make sure it matches the number of records.
language-scala
content_paste
tmax_c10.count
res11: Long = 3606
Save the case class instances to the database.
language-scala
content_paste
tmax_c10.saveToCassandra("spark_ex2", "sftmax")
9. Verify the records match by counting the rows using CQL.
language-scala
content_paste
session.execute("SELECT COUNT(*) FROM spark_ex2.sftmax").all.get(0).getLong(0)
res23: Long = 3606
10. Join the maximum and minimum data into a new table.
Create a Tmin case class to store the minimum temperature sensor data.
language-scala
content_paste
case class Tmin(location: String, year: Int, month: Int, day: Int, tmin: Double, datestring:
String)
val tmin_raw = sc.cassandraTable("spark_ex2", "sftmin")
val tmin_c10 = tmin_raw.map(x => Tmin(x.getString("location"), x.getInt("year"),
x.getInt("month"), x.getInt("day"), x.getDouble("tmin"), x.getString("datestring")))
In order to join RDDs, they need to be PairRDDs, with the first element in the pair being the
join key.
language-scala
content_paste
val tmin_pair = tmin_c10.map(x=>(x.datestring,x))
val tmax_pair = tmax_c10.map(x=>(x.datestring,x))
Create a THiLoDelta case class to store the difference between the maximum and minimum
temperatures.
language-scala
content_paste
45
UDS24201J - UNIT IV
case class THiLoDelta(location: String, year: Int, month: Int, day: Int, hi: Double, low:
Double, delta: Double, datestring: String)
Join the data using the join operation on the PairRDDs. Convert the joined data to the
THiLoDelta case class.
language-scala
content_paste
val tdelta_join1 = tmax_pair1.join(tmin_pair1)
val tdelta_c10 = tdelta_join1.map(x => THiLoDelta(x._2._1._1, x._2._1._2, x._2._1._3,
x._2._1._4, x._2._1._5, x._2._2._5, x._2._1._5 - x._2._2._5, x._1))
Create a new table within Spark using CQL to store the temperature difference data.
language-scala
content_paste
session.execute(s"DROP TABLE IF EXISTS spark_ex2.sftdelta")
session.execute(s"CREATE TABLE IF NOT EXISTS spark_ex2.sftdelta(location TEXT,
year INT, month INT, day INT, hi DOUBLE, low DOUBLE, delta DOUBLE, datestring
TEXT, PRIMARY KEY ((location), year, month, day)) WITH CLUSTERING ORDER BY
(year DESC, month DESC, day DESC)")
Save the temperature difference data to the table.
language-scala
content_paste
tdelta_c10.saveToCassandra("spark_ex2", "sftdelta")
KAFKA STREAM
What is Kafka?
A distributed event streaming framework called Apache Kafka is made to manage fault-
tolerant, high-throughput data streams. It offers a centralized platform for developing real-
time data pipelines and applications, enabling smooth data producer and consumer
connection.
What is Kafka Stream API?
Kafka Streams API can be used to simplify the Stream Processing procedure from
various disparate topics. It can provide distributed coordination, data parallelism,
scalability, and fault tolerance.
This API makes use of the ideas of tasks and partitions as logical units that communicate
with the cluster and are closely related to the subject partitions.
46
UDS24201J - UNIT IV
The fact that the apps you create with Kafka Streams API are regular Java apps that can be
packaged, deployed, and monitored like any other Java application is one of its unique
features
Primary Terminologies Related to Kafka Streams API
Tasks: Within the Kafka Streams API, tasks are logical processing units that take in
input data, process it, and then output the results.
Partitions: Segments of Kafka topics that allow applications using Kafka Streams to
scale and process data in parallel.
Stateful Processing: This refers to the Kafka Streams API's capacity to save and update
state data across stream processing operations, enabling intricate analytics and
transformations.
Windowing is a method for processing and aggregating data streams in predetermined
time frames, making windowed joins and aggregation possible.
How Kafka Streams API Works?
1. Initialization: Include the kafka-streams dependency in your project in order to start
using the Kafka Streams API.
2. Order of magnitude Construction: Use the Processor API or Streams API DSL to
specify the application's processing logic. This entails defining the data transformations,
output topics, and input subjects.
3. Implementation: Create an instance of the Kafka Streams Topology object and set up
characteristics like state storage, input/output serializers, and processing semantics.
4. Installation: Install your Kafka Streams application in a runtime environment, like a
containerised environment or a standalone Java process.
5. Scaling: To provide higher throughput and fault tolerance, Kafka Streams applications
automatically scale horizontally by dividing work across several instances.
Kafka Stream API Workflow With a Diagram
The following diagram illustrates the workflow of Kafka Stream APIs in between
producers and consumers:
47
UDS24201J - UNIT IV
48
UDS24201J - UNIT IV
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>1.1.0</version>
</dependency>
A unique feature of the Kafka Streams API is that the applications you build with it are
normal Java applications. These applications can be packaged, deployed, and monitored
like any other Java application – there is no need to install separate processing clusters
or similar special-purpose and expensive infrastructure.
Advantages of Kafka Stream APIs
The following are the advantages of Kafka Stream APIs:
Simplified Stream Processing: The Kafka Streams API allows developers to
concentrate on application logic by abstracting away the intricacies of stream
processing.
Seamless Integration: Its smooth integration with the current Kafka infrastructure is
due to its membership in the Kafka ecosystem.
Scalability: Because of the horizontal scalability provided by the Kafka Streams API,
applications can manage growing data loads.
Fault Tolerance: Fault tolerance is ensured by built-in processes, which provide
dependable stream processing even in the event of malfunctions.
Disadvantages of Kafka Stream APIs
The following are the disadvantages of Kafka Stream APIs:
Java-Centric: Mostly concentrated on Java, which could be difficult for developers
familiar to other languages.
Learning Curve: While streamlining many parts of stream processing, there is some
learning involved in understanding the ideas and APIs of Kafka Streams.
Complexity: Especially for inexperienced users, managing stateful processing and
windowed processes might be complicated.
Resource Consumption: Kafka Streams applications have the potential to use a large
amount of memory and compute power, depending on their size.
49
UDS24201J - UNIT IV
50
UDS24201J - UNIT IV
Unified Data Access − Load and query data from a variety of sources. Schema-RDDs
provide a single interface for efficiently working with structured data, including Apache
Hive tables, parquet files and JSON files.
Hive Compatibility − Run unmodified Hive queries on existing warehouses. Spark SQL
reuses the Hive frontend and MetaStore, giving you full compatibility with existing Hive
data, queries, and UDFs. Simply install it alongside Hive.
Standard Connectivity − Connect through JDBC or ODBC. Spark SQL includes a server
mode with industry standard JDBC and ODBC connectivity.
Scalability − Use the same engine for both interactive and long queries. Spark SQL takes
advantage of the RDD model to support mid-query fault tolerance, letting it scale to large
jobs too. Do not worry about using a different engine for historical data.
Spark SQL Architecture
The following illustration explains the architecture of Spark SQL −
This architecture contains three layers namely, Language API, Schema RDD, and Data
Sources.
Language API − Spark is compatible with different languages and Spark SQL. It is also,
supported by these languages- API (python, scala, java, HiveQL).
Schema RDD − Spark Core is designed with special data structure called RDD. Generally,
Spark SQL works on schemas, tables, and records. Therefore, we can use the Schema RDD
as temporary table. We can call this Schema RDD as Data Frame.
51
UDS24201J - UNIT IV
Data Sources − Usually the Data source for spark-core is a text file, Avro file, etc.
However, the Data Sources for Spark SQL is different. Those are Parquet file, JSON
document, HIVE tables, and Cassandra database.
SPARK – RDD
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain any
type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD is a
fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are not
so efficient.
Data Sharing is Slow in MapReduce
MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set
of high-level operators, without having to worry about work distribution and fault
tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable storage
system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding
storage system, most of the Hadoop applications, they spend more than 90% of the time
doing HDFS read-write operations.
Data Sharing using Spark RDD
52
UDS24201J - UNIT IV
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of
the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.
Recognizing this problem, researchers developed a specialized framework called Apache
Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-
memory processing computation. This means, it stores the state of memory as an object
across the jobs and the object is sharable between those jobs. Data sharing in memory is 10
to 100 times faster than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.
Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note − If the Distributed memory (RAM) is not sufficient to store intermediate results
(State of the JOB), then it will store those results on the disk
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the
53
UDS24201J - UNIT IV
elements around on the cluster for much faster access, the next time you query it. There is
also support for persisting RDDs on disk, or replicated across multiple nodes.
SPARK MLLIB
Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of
popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that
discusses both high-quality algorithm and high speed. The machine learning algorithms like
regression, classification, clustering, pattern mining, and collaborative filtering. Lower
level machine learning primitives like generic gradient descent optimization algorithm are
also present in MLlib.
Spark.ml is the primary Machine Learning API for Spark. The library Spark.ml offers a
higher-level API built on top of DataFrames for constructing ML pipelines.
Spark MLlib tools are given below:
ML Algorithms
Featurization
Pipelines
Persistence
Utilities
ML Algorithms
ML Algorithms form the core of MLlib. These include common learning algorithms such as
classification, regression, clustering, and collaborative filtering.
MLlib standardizes APIs to make it easier to combine multiple algorithms into a single
pipeline, or workflow. The key concepts are the Pipelines API, where the pipeline concept
is inspired by the scikit-learn project.
Transformer:
A Transformer is an algorithm that can transform one DataFrame into another DataFrame.
Technically, a Transformer implements a method transform(), which converts one
DataFrame into another, generally by appending one or more columns. For example:
A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new
column (e.g., feature vectors), and output a new DataFrame with the mapped column
appended.
A learning model might take a DataFrame, read the column containing feature vectors,
predict the label for each feature vector, and output a new DataFrame with predicted labels
appended as a column.
54
UDS24201J - UNIT IV
Estimator:
An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.
Technically, an Estimator implements a method fit(), which accepts a DataFrame and
produces a Model, which is a Transformer. For example, a learning algorithm such as
LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel,
which is a Model and hence a Transformer.
Transformer.transform() and Estimator.fit() are both stateless. In the future, stateful
algorithms may be supported via alternative concepts.
Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying
parameters (discussed below).
Featurization
Featurization includes feature extraction, transformation, dimensionality reduction, and
selection.
Feature Extraction is extracting features from raw data.
Feature Transformation includes scaling, renovating, or modifying features
Feature Selection involves selecting a subset of necessary features from a huge set of
features.
Pipelines:
A Pipeline chains multiple Transformers and Estimators together to specify an ML
workflow. It also provides tools for constructing, evaluating and tuning ML Pipelines.
In machine learning, it is common to run a sequence of algorithms to process and learn
from data. MLlib represents such a workflow as a Pipeline, which consists of a sequence of
Pipeline Stages (Transformers and Estimators) to be run in a specific order. We will use
this simple workflow as a running example in this section.
Dataframe
Dataframes provide a more user-friendly API than RDDs. The DataFrame-based API for
MLlib provides a uniform API across ML algorithms and across multiple languages.
Dataframes facilitate practical ML Pipelines, particularly feature transformations.
Persistence:
Persistence helps in saving and loading algorithms, models, and Pipelines. This helps in
reducing time and efforts as the model is persistence, it can be loaded/ reused any time
when needed.
Utilities:
55
UDS24201J - UNIT IV
Utilities for linear algebra, statistics, and data handling. Example: mllib.linalg is MLlib
utilities for linear algebra.
MLFLOW.SPARK
The mlflow.spark module provides an API for logging and loading Spark MLlib models.
This module exports Spark MLlib models with the following flavors:
Spark MLlib (native) format
Allows models to be loaded as Spark Transformers for scoring in a Spark session. Models
with this flavor can be loaded as PySpark PipelineModel objects in Python. This is the main
flavor and is always produced.
mlflow.pyfunc
Supports deployment outside of Spark by instantiating a SparkContext and reading input
data as a Spark DataFrame prior to scoring. Also supports deployment in Spark as a Spark
UDF. Models with this flavor can be loaded as Python functions for performing inference.
This flavor is always produced.
mlflow.mleap
Enables high-performance deployment outside of Spark by leveraging MLeap’s custom
dataframe and pipeline representations. Models with this flavor cannot be loaded back as
Python objects. Rather, they must be deserialized in Java using the mlflow/java package.
This flavor is produced only if you specify MLeap-compatible arguments.
SPARK STRUCTURED STREAMING
Apache Spark Streaming is a real-time data processing framework that enables developers
to process streaming data in near real-time. It is a legacy streaming engine in Apache
Spark that works by dividing continuous data streams into small batches and processing
them using batch processing techniques.
However, Spark Streaming has some limitations, such as lack of fault-tolerance guarantees,
limited API, and lack of support for many data sources. It has also stopped receiving
updates.
Spark Structured Streaming is a newer and more powerful streaming engine that provides a
declarative API and offers end-to-end fault tolerance guarantees. It leverages the power of
Spark’s DataFrame API and can handle both streaming and batch data using the same
programming model. Additionally, Structured Streaming offers a wide range of data
sources, including Kafka, Azure Event Hubs, and more.
Benefits of Spark Streaming
56
UDS24201J - UNIT IV
57
UDS24201J - UNIT IV
Output
In Structured Streaming, the output is defined by specifying a mode for the query. There
are three output modes available:
Complete mode: In this mode, the output table contains the complete set of results for all
input data processed so far. Each time the query is executed, the entire output table is
recomputed and written to the output sink. This mode is useful when you need to generate a
complete snapshot of the data at a given point in time.
Update mode: In this mode, the output table contains only the changed rows since the last
time the query was executed. This mode is useful when you want to track changes to the
data over time and maintain a history of the changes. The update mode requires that the
output sink supports atomic updates and deletes.
Append mode: In this mode, the output table contains only the new rows that have been
added since the last time the query was executed. This mode is useful when you want to
continuously append new data to an existing output table. The append mode requires that
the output sink supports appending new data without modifying existing data.
The choice of mode depends on the use case and the capabilities of the output sink. Some
sinks, such as databases or file systems, may support only one mode, while others may
support multiple modes.
Handling Late and Event-Time Data
Event-time data is a concept in stream processing that refers to the time when an event
actually occurred. It is usually different from the processing time, which is the time when
the system receives and processes the event. Event-time data is important in many use
cases, including IoT device-generated events, where the timing of events is critical.
Late data is the data that arrives after the time window for a particular batch of data has
closed. It can occur due to network delays, system failures, or other factors. Late data can
cause issues in processing if not handled correctly, as it can result in incorrect results and
data loss.
Using an event-time column to track IoT device-generated events allows the system to
accurately process events based on when they actually occurred, rather than when they
were received by the system.
Fault Tolerance Semantics
58
UDS24201J - UNIT IV
Fault tolerance semantics refers to the guarantees that a streaming system provides to
ensure that data is processed correctly and consistently in the presence of failures, such as
network failures, node failures, or software bugs.
Idempotent streaming sinks are a feature of fault-tolerant streaming systems that ensure that
data is written to the output sink exactly once, even in the event of failures. An idempotent
sink can be called multiple times with the same data without causing duplicates in the
output. Fault-tolerance semantics – such as end-to-end one-time semantics – provide a high
degree of reliability and consistency in streaming systems.
59