0% found this document useful (0 votes)
22 views21 pages

Cloud Computing Unit 3

This document discusses relational databases in the cloud. Key features include being accessed through a cloud platform, enabling enterprises to host databases without dedicated hardware, and supporting relational and NoSQL databases. Cloud databases provide benefits like scalability, ease of access from anywhere, and disaster recovery through data backups on remote servers. Relational cloud databases use structured query language and consist of tables with rows and columns to organize data.

Uploaded by

kejago
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views21 pages

Cloud Computing Unit 3

This document discusses relational databases in the cloud. Key features include being accessed through a cloud platform, enabling enterprises to host databases without dedicated hardware, and supporting relational and NoSQL databases. Cloud databases provide benefits like scalability, ease of access from anywhere, and disaster recovery through data backups on remote servers. Relational cloud databases use structured query language and consist of tables with rows and columns to organize data.

Uploaded by

kejago
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data in the cloud: Relational databases

A cloud database is a database service built and accessed through a cloud platform.
It serves many of the same functions as a traditional database with the added
flexibility of cloud computing. Users install software on a cloud infrastructure to
implement the database.
Key features:
 A database service built and accessed through a cloud platform
 Enables enterprise users to host databases without buying dedicated hardware
 Can be managed by the user or offered as a service and managed by a provider
 Can support relational databases (including MySQL and PostgreSQL) and NoSQL
databases (including MongoDB and Apache CouchDB)
 Accessed through a web interface or vendor-provided API
Why cloud databases

Ease of access
Users can access cloud databases from virtually anywhere, using a vendor’s API or
web interface.

Scalability
Cloud databases can expand their storage capacities on run-time to accommodate
changing needs. Organizations only pay for what they use.

Disaster recovery
In the event of a natural disaster, equipment failure or power outage, data is kept
secure through backups on remote servers.
Now a day, data has been specifically getting stored over clouds also known as a
virtual environment, either in a hybrid cloud, public or private cloud. A cloud database
is a database that has been optimized or built for such a virtualized environment.
There are various benefits of a cloud database, some of which are the ability to pay
for storage capacity and bandwidth on a per-user basis, and they provide scalability
on demand, along with high availability.
A cloud database also gives enterprises the opportunity to support business
applications in a software-as-a-service deployment.
These databases are categorized by a set of tables where data gets fit into a pre-
defined category. The table consists of rows and columns where the column has an
entry for data for a specific category and rows contains instance for that data defined
according to the category. The Structured Query Language (SQL) is the standard user
and application program interface for a relational database.
There are various simple operations that can be applied over the table which makes
these databases easier to extend, join two databases with a common relation and
modify all existing applications.
HDFS is the storage unit of Hadoop that is used to store and process huge volumes of
data on multiple datanodes. It is designed with low cost hardware that provides data
across multiple Hadoop clusters. It has high fault tolerance and throughput.
Large file is broken down into small blocks of data. HDFS has a default block size of 128
MB which can be increased as per requirement. Multiple copies of each block are stored
in the cluster in a distributed manner on different nodes.

As the number of internet users grew in the early 2000, Google faced the problem of
storing increasing user data on its traditional data servers. Thousands of search queries
were raised per second. There was a need for large, distributed, highly fault tolerant file
system to store and process the queries. The solution to this was Google File System
(GFS).

GFS consists of a single master and multiple chunk servers.


Files are divided into fixed sized chunks.
Each chunk has 64 MB of data in it. Each chunk is replicated on multiple chunk servers (3
by default). Even if any chunk server crashes, the data file will still be present in other
chunk servers.

This helped Google to store and process huge volumes of data in a distributed manner.
Differences between HBase and Cloud
Bigtable
One way to access Cloud Bigtable is to use a customized version of the Apache
HBase client for Java. In general, the customized client exposes the same API as a
standard installation of HBase. This page describes the differences between the
Bigtable HBase client for Java and a standard HBase installation. Many of these
difference are related to management tasks that Bigtable handles automatically.

Column families
When you create a column family, you cannot configure the block size or
compression method, either with the HBase shell or through the HBase API. Bigtable
manages the block size and compression for you.

In addition, if you use the HBase shell to get information about a table, the HBase
shell will always report that each column family does not use compression. In reality,
Bigtable uses proprietary compression methods for all of your data.

Bigtable requires that column family names follow the regular expression [_a-zA-Z0-
9][-_.a-zA-Z0-9]*. If you are importing data into Bigtable HBase, you might need to first
change the family names to follow this pattern.

Rows and cells


 You cannot define an ACL for an individual row.

 You cannot set the visibility of individual cells.

 Tags are not supported. You cannot use the class org.apache.hadoop.hbase.Tag to add metadata
to individual cells.

Mutations and deletions


 Append operations in Bigtable are fully atomic for both readers and writers. Readers will never
be able to read a partially applied Append operation.

 Deleting a specific version of a specific column based on its timestamp is supported, but
deleting all values with a specific timestamp in a given column family or row is not
supported. The following methods in the class org.apache.hadoop.hbase.client.Delete are not
supported:
 new Delete(byte[] row, long timestamp)

 addColumn(byte[] family, byte[] qualifier)

 addFamily(byte[] family, long timestamp)


 addFamilyVersion(byte[] family, long timestamp)

 In HBase, deletes mask puts, but Bigtable does not mask puts after deletes when put
requests are sent after deletion requests. This means that in Bigtable, a write request sent to
a cell is not affected by a previously sent delete request to the same cell.

Gets and scans


 Reverse scans are not supported. You cannot call the
method org.apache.hadoop.hbase.client.Scan#setReversed(boolean reversed).

 Querying versions of column families within a timestamp range is not supported. You cannot
call the following methods:
 org.apache.hadoop.hbase.client.Query#setColumnFamilyTimeRange(byte[] cf, long minStamp, long
maxStamp)

 org.apache.hadoop.hbase.client.Get#setColumnFamilyTimeRange(byte[] cf, long minStamp, long maxStamp)

 org.apache.hadoop.hbase.client.Scan#setColumnFamilyTimeRange(byte[] cf, long minStamp, long maxStamp)

 Limiting the number of values per row per column family is not supported. You cannot call
the method org.apache.hadoop.hbase.client.Scan#setMaxResultsPerColumnFamily(int limit) .

 Setting the maximum number of cells to return for each call to next() is not supported. Calls
to the method org.apache.hadoop.hbase.client.Scan#setBatch(int batch) are ignored.

 Setting the number of rows for caching is not supported. Calls to the
method org.apache.hadoop.hbase.client.Scan#setCaching(int caching) are ignored.

Coprocessors
Coprocessors are not supported. You cannot create classes that implement the
interface org.apache.hadoop.hbase.coprocessor.

Filters
The following table shows which filters are currently supported. All of these filters
are in the package org.apache.hadoop.hbase.filter.

Supported Supported, with limitations Not supported

1. Supports only a single column family.

2. Calling setLenAsVal(true) is not supported.

3. Supports only the BinaryComparator comparator. If any operator other than EQUAL is used, only a single column family is

4. Supports only the following comparators:

 BinaryComparator
Supported Supported, with limitations Not supported

 RegexStringComparator with no flags (flags are ignored) and the EQUAL operator

5. If a PageFilter is in a FilterList, PageFilter will only work similarly to HBase when the FilterList is set to MUST_PASS_ALL,
behavior. If the FilterList is set to MUST_PASS_ONE, Cloud Bigtable will treat the PageFilter as a MUST_PASS_ALL and
of rows corresponding to the PageFilter's pageSize.

6. PrefixFilter scans for rows in the PrefixFilter in most cases. However, if PrefixFilter is part of a FilterList and has the
operator MUST_PASS_ONE, Bigtable cannot determine the implied range and instead performs an unfiltered scan from t
stop row. Use PrefixFilter with BigtableExtendedScan or a combination of filters to optimize performance in this case.

7. Relies on the Bigtable condition filter, which can be slow. Supported but not recommended.
ColumnPrefixFilter ColumnCountGetFilter 1 DependentColumnFilter
FamilyFilter ColumnPaginationFilter 1 FirstKeyValueMatchingQualifier
FilterList ColumnRangeFilter 1 InclusiveStopFilter
FuzzyRowFilter FirstKeyOnlyFilter 1 ParseFilter
MultipleColumnPrefixFilter KeyOnlyFilter 2 SkipFilter
MultiRowRangeFilter PageFilter 5 WhileMatchFilter
PrefixFilter 6 QualifierFilter 3
RandomRowFilter RowFilter 1, 4
TimestampsFilter SingleColumnValueExcludeFilter 1, 4, 7
SingleColumnValueFilter 4, 7
ValueFilter 4

In addition, the following differences affect Bigtable filters:

 In filters that use the regular expression comparator


(org.apache.hadoop.hbase.filter.RegexStringComparator), regular expressions use RE2 syntax, not
Java syntax.

 Custom filters are not supported. You cannot create classes that inherit
from org.apache.hadoop.hbase.filter.Filter.

 Reverse scans are not supported. You cannot call the


method org.apache.hadoop.hbase.filter.Filter#setReversed(boolean reversed).

 There is a size limit of 20 KB on filter expressions. As a workaround to reduce the size of a


filter expression, use a supplementary column that stores the hash value of the filter criteria.

Timestamps
Bigtable stores timestamps in microseconds, while HBase stores timestamps in
milliseconds. This distinction has implications when you use the HBase client library
for Bigtable and you have data with reversed timestamps.

The client library converts between microseconds and milliseconds, but because
that the largest HBase timestamp that Bigtable can store is Long.MAX_VALUE/1000,
any value larger than that is converted to Long.MAX_VALUE/1000. As a result, large
reversed timestamp values might not convert correctly.

Administration
This section describes methods in the interface org.apache.hadoop.hbase.client.Admin that
are not available on Bigtable, or that behave differently on Bigtable than on HBase.
These lists are not exhaustive, and they might not reflect the most recently added
HBase API methods.

Most of these methods are unnecessary on Bigtable, because management tasks


are handled automatically. A few methods are not available because they relate to
features that Bigtable does not support.

General maintenance tasks

Bigtable handles most maintenance tasks automatically. As a result, the following


methods are not available:

 abort(String why, Throwable e)

 balancer()

 enableCatalogJanitor(boolean enable)

 getMasterInfoPort()

 getOperationTimeout()

 isCatalogJanitorEnabled()

 rollWALWriter(ServerName serverName)

 runCatalogScan()

 setBalancerRunning(boolean on, boolean synchronous)

 shutdown()

 stopMaster()

 updateConfiguration()

 updateConfiguration(ServerName serverName)

Locality groups

Bigtable does not allow you to specify locality groups for column families. As a
result, you cannot call HBase methods that return a locality group.

Namespaces

Bigtable does not use namespaces. You can use row key prefixes to simulate
namespaces. The following methods are not available:

 createNamespace(NamespaceDescriptor descriptor)
 deleteNamespace(String name)

 getNamespaceDescriptor(String name)

 listNamespaceDescriptors()

 listTableDescriptorsByNamespace(String name)

 listTableNamesByNamespace(String name)

 modifyNamespace(NamespaceDescriptor descriptor)

Region management

Bigtable uses tablets, which are similar to regions. Bigtable manages your tablets
automatically. As a result, the following methods are not available:

 assign(byte[] regionName)

 closeRegion(byte[] regionname, String serverName)

 closeRegion(ServerName sn, HRegionInfo hri)

 closeRegion(String regionname, String serverName)

 closeRegionWithEncodedRegionName(String encodedRegionName, String serverName)

 compactRegion(byte[] regionName)

 compactRegion(byte[] regionName, byte[] columnFamily)

 compactRegionServer(ServerName sn, boolean major)

 flushRegion(byte[] regionName)

 getAlterStatus(byte[] tableName)

 getAlterStatus(TableName tableName)

 getCompactionStateForRegion(byte[] regionName)

 getOnlineRegions(ServerName sn)

 majorCompactRegion(byte[] regionName)

 majorCompactRegion(byte[] regionName, byte[] columnFamily)

 mergeRegions(byte[] encodedNameOfRegionA, byte[] encodedNameOfRegionB, boolean forcible)

 move(byte[] encodedRegionName, byte[] destServerName)

 offline(byte[] regionName)

 splitRegion(byte[] regionName)

 splitRegion(byte[] regionName, byte[] splitPoint)

 stopRegionServer(String hostnamePort)

 unassign(byte[] regionName, boolean force)


Snapshots

The following methods are not available.

 deleteSnapshots(Pattern pattern)

 deleteSnapshots(String regex)

 isSnapshotFinished(HBaseProtos.SnapshotDescription snapshot)

 restoreSnapshot(byte[] snapshotName)

 restoreSnapshot(String snapshotName)

 restoreSnapshot(byte[] snapshotName, boolean takeFailSafeSnapshot)

 restoreSnapshot(String snapshotName, boolean takeFailSafeSnapshot)

 snapshot(HBaseProtos.SnapshotDescription snapshot)

Table management

Tasks such as table compaction are handled automatically. As a result, the following
methods are not available:

 compact(TableName tableName)

 compact(TableName tableName, byte[] columnFamily)

 flush(TableName tableName)

 getCompactionState(TableName tableName)

 majorCompact(TableName tableName)

 majorCompact(TableName tableName, byte[] columnFamily)

 modifyTable(TableName tableName, HTableDescriptor htd)

 split(TableName tableName)

 split(TableName tableName, byte[] splitPoint)

Coprocessors

Bigtable does not support coprocessors. As a result, the following methods are not
available:

 coprocessorService()

 coprocessorService(ServerName serverName)

 getMasterCoprocessors()

Distributed procedures

Bigtable does not support distributed procedures. As a result, the following methods
are not available:

 execProcedure(String signature, String instance, Map<String, String> props)


 execProcedureWithRet(String signature, String instance, Map<String, String> props)

 isProcedureFinished(String signature, String instance, Map<String, String> props)

Some similarities:

1. Both are NoSQL. That means both don't support joins, transactions, typed columns, etc.
2. Both can handle significant amounts of data - petabyte-scale! This achieved because of
support of linear horizontal scaling.
3. Both make emphasis on high-availability - through replication, versioning.
4. Both are schema-free: you can create table and add column families or columns later.
5. Both have APIs for most popular languages - Java, Python, C#, C++. Complete lists of
supporting languages differ a bit.
6. Both support Apache HBase Java's API: after Apache HBase's success Google added
support for HBase-like API for Bigtable but with some limitations - see API differences.
Some differences:

1. Apache HBase is an open source project, while Bigtable is not.


2. Apache HBase can be installed on any environment, it uses Apache Hadoop's HDFS as
underlying storage. Bigtable is available only as a cloud service from Google.
3. Apache HBase is free, while Bigtable is not.
4. While some APIs are common, others are not - Bigtable supports gRPC (protobuf based)
API, while Apache HBase have Thrift and REST APIs.
5. Apache HBase supports server side scripting (e.q. triggers) and in general is more open to
extensions due to its open source nature.
6. Bigtable supports multi-cluster replication.
7. Apache HBase has immediate consistency always, while Bigtable has eventual consistency
in worst case scenarios.
8. Different security models - Apache HBase uses Access Control Lists, while Bigtable relies
on Google's Cloud Identity and Access Management.
DynamoDB allows users to create databases capable of storing and
retrieving any amount of data and comes in handy while serving any amount
of traffic. It dynamically manages each customer’s requests and provides
high performance by automatically distributing data and traffic over servers.
It is a fully managed NoSQL database service that is fast, predictable in
terms of performance, and seamlessly scalable. It relieves the user from the
administrative burdens of operating and scaling a distributed database as the
user doesn’t have to worry about hardware provisioning, patching Softwares,
or cluster scaling. It also eliminates the operational burden and complexity
involved in protecting sensitive data by providing encryption at REST.

DynamoDB Vs RDBMS
The below table provides us with core differences between a conventional
relational database management system and AWS DynamoDB:
Operations DynamoDB RDBMS

It uses a persistent
Source It uses HTTP requests and API
connection and SQL
connection operations.
commands.

It mainly requires the Primary key and It requires a well-


Create Table no schema on the creation and can have defined table for its
various data sources. operations.

Getting Table All data inside the


Only Primary keys are revealed.
Information table is accessible.

Loading Table In tables, it uses items made of It uses rows made of


Data attributes. columns.

It uses SELECT
Reading Table
It uses GetItem, Query, and Scan statements and
Data
filtering statements.

It uses a secondary index to achieve the


Managing Standard Indexes
same function. It requires specifications
Indexes created by SQL is used.
(partition key and sort key).

Modifying It uses an UPDATE


It uses a UpdateItem operation.
Table Data statement.

Deleting Table It uses a DELETE


It uses a DeleteItem operation.
Data statement.

It uses a DROP TABLE


Deleting Table It uses a DeleteTable operation.
statement.

Advantage of DynamoDB:
The main advantages of opting for Dynamodb are listed below:
 It has fast and predictable performance.
 It is highly scalable.
 It offloads the administrative burden operation and scaling.
 It offers encryption at REST for data protection.
 Its scalability is highly flexible.
 AWS Management Console can be used to monitor resource utilization
and performance metrics.
 It provides on-demand backups.
 It enables point-in-time recovery for your Amazon DynamoDB tables.
Point-in-time recovery helps protect your tables from accidental write or
delete operations. With point-in-time recovery, you can restore that table
to any point in time during the last 35 days.
 It can be highly automated.

Limitations of DynamoDB –

The below list provides us with the limitations of Amazon DynamoDB:


 It has a low read capacity unit of 4kB per second and a write capacity unit
of 1KB per second.
 All tables and global secondary indexes must have a minimum of one
read and one write capacity unit.
 Table sizes have no limits, but accounts have a 256 table limit unless you
request a higher cap.
 Only Five local and twenty global secondary (default quota) indexes per
table are permitted.
 DynamoDB does not prevent the use of reserved words as names.
 Partition key length and value minimum length sits at 1 byte, and
maximum at 2048 bytes, however, DynamoDB places no limit on values.
MapReduce and HDFS are the two major components of Hadoop which
makes it so powerful and efficient to use. MapReduce is a programming model
used for efficient processing in parallel over large data-sets in a distributed
manner. The data is first split and then combined to produce the final result.
The libraries for MapReduce is written in so many programming languages
with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks
for providing less overhead over the cluster network and to reduce the
processing power. The MapReduce task is mainly divided into two
phases Map Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to
process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent
job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular
size to the Hadoop MapReduce Master. Now, the MapReduce master will
divide this job into further equivalent job-parts. These job-parts are then made
available for the Map and Reduce Task. This Map and Reduce task will contain
the program as per the requirement of the use-case that the particular
company is solving. The developer writes their logic to fulfill the requirement
that the industry requires. The input data which we are using is then fed to the
Map Task and the Map will generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and
the final output is stored on the HDFS. There can be n number of Map and
Reduce tasks made available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such
that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs.
The input to the map may be a key-value pair where the key can be the id of some
kind of address and value is the actual value that it keeps. The Map() function will
be executed in its memory repository on each of these input key-value pairs and
generates the intermediate key-value pair which works as input for the Reducer
or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or group the
data based on its key-value pair as per the reducer algorithm written by the
developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs
across the cluster and also to schedule each map on the Task Tracker running on the
same data node since there can be hundreds of data nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is deployed
on each of the nodes available in the cluster that executes the Map and Reduce task
as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs which are generated
during or after the job execution are stored on Job History Server.
26
Introduction to Batch Processing
A batch job can be completed without user intervention. For example, consider a telephone billing
application that reads phone call records from the enterprise information systems and generates a
monthly bill for each account. Since this application does not require any user interaction, it can run
as a batch job.

The phone billing application consists of two phases: The first phase associates each call from the
registry with a monthly bill, and the second phase calculates the tax and total amount due for each
bill. Each of these phases is a step of the batch job.

Batch applications specify a set of steps and their execution order. Different batch frameworks may
specify additional elements, like decision elements or groups of steps that run in parallel. The
following sections describe steps in more detail and provide information about other common
characteristics of batch frameworks.

55.1.1 Steps in Batch Jobs


A step is an independent and sequential phase of a batch job. Batch jobs contain chunk-oriented
steps and task-oriented steps.

 Chunk-oriented steps (chunk steps) process data by reading items from a data source, applying
some business logic to each item, and storing the results. Chunk steps read and process one item at
a time and group the results into a chunk. The results are stored when the chunk reaches a
configurable size. Chunk-oriented processing makes storing results more efficient and facilitates
transaction demarcation.

Chunk steps have three parts.

o The input retrieval part reads one item at a time from a data source, such as entries on a database,
files in a directory, or entries in a log file.

o The business processing part manipulates one item at a time using the business logic defined by the
application. Examples include filtering, formatting, and accessing data from the item for computing a
result.

o The output writing part stores a chunk of processed items at a time.

Chunk steps are often long-running because they process large amounts of data. Batch frameworks
enable chunk steps to bookmark their progress using checkpoints. A chunk step that is interrupted
can be restarted from the last checkpoint. The input retrieval and output writing parts of a chunk step
save their current position after the processing of each chunk, and can recover it when the step is
restarted.

Figure 55-1 shows the three parts of two steps in a batch job.

Figure 55-1 Chunk Steps in a Batch Job


Description of "Figure 55-1 Chunk Steps in a Batch Job"

 Task-oriented steps (task steps) execute tasks other than processing items from a data source.
Examples include creating or removing directories, moving files, creating or dropping database tables,
configuring resources, and so on. Task steps are not usually long-running compared to chunk steps.

For example, the phone billing application consists of two chunk steps.

 In the first step, the input retrieval part reads call records from the registry; the business processing
part associates each call with a bill and creates a bill if one does not exist for an account; and the
output writing part stores each bill in a database.

 In the second step, the input retrieval part reads bills from the database; the business processing part
calculates the tax and total amount due for each bill; and the output writing part updates the database
records and generates printable versions of each bill.

This application could also contain a task step that cleaned up the files from the bills generated for the
previous month.

55.1.2 Parallel Processing


Batch jobs often process large amounts of data or perform computationally expensive operations.
Batch applications can benefit from parallel processing in two scenarios.

 Steps that do not depend on each other can run on different threads.

 Chunk-oriented steps where the processing of each item does not depend on the results of
processing previous items can run on more than one thread.

Batch frameworks provide mechanisms for developers to define groups of independent steps and to
split chunk-oriented steps in parts that can run in parallel.

55.1.3 Status and Decision Elements


Batch frameworks keep track of a status for every step in a job. The status indicates if a step is
running or if it has completed. If the step has completed, the status indicates one of the following.

 The execution of the step was successful.

 The step was interrupted.

 An error occurred in the execution of the step.

In addition to steps, batch jobs can also contain decision elements. Decision elements use the exit
status of the previous step to determine the next step or to terminate the batch job. Decision elements
set the status of the batch job when terminating it. Like a step, a batch job can terminate successfully,
be interrupted, or fail.
Figure 55-2 shows an example of a job that contains chunk steps, task steps and a decision element.

Figure 55-2 Steps and Decision Elements in a Job

55.1.4 Batch Framework Functionality


Batch applications have the following common requirements.

 Define jobs, steps, decision elements, and the relationships between them.

 Execute some groups of steps or parts of a step in parallel.

 Maintain state information for jobs and steps.

 Launch jobs and resume interrupted jobs.

 Handle errors.

Batch frameworks provide the batch execution infrastructure that addresses the common
requirements of all batch applications, enabling developers to concentrate on the business logic of
their applications. Batch frameworks consist of a format to specify jobs and steps, an application
programming interface (API), and a service available at runtime that manages the execution of batch
jobs.

Application Of MapReduce
It incorporates making item proposal Mechanisms for E-commerce
inventories, examining website records, buy history, user interaction
logs, etc. Data Warehouse: We can utilize MapReduce to analyze
large data volumes in data warehouses while implementing specific
business logic for data insights.

You might also like