Cloud Computing Unit 3
Cloud Computing Unit 3
A cloud database is a database service built and accessed through a cloud platform.
It serves many of the same functions as a traditional database with the added
flexibility of cloud computing. Users install software on a cloud infrastructure to
implement the database.
Key features:
A database service built and accessed through a cloud platform
Enables enterprise users to host databases without buying dedicated hardware
Can be managed by the user or offered as a service and managed by a provider
Can support relational databases (including MySQL and PostgreSQL) and NoSQL
databases (including MongoDB and Apache CouchDB)
Accessed through a web interface or vendor-provided API
Why cloud databases
Ease of access
Users can access cloud databases from virtually anywhere, using a vendor’s API or
web interface.
Scalability
Cloud databases can expand their storage capacities on run-time to accommodate
changing needs. Organizations only pay for what they use.
Disaster recovery
In the event of a natural disaster, equipment failure or power outage, data is kept
secure through backups on remote servers.
Now a day, data has been specifically getting stored over clouds also known as a
virtual environment, either in a hybrid cloud, public or private cloud. A cloud database
is a database that has been optimized or built for such a virtualized environment.
There are various benefits of a cloud database, some of which are the ability to pay
for storage capacity and bandwidth on a per-user basis, and they provide scalability
on demand, along with high availability.
A cloud database also gives enterprises the opportunity to support business
applications in a software-as-a-service deployment.
These databases are categorized by a set of tables where data gets fit into a pre-
defined category. The table consists of rows and columns where the column has an
entry for data for a specific category and rows contains instance for that data defined
according to the category. The Structured Query Language (SQL) is the standard user
and application program interface for a relational database.
There are various simple operations that can be applied over the table which makes
these databases easier to extend, join two databases with a common relation and
modify all existing applications.
HDFS is the storage unit of Hadoop that is used to store and process huge volumes of
data on multiple datanodes. It is designed with low cost hardware that provides data
across multiple Hadoop clusters. It has high fault tolerance and throughput.
Large file is broken down into small blocks of data. HDFS has a default block size of 128
MB which can be increased as per requirement. Multiple copies of each block are stored
in the cluster in a distributed manner on different nodes.
As the number of internet users grew in the early 2000, Google faced the problem of
storing increasing user data on its traditional data servers. Thousands of search queries
were raised per second. There was a need for large, distributed, highly fault tolerant file
system to store and process the queries. The solution to this was Google File System
(GFS).
This helped Google to store and process huge volumes of data in a distributed manner.
Differences between HBase and Cloud
Bigtable
One way to access Cloud Bigtable is to use a customized version of the Apache
HBase client for Java. In general, the customized client exposes the same API as a
standard installation of HBase. This page describes the differences between the
Bigtable HBase client for Java and a standard HBase installation. Many of these
difference are related to management tasks that Bigtable handles automatically.
Column families
When you create a column family, you cannot configure the block size or
compression method, either with the HBase shell or through the HBase API. Bigtable
manages the block size and compression for you.
In addition, if you use the HBase shell to get information about a table, the HBase
shell will always report that each column family does not use compression. In reality,
Bigtable uses proprietary compression methods for all of your data.
Bigtable requires that column family names follow the regular expression [_a-zA-Z0-
9][-_.a-zA-Z0-9]*. If you are importing data into Bigtable HBase, you might need to first
change the family names to follow this pattern.
Tags are not supported. You cannot use the class org.apache.hadoop.hbase.Tag to add metadata
to individual cells.
Deleting a specific version of a specific column based on its timestamp is supported, but
deleting all values with a specific timestamp in a given column family or row is not
supported. The following methods in the class org.apache.hadoop.hbase.client.Delete are not
supported:
new Delete(byte[] row, long timestamp)
In HBase, deletes mask puts, but Bigtable does not mask puts after deletes when put
requests are sent after deletion requests. This means that in Bigtable, a write request sent to
a cell is not affected by a previously sent delete request to the same cell.
Querying versions of column families within a timestamp range is not supported. You cannot
call the following methods:
org.apache.hadoop.hbase.client.Query#setColumnFamilyTimeRange(byte[] cf, long minStamp, long
maxStamp)
Limiting the number of values per row per column family is not supported. You cannot call
the method org.apache.hadoop.hbase.client.Scan#setMaxResultsPerColumnFamily(int limit) .
Setting the maximum number of cells to return for each call to next() is not supported. Calls
to the method org.apache.hadoop.hbase.client.Scan#setBatch(int batch) are ignored.
Setting the number of rows for caching is not supported. Calls to the
method org.apache.hadoop.hbase.client.Scan#setCaching(int caching) are ignored.
Coprocessors
Coprocessors are not supported. You cannot create classes that implement the
interface org.apache.hadoop.hbase.coprocessor.
Filters
The following table shows which filters are currently supported. All of these filters
are in the package org.apache.hadoop.hbase.filter.
3. Supports only the BinaryComparator comparator. If any operator other than EQUAL is used, only a single column family is
BinaryComparator
Supported Supported, with limitations Not supported
RegexStringComparator with no flags (flags are ignored) and the EQUAL operator
5. If a PageFilter is in a FilterList, PageFilter will only work similarly to HBase when the FilterList is set to MUST_PASS_ALL,
behavior. If the FilterList is set to MUST_PASS_ONE, Cloud Bigtable will treat the PageFilter as a MUST_PASS_ALL and
of rows corresponding to the PageFilter's pageSize.
6. PrefixFilter scans for rows in the PrefixFilter in most cases. However, if PrefixFilter is part of a FilterList and has the
operator MUST_PASS_ONE, Bigtable cannot determine the implied range and instead performs an unfiltered scan from t
stop row. Use PrefixFilter with BigtableExtendedScan or a combination of filters to optimize performance in this case.
7. Relies on the Bigtable condition filter, which can be slow. Supported but not recommended.
ColumnPrefixFilter ColumnCountGetFilter 1 DependentColumnFilter
FamilyFilter ColumnPaginationFilter 1 FirstKeyValueMatchingQualifier
FilterList ColumnRangeFilter 1 InclusiveStopFilter
FuzzyRowFilter FirstKeyOnlyFilter 1 ParseFilter
MultipleColumnPrefixFilter KeyOnlyFilter 2 SkipFilter
MultiRowRangeFilter PageFilter 5 WhileMatchFilter
PrefixFilter 6 QualifierFilter 3
RandomRowFilter RowFilter 1, 4
TimestampsFilter SingleColumnValueExcludeFilter 1, 4, 7
SingleColumnValueFilter 4, 7
ValueFilter 4
Custom filters are not supported. You cannot create classes that inherit
from org.apache.hadoop.hbase.filter.Filter.
Timestamps
Bigtable stores timestamps in microseconds, while HBase stores timestamps in
milliseconds. This distinction has implications when you use the HBase client library
for Bigtable and you have data with reversed timestamps.
The client library converts between microseconds and milliseconds, but because
that the largest HBase timestamp that Bigtable can store is Long.MAX_VALUE/1000,
any value larger than that is converted to Long.MAX_VALUE/1000. As a result, large
reversed timestamp values might not convert correctly.
Administration
This section describes methods in the interface org.apache.hadoop.hbase.client.Admin that
are not available on Bigtable, or that behave differently on Bigtable than on HBase.
These lists are not exhaustive, and they might not reflect the most recently added
HBase API methods.
balancer()
enableCatalogJanitor(boolean enable)
getMasterInfoPort()
getOperationTimeout()
isCatalogJanitorEnabled()
rollWALWriter(ServerName serverName)
runCatalogScan()
shutdown()
stopMaster()
updateConfiguration()
updateConfiguration(ServerName serverName)
Locality groups
Bigtable does not allow you to specify locality groups for column families. As a
result, you cannot call HBase methods that return a locality group.
Namespaces
Bigtable does not use namespaces. You can use row key prefixes to simulate
namespaces. The following methods are not available:
createNamespace(NamespaceDescriptor descriptor)
deleteNamespace(String name)
getNamespaceDescriptor(String name)
listNamespaceDescriptors()
listTableDescriptorsByNamespace(String name)
listTableNamesByNamespace(String name)
modifyNamespace(NamespaceDescriptor descriptor)
Region management
Bigtable uses tablets, which are similar to regions. Bigtable manages your tablets
automatically. As a result, the following methods are not available:
assign(byte[] regionName)
compactRegion(byte[] regionName)
flushRegion(byte[] regionName)
getAlterStatus(byte[] tableName)
getAlterStatus(TableName tableName)
getCompactionStateForRegion(byte[] regionName)
getOnlineRegions(ServerName sn)
majorCompactRegion(byte[] regionName)
offline(byte[] regionName)
splitRegion(byte[] regionName)
stopRegionServer(String hostnamePort)
deleteSnapshots(Pattern pattern)
deleteSnapshots(String regex)
isSnapshotFinished(HBaseProtos.SnapshotDescription snapshot)
restoreSnapshot(byte[] snapshotName)
restoreSnapshot(String snapshotName)
snapshot(HBaseProtos.SnapshotDescription snapshot)
Table management
Tasks such as table compaction are handled automatically. As a result, the following
methods are not available:
compact(TableName tableName)
flush(TableName tableName)
getCompactionState(TableName tableName)
majorCompact(TableName tableName)
split(TableName tableName)
Coprocessors
Bigtable does not support coprocessors. As a result, the following methods are not
available:
coprocessorService()
coprocessorService(ServerName serverName)
getMasterCoprocessors()
Distributed procedures
Bigtable does not support distributed procedures. As a result, the following methods
are not available:
Some similarities:
1. Both are NoSQL. That means both don't support joins, transactions, typed columns, etc.
2. Both can handle significant amounts of data - petabyte-scale! This achieved because of
support of linear horizontal scaling.
3. Both make emphasis on high-availability - through replication, versioning.
4. Both are schema-free: you can create table and add column families or columns later.
5. Both have APIs for most popular languages - Java, Python, C#, C++. Complete lists of
supporting languages differ a bit.
6. Both support Apache HBase Java's API: after Apache HBase's success Google added
support for HBase-like API for Bigtable but with some limitations - see API differences.
Some differences:
DynamoDB Vs RDBMS
The below table provides us with core differences between a conventional
relational database management system and AWS DynamoDB:
Operations DynamoDB RDBMS
It uses a persistent
Source It uses HTTP requests and API
connection and SQL
connection operations.
commands.
It uses SELECT
Reading Table
It uses GetItem, Query, and Scan statements and
Data
filtering statements.
Advantage of DynamoDB:
The main advantages of opting for Dynamodb are listed below:
It has fast and predictable performance.
It is highly scalable.
It offloads the administrative burden operation and scaling.
It offers encryption at REST for data protection.
Its scalability is highly flexible.
AWS Management Console can be used to monitor resource utilization
and performance metrics.
It provides on-demand backups.
It enables point-in-time recovery for your Amazon DynamoDB tables.
Point-in-time recovery helps protect your tables from accidental write or
delete operations. With point-in-time recovery, you can restore that table
to any point in time during the last 35 days.
It can be highly automated.
Limitations of DynamoDB –
1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to
process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent
job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular
size to the Hadoop MapReduce Master. Now, the MapReduce master will
divide this job into further equivalent job-parts. These job-parts are then made
available for the Map and Reduce Task. This Map and Reduce task will contain
the program as per the requirement of the use-case that the particular
company is solving. The developer writes their logic to fulfill the requirement
that the industry requires. The input data which we are using is then fed to the
Map Task and the Map will generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and
the final output is stored on the HDFS. There can be n number of Map and
Reduce tasks made available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such
that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs.
The input to the map may be a key-value pair where the key can be the id of some
kind of address and value is the actual value that it keeps. The Map() function will
be executed in its memory repository on each of these input key-value pairs and
generates the intermediate key-value pair which works as input for the Reducer
or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or group the
data based on its key-value pair as per the reducer algorithm written by the
developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs
across the cluster and also to schedule each map on the Task Tracker running on the
same data node since there can be hundreds of data nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is deployed
on each of the nodes available in the cluster that executes the Map and Reduce task
as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs which are generated
during or after the job execution are stored on Job History Server.
26
Introduction to Batch Processing
A batch job can be completed without user intervention. For example, consider a telephone billing
application that reads phone call records from the enterprise information systems and generates a
monthly bill for each account. Since this application does not require any user interaction, it can run
as a batch job.
The phone billing application consists of two phases: The first phase associates each call from the
registry with a monthly bill, and the second phase calculates the tax and total amount due for each
bill. Each of these phases is a step of the batch job.
Batch applications specify a set of steps and their execution order. Different batch frameworks may
specify additional elements, like decision elements or groups of steps that run in parallel. The
following sections describe steps in more detail and provide information about other common
characteristics of batch frameworks.
Chunk-oriented steps (chunk steps) process data by reading items from a data source, applying
some business logic to each item, and storing the results. Chunk steps read and process one item at
a time and group the results into a chunk. The results are stored when the chunk reaches a
configurable size. Chunk-oriented processing makes storing results more efficient and facilitates
transaction demarcation.
o The input retrieval part reads one item at a time from a data source, such as entries on a database,
files in a directory, or entries in a log file.
o The business processing part manipulates one item at a time using the business logic defined by the
application. Examples include filtering, formatting, and accessing data from the item for computing a
result.
Chunk steps are often long-running because they process large amounts of data. Batch frameworks
enable chunk steps to bookmark their progress using checkpoints. A chunk step that is interrupted
can be restarted from the last checkpoint. The input retrieval and output writing parts of a chunk step
save their current position after the processing of each chunk, and can recover it when the step is
restarted.
Figure 55-1 shows the three parts of two steps in a batch job.
Task-oriented steps (task steps) execute tasks other than processing items from a data source.
Examples include creating or removing directories, moving files, creating or dropping database tables,
configuring resources, and so on. Task steps are not usually long-running compared to chunk steps.
For example, the phone billing application consists of two chunk steps.
In the first step, the input retrieval part reads call records from the registry; the business processing
part associates each call with a bill and creates a bill if one does not exist for an account; and the
output writing part stores each bill in a database.
In the second step, the input retrieval part reads bills from the database; the business processing part
calculates the tax and total amount due for each bill; and the output writing part updates the database
records and generates printable versions of each bill.
This application could also contain a task step that cleaned up the files from the bills generated for the
previous month.
Steps that do not depend on each other can run on different threads.
Chunk-oriented steps where the processing of each item does not depend on the results of
processing previous items can run on more than one thread.
Batch frameworks provide mechanisms for developers to define groups of independent steps and to
split chunk-oriented steps in parts that can run in parallel.
In addition to steps, batch jobs can also contain decision elements. Decision elements use the exit
status of the previous step to determine the next step or to terminate the batch job. Decision elements
set the status of the batch job when terminating it. Like a step, a batch job can terminate successfully,
be interrupted, or fail.
Figure 55-2 shows an example of a job that contains chunk steps, task steps and a decision element.
Define jobs, steps, decision elements, and the relationships between them.
Handle errors.
Batch frameworks provide the batch execution infrastructure that addresses the common
requirements of all batch applications, enabling developers to concentrate on the business logic of
their applications. Batch frameworks consist of a format to specify jobs and steps, an application
programming interface (API), and a service available at runtime that manages the execution of batch
jobs.
Application Of MapReduce
It incorporates making item proposal Mechanisms for E-commerce
inventories, examining website records, buy history, user interaction
logs, etc. Data Warehouse: We can utilize MapReduce to analyze
large data volumes in data warehouses while implementing specific
business logic for data insights.