Unit 4 BDTT
Unit 4 BDTT
Definition
An open-source workflow scheduling tool, Apache Oozie helps handle and
organize data processing tasks across Hadoop-based infrastructure.
Users can create, plan, and control workflows that contain a coordinated series
of Hadoop jobs, Pig scripts, Hive searches, and other operations.
Oozie can handle task dependencies, manage retry mechanisms, and support a
variety of workflow types, including simple and sophisticated processes.
Evolution of Oozie
Yahoo initially created Apache Oozie in 2008 as a tool for privately managing
Hadoop operations. Later, in 2011, it was made available as an open-source
undertaking run by the Apache Software Foundation.
For managing and scheduling massive data processing processes, Oozie is a
critical Hadoop ecosystem component frequently used in production settings. Its
community has expanded, with developers contributing to its continual
development and advancements.
The Oozie Workflow Manager and Oozie Coordinators are the two main
workflow management components of Apache Oozie.
The Oozie Workflow Manager manages and executes workflows and sequences
defines workflows. The WDL outlines the order in which activities must be
carried out, the input and output data required by each action, and their
handling errors, the Workflow Manager parses the WDL and carries out the steps
workflows, the data input for each instance of the workflow, and dependencies
between the cases of the process. The Coordinator operates periodically and
Oozie allows users to create, organize, and carry out workflow collections of
tasks or actions.
Oozie, ensuring that activities are executed in the proper order and that
Oozie is highly scalable and designed for large-scale data processing tasks in
Oozie offers a web-based graphical user interface and RESTful API for
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage
purpose only.
The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.
It was donated to Apache software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014.
The following diagram shows three ways of how Spark can be built with Hadoop
components.
Components of Spark
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.
Spark Streaming
It is, according to benchmarks, done by the MLlib developers against the Alternating
Least Squares (ALS) implementations.
GraphX
It provides an API for expressing graph computation that can model the user-defined
graphs by using Pregel abstraction API.
LIMITATIONS OF HADOOP
S.No ISSSUE SOLUTION
1. Issue with small files Just merge the small files to
create bigger files and then
copy bigger files to HDFS.
Sequence files work very
well in practice to overcome
the ‘small file problem’, in
which we use the filename as
the key and the file contents as
the value.
Storing files in HBase is a
very common design
pattern to overcome small
file problem with HDFS.
2. Slow processing speed In-memory processing is
faster as no time is spent in
moving the data/processes in
and out of the disk.
Spark is 100 times faster than
MapReduce as it processes
everything in memory.
Flink, as it processes faster
than spark because of its
streaming architecture and
Flink gets instructions to
process only the parts of the
data that have actually
changed
3. Latency Apache Spark is yet another
batch system but it is
relatively faster since it caches
much of the input data on
memory by RDD(Resilient
Distributed Dataset) and
keeps intermediate data in
memory itself.
4. Security Spark provides a security
bonus to overcome these
limitations of Hadoop.
If we run the spark in HDFS,
it can use HDFS ACLs and
file-level permissions.
Additionally, Spark can run
on YARN giving it the
capability of using Kerberos
authentication.
5. No real time data processing Apache Spark supports
stream processing. Stream
processing involves
continuous input and output of
data.
Apache Flink provides
single run-time for the
streaming as well as batch
processing, so one common
run-time is utilized for data
streaming applications and
batch processing applications.
6. Support for batch processing only Flink improves the overall
performance as it provides
single run-time for the
streaming as well as batch
processing.
Flink uses native closed loop
iteration operators which
make machine learning and
graph processing faster.
7. Uncertainty Hadoop only ensures that the
data job is complete, but it’s
unable to guarantee when the
job will be complete.
8. Lengthy line of code Spark and Flink are written in
scala and java but the
implementation is in Scala, so
the number of lines of code is
lesser than Hadoop.
So it will also take less time to
execute the program and solve
the lengthy line of code
limitations of Hadoop.
9. No Caching Spark and Flink can overcome
this limitation of Hadoop, as
Spark and Flink cache data in
memory for further iterations
which enhance the overall
performance.
10. Not ease of use Spark has interactive mode so
that developers and users alike
can have intermediate
feedback for queries and other
activities.
Spark is easy to program as it
has tons of high-level
operators.
11. No delta iteration Spark iterates its data in
batches.
For iterative processing in
Spark, we schedule and
execute each iteration
separately.
12. Vulnerable by nature Hadoop is entirely written
in Java, a language most
widely used, hence java been
most heavily exploited by
cyber criminals and as a
result, implicated in numerous
security breaches.
13. No abstraction Spark is used in which we
have RDD abstraction for
the batch.
Flink has Dataset abstraction.
APACHE FLINK
Apache Flink is a real-time processing framework which can process streaming
data.
It has true streaming model and does not take input data as batch or micro-batches.
Apache Flink was founded by Data Artisans company and is now developed under
Apache License by Apache Flink Community.
Storage
Apache Flink has multiple options from where it can Read/Write data. Below is a
basic storage list −
Deploy
You can deploy Apache Fink in local mode, cluster mode or on cloud. Cluster
mode can be standalone, YARN, MESOS.
Kernel
This is the runtime layer, which provides distributed processing, fault tolerance,
reliability, native iterative processing capability and more.
This is the top layer and most important layer of Apache Flink.
It has Dataset API, which takes care of batch processing, and Datastream API,
which takes care of stream processing.
There are other libraries like Flink ML (for machine learning), Gelly (for graph
processing ), Tables for SQL.
wget http://mirrors.estointernet.in/apache/flink/flink-1.7.1/flink-1.7.1-bin-scala_2.11.tgz
cd flink-1.7.1/
./bin/start-cluster.sh
Open the Mozilla browser and go to the below URL, it will open the Flink Web
Dashboard.
http://localhost:8081
This is how the User Interface of Apache Flink Dashboard looks like.
BATCH ANALYTICS USING FLINK
Apache Flink’s unified approach to stream and batch processing means that a DataStream
application executed over bounded input will produce the same final results regardless of the
configured execution mode.
It is important to note what final means here: a job executing in STREAMING mode might
produce incremental updates (think upserts in a database) while a BATCH job would only
produce one final result at the end. The final result will be the same if interpreted correctly but
the way to get there can be different.
As a rule of thumb, you should be using BATCH execution mode when your program is
bounded because this will be more efficient.
You have to use STREAMING execution mode when your program is unbounded because
only this mode is general enough to be able to deal with continuous data streams.
In BATCH execution mode, the tasks of a job can be separated into stages that
can be executed one after another.
We can do this because the input is bounded and Flink can therefore fully process
one stage of the pipeline before moving on to the next.
In the above example the job would have three stages that correspond to the three
tasks that are separated by the shuffle barriers.
For one, this allows Flink to backtrack to the latest available results when a failure
happens instead of restarting the whole job.
Another side effect is that BATCH jobs can execute on fewer resources (in terms
of available slots at TaskManagers) because the system can execute tasks
sequentially one after the other.
Task Managers will keep intermediate results at least as long as downstream tasks
have not consumed them.
After that, they will be kept for as long as space allows in order to allow the
aforementioned backtracking to earlier results in case of a failure.
BIG DATA MINING WITH NO SQL
Datasets that are difficult to store and analyze by any software database tool are
referred to as big data. Due to the growth of data, an issue arises that based on
recent fads in the IT region, how the data will be effectively processed. A
requirement for ideas, techniques, tools, and technologies is been set for handling
and transforming a lot of data into business value and knowledge. The major
features of NoSQL solutions are stated below that help us to handle a large amount
of data.
1. The queries should be moved to the data rather than moving data to queries:
At the point, when an overall query is needed to be sent by a customer to all
hubs/nodes holding information, the more proficient way is to send a query to every
hub than moving a huge set of data to a central processor. The stated statement is
a basic rule that assists to see how NoSQL data sets have sensational execution
benefits on frameworks that were not developed for queries distribution to hubs.
The entire data is kept inside hub/node in document form which means just the
query and result are needed to move over the network, thus keeping big data’s
queries quick.
2. Hash rings should be used for even distribution of data:
To figure out a reliable approach to allocating a report to a processing hub/node is
perhaps the most difficult issue with databases that are distributed. With a help of
an arbitrarily produced 40-character key, the hash rings method helps in even
distribution of a large amount of data on numerous servers and this is a decent
approach to uniform distribution of network load.
3. For scaling read requests, replication should be used:
In real-time, replication is used by databases for making data’s backup copies.
Read requests can be scaled horizontally with the help of replication. The strategy
of replication functions admirably much of the time.
4. Distribution of queries to nodes should be done by the database:
Separation of concerns of evaluation of query from the execution of the query is
important for getting more increased performance from queries traversing
numerous hubs/nodes. The query is moved to the data by the NoSQL database
instead of data moving to the query.
NOSQL DATABASES
A database is a collection of structured data or information which is stored in a
computer system and can be accessed easily. A database is usually managed by a
Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular
form. NoSQL stands for Not only SQL. The main types are documents, key-value,
wide-column, and graphs.
Document-based databases
Key-value stores
Column-oriented databases
Graph-based databases
Document-Based Database:
Key-Value Stores:
Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Key features of graph database:
In a graph-based database, it is easy to identify the relationship between
the data by using the links.
The Query’s output is real-time results.
The speed depends upon the number of relationships among the database
elements.
Updating data is also easy, as adding a new node or edge to a graph
database is a straightforward task that does not require significant schema
changes.
MONGODB
MongoDB, the most popular NoSQL database, is an open-source document-
oriented database.
It means that MongoDB isn’t based on the table-like relational database structure
but provides an altogether different mechanism for storage and retrieval of data.
title: 'Geeksforgeeks',
url: 'https://www.geeksforgeeks.org',
type: 'NoSQL'
This data is stored in a predefined data model which is not very much flexible for
today’s real-world highly growing applications.
Modern applications are more networked, social and interactive than ever.
Applications are storing more and more data and are accessing it at higher rates.
If the database runs on a single server, then it will reach a scaling limit. NoSQL
databases are more scalable and provide superior performance.
MongoDB is such a NoSQL database that scales by adding more and more servers
and increases productivity with its flexible document model.
Where do we use MongoDB?
Installing MongoDB:
For Windows, a few options for the 64-bit operating systems drops down.
When you’re running on Windows 7, 8 or newer versions, select Windows
64-bit 2008 R2+. When you’re using Windows XP or Vista then
select Windows 64-bit 2008 R2+ legacy.
BASIC QUERIES IN MONGODB
Syntax
use DATABASE_NAME
Syntax
db.dropDatabase()
This will delete the selected database. If you have not selected any database, then it
will delete default 'test' database.
Syntax
db.createCollection(name, options)
db.COLLECTION_NAME.drop()
Syntax
>db.COLLECTION_NAME.insert(document)
Syntax
>db.COLLECTION_NAME.find()
Apache Cassandra was originally developed at Facebook after that it was open-
sourced in 2008 and after that, it become one of the top-level Apache projects in
2010.
Features of Cassandra:
1. It is scalable.
2. It is flexible (can accept structured , semi-structured and unstructured data).
3. It has transaction support as it follows ACID properties.
4. It is highly available and fault tolerant.
5. It is open source.
Apache Cassandra is a highly scalable, distributed database that strictly follows the
principle of the CAP (Consistency Availability and Partition tolerance) theorem.
In Apache Cassandra, there is no master-client architecture. It has a peer-to-peer
architecture.
In Apache Cassandra, we can create multiple copies of data at the time of keyspace
creation.
Example:
CREATE KEYSPACE Example
WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': '3'};
In this example, we define RF (Replication Factor) as 3 which simply means that
we are creating here 3 copies of data across multiple nodes in a clockwise direction.