0% found this document useful (0 votes)
32 views23 pages

Unit 4 BDTT

Oozie is an open-source workflow scheduling system for managing Hadoop jobs such as Pig, Hive, Sqoop and MapReduce jobs. It allows users to manage dependencies and run workflows for complex jobs. Oozie workflows are defined using XML files and can be scheduled to run periodically using coordinators. The main components of Oozie are the workflow manager and coordinators.

Uploaded by

Ashwath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views23 pages

Unit 4 BDTT

Oozie is an open-source workflow scheduling system for managing Hadoop jobs such as Pig, Hive, Sqoop and MapReduce jobs. It allows users to manage dependencies and run workflows for complex jobs. Oozie workflows are defined using XML files and can be scheduled to run periodically using coordinators. The main components of Oozie are the workflow manager and coordinators.

Uploaded by

Ashwath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

INTRODUCTION TO OOZIE

Definition
An open-source workflow scheduling tool, Apache Oozie helps handle and
organize data processing tasks across Hadoop-based infrastructure.
Users can create, plan, and control workflows that contain a coordinated series
of Hadoop jobs, Pig scripts, Hive searches, and other operations.
Oozie can handle task dependencies, manage retry mechanisms, and support a
variety of workflow types, including simple and sophisticated processes.

Evolution of Oozie

Yahoo initially created Apache Oozie in 2008 as a tool for privately managing
Hadoop operations. Later, in 2011, it was made available as an open-source
undertaking run by the Apache Software Foundation.
For managing and scheduling massive data processing processes, Oozie is a
critical Hadoop ecosystem component frequently used in production settings. Its
community has expanded, with developers contributing to its continual
development and advancements.

Main Components of Apache Oozie

The Oozie Workflow Manager and Oozie Coordinators are the two main
workflow management components of Apache Oozie.

 The Oozie Workflow Manager manages and executes workflows and sequences

of actions that must be conducted in a specific order. The Workflow Definition

Language (WDL), an Extensible Markup Language (XML)-based language,

defines workflows. The WDL outlines the order in which activities must be

carried out, the input and output data required by each action, and their

interdependencies. In addition to managing dependencies between actions and

handling errors, the Workflow Manager parses the WDL and carries out the steps

in the predetermined order.

 Oozie Coordinators are responsible for organizing and overseeing repeating

workflows. The Coordinator Application Language (CAL), an XML-based


language, defines coordinators. Coordinators describe a schedule for running

workflows, the data input for each instance of the workflow, and dependencies

between the cases of the process. The Coordinator operates periodically and

generates workflow instances by the plan and supplied data.

Key Features of Oozie

 Oozie allows users to create, organize, and carry out workflow collections of

tasks or actions.

 Oozie supports the scheduling of repeating processes using coordinators, which

lets users provide a schedule for when workflows will execute.

 Management of dependencies between tasks and workflows is supported by

Oozie, ensuring that activities are executed in the proper order and that

workflows are correctly completed.

 Oozie is built on a modular, extensible architecture that enables users to

customize and extend its features.

 Oozie is highly scalable and designed for large-scale data processing tasks in

distributed computing environments.

 Oozie offers a web-based graphical user interface and RESTful API for

controlling and monitoring workflows and coordinators.


APACHE SPARK

Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.

Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage
purpose only.

Apache Spark is a lightning-fast cluster computing technology, designed for fast


computation.

It is based on Hadoop MapReduce and it extends the MapReduce model to


efficiently use it for more types of computations, which includes interactive queries
and stream processing.

The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications,


iterative algorithms, interactive queries and streaming.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab


by Matei Zaharia.

It was Open Sourced in 2010 under a BSD license.

It was donated to Apache software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014.

Features of Apache Spark

 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times


faster in memory, and 10 times faster when running on disk.

 Supports multiple languages − Spark provides built-in APIs in Java, Scala,


or Python. Spark comes up with 80 high-level operators for interactive
querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop
components.

Components of Spark
Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon.

It provides In-Memory computing and referencing datasets in external storage


systems.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.

Spark Streaming

Spark Streaming leverages Spark Core's fast scheduling capability to perform


streaming analytics.

It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)


transformations on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the


distributed memory-based Spark architecture.

It is, according to benchmarks, done by the MLlib developers against the Alternating
Least Squares (ALS) implementations.

GraphX

GraphX is a distributed graph-processing framework on top of Spark.

It provides an API for expressing graph computation that can model the user-defined
graphs by using Pregel abstraction API.
LIMITATIONS OF HADOOP
S.No ISSSUE SOLUTION
1. Issue with small files  Just merge the small files to
create bigger files and then
copy bigger files to HDFS.
 Sequence files work very
well in practice to overcome
the ‘small file problem’, in
which we use the filename as
the key and the file contents as
the value.
 Storing files in HBase is a
very common design
pattern to overcome small
file problem with HDFS.
2. Slow processing speed  In-memory processing is
faster as no time is spent in
moving the data/processes in
and out of the disk.
 Spark is 100 times faster than
MapReduce as it processes
everything in memory.
 Flink, as it processes faster
than spark because of its
streaming architecture and
Flink gets instructions to
process only the parts of the
data that have actually
changed
3. Latency  Apache Spark is yet another
batch system but it is
relatively faster since it caches
much of the input data on
memory by RDD(Resilient
Distributed Dataset) and
keeps intermediate data in
memory itself.
4. Security  Spark provides a security
bonus to overcome these
limitations of Hadoop.
 If we run the spark in HDFS,
it can use HDFS ACLs and
file-level permissions.
 Additionally, Spark can run
on YARN giving it the
capability of using Kerberos
authentication.
5. No real time data processing  Apache Spark supports
stream processing. Stream
processing involves
continuous input and output of
data.
 Apache Flink provides
single run-time for the
streaming as well as batch
processing, so one common
run-time is utilized for data
streaming applications and
batch processing applications.
6. Support for batch processing only  Flink improves the overall
performance as it provides
single run-time for the
streaming as well as batch
processing.
 Flink uses native closed loop
iteration operators which
make machine learning and
graph processing faster.
7. Uncertainty  Hadoop only ensures that the
data job is complete, but it’s
unable to guarantee when the
job will be complete.
8. Lengthy line of code  Spark and Flink are written in
scala and java but the
implementation is in Scala, so
the number of lines of code is
lesser than Hadoop.
 So it will also take less time to
execute the program and solve
the lengthy line of code
limitations of Hadoop.
9. No Caching  Spark and Flink can overcome
this limitation of Hadoop, as
Spark and Flink cache data in
memory for further iterations
which enhance the overall
performance.
10. Not ease of use  Spark has interactive mode so
that developers and users alike
can have intermediate
feedback for queries and other
activities.
 Spark is easy to program as it
has tons of high-level
operators.
11. No delta iteration  Spark iterates its data in
batches.
 For iterative processing in
Spark, we schedule and
execute each iteration
separately.
12. Vulnerable by nature  Hadoop is entirely written
in Java, a language most
widely used, hence java been
most heavily exploited by
cyber criminals and as a
result, implicated in numerous
security breaches.
13. No abstraction  Spark is used in which we
have RDD abstraction for
the batch.
 Flink has Dataset abstraction.
APACHE FLINK
Apache Flink is a real-time processing framework which can process streaming
data.

It is an open source stream processing framework for high-performance, scalable,


and accurate real-time applications.

It has true streaming model and does not take input data as batch or micro-batches.

Apache Flink was founded by Data Artisans company and is now developed under
Apache License by Apache Flink Community.

Ecosystem on Apache Flink

Storage

Apache Flink has multiple options from where it can Read/Write data. Below is a
basic storage list −

 HDFS (Hadoop Distributed File System)


 Local File System
 S3
 RDBMS (MySQL, Oracle, MS SQL etc.)
 MongoDB
 HBase
 Apache Kafka
 Apache Flume

Deploy

You can deploy Apache Fink in local mode, cluster mode or on cloud. Cluster
mode can be standalone, YARN, MESOS.

On cloud, Flink can be deployed on AWS or GCP.

Kernel

This is the runtime layer, which provides distributed processing, fault tolerance,
reliability, native iterative processing capability and more.

APIs & Libraries

This is the top layer and most important layer of Apache Flink.

It has Dataset API, which takes care of batch processing, and Datastream API,
which takes care of stream processing.

There are other libraries like Flink ML (for machine learning), Gelly (for graph
processing ), Tables for SQL.

This layer provides diverse capabilities to Apache Flink.


INSTALLING FLINK
check whether we have Java 8 installed in our system

now proceed by downloading Apache Flink.

wget http://mirrors.estointernet.in/apache/flink/flink-1.7.1/flink-1.7.1-bin-scala_2.11.tgz

Now, uncompress the tar file.

tar -xzf flink-1.7.1-bin-scala_2.11.tgz

Go to Flink's home directory.

cd flink-1.7.1/

Start the Flink Cluster.

./bin/start-cluster.sh
Open the Mozilla browser and go to the below URL, it will open the Flink Web
Dashboard.

http://localhost:8081

This is how the User Interface of Apache Flink Dashboard looks like.
BATCH ANALYTICS USING FLINK
Apache Flink’s unified approach to stream and batch processing means that a DataStream
application executed over bounded input will produce the same final results regardless of the
configured execution mode.
It is important to note what final means here: a job executing in STREAMING mode might
produce incremental updates (think upserts in a database) while a BATCH job would only
produce one final result at the end. The final result will be the same if interpreted correctly but
the way to get there can be different.

As a rule of thumb, you should be using BATCH execution mode when your program is
bounded because this will be more efficient.

You have to use STREAMING execution mode when your program is unbounded because
only this mode is general enough to be able to deal with continuous data streams.

The execution mode can be configured via the execution.runtime-mode setting.


There are three possible values:

 STREAMING: The classic DataStream execution mode (default)


 BATCH: Batch-style execution on the DataStream API
 AUTOMATIC: Let the system decide based on the boundedness of the
sources

In BATCH execution mode, the tasks of a job can be separated into stages that
can be executed one after another.

We can do this because the input is bounded and Flink can therefore fully process
one stage of the pipeline before moving on to the next.

In the above example the job would have three stages that correspond to the three
tasks that are separated by the shuffle barriers.

Instead of sending records immediately to downstream tasks, as explained above


for STREAMING mode, processing in stages requires Flink to materialize
intermediate results of tasks to some non-ephemeral storage which allows
downstream tasks to read them after upstream tasks have already gone off line.
This will increase the latency of processing but comes with other interesting
properties.

For one, this allows Flink to backtrack to the latest available results when a failure
happens instead of restarting the whole job.

Another side effect is that BATCH jobs can execute on fewer resources (in terms
of available slots at TaskManagers) because the system can execute tasks
sequentially one after the other.

Task Managers will keep intermediate results at least as long as downstream tasks
have not consumed them.

After that, they will be kept for as long as space allows in order to allow the
aforementioned backtracking to earlier results in case of a failure.
BIG DATA MINING WITH NO SQL
Datasets that are difficult to store and analyze by any software database tool are
referred to as big data. Due to the growth of data, an issue arises that based on
recent fads in the IT region, how the data will be effectively processed. A
requirement for ideas, techniques, tools, and technologies is been set for handling
and transforming a lot of data into business value and knowledge. The major
features of NoSQL solutions are stated below that help us to handle a large amount
of data.

NoSQL databases that are best for big data are:


 MongoDB
 Cassandra
 CouchDB
 Neo4j

Different ways to handle Big Data problems:

1. The queries should be moved to the data rather than moving data to queries:
At the point, when an overall query is needed to be sent by a customer to all
hubs/nodes holding information, the more proficient way is to send a query to every
hub than moving a huge set of data to a central processor. The stated statement is
a basic rule that assists to see how NoSQL data sets have sensational execution
benefits on frameworks that were not developed for queries distribution to hubs.
The entire data is kept inside hub/node in document form which means just the
query and result are needed to move over the network, thus keeping big data’s
queries quick.
2. Hash rings should be used for even distribution of data:
To figure out a reliable approach to allocating a report to a processing hub/node is
perhaps the most difficult issue with databases that are distributed. With a help of
an arbitrarily produced 40-character key, the hash rings method helps in even
distribution of a large amount of data on numerous servers and this is a decent
approach to uniform distribution of network load.
3. For scaling read requests, replication should be used:
In real-time, replication is used by databases for making data’s backup copies.
Read requests can be scaled horizontally with the help of replication. The strategy
of replication functions admirably much of the time.
4. Distribution of queries to nodes should be done by the database:
Separation of concerns of evaluation of query from the execution of the query is
important for getting more increased performance from queries traversing
numerous hubs/nodes. The query is moved to the data by the NoSQL database
instead of data moving to the query.
NOSQL DATABASES
A database is a collection of structured data or information which is stored in a
computer system and can be accessed easily. A database is usually managed by a
Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular
form. NoSQL stands for Not only SQL. The main types are documents, key-value,
wide-column, and graphs.

Types of NoSQL Database:

 Document-based databases
 Key-value stores
 Column-oriented databases
 Graph-based databases

Document-Based Database:

The document-based database is a nonrelational database. Instead of storing the data


in rows and columns (tables), it uses the documents to store the data in the database.
A document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these data
in the applications. In the Document database, the particular elements can be
accessed by using the index value that is assigned for faster querying.
Key features of documents database:
 Flexible schema: Documents in the database has a flexible schema. It
means the documents in the database need not be the same schema.
 Faster creation and maintenance: the creation of documents is easy and
minimal maintenance is required once we create the document.
 No foreign keys: There is no dynamic relationship between two
documents so documents can be independent of one another. So, there is
no requirement for a foreign key in a document database.
 Open formats: To build a document we use XML, JSON, and others.

Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL


database is a key-value store. Every data element in the database is stored in key-
value pairs. The data can be retrieved by using a unique key allotted to each element
in the database. The values can be simple data types like strings and numbers or
complex objects.
A key-value store is like a relational database with only two columns which is the
key and the value.

Column Oriented Databases:

A column-oriented database is a non-relational database that stores the data in


columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming
memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data.

Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Key features of graph database:
 In a graph-based database, it is easy to identify the relationship between
the data by using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the database
elements.
 Updating data is also easy, as adding a new node or edge to a graph
database is a straightforward task that does not require significant schema
changes.
MONGODB
MongoDB, the most popular NoSQL database, is an open-source document-
oriented database.

The term ‘NoSQL’ means ‘non-relational’.

It means that MongoDB isn’t based on the table-like relational database structure
but provides an altogether different mechanism for storage and retrieval of data.

This format of storage is called BSON ( similar to JSON format).

A simple MongoDB document Structure:

title: 'Geeksforgeeks',

by: 'Harshit Gupta',

url: 'https://www.geeksforgeeks.org',

type: 'NoSQL'

SQL databases store data in tabular format.

This data is stored in a predefined data model which is not very much flexible for
today’s real-world highly growing applications.

Modern applications are more networked, social and interactive than ever.

Applications are storing more and more data and are accessing it at higher rates.

Relational Database Management System(RDBMS) is not the correct choice


when it comes to handling big data by the virtue of their design since they are
not horizontally scalable.

If the database runs on a single server, then it will reach a scaling limit. NoSQL
databases are more scalable and provide superior performance.

MongoDB is such a NoSQL database that scales by adding more and more servers
and increases productivity with its flexible document model.
Where do we use MongoDB?

MongoDB is preferred over RDBMS in the following scenarios:


 Big Data: If you have huge amount of data to be stored in tables, think
of MongoDB before RDBMS databases. MongoDB has built-in
solution for partitioning and sharding your database.
 Unstable Schema: Adding a new column in RDBMS is hard whereas
MongoDB is schema-less. Adding a new field does not effect old
documents and will be very easy.
 Distributed data Since multiple copies of data are stored across
different servers, recovery of data is instant and safe even if there is a
hardware failure.

Language Support by MongoDB:

 MongoDB currently provides official driver support for all popular


programming languages like C, C++, Rust, C#, Java, Node.js, Perl, PHP,
Python, Ruby, Scala, Go, and Erlang.

Installing MongoDB:

 Just go to http://www.mongodb.org/downloads and select your operating


system out of Windows, Linux, Mac OS X and Solaris. A detailed
explanation about the installation of MongoDB is given on their site.

 For Windows, a few options for the 64-bit operating systems drops down.
When you’re running on Windows 7, 8 or newer versions, select Windows
64-bit 2008 R2+. When you’re using Windows XP or Vista then
select Windows 64-bit 2008 R2+ legacy.
BASIC QUERIES IN MONGODB

The use Command

MongoDB use DATABASE_NAME is used to create database. The command


will create a new database if it doesn't exist, otherwise it will return the existing
database.

Syntax

Basic syntax of use DATABASE statement is as follows −

use DATABASE_NAME

The dropDatabase() Method

MongoDB db.dropDatabase() command is used to drop a existing database.

Syntax

Basic syntax of dropDatabase() command is as follows −

db.dropDatabase()

This will delete the selected database. If you have not selected any database, then it
will delete default 'test' database.

The createCollection() Method

MongoDB db.createCollection(name, options) is used to create collection.

Syntax

Basic syntax of createCollection() command is as follows −

db.createCollection(name, options)

In the command, name is name of collection to be created. Options is a document


and is used to specify configuration of collection.

The drop() Method

MongoDB's db.collection.drop() is used to drop a collection from the database.


Syntax

Basic syntax of drop() command is as follows −

db.COLLECTION_NAME.drop()

The insert() Method

To insert data into MongoDB collection, you need to use


MongoDB's insert() or save() method.

Syntax

The basic syntax of insert() command is as follows −

>db.COLLECTION_NAME.insert(document)

The find() Method

To query data from MongoDB collection, you need to use


MongoDB's find() method.

Syntax

The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()

find() method will display all the documents in a non-structured way.


INTRODUCTION TO CASSANDRA
Apache Cassandra is an open-source no SQL database that is used for handling big
data.

Apache Cassandra has the capability to handle structure, semi-structured, and


unstructured data.

Apache Cassandra was originally developed at Facebook after that it was open-
sourced in 2008 and after that, it become one of the top-level Apache projects in
2010.

Features of Cassandra:
1. It is scalable.
2. It is flexible (can accept structured , semi-structured and unstructured data).
3. It has transaction support as it follows ACID properties.
4. It is highly available and fault tolerant.
5. It is open source.

Apache Cassandra is a highly scalable, distributed database that strictly follows the
principle of the CAP (Consistency Availability and Partition tolerance) theorem.
In Apache Cassandra, there is no master-client architecture. It has a peer-to-peer
architecture.

In Apache Cassandra, we can create multiple copies of data at the time of keyspace
creation.

We can simply define replication strategy and RF (Replication Factor) to create


multiple copies of data.

Example:
CREATE KEYSPACE Example
WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': '3'};
In this example, we define RF (Replication Factor) as 3 which simply means that
we are creating here 3 copies of data across multiple nodes in a clockwise direction.

You might also like