0% found this document useful (0 votes)
25 views

Unit 6

Uploaded by

gupta1803yashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Unit 6

Uploaded by

gupta1803yashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Unit 6

Hbase & Sqoop


and
Apache Spark:
Hadoop Projects
What is Apache HBase?
• HBase is a column-oriented non-relational database management system that

runs on top of Hadoop Distributed File System (HDFS), a main component of

Apache Hadoop.

• HBase provides a fault-tolerant way of storing sparse data sets, which are

common in many big data use cases. It is well suited for real-time data processing

or random read/write access to large volumes of data.

• Unlike relational database systems, HBase does not support a structured query

language like SQL; in fact, HBase isn’t a relational data store at all. HBase

applications are written in Java™ much like a typical Apache MapReduce

application. HBase does support writing applications in Apache Avro, REST and
• An HBase system is designed to scale linearly. It comprises a set of standard tables

with rows and columns, much like a traditional database. Each table must have an

element defined as a primary key, and all access attempts to HBase tables must use

this primary key.

• Avro, as a component, supports a rich set of primitive data types including:

numeric, binary data and strings; and a number of complex types including arrays,

maps, enumerations and records. A sort order can also be defined for the data.

• HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built

into HBase, but if you’re running a production cluster, it’s suggested that you have a

dedicated ZooKeeper cluster that’s integrated with your HBase cluster.

• HBase works well with Hive, a query engine for batch processing of big data, to

enable fault-tolerant big data applications.


An example of HBase
• An HBase column represents an attribute of an object; if the table is

storing diagnostic logs from servers in your environment, each row

might be a log record, and a typical column could be the timestamp

of when the log record was written, or the server name where the

record originated.
• HBase allows for many attributes to be grouped together into column families, such
that the elements of a column family are all stored together. This is different from a
row-oriented relational database, where all the columns of a given row are stored
together. With HBase you must predefine the table schema and specify the column
families. However, new columns can be added to families at any time, making the
schema flexible and able to adapt to changing application requirements.

• Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker
and TaskTracker slaves, HBase is built on similar concepts. In HBase a master node
manages the cluster and region servers store portions of the tables and perform the
work on the data. In the same way HDFS has some enterprise concerns due to the
availability of the NameNode HBase is also sensitive to the loss of its master node.
HBase Vs RDBMS
HBase Shell Commands
Apache Sqoop
What is Sqoop and Why Use Sqoop?

• Sqoop is a tool used to transfer bulk data between Hadoop and external

datastores, such as relational databases (MS SQL Server, MySQL).


• To process data using Hadoop, the data first needs to be loaded into Hadoop
clusters from several sources. However, it turned out that the process of
loading data from several heterogeneous sources was extremely challenging.
The problems administrators encountered included.
• Maintaining data consistency

• Ensuring efficient utilization of resources

• Loading bulk data to Hadoop was not possible

• Loading data using scripts was slow

The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the
challenges of the traditional approach and it could load bulk data from
RDBMS to Hadoop with ease.
Sqoop Features
Parallel Import/Export

• Sqoop uses the YARN framework to import and export data. This provides fault tolerance on

top of parallelism.

Import Results of an SQL Query

• Sqoop enables us to import the results returned from an SQL query into HDFS.

Connectors For All Major RDBMS Databases

• Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL

servers.

Kerberos Security Integration

• Sqoop supports the Kerberos computer network authentication protocol, which enables nodes

communication over an insecure network to authenticate users securely.

Provides Full and Incremental Load


Sqoop Architecture

1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data
warehouse, document-based systems, and a relational database. We have a
connector for each of these; connectors help to work with a range of accessible
databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS
using the Sqoop export command.
Sqoop Import
1. In this example, a company’s data is present in the RDBMS. All this metadata
is sent to the Sqoop import. Scoop then performs an introspection of the
database to gather metadata (primary key information).
2. It then submits a map-only job. Sqoop divides the input dataset into splits and
uses individual map tasks to push the splits to HDFS.
Sqoop Export
1. The first step is to gather the metadata through introspection.
2. Sqoop then divides the input dataset into splits and uses
individual map tasks to push the splits to RDBMS.
Sqoop Processing

• Processing takes place step by step, as shown below:

1. Sqoop runs in the Hadoop cluster.


2. It imports data from the RDBMS or NoSQL database to HDFS.
3. It uses mappers to slice the incoming data into multiple formats
and loads the data in HDFS.
4. Exports data back into the RDBMS while ensuring that the
schema of the data in the database is maintained.
• Transferring an entire table, specifying a target direct
ory, importing only a subset of data, Incremental Upl
oads: Importing only new data
Apache Spark:
What is apache spark
• Apache Spark is a lightning-fast cluster computing technology, designed for fast

computation. It is based on Hadoop MapReduce and it extends the MapReduce

model to efficiently use it for more types of computations, which includes

interactive queries and stream processing. The main feature of Spark is its in-

memory cluster computing that increases the processing speed of an application.

• Spark is designed to cover a wide range of workloads such as batch applications,

iterative algorithms, interactive queries and streaming. Apart from supporting all

these workload in a respective system, it reduces the management burden of

maintaining separate tools.


Features of Apache Spark

• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster

in memory, and 10 times faster when running on disk. This is possible by reducing

number of read/write operations to disk. It stores the intermediate processing data in

memory.

• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or

Python. Therefore, you can write applications in different languages. Spark comes

up with 80 high-level operators for interactive querying.

• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports

SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Components of Spark

Apache Spark Core

• Spark Core is the underlying general execution engine for spark


platform that all other functionality is built upon. It provides In-Memory
computing and referencing datasets in external storage systems.
Spark SQL

• Spark SQL is a component on top of Spark Core that introduces a new

data abstraction called SchemaRDD, which provides support for structured

and semi-structured data.

Spark Streaming

• Spark Streaming leverages Spark Core's fast scheduling capability to

perform streaming analytics. It ingests data in mini-batches and performs

RDD (Resilient Distributed Datasets) transformations on those mini-

batches of data.
MLlib (Machine Learning Library)

• MLlib is a distributed machine learning framework above Spark because of the

distributed memory-based Spark architecture. It is, according to benchmarks, done

by the MLlib developers against the Alternating Least Squares (ALS)

implementations. Spark MLlib is nine times as fast as the Hadoop disk-based

version of Apache Mahout (before Mahout gained a Spark interface).

GraphX

• GraphX is a distributed graph-processing framework on top of Spark. It provides

an API for expressing graph computation that can model the user-defined graphs

by using Pregel abstraction API. It also provides an optimized runtime for this

abstraction.
• Domain Scenarios of Apache spark.

You might also like