0% found this document useful (0 votes)

25 views

Unit 6

Uploaded by

gupta1803yashi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Unit 6

Uploaded by

gupta1803yashi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Unit 6

Hbase & Sqoop

and
Apache Spark:
Hadoop Projects
What is Apache HBase?
• HBase is a column-oriented non-relational database management system that

runs on top of Hadoop Distributed File System (HDFS), a main component of

Apache Hadoop.

• HBase provides a fault-tolerant way of storing sparse data sets, which are

common in many big data use cases. It is well suited for real-time data processing

or random read/write access to large volumes of data.

• Unlike relational database systems, HBase does not support a structured query

language like SQL; in fact, HBase isn’t a relational data store at all. HBase

applications are written in Java™ much like a typical Apache MapReduce

application. HBase does support writing applications in Apache Avro, REST and
• An HBase system is designed to scale linearly. It comprises a set of standard tables

with rows and columns, much like a traditional database. Each table must have an

element defined as a primary key, and all access attempts to HBase tables must use

this primary key.

• Avro, as a component, supports a rich set of primitive data types including:

numeric, binary data and strings; and a number of complex types including arrays,

maps, enumerations and records. A sort order can also be defined for the data.

• HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built

into HBase, but if you’re running a production cluster, it’s suggested that you have a

dedicated ZooKeeper cluster that’s integrated with your HBase cluster.

• HBase works well with Hive, a query engine for batch processing of big data, to

enable fault-tolerant big data applications.

An example of HBase
• An HBase column represents an attribute of an object; if the table is

storing diagnostic logs from servers in your environment, each row

might be a log record, and a typical column could be the timestamp

of when the log record was written, or the server name where the

record originated.
• HBase allows for many attributes to be grouped together into column families, such
that the elements of a column family are all stored together. This is different from a
row-oriented relational database, where all the columns of a given row are stored
together. With HBase you must predefine the table schema and specify the column
families. However, new columns can be added to families at any time, making the
schema flexible and able to adapt to changing application requirements.

• Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker
and TaskTracker slaves, HBase is built on similar concepts. In HBase a master node
manages the cluster and region servers store portions of the tables and perform the
work on the data. In the same way HDFS has some enterprise concerns due to the
availability of the NameNode HBase is also sensitive to the loss of its master node.
HBase Vs RDBMS
HBase Shell Commands
Apache Sqoop
What is Sqoop and Why Use Sqoop?

• Sqoop is a tool used to transfer bulk data between Hadoop and external

datastores, such as relational databases (MS SQL Server, MySQL).

• To process data using Hadoop, the data first needs to be loaded into Hadoop
clusters from several sources. However, it turned out that the process of
loading data from several heterogeneous sources was extremely challenging.
The problems administrators encountered included.
• Maintaining data consistency

• Ensuring efficient utilization of resources

• Loading bulk data to Hadoop was not possible

• Loading data using scripts was slow

The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the
challenges of the traditional approach and it could load bulk data from
RDBMS to Hadoop with ease.
Sqoop Features
Parallel Import/Export

• Sqoop uses the YARN framework to import and export data. This provides fault tolerance on

top of parallelism.

Import Results of an SQL Query

• Sqoop enables us to import the results returned from an SQL query into HDFS.

Connectors For All Major RDBMS Databases

• Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL

servers.

Kerberos Security Integration

• Sqoop supports the Kerberos computer network authentication protocol, which enables nodes

communication over an insecure network to authenticate users securely.

Provides Full and Incremental Load

Sqoop Architecture

1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data
warehouse, document-based systems, and a relational database. We have a
connector for each of these; connectors help to work with a range of accessible
databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS
using the Sqoop export command.
Sqoop Import
1. In this example, a company’s data is present in the RDBMS. All this metadata
is sent to the Sqoop import. Scoop then performs an introspection of the
database to gather metadata (primary key information).
2. It then submits a map-only job. Sqoop divides the input dataset into splits and
uses individual map tasks to push the splits to HDFS.
Sqoop Export
1. The first step is to gather the metadata through introspection.
2. Sqoop then divides the input dataset into splits and uses
individual map tasks to push the splits to RDBMS.
Sqoop Processing

• Processing takes place step by step, as shown below:

1. Sqoop runs in the Hadoop cluster.

2. It imports data from the RDBMS or NoSQL database to HDFS.
3. It uses mappers to slice the incoming data into multiple formats
and loads the data in HDFS.
4. Exports data back into the RDBMS while ensuring that the
schema of the data in the database is maintained.
• Transferring an entire table, specifying a target direct
ory, importing only a subset of data, Incremental Upl
oads: Importing only new data
Apache Spark:
What is apache spark
• Apache Spark is a lightning-fast cluster computing technology, designed for fast

computation. It is based on Hadoop MapReduce and it extends the MapReduce

model to efficiently use it for more types of computations, which includes

interactive queries and stream processing. The main feature of Spark is its in-

memory cluster computing that increases the processing speed of an application.

• Spark is designed to cover a wide range of workloads such as batch applications,

iterative algorithms, interactive queries and streaming. Apart from supporting all

these workload in a respective system, it reduces the management burden of

maintaining separate tools.

Features of Apache Spark

• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster

in memory, and 10 times faster when running on disk. This is possible by reducing

number of read/write operations to disk. It stores the intermediate processing data in

memory.

• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or

Python. Therefore, you can write applications in different languages. Spark comes

up with 80 high-level operators for interactive querying.

• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports

SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Components of Spark

Apache Spark Core

• Spark Core is the underlying general execution engine for spark

platform that all other functionality is built upon. It provides In-Memory
computing and referencing datasets in external storage systems.
Spark SQL

• Spark SQL is a component on top of Spark Core that introduces a new

data abstraction called SchemaRDD, which provides support for structured

and semi-structured data.

Spark Streaming

• Spark Streaming leverages Spark Core's fast scheduling capability to

perform streaming analytics. It ingests data in mini-batches and performs

RDD (Resilient Distributed Datasets) transformations on those mini-

batches of data.
MLlib (Machine Learning Library)

• MLlib is a distributed machine learning framework above Spark because of the

distributed memory-based Spark architecture. It is, according to benchmarks, done

by the MLlib developers against the Alternating Least Squares (ALS)

implementations. Spark MLlib is nine times as fast as the Hadoop disk-based

version of Apache Mahout (before Mahout gained a Spark interface).

GraphX

• GraphX is a distributed graph-processing framework on top of Spark. It provides

an API for expressing graph computation that can model the user-defined graphs

by using Pregel abstraction API. It also provides an optimized runtime for this

abstraction.
• Domain Scenarios of Apache spark.

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
COBOL Programming With VSCode PDF
100% (1)
COBOL Programming With VSCode PDF
155 pages
SQOOP
No ratings yet
SQOOP
8 pages
Module 5_Sqoop
No ratings yet
Module 5_Sqoop
25 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
DMBD MBAA21041 Sqoop
No ratings yet
DMBD MBAA21041 Sqoop
11 pages
Unit 4 3 Lumify,Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify,Data Rapper and Sqooop
27 pages
bda u3 copy
No ratings yet
bda u3 copy
59 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Hadoop
No ratings yet
Hadoop
14 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
No ratings yet
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
39 pages
big data BASICS
No ratings yet
big data BASICS
3 pages
Ibm Hadoop
No ratings yet
Ibm Hadoop
4 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
9 HBase
No ratings yet
9 HBase
77 pages
BD Unit 6
No ratings yet
BD Unit 6
6 pages
Big Data Presentations (Autosaved)
No ratings yet
Big Data Presentations (Autosaved)
126 pages
BDA Module 2 PDF
No ratings yet
BDA Module 2 PDF
123 pages
Hadoopdb: An An Architectural Hybrid of Mapreduce & Dbms Technologies For Analytical Workloads
No ratings yet
Hadoopdb: An An Architectural Hybrid of Mapreduce & Dbms Technologies For Analytical Workloads
34 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
6
No ratings yet
6
2 pages
Unit 2
No ratings yet
Unit 2
15 pages
bigdata+ppt (2)
No ratings yet
bigdata+ppt (2)
140 pages
Hbase Understanding Mapreduce: Unit-2 P-2
No ratings yet
Hbase Understanding Mapreduce: Unit-2 P-2
32 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
Module 2
No ratings yet
Module 2
27 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Data Migration From RDBMS To Hadoop: Platform Migration Approach
No ratings yet
Data Migration From RDBMS To Hadoop: Platform Migration Approach
25 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Gold Video Task Complted
No ratings yet
Gold Video Task Complted
31 pages
Practice Assignment
No ratings yet
Practice Assignment
4 pages
Big data UNIT 5 own
No ratings yet
Big data UNIT 5 own
18 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Spark SQL
100% (1)
Spark SQL
25 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Big Data: Sqoop
No ratings yet
Big Data: Sqoop
43 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Unit 5
No ratings yet
Unit 5
10 pages
Big Data Management
No ratings yet
Big Data Management
55 pages
Mod 2
No ratings yet
Mod 2
70 pages
Practice Assignment
No ratings yet
Practice Assignment
3 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
SQOOP
No ratings yet
SQOOP
6 pages
ds2 5 Pig Pyspark
No ratings yet
ds2 5 Pig Pyspark
64 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
UNIT-4
No ratings yet
UNIT-4
119 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
M5
No ratings yet
M5
18 pages
Unit V
No ratings yet
Unit V
6 pages
Bda 5
No ratings yet
Bda 5
21 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Python The Complete Manual 1652453990
No ratings yet
Python The Complete Manual 1652453990
34 pages
Unit I
No ratings yet
Unit I
207 pages
COMPUTATIONAL PHYSICS Lecture 01 SHAHBAZ BHATTI
No ratings yet
COMPUTATIONAL PHYSICS Lecture 01 SHAHBAZ BHATTI
5 pages
Oopm(Cs305) Unit-2 Notes
No ratings yet
Oopm(Cs305) Unit-2 Notes
13 pages
Question: MIPS A) Consider The C Statement: A (B + D) + (B - C) + (C + D)
No ratings yet
Question: MIPS A) Consider The C Statement: A (B + D) + (B - C) + (C + D)
2 pages
lsmw-to-upload-sap-master-data-materials.doc
No ratings yet
lsmw-to-upload-sap-master-data-materials.doc
20 pages
Manual Sybase
No ratings yet
Manual Sybase
530 pages
C Cheatsheet PDF
No ratings yet
C Cheatsheet PDF
1 page
Google App Engine
No ratings yet
Google App Engine
5 pages
ABAP - Dynamic Variant Processing With STVARV
100% (1)
ABAP - Dynamic Variant Processing With STVARV
12 pages
Week 5 Quiz Answers
No ratings yet
Week 5 Quiz Answers
12 pages
Input and Output - Python 3.12
No ratings yet
Input and Output - Python 3.12
8 pages
Nested Queries and Join Queries
100% (1)
Nested Queries and Join Queries
6 pages
Example Servlet
No ratings yet
Example Servlet
10 pages
DSA Unit - II Trees
No ratings yet
DSA Unit - II Trees
96 pages
1.what Does Static Variable Mean? Ans: Static
100% (1)
1.what Does Static Variable Mean? Ans: Static
3 pages
Chapter 1: Introduction: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edit9on
No ratings yet
Chapter 1: Introduction: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edit9on
32 pages
Introduction To Computation and Programming Using Python, Revised - Guttag, John v..131
No ratings yet
Introduction To Computation and Programming Using Python, Revised - Guttag, John v..131
1 page
Lab-03
No ratings yet
Lab-03
8 pages
PODEM Algorithm Podem
No ratings yet
PODEM Algorithm Podem
7 pages
IASSim A Programmable Emulator For The Princeton I
No ratings yet
IASSim A Programmable Emulator For The Princeton I
7 pages
DBMS Unit-3
No ratings yet
DBMS Unit-3
42 pages
Generational Cycle of GA
No ratings yet
Generational Cycle of GA
3 pages
OSAMA KHAN-Software Engineer
No ratings yet
OSAMA KHAN-Software Engineer
1 page
Tibco Ems - LB&FT
50% (2)
Tibco Ems - LB&FT
18 pages
HP-UX PVID T
No ratings yet
HP-UX PVID T
11 pages
Entity Relationship Model: IS 2511 - Fundamentals of Database Systems
No ratings yet
Entity Relationship Model: IS 2511 - Fundamentals of Database Systems
56 pages
Y10 03 CT14 Slides
No ratings yet
Y10 03 CT14 Slides
14 pages
Rutvi Shah - 38 - B
No ratings yet
Rutvi Shah - 38 - B
23 pages

Unit 6

Uploaded by

Unit 6

Uploaded by

Unit 6

Hbase & Sqoop

runs on top of Hadoop Distributed File System (HDFS), a main component of

or random read/write access to large volumes of data.

applications are written in Java™ much like a typical Apache MapReduce

this primary key.

• Avro, as a component, supports a rich set of primitive data types including:

• HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built

dedicated ZooKeeper cluster that’s integrated with your HBase cluster.

enable fault-tolerant big data applications.

storing diagnostic logs from servers in your environment, each row

might be a log record, and a typical column could be the timestamp

datastores, such as relational databases (MS SQL Server, MySQL).

• Ensuring efficient utilization of resources

• Loading bulk data to Hadoop was not possible

• Loading data using scripts was slow

Import Results of an SQL Query

Connectors For All Major RDBMS Databases

Kerberos Security Integration

communication over an insecure network to authenticate users securely.

Provides Full and Incremental Load

• Processing takes place step by step, as shown below:

1. Sqoop runs in the Hadoop cluster.

computation. It is based on Hadoop MapReduce and it extends the MapReduce

model to efficiently use it for more types of computations, which includes

memory cluster computing that increases the processing speed of an application.

• Spark is designed to cover a wide range of workloads such as batch applications,

these workload in a respective system, it reduces the management burden of

maintaining separate tools.

number of read/write operations to disk. It stores the intermediate processing data in

• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or

up with 80 high-level operators for interactive querying.

Apache Spark Core

• Spark Core is the underlying general execution engine for spark

• Spark SQL is a component on top of Spark Core that introduces a new

data abstraction called SchemaRDD, which provides support for structured

and semi-structured data.

• Spark Streaming leverages Spark Core's fast scheduling capability to

perform streaming analytics. It ingests data in mini-batches and performs

RDD (Resilient Distributed Datasets) transformations on those mini-

• MLlib is a distributed machine learning framework above Spark because of the

distributed memory-based Spark architecture. It is, according to benchmarks, done

by the MLlib developers against the Alternating Least Squares (ALS)

implementations. Spark MLlib is nine times as fast as the Hadoop disk-based

version of Apache Mahout (before Mahout gained a Spark interface).

• GraphX is a distributed graph-processing framework on top of Spark. It provides

You might also like