0% found this document useful (0 votes)
15 views76 pages

Cse3002 Big Data m2

Apache Hive is a data warehousing tool built on Hadoop, designed for SQL users to manage and analyze structured data using HiveQL. It supports large-scale data storage and processing, with features like partitioning and bucketing to optimize query performance. Hive is widely used at Facebook, handling massive datasets and enabling efficient data querying without requiring Java knowledge.

Uploaded by

Chinmayi D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views76 pages

Cse3002 Big Data m2

Apache Hive is a data warehousing tool built on Hadoop, designed for SQL users to manage and analyze structured data using HiveQL. It supports large-scale data storage and processing, with features like partitioning and bucketing to optimize query performance. Hive is widely used at Facebook, handling massive datasets and enabling efficient data querying without requiring Java knowledge.

Uploaded by

Chinmayi D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Introduction to Apache Hive

• Apache Hive is a Data Warehousing tool built on top of Hadoop and is used for data
analysis.

• Hive is targeted towards users who are comfortable with SQL.

• It is similar to SQL and called HiveQL, used for managing and querying structured data.

• This language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers.

• The popular feature of Hive is that there is no need to learn Java.

• Hive, an open source date warehousing framework based on Hadoop, was developed by the
Data Infrastructure Team at Facebook.
• Hive is also one of the technologies that are being used to address the requirements at
Facebook.

• Hive is very popular with all the users internally at Facebook and is being used to run
thousands of jobs on the cluster with hundreds of users, for a wide variety of
applications.

• Hive-Hadoop cluster at Facebook stores more than 2PB of raw data and regularly
loads 15 TB of data on a daily basis.
Hive Architecture:
Where to Use Hive
Limitations of Hive:
SQL
• SQL stands for Structured Query Language.
• SQL is a language which helps us to work with the databases. Database does not
understand English or any other language.
• Just as to create software, we use Java or C#, in a similar way to work with databases,
we use SQL.
• SQL is the standard language of Database and is also pronounced as Sequel by many
people
• SQL itself is a declarative language.
• SQL deals with structured data and is for RDBMS that is a relational database
management
• SQL support schema for data storage
• We use SQL when we need frequent modification in records. SQL is used for better
performance

Data Modelling using Entities and


Relationships
HiveQL
• Hive’s SQL language is known as HiveQL, it is a combination of
SQL-92, Oracle’s SQL language, and MySQL.

• HiveQL provides some improved features from the previous


version of SQL standards, like analytics function from SQL 2003.

• Some Hive’s’ extension like multitable inserts, TRANSFORM,


MAP and REDUCE.

Data Modelling using Entities and


Relationships
Hive Data Types
• Hive data types are categorized in numeric types, string types, misc types,
and complex types.
Integer Types Decimal Type

Type Size Range Type Size Range

TINYINT 1-byte signed -128 to 127 FLOAT 4-byte Single


integer precision
floating
SMALLINT 2-byte signed 32,768 to
point
integer 32,767
number
INT 4-byte signed 2,147,483,648
DOUBLE 8-byte Double
integer to
precision
2,147,483,647
floating
BIGINT 8-byte signed - point
integer 9,223,372,036 number
,854,775,808
to
9,223,372,036
Data,854,775,807
Modelling using Entities and
Relationships
Date/Time Types

TIMESTAMP
•It supports traditional UNIX timestamp with optional nanosecond precision.
•As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
•As Floating point numeric type, it is interpreted as UNIX timestamp in
seconds with decimal precision.
•As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)

DATES
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.

Data Modelling using Entities and


Relationships
String Types
• STRING
• The string is a sequence of characters. It values can be
enclosed within single quotes (') or double quotes (").
• Varchar
• The varchar is a variable length type whose range lies
between 1 and 65535, which specifies that the maximum
number of characters allowed in the character string.
• CHAR
• The char is a fixed-length type whose maximum length is
fixed at 255.

Data Modelling using Entities and


Relationships
Complex Type
Type Size Range

Struct It is similar to C struct or an struct('James','Roy')


object where fields are
accessed using the "dot"
notation.

Map It contains the key-value map('first','James','last','Roy')


tuples where the fields are
accessed using array
notation.

Array It is a collection of similar type array('James','Roy')


of values that indexable using
zero-based integers.

Data Modelling using Entities and


Relationships
Hive DDL Commands
Hive DDL commands are the statements used for defining and changing the
structure of a table or database in Hive. It is used to build or modify the tables
and other objects in the database.

DDL Command Use With


CREATE Database, Table
Databases, Tables, Table Properties, Partitions,
SHOW
Functions, Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table

Data Modelling using Entities and


Relationships
Hive DML Commands
• Hive DML (Data Manipulation Language) commands are used to
insert, update, retrieve, and delete data from the Hive table once the
table and database schema has been defined using Hive DDL
commands.
• The various Hive DML commands are:
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT

Data Modelling using Entities and


Relationships
Hive Joining tables
• The HiveQL Join clause is used to combine the
data of two or more tables based on a related
column between them. The various type of
HiveQL joins are: -
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join

Data Modelling using Entities and


Relationships
Data Modelling using Entities and
Relationships
Here, we are going to execute the join
clauses on the records of the following
table:

Data Modelling using Entities and


Relationships
inner Join in HiveQL
The HiveQL inner join is used to return the rows of multiple tables
where the join condition satisfies. In other words, the join criteria
find the match records in every table being joined.

Example of Inner Join in Hive


In this example, we take two table employee and employee_department. The
primary key (empid) of employee table represents the foreign key (depid) of
employee_department table. Let's perform the inner join operation by using the
following steps: -
•Select the database in which we want to create a table.
select e1.empname, e2.department_name from employee e1 join employee_d
epartment e2 on e1.empid= e2.depid;

Data Modelling using Entities and


Relationships
What are the Hive Partitions?
• Apache Hive organizes
tables into partitions.
Partitioning is a way of
dividing a table into related
parts based on the values of
particular columns like date,
city, and department.
• Each table in the hive can
have one or more partition
keys to identify a particular
partition. Using partition it is
easy to do queries on slices
of the data.
Why is Partitioning Important?
• In the current century, we know that the huge amount of data which is in the range of
petabytes is getting stored in HDFS. So due to this, it becomes very difficult for Hadoop
users to query this huge amount of data.
• The Hive was introduced to lower down this burden of data querying. Apache Hive
converts the SQL queries into MapReduce jobs and then submits it to the Hadoop
cluster. When we submit a SQL query, Hive read the entire data-set.
• So, it becomes inefficient to run MapReduce jobs over a large table. Thus this is
resolved by creating partitions in tables. Apache Hive makes this job of implementing
partitions very easy by creating partitions by its automatic partition scheme at the
time of table creation.
• In Partitioning method, all the table data is divided into multiple partitions. Each
partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-
record inside the table’s record present in the HDFS.
• Therefore on querying a particular table, appropriate partition of the table is queried
which contains the query value. Thus this decreases the I/O time required by the
query. Hence increases the performance speed.
How to Create Partitions in
Hive?
• To create data partitioning in Hive following
command is used-

CREATE TABLE table_name (column1 data_type,


column2 data_type) PARTITIONED BY (partition1
data_type, partition2 data_type,….);
Hive Data Partitioning Example
• Now let’s understand data partitioning in Hive with an
example. Consider a table named Tab1. The table
contains client detail like id, name, dept, and yoj( year of
joining).
• Suppose we need to retrieve the details of all the clients
who joined in 2012. Then, the query searches the whole
table for the required information. But if we partition the
client data with the year and store it in a separate file,
this will reduce the query processing time. The below
example will help us to learn how to partition a file and
its data-
The file name says file1 contains client data table:
[php]tab1/clientdata/file1
id, name, dept, yoj
1, sunny, SC, 2009
2, animesh, HR, 2009
3, sumeer, SC, 2010
4, sarthak, TP, 2010[/php]
Now, let us partition above data into two files using years
[php]tab1/clientdata/2009/file2
1, sunny, SC, 2009
2, animesh, HR, 2009
tab1/clientdata/2010/file3
3, sumeer, SC, 2010
4, sarthak, TP, 2010
• Now when we are retrieving the data from the table, only the data of the
specified partition will be queried. Creating a partitioned table is as follows:
• CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY
(year STRING);
• LOAD DATA LOCAL INPATH tab1’/clientdata/2009/file2’OVERWRITE INTO TABLE
studentTab PARTITION (year=’2009′);
• LOAD DATA LOCAL INPATH tab1’/clientdata/2010/file3’OVERWRITE INTO TABLE
studentTab PARTITION (year=’2010′);[/php]
Types of Hive Partitioning :
• Static Partitioning
• Dynamic Partitioning
Hive Static Partitioning
• Insert input data files individually into a partition table is Static Partition.
• Usually when loading files (big files) into Hive tables static partitions are preferred.
• Static Partition saves your time in loading data compared to dynamic partition.
• You “statically” add a partition in the table and move the file into the partition of the table.
• We can alter the partition in the static partition.
• You can get the partition column value from the filename, day of date etc without reading
the whole big file.
• If you want to use the Static partition in the hive you should set property set
hive.mapred.mode = strict This property set by default in hive-site.xml
• Static partition is in Strict Mode.
• You should use where clause to use limit in the static partition.
• You can perform Static partition on Hive Manage table or external table.
Hive Dynamic Partitioning
• Single insert to partition table is known as a dynamic partition.
• Usually, dynamic partition loads the data from the non-partitioned table.
• Dynamic Partition takes more time in loading data compared to
static partition.
• When you have large data stored in a table then the Dynamic partition
is suitable.
• If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable.
• Dynamic partition there is no required where clause to use limit.
• we can’t perform alter on the Dynamic partition.
• You can perform dynamic partition on hive external table and managed
table.
• If you want to use the Dynamic partition in the hive then the mode is in
non-strict mode.
• Here are Hive dynamic partition properties you should allow
Hive Partitioning – Advantages and
Disadvantages
• a) Hive Partitioning Advantages
• Partitioning in Hive distributes execution load
horizontally.
• In partition faster execution of queries with the
low volume of data takes place. For example,
search population from Vatican City returns very
fast instead of searching entire world population.
• b) Hive Partitioning Disadvantages

• There is the possibility of too many small


partition creations- too many directories.
• Partition is effective for low volume data. But
there are some queries like group by on high
volume of data take a long time to execute.
Bucketing in Hive
• What is Bucketing in Hive
• Basically, for decomposing table data sets into
more manageable parts, Apache Hive offers
another technique. That technique is what we
call Bucketing in Hive.
• Why Bucketing?
• Basically, the concept of Hive
Partitioning provides a way of segregating hive
table data into multiple files/directories.
However, it only gives effective results in few
scenarios. Such as:

– When there is the limited number of partitions.


– Or, while partitions are of comparatively equal
size.
• Although, it is not possible in all scenarios. For example when are
partitioning our tables based geographic locations like country.
Hence, some bigger countries will have large partitions (ex: 4-5
countries itself contributing 70-80% of total data).
• While small countries data will create small partitions (remaining all
countries in the world may contribute to just 20-30 % of total data).
Hence, at that time Partitioning will not be ideal.
• Then, to solve that problem of over partitioning, Hive offers
Bucketing concept. It is another effective technique for decomposing
table data sets into more manageable parts.
• Features of Bucketing in Hive
• Basically, this concept is based on hashing function on the bucketed column.
Along with mod (by the total number of buckets).
• i. Where the hash_function depends on the type of the bucketing column.
ii. However, the Records with the same bucketed column will always be
stored in the same bucket.
iii. Moreover, to divide the table into buckets we use CLUSTERED BY clause.
iv. Along with Partitioning on Hive tables bucketing can be done and even
without partitioning.
vi. Moreover, Bucketed tables will create almost equally distributed data file
parts.
• Advantages of Bucketing in Hive
• i. On comparing with non-bucketed tables, Bucketed tables offer the efficient
sampling.
ii. Here also bucketed tables offer faster query responses than non-bucketed
tables as compared to Similar to partitioning.
iii. This concept offers the flexibility to keep the records in each bucket to be
sorted by one or more columns.

Limitations of Bucketing in Hive


i. However, it doesn’t ensure that the table is properly populated.
ii. So, we need to handle Data Loading into buckets by our-self.
• Example Use Case for Bucketing in Hive
• To understand the remaining features of Hive Bucketing let’s see an example
Use case, by creating buckets for the sample user records file for testing in
this post

first_name,last_name, address, country, city, state, post,phone1,phone2,


email, web Rebbecca, Didio, 171 E 24th St, AU, Leith, TA, 7315, 03-8174-
9123, 0458-665-290,
[email protected],http://www.brandtjonathanfesq.com.au

Hence, let’s create the table partitioned by country and bucketed by state
and sorted in ascending order of cities.
• Creation of Bucketed Tables
• However, with the help of CLUSTERED BY clause
and optional SORTED BY clause in CREATE
TABLE statement we can create bucketed tables.
Moreover, we can create a bucketed_user table
with above-given requirement with the help of
the below HiveQL.
• CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
COMMENT ‘A bucketed sorted user table’
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;
CREATE TABLE bucketed_user(
As shown in code for state and city columns
firstname VARCHAR(64),
Bucketed columns are included in the table
lastname VARCHAR(64),
definition, Unlike partitioned
address STRING,
columns. Especially, which are not included in
city VARCHAR(64),
table columns definition.
state VARCHAR(64),

post STRING,

phone1 VARCHAR(64),

phone2 STRING,

email STRING,

web STRING

COMMENT 'A bucketed sorted user table'

PARTITIONED BY (country VARCHAR(64))

CLUSTERED BY (state) SORTED BY (city) INTO 32


BUCKETS

STORED AS SEQUENCEFILE;
• Inserting data Into Bucketed Tables
• However, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH
command, similar to partitioned tables. Instead to populate the bucketed tables we
need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another
table.
• Hence, we will create one temporary table in hive with all the columns in input file
from that table we will copy into our target bucketed table for this.
• i. However, in partitioning the property hive.enforce.bucketing = true is similar to
hive.exec.dynamic.partition=true property. So, we can enable dynamic bucketing
while loading data into hive table By setting this property.
INTRODUCTION TO
HBASE
HBASE INTRODUCTION
• HBase is a distributed column-oriented database built on top of the Hadoop file system.

• It is an open-source project and is horizontally scalable.

• HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

• It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.

• One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HDFS HBase

HDFS is a distributed file system suitable for HBase is a database built on top of the
storing large files. HDFS.

HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing. It provides low latency access to single rows
from billions of records (Random access).

It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the
data in indexed HDFS files for faster
lookups.
Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

3
Column Oriented and Row Oriented
• Column-oriented databases are those that store data tables as sections of
columns of data, rather than as rows of data. Shortly, they will have column
families.
Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process It is suitable for Online Analytical Processing
(OLTP). (OLAP).

Such databases are designed for small Column-oriented databases are designed for
number of rows and columns. huge tables.
46
• Tables: Data is stored in a table format in HBase. But here tables are
in column-oriented format.
• Row Key: Row keys are used to search records which make searches
fast.
• Column Families: Various columns are combined in a column family.
These column families are stored together which makes the
searching process faster because data belonging to same column
family can be accessed together in a single seek.
• Column Qualifiers: Each column’s name is known as its column
qualifier.
• Cell: Data is stored in cells. The data is dumped into cells which are
specifically identified by rowkey and column qualifiers.
• Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp. This makes
easy to search for a particular version of data.

47
48
• The HBase Architecture:
• -1. Consists of servers in a Master-Slave relationship. Master node is
called HMaster and the multiple Region Servers(Slaves)(collection of
column families) are called HRegionServer. Each Region Server
contains multiple Regions – HRegions.
• -2. Data is stored in Tables which are stored in Regions(collection of
column families). When a Table becomes too big, the Table is
partitioned into multiple Regions.
• -3. Each Region Server contains a Write-Ahead Log (called HLog) and
multiple Regions. Each Region in turn is made up of a MemStore and
multiple StoreFiles (HFile). The data lives in these StoreFiles in the
form of Column Families . The MemStore holds in-memory
modifications to the Store (data).
• -4. A system table called .META - keeps the mapping of Regions to
Region Server. The clients read the required Region information from
the .META table and directly communicate with the appropriate Region
Server.

49
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; which describes the whole structure of
defines only column families. tables.

It is built for wide tables. HBase is It is thin and built for small tables. Hard
horizontally scalable. to scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.


It is good for semi-structured as well as It is good for structured data.
structured data.
Where to Use HBase
• Apache HBase is used to have random, real-time read/write access to Big
Data.

• It hosts very large tables on top of clusters of commodity hardware.

• Apache HBase is a non-relational database modeled after Google's Bigtable.


Bigtable acts up on Google File System, likewise Apache HBase works on top
of Hadoop and HDFS.
Applications of HBase

• It is used whenever there is a need to write heavy applications.

• HBase is used whenever we need to provide fast random access


to available data.

• Companies such as Facebook, Twitter, Yahoo, and Adobe use


HBase internally.
HBase History
Year Event
Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop


contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was
released.

Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.


Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.
Working with Hbase commands
1) HBase general commands – Opening of the terminal
2) create table
3) list
4) disable
5) is_disabled
6) enable
7) is_enabled
8) Describe
9) Drop
10)put
11. Get
12. Delete
13. deleteall
14. scan
15. Count
16. truncate
Overview of SQOOP in Hadoop

• Previously when there was no Hadoop or there


was no concept of big data at that point in time
all the data is used to be stored in the relational
database management system.
• But nowadays after the introduction of concepts
of Big data, the data need to be stored in a more
concise and effective way. Thus Sqoop comes into
existence
• So all the data which are stored in a relational
database management system needed to be
transferred into the Hadoop structure.
• So the transfer of this large amount of data
manually is not possible but with the help of
Sqoop, we can able to do it.
• Thus Sqoop is defined as the tool which is used
to perform data transfer operations from
relational database management system to
Hadoop server

57
Features of the Sqoop
1.Parallel Import/Export
Sqoop uses the YARN framework to import and export
data. This provides fault tolerance on top of parallelism.

2. Import Results of an SQL Query


Sqoop enables us to import the results returned from an
SQL query into HDFS.

3. Connectors For All Major RDBMS Databases


Sqoop provides connectors for multiple RDBMSs, such as
the MySQL and Microsoft SQL servers.

58
4. Kerberos Security Integration
Sqoop supports the Kerberos computer network
authentication protocol, which enables nodes
communication over an insecure network to
authenticate users securely.
Provides Full and Incremental Load
1.Sqoop can load the entire table or parts of the
table with a single command.

59
Sqoop Architecture

1. The client submits the import/ export command to


import or export data.

2. Sqoop fetches data from different databases. Here, we have


an enterprise data warehouse, document-based systems, and
a relational database. We have a connector for each of these;
connectors help to work with a range of accessible databases.

3. Multiple mappers perform map tasks to load the data on to


HDFS

60
61
62
4. Similarly, numerous map tasks will export the
data from HDFS on to RDBMS using the Sqoop
export command.

63
Sqoop - Import All Tables

• A tool which imports a set of tables from an RDBMS to


HDFS is what we call the Sqoop import all tables.

• Basically, here in HDFS, data from each table is stored in a


separate directory.

• Each table data is stored in a separate directory and the


directory name is same as the table name.

64
Syntax

The following syntax is used to import all tables.

$ sqoop import-all-tables (generic-args) (import-args)


$ sqoop-import-all-tables (generic-args) (import-args)

65
Example
• Let us take an example of importing all tables from the userdb
database. The list of tables that the database userdb contains is as
follows.

66
If you are using the import-all-tables, it is
mandatory that every table in that database must
have a primary key field.

The following command is used to verify all the


table data to the userdb database in HDFS.

$ $HADOOP_HOME/bin/hadoop fs -ls

67
Sqoop Export
• A tool which exports a set of files from HDFS back to RDBMS.
That tool is what we call a Sqoop Export Tool.

• There is one condition for it, that in the database, target the
table must already exist.

• However, the input files are read and parsed according to the
user-specified delimiters into a set of records.

68
The export command works in two modes- insert mode
and update mode.

1. Insert mode: It is the default mode. In this mode, the


records from the input files are inserted into the
database table by using the INSERT statement.

2. Update mode: In the update mode, Sqoop generates an


UPDATE statement that replaces existing records into the
database.

69
The Syntax for Sqoop Export are:

$ sqoop export (generic-args) (export-args)

$ sqoop-export (generic-args) (export-args)

70
• Example
• Let us take an example of the employee data in file, in HDFS. The employee
data is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data
is as follows.

It is mandatory that the table to be exported is created manually and is present in the
database from where it has to be exported.

71
The following command is used to export the table data (which is in
emp_data file on HDFS) to the employee table in db database of Mysql
database server.

$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data

72
Importing data from MySQL to HDFS
In order to store data into HDFS, we make use of Apache Hive which
provides an SQL-like interface between the user and the Hadoop
distributed file system (HDFS) which integrates Hadoop. We perform the
following steps:

Step 1: Login into MySQL


Step 2: Create a database and table and insert data.
Step 3: Create a database and table in the hive where data should be
imported

73
Step 4: Run below the import command on Hadoop.

sqoop import --connect \


jdbc:mysql://127.0.0.1:3306/database_name_in_mysql \
--username root --password cloudera \
--table table_name_in_mysql \
--hive-import --hive-table database_name_in_hive.table_name_in_hive \
--m 1

Step 5: Check-in hive if data is imported successfully or not.

74
SQOOP VS FLUME
Apache Sqoop Apache Flume

Apache Sqoop is basically designed to work


with any type of Relational database system
(RDBMS) which has the basic JDBC Apache Flume works pretty well in Streaming
connectivity. Apache Sqoop can import data data sources that are generated continuously
from NoSQL databases like MongoDB, in Hadoop environments, such as log files
Cassandra and along with it also allow data
transfer to Apache Hive or HDFS.

Apache Flume data loading is completely


Apache Sqoop load is not driven by events
event-driven

Apache Sqoop will be considered an ideal fit if Apache Flume is considered the best choice
the data is being available in Teradata, Oracle, when we are talking about moving bulk
MySQL, PostgreSQL or any other JDBC streaming data from sources likes JMS or
compatible database Spooling directories

HDFS is the destination for importing data in Data is said to flow to HDFS through channels
Apache Sqoop in Apache Flume

75
Apache Sqoop has a connector based architecture,
Apache Flume has agent-based architecture, that
which means the connectors know a great deal in
means code written in Flume is known as an agent
connecting with the various data sources and also to
that will be held responsible for fetching the data
fetch data correspondingly

Apache Sqoop connectors are designed specifically to Apache Flume is specifically designed to fetch
work with structured data sources and to fetch data streaming data like tweets from Twitter or log files
from them alone. from Web servers or Application servers etc.

Apache Flume is specifically used for collecting and


Apache Sqoop is specifically used for Parallel data
aggregating data because of its distributed, reliable
transfers, data imports as it copies the data pretty
nature, and also because of its highly available backup
quick
routes.

76

You might also like