0% found this document useful (0 votes)
18 views4 pages

Presentation of Big Data

This document discusses how large amounts of data can be collected, aggregated, and moved between Hadoop Distributed File System and relational database management systems using tools like Sqoop and Flume. It explains that Flume is used for ingesting streaming data into Hadoop while Sqoop can import data from relational databases into Hadoop.

Uploaded by

merazga ammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Presentation of Big Data

This document discusses how large amounts of data can be collected, aggregated, and moved between Hadoop Distributed File System and relational database management systems using tools like Sqoop and Flume. It explains that Flume is used for ingesting streaming data into Hadoop while Sqoop can import data from relational databases into Hadoop.

Uploaded by

merazga ammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Big Data applies to information that cannot be processed or

analyzed using traditional (RDBMS)(Relational database


management system) process or tools. So large amount of data is
required for analytical processing and this data is loaded from
different sources into Hadoop clusters (i.e. a cluster used for storing
and analyzing a huge amount of data in a distributed manner).
Sourcing of this bulk data into Hadoop clusters from different
sources faces a problem like maintaining and ensuring data
consistency since each data source could have data in varied form
and structure. The best way of collecting, aggregating, and moving
large amounts of data between the Hadoop Distributed File System
and RDBMS is via using tools such as Sqoop or Flume.

-On the one hand, you may have a Hadoop cluster that used for the
procession and storing large amounts of data, in the other hand,
you have an application that producing a large of amount of data or
you have a legacy system that storing data in a relational database.
how do you connect these two? That's exactly where Flume and
Sqoop coming .
Flume is used for ingesting Streaming data into Hadoop it has three
majored component : Sources - Channels and Sinks.
Sqoop is used to port your data from existing relational database
into Hadoop , we can use Sqoop to import data used from MySQL
into Ha doop.
Generally the data where come from two kinds of sources: either
it is an application that produces data in regular bases. or a
traditional "Relational database management system" like for
example "Oracle DB, SQL Server..." in both cases we
have sources which contains data and you have a destination
which is a Hadoop ecosystem data store....

The question now is how do we get our data from these sources
to Hadoop ?
Of course you will say after the introduction Flume and Sqoop .
but let explain how this process is done Or in the absence of
these tools, what is the steps to do ...
Normally , all the Hadoop ecosystem technology . exposes Java
APIs (application programming interface), you can directly use
these APIs to write data to for example to HDFS , HBase
Cassandra ...
But there are a few reasons why that may be a different problem ,
based on if you are transforming data :form an application or if
you are bulk transforming data such as RDBMS.
Let start with application : let suppose that we have a number of
events that produce data for this application which needs to be
stored as the events occur . this is called streaming data . so to
do that:
1- we have first to integrate your application with HDFS's Java
API .
2- also you have to create a mechanism to Buffer your data .
because the HDFS files have to be large to take advantage of it
distributed architecture . it means buffer the data in memory or in
an intermediate file before writing to HDFS
3- The buffer layer has to be fault-tolerant and non-lossy .
you should not lose any data even if there is a crash , and we
need a guarantee so that no data be lost.
All these difficulties and problems are then overridden by Flume

also there are a few problems with using RDBMS and directly
integrating with a Java API .
Let say that you have a legacy system used RDBMS and you
want to port your data from it to HDFS .
The first option is to:
1/ Dump all your tables into large files and than manually copy
these files to HDFS.
The second option is to :

2/ use scripts to read the data from RDBMS and then write the
data back to HDFS.
Fortunately, we do not need to think about any option we'll
choose thanks to Sqoop .

So Flume and Sqoop are technologies developed to isolate and


abstract the transport of data between a source and the data
store.
both are open source technologies developed by Apache.
There role in the Hadoop ecosystem is slightly similar . but they
used cases for each with be slightly different.

the first different is

You might also like