Unit 6
Unit 6
Apache Hadoop.
• HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases. It is well suited for real-time data processing
• Unlike relational database systems, HBase does not support a structured query
language like SQL; in fact, HBase isn’t a relational data store at all. HBase
application. HBase does support writing applications in Apache Avro, REST and
• An HBase system is designed to scale linearly. It comprises a set of standard tables
with rows and columns, much like a traditional database. Each table must have an
element defined as a primary key, and all access attempts to HBase tables must use
numeric, binary data and strings; and a number of complex types including arrays,
maps, enumerations and records. A sort order can also be defined for the data.
into HBase, but if you’re running a production cluster, it’s suggested that you have a
• HBase works well with Hive, a query engine for batch processing of big data, to
of when the log record was written, or the server name where the
record originated.
• HBase allows for many attributes to be grouped together into column families, such
that the elements of a column family are all stored together. This is different from a
row-oriented relational database, where all the columns of a given row are stored
together. With HBase you must predefine the table schema and specify the column
families. However, new columns can be added to families at any time, making the
schema flexible and able to adapt to changing application requirements.
• Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker
and TaskTracker slaves, HBase is built on similar concepts. In HBase a master node
manages the cluster and region servers store portions of the tables and perform the
work on the data. In the same way HDFS has some enterprise concerns due to the
availability of the NameNode HBase is also sensitive to the loss of its master node.
HBase Vs RDBMS
HBase Shell Commands
Apache Sqoop
What is Sqoop and Why Use Sqoop?
• Sqoop is a tool used to transfer bulk data between Hadoop and external
The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the
challenges of the traditional approach and it could load bulk data from
RDBMS to Hadoop with ease.
Sqoop Features
Parallel Import/Export
• Sqoop uses the YARN framework to import and export data. This provides fault tolerance on
top of parallelism.
• Sqoop enables us to import the results returned from an SQL query into HDFS.
• Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL
servers.
• Sqoop supports the Kerberos computer network authentication protocol, which enables nodes
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data
warehouse, document-based systems, and a relational database. We have a
connector for each of these; connectors help to work with a range of accessible
databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS
using the Sqoop export command.
Sqoop Import
1. In this example, a company’s data is present in the RDBMS. All this metadata
is sent to the Sqoop import. Scoop then performs an introspection of the
database to gather metadata (primary key information).
2. It then submits a map-only job. Sqoop divides the input dataset into splits and
uses individual map tasks to push the splits to HDFS.
Sqoop Export
1. The first step is to gather the metadata through introspection.
2. Sqoop then divides the input dataset into splits and uses
individual map tasks to push the splits to RDBMS.
Sqoop Processing
interactive queries and stream processing. The main feature of Spark is its in-
iterative algorithms, interactive queries and streaming. Apart from supporting all
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
memory.
Python. Therefore, you can write applications in different languages. Spark comes
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Components of Spark
Spark Streaming
batches of data.
MLlib (Machine Learning Library)
GraphX
an API for expressing graph computation that can model the user-defined graphs
by using Pregel abstraction API. It also provides an optimized runtime for this
abstraction.
• Domain Scenarios of Apache spark.