Cse3002 Big Data m2
Cse3002 Big Data m2
• Apache Hive is a Data Warehousing tool built on top of Hadoop and is used for data
analysis.
• It is similar to SQL and called HiveQL, used for managing and querying structured data.
• This language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers.
• Hive, an open source date warehousing framework based on Hadoop, was developed by the
Data Infrastructure Team at Facebook.
• Hive is also one of the technologies that are being used to address the requirements at
Facebook.
• Hive is very popular with all the users internally at Facebook and is being used to run
thousands of jobs on the cluster with hundreds of users, for a wide variety of
applications.
• Hive-Hadoop cluster at Facebook stores more than 2PB of raw data and regularly
loads 15 TB of data on a daily basis.
Hive Architecture:
Where to Use Hive
Limitations of Hive:
SQL
• SQL stands for Structured Query Language.
• SQL is a language which helps us to work with the databases. Database does not
understand English or any other language.
• Just as to create software, we use Java or C#, in a similar way to work with databases,
we use SQL.
• SQL is the standard language of Database and is also pronounced as Sequel by many
people
• SQL itself is a declarative language.
• SQL deals with structured data and is for RDBMS that is a relational database
management
• SQL support schema for data storage
• We use SQL when we need frequent modification in records. SQL is used for better
performance
TIMESTAMP
•It supports traditional UNIX timestamp with optional nanosecond precision.
•As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
•As Floating point numeric type, it is interpreted as UNIX timestamp in
seconds with decimal precision.
•As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of
Date type lies between 0000--01--01 to 9999--12--31.
Hence, let’s create the table partitioned by country and bucketed by state
and sorted in ascending order of cities.
• Creation of Bucketed Tables
• However, with the help of CLUSTERED BY clause
and optional SORTED BY clause in CREATE
TABLE statement we can create bucketed tables.
Moreover, we can create a bucketed_user table
with above-given requirement with the help of
the below HiveQL.
• CREATE TABLE bucketed_user(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
COMMENT ‘A bucketed sorted user table’
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;
CREATE TABLE bucketed_user(
As shown in code for state and city columns
firstname VARCHAR(64),
Bucketed columns are included in the table
lastname VARCHAR(64),
definition, Unlike partitioned
address STRING,
columns. Especially, which are not included in
city VARCHAR(64),
table columns definition.
state VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
STORED AS SEQUENCEFILE;
• Inserting data Into Bucketed Tables
• However, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH
command, similar to partitioned tables. Instead to populate the bucketed tables we
need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another
table.
• Hence, we will create one temporary table in hive with all the columns in input file
from that table we will copy into our target bucketed table for this.
• i. However, in partitioning the property hive.enforce.bucketing = true is similar to
hive.exec.dynamic.partition=true property. So, we can enable dynamic bucketing
while loading data into hive table By setting this property.
INTRODUCTION TO
HBASE
HBASE INTRODUCTION
• HBase is a distributed column-oriented database built on top of the Hadoop file system.
• HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HDFS HBase
HDFS is a distributed file system suitable for HBase is a database built on top of the
storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing. It provides low latency access to single rows
from billions of records (Random access).
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access, and it stores the
data in indexed HDFS files for faster
lookups.
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
3
Column Oriented and Row Oriented
• Column-oriented databases are those that store data tables as sections of
columns of data, rather than as rows of data. Shortly, they will have column
families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online Transaction Process It is suitable for Online Analytical Processing
(OLTP). (OLAP).
Such databases are designed for small Column-oriented databases are designed for
number of rows and columns. huge tables.
46
• Tables: Data is stored in a table format in HBase. But here tables are
in column-oriented format.
• Row Key: Row keys are used to search records which make searches
fast.
• Column Families: Various columns are combined in a column family.
These column families are stored together which makes the
searching process faster because data belonging to same column
family can be accessed together in a single seek.
• Column Qualifiers: Each column’s name is known as its column
qualifier.
• Cell: Data is stored in cells. The data is dumped into cells which are
specifically identified by rowkey and column qualifiers.
• Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp. This makes
easy to search for a particular version of data.
47
48
• The HBase Architecture:
• -1. Consists of servers in a Master-Slave relationship. Master node is
called HMaster and the multiple Region Servers(Slaves)(collection of
column families) are called HRegionServer. Each Region Server
contains multiple Regions – HRegions.
• -2. Data is stored in Tables which are stored in Regions(collection of
column families). When a Table becomes too big, the Table is
partitioned into multiple Regions.
• -3. Each Region Server contains a Write-Ahead Log (called HLog) and
multiple Regions. Each Region in turn is made up of a MemStore and
multiple StoreFiles (HFile). The data lives in these StoreFiles in the
form of Column Families . The MemStore holds in-memory
modifications to the Store (data).
• -4. A system table called .META - keeps the mapping of Regions to
Region Server. The clients read the required Region information from
the .META table and directly communicate with the appropriate Region
Server.
49
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; which describes the whole structure of
defines only column families. tables.
It is built for wide tables. HBase is It is thin and built for small tables. Hard
horizontally scalable. to scale.
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was
released.
57
Features of the Sqoop
1.Parallel Import/Export
Sqoop uses the YARN framework to import and export
data. This provides fault tolerance on top of parallelism.
58
4. Kerberos Security Integration
Sqoop supports the Kerberos computer network
authentication protocol, which enables nodes
communication over an insecure network to
authenticate users securely.
Provides Full and Incremental Load
1.Sqoop can load the entire table or parts of the
table with a single command.
59
Sqoop Architecture
60
61
62
4. Similarly, numerous map tasks will export the
data from HDFS on to RDBMS using the Sqoop
export command.
63
Sqoop - Import All Tables
64
Syntax
65
Example
• Let us take an example of importing all tables from the userdb
database. The list of tables that the database userdb contains is as
follows.
66
If you are using the import-all-tables, it is
mandatory that every table in that database must
have a primary key field.
$ $HADOOP_HOME/bin/hadoop fs -ls
67
Sqoop Export
• A tool which exports a set of files from HDFS back to RDBMS.
That tool is what we call a Sqoop Export Tool.
• There is one condition for it, that in the database, target the
table must already exist.
• However, the input files are read and parsed according to the
user-specified delimiters into a set of records.
68
The export command works in two modes- insert mode
and update mode.
69
The Syntax for Sqoop Export are:
70
• Example
• Let us take an example of the employee data in file, in HDFS. The employee
data is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data
is as follows.
It is mandatory that the table to be exported is created manually and is present in the
database from where it has to be exported.
71
The following command is used to export the table data (which is in
emp_data file on HDFS) to the employee table in db database of Mysql
database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
72
Importing data from MySQL to HDFS
In order to store data into HDFS, we make use of Apache Hive which
provides an SQL-like interface between the user and the Hadoop
distributed file system (HDFS) which integrates Hadoop. We perform the
following steps:
73
Step 4: Run below the import command on Hadoop.
74
SQOOP VS FLUME
Apache Sqoop Apache Flume
Apache Sqoop will be considered an ideal fit if Apache Flume is considered the best choice
the data is being available in Teradata, Oracle, when we are talking about moving bulk
MySQL, PostgreSQL or any other JDBC streaming data from sources likes JMS or
compatible database Spooling directories
HDFS is the destination for importing data in Data is said to flow to HDFS through channels
Apache Sqoop in Apache Flume
75
Apache Sqoop has a connector based architecture,
Apache Flume has agent-based architecture, that
which means the connectors know a great deal in
means code written in Flume is known as an agent
connecting with the various data sources and also to
that will be held responsible for fetching the data
fetch data correspondingly
Apache Sqoop connectors are designed specifically to Apache Flume is specifically designed to fetch
work with structured data sources and to fetch data streaming data like tweets from Twitter or log files
from them alone. from Web servers or Application servers etc.
76