Bigdata Analytics
Bigdata Analytics
By
S.Manasha
II MSc Computer Science
HIVE
Data warehouse and an ETL tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS) which
integrates Hadoop.
It facilitates reading, writing and handling wide datasets that stored in
distributed storage and queried by Structure Query Language (SQL) syntax. It
is not built for Online Transactional Processing (OLTP) workloads.
It is designed to enhance scalability, extensibility, performance, fault-tolerance
and loose-coupling with its input formats.
HIVE DATA MODELLING
Tables - Tables in Hive are created the same way it is done in RDBMS
Partitions - Here, tables are organized into partitions for grouping similar
types of data based on the partition key
Buckets - Data present in partitions can be further divided into buckets for
efficient querying
HIVE INTERNAL TABLES VS EXTERNAL TABLES
Internal:
Data is stored in the Hive data warehouse. The data warehouse is located at
/hive/warehouse/ on the default storage for the cluster.
Use internal tables when one of the following conditions applies.
Data is temporary.
You want Hive to manage the lifecycle of the table and data.
External:
Data is stored outside the data warehouse. The data can be stored on any
storage accessible by the cluster.
Use external tables when one of the following conditions apply:
The data is also used outside of Hive. For example, the data files are updated
by another process (that doesn't lock the files.)
Data needs to remain in the underlying location, even after dropping the table.
You need a custom location, such as a non-default storage account.
PARTITIONS
A table may be partitioned in multiple dimensions.
For example, in addition to partitioning logs by date, we might also subpartition each date
partition by country to permit efficient queries by location.
Partitioned are defined at table creation time using the PATITIONED by the clause, which takes
a list of column definitions.
If we want to search a large amount of data, then we can divide the large data into partitions.
hive>create table party table(loaded int, logerror string) PARTITIONED BY (Logdt string, country
string)
BUCKETS
To enable more efficient queries.
To bucket a table is to make sampling more efficient.
Local Mode - Used when Hadoop has one data node, and the amount of data is
small. Here, the processing will be very fast on smaller datasets, which are present in
local machines.
Mapreduce Mode - Used when the data in Hadoop is spread across multiple data
nodes. Processing large datasets can be more efficient using this mode.
ADVANTAGES OF HIVE
Scalability
Familiar SQL-like interface
Supports partitioning and bucketing
User-defined functions
HIVE QL
Hive QL is the HIVE QUERY LANGUAGE
DDL and DML are the parts of HIVE QL
Data Definition Language (DDL) is used for creating, altering and dropping
databases, tables, views, functions and indexes.
Data manipulation language is used to put data into Hive tables and to
extract data to the file system and also how to explore and manipulate data
with queries, grouping, filtering, joining etc.
COMMANDS
CREATE DATABASE db name -- to create a database in Hive
USE db name -- To use the database in Hive.
DROP db name -- To delete the database in Hive.
SHOW DATABASE -- to see the list of the DataBase
THANKYOU