0% found this document useful (0 votes)

5 views34 pages

Module 4

Hive is a data warehousing tool built on Hadoop, designed for processing structured data through summarization, querying, and analysis. It uses HDFS for storage, MapReduce for execution, and provides a SQL-like language called HiveQL for querying. Key features include support for rich data types, partitioning, bucketing, and a metastore for managing metadata, while it is not suitable for real-time queries or OLTP.

Uploaded by

dhanulokesh06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views34 pages

Module 4

Uploaded by

dhanulokesh06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

MODULE-4

Introduction to Hive, Introduction to Pig

Introduction to Hive

WHAT IS HIVE?
• Hive is a Data Warehousing tool that sits on top of Hadoop.

Hive - a data warehousing tool

• Hive is used to process structured data in Hadoop.
• The three main tasks performed by Apache Hive are:
1. Summarization
2. Querying
3. Analysis
• Facebook initially created Hive component to manage their ever-growing volumes of
log data.
• Later Apache software foundation developed it as open-source and it came to be
known as Apache Hive.
• Hive makes use of the following:
1. HDFS for Storage.
2. MapReduce for execution.
3. Stores metadata/schemas in an RDBMS.
• Hive provides HQL (Hive Query Language) or HiveQL which is similar to SQL.
• Hive compiles SQL queries into MapReduce jobs and then runs the job in the Hadoop
Cluster.
• It is designed to support OLAP (Online Analytical Processing).
• Hive provides extensive data type functions and formats for data summarization and
analysis.
Note:
1. Hive is not RDBMS.
2. It is not designed to support OLTP (Online Transaction Processing).
3. It is not designed for real-time queries.
4. It is not designed to support row-level updates.

History of Hive and Recent Releases of Hive

• The history of Hive and recent releases of Hive are illustrated pictorially.

History of Hive
Recent releases of Hive

Hive Features
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.

Hive Integration and Work Flow

• Figure depicts the flow of log file analysis.
• Explanation of the workflow. Hourly Log Data can be stored directly into HDFS and
then data cleansing is performed on the log file. Finally, Hive table(s) can be created
to query the log file.

Flow of log analysis file

Hive Data Units
1. Databases: The namespace for tables.
2. Tables: Set of records that have similar schema.
3. Partitions: Logical separations of data based on classification of given information as per
specific at tributes. Once hive has partitioned the data based on a specified key, it starts to
assemble the records into specific folders as and when the records are inserted.
4. Buckets (or Clusters): Similar to partitions but useses hash function to segregate data and
determines the cluster or bucket into which the record should be placed.
• Partitioning tables changes how Hive structures the data storage.
• Hive will create subdirectories reflecting the partitioning structure like
.../customers/ country=ABС
• Although partitioning helps in enhancing performance and is recommended, having
too many partitions may prove detrimental for few queries.
• Bucketing is another technique of managing large datasets.
• If we partition the dataset based on customer_ID, we would end up with far too many
partitions.
• Instead, if we bucket the customer table and use customer_id as the bucketing column,
the value of this column will be hashed by a user-defined number into buckets.
• Records with the same customer_id will always be placed in the same bucket.
• Assuming we have far more customer_ids than the number of buckets, each bucket
will house many customer_ids.
• While creating the table you can specify the number of buckets that you would like
your data to be distributed in using the syntax "CLUSTERED BY (customer_id)
INTO XX BUCKETS"; here XX is the number of buckets.
When to Use Partitioning/Bucketing?
• Bucketing works well when the field has high cardinality (cardinality is the number of
values a column or field can have) and data is evenly distributed among buckets.
• Partitioning works best when the cardinality of the partitioning field is not too high.
• Partitioning can be done on multiple fields with an order (Year/Month/ Day) whereas
bucketing can be done on only one field.
• Figure shows how these data units are arranged in a Hive Cluster.

Data units as arranged in a Hive

• Figure describes the semblance of Hive structure with database.

Semblance of Hive structure with database

• A database contains several tables.
• Each table is constituted of rows and columns.
• In Hive, tables stored as a folder and partition tables are stored as a sub-directory.
• Bucketed tables are stored as a file.

HIVE ARCHITECTURE
• Hive Architecture is depicted in Figure.
Hive architecture
• The various parts are as follows:
1. Hive Command-Line Interface (Hive CLI): The most commonly used interface to
interact with Hive.
2. Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to
execute query.
3. Hive Server: This is an optional server. This can be used to submit Hive Jobs from a
remote client.
4. JDBC/ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to
connect to Hive and submit jobs on it.
5. Driver: Hive queries are sent to the driver for compilation, optimization and execution.
6. Metastore: Hive table definitions and mappings to the data are stored in a Metastore. A
Metastore consists of the following:
➢ Metastore service: Offers interface to the Hive.
➢ Database: Stores data definitions, mappings to the data and others.
• The metadata which is stored in the metastore includes IDs of Database, IDs of Tables,
IDs of Indexes, etc., the time of creation of a Table, the Input Format used for a Table, the
Output Format used for a Table, etc.
• The metastore is updated whenever a table is created or deleted from Hive.
• There are three kinds of metastore.
1. Embedded Metastore: This metastore is mainly used for unit tests. Here, only one process
is allowed to connect to the metastore at a time. This is the default metastore for Hive. It is
Apache Derby Database. In this metastore, both the database and the metastore service run
embedded in the main Hive Server process. Figure shows an Embedded Metastore.

Embedded Metastore.
2. Local Metastore: Metadata can be stored in any RDBMS component like MySQL. Local
metastore allows multiple connections at a time. In this mode, the Hive metastore service runs
in the main Hive Server process, but the metastore database runs in a separate process, and
can be on a separate host. Figure shows a Local Metastore.
Local Metastore
3. Remote Metastore: In this, the Hive driver and the metastore interface run on different
JVMs (which can run on different machines as well) as in Figure. This way the database can
be fire-walled from the Hive user and also database credentials are completely isolated from
the users of Hive.

Remote Metastore
HIVE DATA TYPES

Primitive Data Types

Collection Data Types

HIVE FILE FORMAT
The file formats in Hive specify how records are encoded in a file.

Text File
The default file format is text file. In this format, each record is a line in the file. In text file,
different control characters are used as delimiters. The delimiters are ^A (octal 001, separates
all fields),^B (octal 002, separates the elements in the array or struct), ^C (octal 003, separates
key-value pair), and \n. The term field is used when overriding the default delimiter. The
supported text files are CSV and TSV. JSON or XML documents too can be specified as text
file.
Sequential File
Sequential files are flat files that store binary key-value pairs. It includes compression support
which reduces the CPU, I/O requirement.
RCFile (Record Columnar File)
RCFile stores the data in Column Oriented Manner which ensures that Aggregation operation
is not an expensive operation. For example, consider a table which contains four columns as
shown in Table below.
Table: A table with four columns

Instead of only partitioning the table horizontally like the row-oriented DBMS (row-store),
RCFile partitions this table first horizontally and then vertically to serialize the data. Based on
the user-specified value, first the table is partitioned into multiple row groups horizontally.
Depicted in Table below, above Table is partitioned into two row groups by considering three
rows as the size of each row group.
Table: Table with two row groups

Next, in every row group RCFile partitions the data vertically like column-store. So the table
will be serialized as shown in Table below.
Table: Table in RCFile Format

HIVE QUERY LANGUAGE (HQL)

Hive query language provides basic SQL like operations. Here are few of the tasks which
HQL can do easily.
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
4. Download the contents of a table to a local directory or result of queries to HDFS directory.
DDL (Data Definition Language) Statements
These statements are used to build and modify the tables and other objects in the database.
The DDL commands are as follows:
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
DML (Data Manipulation Language) Statements
These statements are used to retrieve, store, modify, delete, and update data in database. The
DML commands are as follows:
1. Loading files into table.
2. Inserting data into Hive Tables from queries.
Note: Hive 0.14 supports update, delete, and transaction operations.

Starting Hive Shell

To start Hive, go to the installation path of Hive and type as below:

The sections have been designed as follows:

Objective: What is it that we are trying to achieve here?
Input (optional): What is the input that has been given to us to act upon?
Act: The actual statement/command to accomplish the task at hand.
Outcomе: The result/output as a consequence of executing the statement.

Database
A database is like a container for data. It has a collection of tables which houses the data.

Explanation of the syntax:

IF NOT EXIST: It is an optional clause. The create database statement with "IF Not
EXISTS" clause creates a database if it does not exist. However, if the database already exists
then it will notify the user that a database with the same name already exists and will not
show any error message.
COMMENT: This is to provide short description about the database.
WITH DBPROPERTIES: It is an optional clause. It is used to specify any properties of
database in the form of (key, value) separated pairs. In the above example, "Creator" is the
"Key" and "JOHN" is the value.
We can use "SCHEMA" in place of "DATABASE" in this command.
Note: We have not specified the location where the Hive database will be created. By default
all the Hive databases will be created under default warehouse directory (set by the property
hive.metastore.warehouse.dir) as /user/hive/warehouse/database_name.db. But if we want to
specify our own location, then the LOCATION clause can be specified. This clause is
optional.

By default, SHOW DATABASES lists all the databases available in the metastore. We can
use "SCHEMAS" in place of "DATABASES" in this command. The command has an
optional "Like" clause. It can be used to filter the database names using regular expressions
such as "*”,”?” etc.
SHOW DATABASES LIKE "Stu*"
SHOW DATABASES like "Stud???s"

DESCRIBE DATABASE EXTENDED shows database's properties given under

DBPROPERTIES argument at the time of creation. We can use "SCHEMA" in place of
"DATABASE”, “DESC" in place of "DESCRIBE” in this command.
In the above example, the ALTER DATABASE command is used to assign new ('edited-by' =
'JAMES') pair into DBPROPERTIES. This can be verified by using the 'describe extended'.
Hive> DESCRIBE DATABASE Student EXTENDED

There is no command to show the current database, but use the below command statement to
keep printing the current database name as suffix in the command line prompt.
set hive.cli.print.current.db=true;

Now assume that the database "STUDENTS" has 10 tables within it. How do we delete the
complete database along with the tables contained therein? Use the command:
DROP DATABASE STUDENTS CASCADE;
By default the mode is RESTRICT which implies that the database will NOT be dropped if it
contains tables.
Note: The complete syntax is as follows:
DROP DATABASE [IF EXISTS] database_name [RESTRICT | CASCADE]

Тables
Hive provides two kinds of table:
1. Internal or Managed Table
2. External Table

Managed Table
1. Hive stores the Managed tables under the warehouse folder under Hive.
2. The complete life cycle of table and data is managed by Hive.
3. When the internal table is dropped, it drops the data as well as the metadata.
When you create a table in Hive, by default it is internal or managed table. If one needs to
create an external ble, one will have to use the keyword "EXTERNAL".

To check whether an existing table is managed or external, use the below syntax:
DESCRIBE FORMATTED tablename;
It displays complete metadata of a table. You will see one row called table type which will
display either MANAGED_TABLE OR EXTERNAL_TABLE.
DESCRIBE FORMATTED STUDENT
External or Self-Managed Table
1. When the table is dropped, it retains the data in the underlying location.
2. External keyword is used to create an external table.
3. Location needs to be specified to store the dataset in that particular location.
Loading Data into Table from File

Let us understand the difference between INTO TABLE and OVERWRITE TABLE with an
example: Assume the "EXT_STUDENT" table already had 100 records and the "student.tsv"
file has 10 records After issuing the LOAD DATA statement with the INTO TABLE clause,
the table "EXT_STUDENT" will contain110 records; however, the same LOAD DATA
statement with the OVERWRITE clause will wipe out all the former content from the table
and then load the 10 records from the data file.
Collection Data Types

Querying Table
Partitions
In Hive, the query reads the entire dataset even though a where clause filter is specified on a
particular column. This becomes a bottleneck in most of the MapReduce jobs as it involves
huge degree of I/O. So it is necessary to reduce I/O required by the MapReduce job to
improve the performance of the query. A very common method to reduce I/O is data
partitioning.
Partitions split the larger dataset into more meaningful chunks.
Partition is of two types:
1. STATIC PARTITION: It is upon the user to mention the partition (the segregation unit)
where the data from the file is to be loaded.
2. DYNAMIC PARTITION: The user is required to simply state the column, basis which
the partitioning will take place. Hive will then create partitions basis the unique values in the
column on which partition is to be carried out.
Points to consider as you create partitions:
1. STATIC PARTITIONING implies that the user controls everything from defining the
PARTITION column to loading data into the various partitioned folders.
2. DYNAMIC PARTITIONING means Hive will intelligently get the distinct values for
partitioned column and segregate data into respective partitions. There is no manual
intervention.
By default, dynamic partitioning is enabled in Hive. Also by default it is strictly implying that
one is required to do one level of STATIC partitioning before Hive can perform DYNAMIC
partitioning inside this STATIC segregation unit.
In order to go with full dynamic partitioning, we have to set below property to non-strict in
Hive.
hive> set hive.exec.dynamic.partition.mode=nonstrict

Static Partition
Static partitions comprise columns whose values are known at compile time.
Dynamic Partition
Dynamic partition have columns whose values are known only at Execution Time.

Bucketing
Bucketing is similar to partition. However, there is a subtle difference between partition and
bucketing. In a partition, you need to create partition for each unique value of the column.
This may lead to situations where you may end up with thousands of partitions. This can be
avoided by using Bucketing in which you can limit the number of buckets that will be created.
A bucket is a file whereas a partition is a directory.
Views
In Hive, view support is available only in version starting from 0.6. Views are purely logical
object.

Sub-Query
In Hive, sub-queries are supported only in the FROM clause (Hive 0.12). You need to specify
name for subquery because every table in a FROM clause has a name. The columns in the
sub-query select list should have unique names. The columns in the subquery select list are
available to the outer query just like columns of a table.
Joins
Joins in Hive is similar to the SQL Join.
Aggregation

Hive supports aggregation functions like avg, count, etc.

Group By and Having
Data in a column or columns can be grouped on the basis of values contained therein by using
"Group By". "Having" clause is used to filter out groups NOT meeting the specified
condition.

RCFILE IMPLEMENTATION

RCFile (Record Columnar File) is a data placement structure that determines how to store
relational tables on computer clusters.
USER-DEFINED FUNCTION (UDF)
In Hive, you can use custom functions by defining the User-Defined Function (UDF).

Introduction to Pig

WHAT IS PIG?
• Apache Pig is a platform for data analysis.
• It is an alternative to MapReduce Programming.
• Pig was developed as a research project at Yahoo.

Key Features of Pig

1. It provides an engine for executing data flows (how your data should flow). Pig processes
data in parallel on the Hadoop cluster.
2. It provides a language called "Pig Latin" to express data flows.
3. Pig Latin contains operators for many of the traditional data operations such as join, filter,
sort, etc.
4. It allows users to develop their own functions (User Defined Functions) for reading,
processing, and writing data.
THE ANATOMY OF PIG
The main components of Pig are as follows:
1. Data flow language (Pig Latin).
2 Interactive shell where you can type Pig Latin statements (Grunt).
3. Pig interpreter and execution engine.

Refer Figure

The anatomy of Pig.

PIG ON HADOOP
• Pig runs on Hadoop.
• Pig uses both Hadoop Distributed File System and MapReduce Programming.
• By default, Pig reads input files from HDFS.
• Pig stores the intermediate data (data produced by MapReduce jobs) and the output in
HDFS.
• However, Pig can also read input from and place output to other sources.
• Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.

PIG PHILOSOPHY

Figure describes the Pig philosophy.

Pig philosophy.
1. Pigs Eat Anything: Pig can process different kinds of data such as structured and
unstructured data.
2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.
3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the
same can be included in the script for complex operations.
4. Pigs Fly: Pig processes data quickly.

USE CASE FOR PIG: ETL PROCESSING

Pig is widely used for "ETL" (Extract, Transform, and Load). Pig can extract data from
different sources such as ERP, Accounting, Flat Files, etc. Pig then makes use of various
operators to perform transformation on the data and subsequently loads it into the data
warehouse. Refer Figure.

Pig: ETL Processing

PIG LATIN OVERVIEW

Pig Latin Statements

1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as input and yields another relation as output.
4. Pig Latin statements include schemas and expressions to process data.
5. Pig Latin statements should end with a semi-colon.

Pig Latin Statements are generally ordered as follows:

1. LOAD statement that reads data from the file system.
2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result.

The following is a simple Pig Latin script to load, filter, and store "student" data.
A = load 'student' (rollno, name, gpa);
A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name);
STORE A INTO 'myreport’

Note: In the above example A is a relation and NOT a variable.

Pig Latin: Keywords

Keywords are reserved. It cannot be used to name things.

Pig Latin: Identifiers

1. Identifiers are names assigned to fields or other data structures.

2. It should begin with a letter and should be followed only by letters, numbers, and
underscores.

Table describes valid and invalid identifiers.

Pig Latin: Comments

In Pig Latin two two types of comments are supported:

1. Single line comments that begin with "--".
2. Multiline comments that begin with "/* and end with */".

Pig Latin: Case Sensitivity

1. Keywords are not case sensitive such as LOAD, STORE, GROUP, FOREACH, DUMP,
etc.
2. Relations and paths are case-sensitive.
3. Function names are case sensitive such as PigStorage, COUNT.

Operators in Pig Latin

Table describes operators in Pig Latin.

Table:Operators in Pig Latin

DATA TYPES IN PIG

Simple Data Types

Table describes simple data types supported in Pig. In Pig, fields of unspecified types are
considered as an array of bytes which is known as bytearray. Null: In Pig Latin, NULL
denotes a value that is unknown or is non-existent.
Table:Simple data types supported in Pig

Complex Data Types

Table describes complex data types in Pig.

Table:Complex data types in Pig

RUNNING PIG

Pig can run in two ways:

1. Interactive Mode
2. Batch Mode

Interactive Mode

Pig can run in interactive mode by invoking grunt shell. Type pig to get grunt shell as shown
below.

Once you get the grunt prompt, you can type the Pig Latin statement as shown below.

Here, the path refers to HDFS path and DUMP displays the result on the console as shown
below.
Batch Mode
"Pig Script" need to be created to run pig in batch mode. Write Pig Latin statements in a file
and save it with .pig extension.

EXECUTION MODES OF PIG

Pig can be executed in two modes:

1. Local Mode
2. MapReduce Mode

Local Mode
To run pig in local mode, you need to have your files in the local file system.
Syntax:
pig -x local filename

MapReduce Mode
To run pig in MapReduce mode, you need to have access to a Hadoop Cluster to read /write
file. This is the default mode of Pig.
Syntax:
pig filename

HDFS COMMANDS

We can work with all HDFS commands in Grunt shell. For example, you can create a
directory as shown below.
grunt> fs -mkdir /piglatindemos;
grunt>

RELATIONAL OPERATORS

FILTER

FILTER operator is used to select tuples from a relation based on specified conditions.

Objective: Find the tuples of those student where the GPA is greater than 4.0.
Input:
Student (rollno:int,name:chararray,gpa:float)
Act:
A = load '/pigdemo/student.tsv' as (rollno:int, name:chararray, gpa:float);
B = filter A by gpa > 4.0;
DUMP В;

FOREACH

Use FOREACH when you want to do data transformation based on columns of data.
GROUP

GROUP operator is used to group data.

DISTINCT

DISTINCT operator is used to remove duplicate tuples. In Pig, DISTINCT operator works on
the entire tuple and NOT on individual fields.
LIMIT

LIMIT operator is used to limit the number of output tuples.

ORDER BY

ORDER BY is used to sort a relation based on specific value.

JOIN

It is used to join two or more relations based on values in the common field. It always
performs inner Join.

UNION

It is used to merge the contents of two relations.

SPLIT

It is used to partition a relation into two or more relations.

SAMPLE

It is used to select random sample of data based on the specified sample size.
EVAL FUNCTION

AVG

AVG is used to compute the average of numeric values in a single column bag.

MAX

MAX is used to compute the maximum of numeric values in a single column bag.

COUNT

COUNT is used to count the number of elements in a bag.

COMPLEX DATA TYPES

TUPLE

A TUPLE is an ordered collection of fields.

МАР

MAP represents a key/value pair.

PIGGY BANK

Pig user can use Piggy Bank functions in Pig Latin script and they can also share their
functions in Piggy Bank.

USER-DEFINED FUNCTIONS (UDF)

Pig allows you to create your own function for complex analysis.
PIG versus HIVE

Dbms Lab Record - Merged
No ratings yet
Dbms Lab Record - Merged
43 pages
SQL Functions
No ratings yet
SQL Functions
8 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Agnity User Guide
100% (1)
Agnity User Guide
83 pages
Xii-Score Plus Cs - QB With Cbse SP and MTP - 12
No ratings yet
Xii-Score Plus Cs - QB With Cbse SP and MTP - 12
178 pages
Ssrs SQL Tutorial
100% (2)
Ssrs SQL Tutorial
192 pages
BDA_Hive
No ratings yet
BDA_Hive
22 pages
Module 4
No ratings yet
Module 4
51 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
BAD601 Module 4 PDF
No ratings yet
BAD601 Module 4 PDF
51 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Hive
No ratings yet
Hive
45 pages
Module-IV Hive
No ratings yet
Module-IV Hive
17 pages
Hive
No ratings yet
Hive
23 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive
No ratings yet
Hive
47 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive Pig PDF
No ratings yet
Hive Pig PDF
20 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Big Data
No ratings yet
Big Data
120 pages
Module-IV HIVE Ppt
No ratings yet
Module-IV HIVE Ppt
69 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Unit IV
No ratings yet
Unit IV
22 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Hive
No ratings yet
Hive
49 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Hive
No ratings yet
Hive
30 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Big Data Analytics Module-4
No ratings yet
Big Data Analytics Module-4
39 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
HIVE
No ratings yet
HIVE
80 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Super 25 Unit 4 Notes
No ratings yet
Super 25 Unit 4 Notes
16 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Unit 5 Hive and Pig
No ratings yet
Unit 5 Hive and Pig
16 pages
Actividad 7. Investigación Hive
No ratings yet
Actividad 7. Investigación Hive
25 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Microsoft SQL Server 2016 A Beginner's Guide Sixth Edition Petkovic 2024 Scribd Download
100% (3)
Microsoft SQL Server 2016 A Beginner's Guide Sixth Edition Petkovic 2024 Scribd Download
47 pages
Report On User Defined Functions in Python
No ratings yet
Report On User Defined Functions in Python
3 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Database Lecture08
No ratings yet
Database Lecture08
40 pages
SQL States
No ratings yet
SQL States
53 pages
PostgreSQL CREATE FUNCTION by Practical Examples
No ratings yet
PostgreSQL CREATE FUNCTION by Practical Examples
9 pages
LogicEditor enUS
No ratings yet
LogicEditor enUS
254 pages
Note 2620830 - How To Record SAP HANA Memory Allocator Traces To Analyze Memory Leaks
No ratings yet
Note 2620830 - How To Record SAP HANA Memory Allocator Traces To Analyze Memory Leaks
4 pages
DB2 SQL - Important Queries
No ratings yet
DB2 SQL - Important Queries
9 pages
Installation Guide: DB2 Universal Database For OS/390
No ratings yet
Installation Guide: DB2 Universal Database For OS/390
576 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
BDH Unit 3
No ratings yet
BDH Unit 3
16 pages
Sindhu Internship Report
No ratings yet
Sindhu Internship Report
38 pages
Rbafy-Database SQL Programming 7.1
No ratings yet
Rbafy-Database SQL Programming 7.1
508 pages
Test
No ratings yet
Test
107 pages
Chapter 3. Program Example: 3.1 How To Make Project and Open Program
No ratings yet
Chapter 3. Program Example: 3.1 How To Make Project and Open Program
21 pages
Smart Circular
No ratings yet
Smart Circular
54 pages
SSMA For Oracle
No ratings yet
SSMA For Oracle
15 pages
Introductory FLUENT Training: User-Defined Functions
No ratings yet
Introductory FLUENT Training: User-Defined Functions
25 pages
Module - 6 Functions
No ratings yet
Module - 6 Functions
16 pages
4-Stored Procedures
No ratings yet
4-Stored Procedures
22 pages
Pointers Midterm Exam 2024 2025 COMPUTER PROGRAMMING 2
No ratings yet
Pointers Midterm Exam 2024 2025 COMPUTER PROGRAMMING 2
5 pages
DB2 V9 Application Programming&SQL Guide Dsnapk13
No ratings yet
DB2 V9 Application Programming&SQL Guide Dsnapk13
1,157 pages
Programming SQL Server CLR Integration
No ratings yet
Programming SQL Server CLR Integration
12 pages

Module 4

Uploaded by

Module 4

Uploaded by

MODULE-4

Introduction to Hive, Introduction to Pig

Hive - a data warehousing tool

History of Hive and Recent Releases of Hive

Hive Integration and Work Flow

Flow of log analysis file

Data units as arranged in a Hive

Semblance of Hive structure with database

Primitive Data Types

Collection Data Types

HIVE QUERY LANGUAGE (HQL)

Starting Hive Shell

The sections have been designed as follows:

Explanation of the syntax:

DESCRIBE DATABASE EXTENDED shows database's properties given under

Hive supports aggregation functions like avg, count, etc.

Key Features of Pig

The anatomy of Pig.

Figure describes the Pig philosophy.

USE CASE FOR PIG: ETL PROCESSING

Pig: ETL Processing

Pig Latin Statements

Pig Latin Statements are generally ordered as follows:

Note: In the above example A is a relation and NOT a variable.

Pig Latin: Keywords

Keywords are reserved. It cannot be used to name things.

Pig Latin: Identifiers

1. Identifiers are names assigned to fields or other data structures.

Table describes valid and invalid identifiers.

Pig Latin: Comments

In Pig Latin two two types of comments are supported:

Pig Latin: Case Sensitivity

Operators in Pig Latin

Table describes operators in Pig Latin.

Table:Operators in Pig Latin

DATA TYPES IN PIG

Simple Data Types

Complex Data Types

Table describes complex data types in Pig.

Table:Complex data types in Pig

Pig can run in two ways:

EXECUTION MODES OF PIG

Pig can be executed in two modes:

GROUP operator is used to group data.

LIMIT operator is used to limit the number of output tuples.

ORDER BY is used to sort a relation based on specific value.

It is used to merge the contents of two relations.

It is used to partition a relation into two or more relations.

COUNT is used to count the number of elements in a bag.

A TUPLE is an ordered collection of fields.

MAP represents a key/value pair.

USER-DEFINED FUNCTIONS (UDF)

You might also like