0% found this document useful (0 votes)

39 views29 pages

Unit-5 1

Uploaded by

gowthami reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views29 pages

Unit-5 1

Uploaded by

gowthami reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

UNIT-5

Exploring Hive: Introducing Hive, Hive Service, Built-In Functions in Hive, Hive DDL,
Data Manipulation in Hive, Data Retrieval Queries, Using JOINS in Hive.

NoSQL Data Management: Introduction to NoSQL, Types of NoSQL Data Models,

Schema-Less Databases, Materialized Views, Distribution Models, Sharding.

Introducing Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible
The architecture of Hive consists of various components. These components are described as
follows:

 User Interface (UI)—Allows you to submit queries to the Hive system for execution.

 Driver—Receives the submitted queries. This driver component creates a session handle
for the submitted query and then sends the query to the compiler to generate an execution
plan.

 Compiler—Parses the query, performs semantic analysis on different query blocks and
query expressions, and generates an execution plan.

 Metastore—Stores all the information related to the structure of the various tables and
partitions in the data warehouse. It also includes column and column type information and the
serializers and deserializers necessary to read and write data. It also contains information
about the corresponding HDFS files where your data is stored.

 Execution Engine—Executes the execution plan created by the compiler. The plan is in the
form of a Directed Acyclic Graph (DAG) to be executed in various stages. This engine
manages the dependencies between the different stages of a plan and is also responsible to
execute these stages on the appropriate system components.

Hive Services
You can get all the Hive services you want by typing Hive service help. Some Hive services
are as follows:

 CLI—It is the black window or the panel that we get after the installation of Hive. This is
nothing but a command line interface of Hive. This is an inbuilt default service present in
Hive.

 Hive server—Runs Hive as an integrated server to expose a thrift service, which integrates
the access to the number of clients that are written in different types of languages. There are
many applications like JDBC, ODBC connectors that need to execute a Hive server to get in
communication loop with Hive.

 Hive Web Interface (HWI)—The Hive Web Interface is the GUI of Hive on which we
can execute the queries. It is an alternative to the shell

 JAR—Hive is somewhat equal to Hadoop JAR, as it is convenient to run Java applications

including the java applications, Hadoop, and Hive classes on the classpath.

 Metastore—It is the service that runs with the Hive services whenever Hive starts. This is
a default process. Using the metastore service, it is possible to run it on a standalone (remote)
process. For this, you just need to set the property of the Hive “METASTORE_PORT”
environment variable so that the specified port gets listened by the server.

 Hive client—There are many different mechanisms to get in contact with the applications
when you run Hive as a server that is hiveserver. Following is one of the clients of Hive:

 Thrift Client—It is very easy to run Hive commands from the extensive variety of
programming dialects through the Hive thrift client. There are numerous
programming dialects such as C++, Java, php, python, and ruby for which Hives has
the thrift client, to make these dialects accessible.

Similarly, there are JDBC and ODBC drivers that are compatible with Hive. These drivers
connect Hive with the metastores and help run applications.

Hive Variables:

Hive allows you to set variables that can be referred in the Hive script. For this purpose, you
need to use the –d or –define option, as shown in the following commands:
In the preceding commands, a table named sampletable is created in the database
sampledatabase.

By default, the variable substitution option is enabled. However, you can disable this option
by using the following command:

In general, there are three namespaces for defining variables hiveconf, system, and env. You
can also define custom variables in separate namespaces using the define or hivevar options.

Hive Properties:

The hive-site.xml file stores the configuration properties of Hive. These properties can be
overwritten by the developers. To overwrite the properties of the hive-site.xml file, the set
command is used.

Hive Queries:

Hive allows you to simultaneously execute one or more queries. These queries can be stored
in and executed from files. The extension of a query file in Hive is .hql or .q. Let’s take an
example of a Hive query file, named ourquery.hql, stored in the …/home/weusers/
queries/folder.

Now, type the following commands to execute the query stored in the ourquery.hql file:

You can also execute a Hive query in the background, as shown in the following command:
To list the directory content in the Hadoop dfs, type the following command:

The preceding command gives the following output:

Data Types in Hive:

Hive supports two kinds of data types: primitive type and complex type. Primitive data types
are built-in data types, which also act as basic structures for building more sophisticated data
types. Primitive data types are associated with columns of a table. Figure 12.2 shows a list of
primitive data types available in Hive:

Complex data types are customized data types that can be created by using primitive data
types. The following types of complex data types are available in Hive:

 Structs—The elements within this type can be accessed by using the Dot (.) operator.

 Maps—The elements in a map are accessed by using the ['element name'] notation.
 Arrays—The elements in an array have to be of the same data type. Elements can be
accessed by using the [n] notation where n represents the index of the array.

Built-In Functions in Hive

You must already be aware of the concept of functions and the role they play in the
development of an application. A function is a group of commands used to perform a
particular task in a program and return an outcome. Like every programming language, Hive
also has its set of built-in functions (also known as pre-defined functions). Table 12.3 lists the
built-in functions available in Hive:
To perform various types of computations, Hive provides aggregate functions. Table 12.4
lists the aggregate functions available in Hive:
Hive DDL

Data Definition Language (DDL) is used to describe data and data structures of a database.
Hive has its own DDL, such as SQL DDL, which is used for managing, creating, altering,
and dropping databases, tables, and other objects in a database. Similar to other SQL
databases, Hive databases also contain namespaces for tables. If the name of the database is
not specified, the table is created in the default database. Some of the main commands used in
DDL are as follows:

Let’s learn more about these commands and the operations that can be performed by using
them.

Creating Databases;

In order to create a database, we can use the following command:

The preceding command creates a database with the name temp_database. In case a database
already exists with the same name, an error is thrown. You can avoid the error by creating the
database by using the following command:

You can also add a table in a particular database by using the following command:

In the preceding command, the name of the database is added before the name of the table.
Therefore, temp_table gets added in the temp_database. In addition, you can also create a
table in the database by using the following commands:

In the preceding commands, the USE statement is used for setting the current database to
execute all the subsequent HiveQL statements. In this way, you do not need to add the name
of the database before the table name. The table temp_table is created in the database
temp_database. Furthermore, you can specify DBPROPERTIES in the form of key-value
pairs in the following manner:

Viewing a Database:

You can view all the databases present in a particular path by using the following command:

The preceding command shows all the databases present at a particular location.

Dropping a Database:
Dropping a database means deleting it from its storage location. The database can be deleted
by using the following command:

You must note that if the database to be deleted contains any tables, you first need to delete
tables from the database by using the DROP command. Deleting each table individually from
a database is a tedious task. Hive allows you to delete tables along with the database by using
the following command:

Altering Databases:

Altering a database means making changes in the existing database. You can alter a database
by using the following command:

In the preceding command, we are altering ‘DBPROPERTIES’ of the temp_database. It

should be noted that, in any case, Hive does not allow you to alter the name of the database.

Creating Tables:

You can create a table in a database by using the CREATE command, as discussed earlier.
Now, let’s learn how to provide the complete definition of a table in a database by using the
following commands:

In the preceding commands, temp_database is first set and then the table is created with the
name employee. In the employee table, the columns (ename, salary, and designation) are
specified with their respective data types. The TBLPROPERTIES is a set of key-value
properties. C comments are also added in the table to provide more details.

External Table:
External tables are created by accessing the query written in files stored on some other
computer system. You can specify the location of a data file by using the LOCATION
keyword while creating the external table. Execute the following commands for creating an
external table:

The concept of external table proves to be very useful when HDFS has huge data files,
making copies of which is not feasible due to space constraints.

Creating a Table Using the Existing Schema:

You can also create a table having the same schema as another table by using the following
command:

In the preceding command, the schema of the student table present in the details database gets
copied in the employee table present in the same database. The name of the table to which the
schema gets copied is specified before the LIKE keyword, and the name of the table from
which the schema gets copied is specified after the LIKE keyword.

You can know the structure of the existing table by using the following command:

The DESCRIBE command shows the structure of the student table present in the details
database.

You can also find the details about any column present in a table by using the following
command:
In the preceding command, the name of the column is specified with the table name using a
dot between them. This will return the data type of the name column of the student table.

Partitioning Tables:

Partitioning of a table means dividing a table into different sections (also called partitions) in
order to view the required data. It helps you to create and view only a particular set of data of
a table. You can create partitions in a table by using the PARTITIONED BY keyword.
Consider the following command:

Here, we are creating a table named employee with the ename and sal columns. The table is
partitioned from the column designation.

Dropping Tables:

You can delete or drop a table by using the following command:

On execution, the preceding command deletes the students table.

Altering Tables:

Altering a table means modifying or changing an existing table. By altering a table, you can
modify the metadata associated with the table. The table can be modified by using the
ALTER TABLE statement. The altering of a table allows you to:

 Rename tables

 Modify columns

 Add new columns

 Delete some columns

 Change table properties

 Alter tables for adding partitions

Let’s now learn how to perform the preceding operations by using the ALTER TABLE
statement.

Rename Tables:

You can rename a table by using the following command:

In the preceding command, the employees table gets renamed to employees_new.

Modify Columns:

You can modify the name of a column in Hive by using the following command:

In the preceding command, the column, ename, gets renamed to emp_name.

Add Columns:

In Hive, you can also add an extra column to a given table by using the ADD COLUMN
statement:

The preceding command adds an extra column with the name sales_month in the employees
table. By default, the new column is added at the last position before the partitioned columns.

Replace Columns:

You can replace all the existing columns with new columns in a table by using the following
command:

The REPLACE COLUMNS statement is used to delete all the existing columns and add new
columns in their place.

Data Manipulation in Hive

After specifying the database schema and creating a database, the data can be modified by
using a set of procedures/mechanisms defined by a special language known as Data
Manipulation Language (DML). Data can be manipulated in the following ways:

 Loading files into tables

 Inserting data into Hive table from queries

 Updating existing tables

 Deleting records in tables

Let’s learn about each of these mechanisms in detail.

Loading Files into Tables:

While loading data into tables, Hive does not perform any type of transformations. The data
load operations in Hive are, at present, pure copy/move operations, which move data files
from one location to another. You can upload data into Hive tables from the local file system
as well as from HDFS. The syntax of loading data from files into tables is as follows:

When the LOCAL keyword is specified in the LOAD DATA command, Hive searches for
the local directory. If the LOCAL keyword is not used, Hive checks the directory on HDFS.
On the other hand, when the OVERWRITE keyword is specified, it deletes all the files under
Hive’s warehouse directory for the given table. After that, the latest files get uploaded. If you
do not specify the OVERWRITE keyword, the latest files are added in the already existing
folder.

Inserting Data into Tables:

You can also insert data into tables through queries by using the INSERT command. The
syntax of using the INSERT command is as follows:
In the preceding syntax, the INSERT OVERWRITE statement overwrites the current data in
the table or partition. The IF NOT EXISTS statement is given for a partition. On the other
hand, the INSERT INTO statement either appends the table or creates a partition without
modifying the existing data. The insert operation can be performed on a table or a partition.
You can also specify multiple insert clauses in the same query.

Consider two tables, T1 and T2. We want to copy the sal column from T2 to T1 by using the
INSERT command. It can be done as follows:

Here, we use the keyword OVERWRITE. It means that any previous data in table T1 will be
deleted, and new data from table T2 will be inserted. If we want to retain the previous data in
table T1, we can use the following command:

Static Partition Insertion:

Static partition insertion refers to the task of inserting data into a table by specifying a
partition column value. Consider the following example:

In the preceding command, we are inserting data into the JANUARY partition of the T1
table.
Dynamic Partition Insertion:

In dynamic partition insertion, you need to specify a list of partition column names in the
PARTITION() clause along with the optional column values. A dynamic partition column
always has a corresponding input column in the SELECT statement. If the SELECT
statement has multiple column names, the dynamic partition columns must be specified at the
end of the columns and in the same order in which they appear in the PARTITION() clause.
By default, the feature of dynamic partition is disabled.

The following command shows an example of dynamic partition insertion:

In the preceding command, dynamic partition insertion takes place by including all ds and hr
columns. In the SELECT statement, these columns are listed at the end, in the same order in
which they appear in the PARTITION clause.

Inserting Data into Local Files:

Sometimes, you might require to save the result of the SELECT query in flat files so that you
do not have to execute the queries again and again. Consider the following example:

The preceding command inserts the SELECT query result of the employee table in the myfile
file of the local directory.

Creating and Inserting Data into a Table Using a Single Query:

In Hive, you can create and load data into a table using a single query as follows:

Here, a new table called T1 would be created, and the schema for the table would be three
columns named name, sal, and month, with the same data types as mentioned in the T2 table.

Update in Hive:
The update operation in Hive is available from Hive 0.14 version. The update operation can
only be performed on tables that support the ACID property. The syntax for performing the
update operation in Hive is as follows:

In the preceding syntax, the name of the table will be followed by the UPDATE statement.
Only those rows that match with the WHERE clause will be updated. You must note that the
partitioning and bucketing columns cannot be updated.

Consider the following update operation performed on the sal column:

In the preceding command, the salary of the employee whose empid is 10001 is updated.

Delete in Hive:

The delete operation is available in Hive from the Hive 0.14 version. The syntax for
performing the delete operation is as follows:

In the preceding syntax, only those rows that match the WHERE clause will be deleted. The
table name from which the records are deleted are specified after the DELETE FROM
statement. To delete the records of an employee whose empid is 10001, type the following
command:

When the preceding command is executed, the records of the employee whose employee id is

Data Retrieval Queries

Hive allows you to perform data retrieval queries by using the SELECT command along with
various types of operators and clauses.

In this section, you learn about the following:

 Using the SELECT command

 Using the WHERE clause

 Using the GROUP BY clause

 Using the HAVING clause

 Using the LIMIT clause

 Executing HiveQL queries

Using the SELECT Command:

The SELECT statement is the most common operation in SQL. You can filter the required
columns, rows, or both. The syntax for using the SELECT command is as follows:

The following query retrieves all the columns and rows from table, mydemotable:

Using the WHERE Clause:

The WHERE clause is used to search records of a table on the basis of a given condition.
This clause returns a boolean result. For example, a query can return only those sales records
that have an amount greater than 15000 from the US region. Hive also supports a number of
operators (such as > and <) in the WHERE clause. The following query shows an example of
using the WHERE clause:

Using the GROUP BY Clause:

The GROUP BY clause is used to put all the related records together. It can also be used with
aggregate functions. Often, it is required to group the resultsets in complex queries. In such
scenarios, the ‘GROUP BY’ clause can be used.

For example, if you wish to calculate the average marks obtained by the students from all
semesters, you may use the GROUP BY clause as shown in the following example:

The following example helps you to count the number of distinct users by gender:

You can also use aggregate functions at the same time, but no two aggregations can have
different DISTINCT columns.

It should be noted that when the GROUP BY clause is used with the SELECT query, the
SELECT statement can only include columns included in the GROUP BY clause.

The following queries create a table and use the GROUP BY clause on it:

Using the HAVING Clause:

The HAVING clause is used to specify a condition on the use of the GROUP BY clause. The
use of the HAVING clause is added in the version 0.7.0 of Hive. The following query shows
an example of using the HAVING clause:

The preceding command displays the clmn1 column of the mydemotable table grouped by
clmn1 where the sum of clmn2 is greater than 15.

Using the LIMIT Clause:

The LIMIT clause is used to restrict the number of rows to be returned. The rows that are to
be returned are chosen at random. The following query returns 8 rows from mydemotable at
random:
Using JOINS in Hive

Hive supports joining of one or more tables to aggregate information. The various joins
supported by Hive are:

 Inner joins

 Outer joins

Inner Joins:

In case of inner joins, only the records satisfying the given condition get selected. All the
other records get discarded. Figure 12.11 illustrates the concept of inner joins:

Let’s take an example to describe the concept of inner joins. Consider two tables, order and
customer. Table 12.6 lists the data of the order table:

The above table contains the order id (order_id) and the corresponding customer id (cust_id).
Table 12.7 lists the data of the customer table:
The above table contains the customer id (cust_id) and the corresponding customer name
(cust_name).

To know the names of the customers who have placed orders, we need to take the inner join
of the two tables, as follows:

The output of the previous command is as follows:

Outer Joins:

Sometimes, you need to retrieve all the records from one table and only some records from
the other table. In such cases, you have to use the outer join. Figure 12.12 illustrates the
concept of outer joins:

Outer joins are of three types:

 Right Outer Join

 Left Outer Join

 Full Outer Join

Right Outer Join:

In this type of join, all the records from the table on the right side of the join are retained.
Figure 12.13 illustrates the concept of right outer join:

The query involving right outer join can be written as follows:

The output of the preceding query is as follows:

You can see that the order_id field for the customer ‘Monty’ is marked as NULL. It is
because there is no related record matching the given customer id in the order table.

Left Outer Join:

In this type of join, all the records from the table on the left side of the join are retained.
Figure 12.14 illustrates the concept of left outer joins:
The query involving left outer joins can be written as follows:

Here, all the entries from the order table are present in the output. A field is kept blank as
there is no corresponding value for the key in the customer table. The preceding query gives
the following output:

Full Outer Join:

In this case, all the fields from both tables are included. For the entries that do not have any
match, a NULL value would be displayed.

Figure 12.15 illustrates the concept of full outer join:

The query involving a full outer join can be written as follows:

The preceding query gives the following output:

Cartesian Product Joins:

In cartesian product joins, all the records of one table are combined with another table in all
possible combinations. This type of join does not involve any key column to join the tables.

The following is a query with a Cartesian product joint:

Map-Side Joins:

In a map-side join operation in Hive, the job is assigned to a Map Reduce task that consists of
two stages: Map and Reduce. At the Map stage, the data is read from join tables and the ‘join
key’ and ‘join value’ pair are returned to an intermediate file. This intermediate file is then
sorted and merged in the shuffle stage. At the Reduce stage, the sorted result is taken as input,
and the joining task is completed.

Figure 12.16 shows the map join procedure in Hive:

A map-side join is similar to the normal join, but here, all the tasks are performed by the
mapper alone. The map-side join is preferred in small tables. Suppose you have two tables,
out of which one is a small table. Now, when a map reduce task is submitted, a Map Reduce
local task will be created to read the data of the small table from HDFS and store it into an in-
memory hash table. After reading the data, the Map Reduce local task serializes the in-
memory hash table into a hash table file.

In the next stage, the main join Map Reduce task runs and moves the data from the hash table
file to the Hadoop distributed cache, which supplies these files to each mapper’s local disk. It
means that all the mappers can load this hash table file back into the memory and continue
with the join work.

Figure 12.17 shows the map-side join in Hive:

In a map-side join, the small table is read only once, and if multiple mappers are running on
the same machine, the distributed cache needs to push only one copy of the hash table file to
this machine. Map-side joins help in improving the performance of a task by decreasing the
time to finish the task. Moreover, it helps in minimizing the cost involved in sorting and
merging in the shuffle and reduce stages.

Joining Tables:

You can combine the data of two or more tables in Hive by using HiveQL queries. For this,
we need to create tables and load them into Hive from HDFS.

Let us create two tables, Post_data_uk and Post_data_us, in HDFS through HQL.

1. To create the table Post_data_uk, type the following command on the terminal:

2. Type the following command on the terminal to create the table Post_data_us:
3. To load the two tables, Post_data_uk and Post_data_us, type the following commands:

4. To perform join operations on tables, we have specific HQL queries. To join the
Post_data_uk table with the Post_data_us table on the basis of the first_name column, type
the following command:

5. Type the following command to select five records after performing the left outer join
operation on the two tables, namely, Post_data_uk and Post_data_us:

6. To perform the right outer join on the tables, type the following command on the terminal:

7. To perform a full outer join operation, type the following command on the terminal:

You can also create partitions on the records stored in HDFS tables through HQL queries. For
this, let’s create a table named Post_data with partitions.

8. Type the following command to create the Post_data table with partitions:
The preceding command creates a table that would be partitioned on the basis of a string,
country. To load data from the post_uk file in the first partition of the Post_data table, type:

To load data from the post_us file in the second partition of the Post_data table, type:

The post_data file now contains the data given in both the post_uk and post_us datasets.

Bucketing:

Bucketing is the other method for partitioning, but there is a difference between both the
methods i.e. partitioning and bucketing. In partitioning, we always create a partitioning table
that holds each unique value of column. But, sometimes, there is a problem in creating the
partitions as there are so many values that are unique in nature due to which we need to make
a lot of small partitions. But, when we use bucketing, we can limit it to an extent of number
of your choice and extract your data to that bucket.

In Hive, partitions are basically the directories, whereas bucket is a file. In Hive, bucketing is
not automatically defined; bucketing property need to be set before using it.
set.hive.enforce.bucketing=true is the property that should be “true” to use bucketing. Let’s
perform the following steps to insert tables in bucket:

1. Create a table, say, abhitbl to store your data:

2. Now, create a bucketed table:

Hadoop HIVE
No ratings yet
Hadoop HIVE
41 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Cluster Computing
No ratings yet
Cluster Computing
57 pages
Unit 5 BDT
No ratings yet
Unit 5 BDT
49 pages
Unit IV Notes
No ratings yet
Unit IV Notes
47 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Chapter 7
No ratings yet
Chapter 7
84 pages
Unit IV
No ratings yet
Unit IV
22 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Hive
No ratings yet
Hive
30 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
SQL and Nosql Programming With Spark
No ratings yet
SQL and Nosql Programming With Spark
63 pages
Unit 5-Hive
No ratings yet
Unit 5-Hive
18 pages
Unit 3
No ratings yet
Unit 3
75 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Hive
No ratings yet
Hive
5 pages
Hive
No ratings yet
Hive
63 pages
HIVE
No ratings yet
HIVE
28 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Hive PPT
No ratings yet
Hive PPT
61 pages
Unit-3 FBDA
No ratings yet
Unit-3 FBDA
34 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Unit V BD LM Cse
No ratings yet
Unit V BD LM Cse
34 pages
Hive
No ratings yet
Hive
23 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Hive Final
No ratings yet
Hive Final
75 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
BDA Unit V
No ratings yet
BDA Unit V
23 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Apache HIVE
100% (1)
Apache HIVE
105 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
HIVE
No ratings yet
HIVE
33 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
HIVE
No ratings yet
HIVE
16 pages
HIVE
No ratings yet
HIVE
18 pages
Unit 5 Handouts
No ratings yet
Unit 5 Handouts
16 pages
Hive and Hiveql
No ratings yet
Hive and Hiveql
10 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive
No ratings yet
Hive
45 pages
Unit-4 Hive
No ratings yet
Unit-4 Hive
10 pages
Hive
No ratings yet
Hive
9 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Ketan Babbar - 22BCC70082 - Daa
No ratings yet
Ketan Babbar - 22BCC70082 - Daa
34 pages
Se2030-01 Software Engineering Tools and Practices
No ratings yet
Se2030-01 Software Engineering Tools and Practices
33 pages
Ga 14 15
No ratings yet
Ga 14 15
4 pages
MCQs SQL Server
No ratings yet
MCQs SQL Server
15 pages
University Institute of Engineering Department of Electronics & Communication Engineering
No ratings yet
University Institute of Engineering Department of Electronics & Communication Engineering
3 pages
Kill Script Roblox
No ratings yet
Kill Script Roblox
3 pages
Python Programming New
No ratings yet
Python Programming New
1 page
Ôn Tập LTW 2
No ratings yet
Ôn Tập LTW 2
38 pages
OOP in C++
88% (8)
OOP in C++
23 pages
Final Oopm Lab Manual - 17-18
No ratings yet
Final Oopm Lab Manual - 17-18
45 pages
Number - of - Certificates - On - Etoken
No ratings yet
Number - of - Certificates - On - Etoken
2 pages
Pipelined RISC-V Processor With Cache
No ratings yet
Pipelined RISC-V Processor With Cache
7 pages
Mod Menu Log - Com - Roblox.client
No ratings yet
Mod Menu Log - Com - Roblox.client
229 pages
Emgu CV Tutorial Skander
No ratings yet
Emgu CV Tutorial Skander
36 pages
B.SC Cs Batchno 8
No ratings yet
B.SC Cs Batchno 8
40 pages
Print Brush Java Project
50% (2)
Print Brush Java Project
40 pages
Pradeep SIT
No ratings yet
Pradeep SIT
1 page
DSA Project v1
No ratings yet
DSA Project v1
3 pages
Resume - Justin Curtin
No ratings yet
Resume - Justin Curtin
2 pages
ExtJS Toolbars Menus Buttons
No ratings yet
ExtJS Toolbars Menus Buttons
16 pages
FDS Q.B. @vishal Ghuge
No ratings yet
FDS Q.B. @vishal Ghuge
2 pages
CS 2022-24
No ratings yet
CS 2022-24
228 pages
Design and Code Review Checklists Assignment Kit
No ratings yet
Design and Code Review Checklists Assignment Kit
14 pages
RLIB Programmers Manual: SICOM Systems, INC
No ratings yet
RLIB Programmers Manual: SICOM Systems, INC
52 pages
Assignment-3 Ch-1 & 2
No ratings yet
Assignment-3 Ch-1 & 2
3 pages
Intdata An Array of Strings Twodarray A Two Dimensional Array
No ratings yet
Intdata An Array of Strings Twodarray A Two Dimensional Array
6 pages
JSP Objective
No ratings yet
JSP Objective
16 pages
21CS43 Module 5 Microcontroller and Embedded Systems Prof VANARASAN
No ratings yet
21CS43 Module 5 Microcontroller and Embedded Systems Prof VANARASAN
41 pages
DBT 1301 Data Structures Algorithms December 2019 PDF
No ratings yet
DBT 1301 Data Structures Algorithms December 2019 PDF
5 pages

Unit-5 1

Uploaded by

Unit-5 1

Uploaded by

UNIT-5

NoSQL Data Management: Introduction to NoSQL, Types of NoSQL Data Models,

 JAR—Hive is somewhat equal to Hadoop JAR, as it is convenient to run Java applications

The preceding command gives the following output:

Data Types in Hive:

Built-In Functions in Hive

In order to create a database, we can use the following command:

In the preceding command, we are altering ‘DBPROPERTIES’ of the temp_database. It

Creating a Table Using the Existing Schema:

You can delete or drop a table by using the following command:

On execution, the preceding command deletes the students table.

 Add new columns

 Delete some columns

 Change table properties

 Alter tables for adding partitions

You can rename a table by using the following command:

In the preceding command, the employees table gets renamed to employees_new.

In the preceding command, the column, ename, gets renamed to emp_name.

Data Manipulation in Hive

 Loading files into tables

 Inserting data into Hive table from queries

 Updating existing tables

 Deleting records in tables

Let’s learn about each of these mechanisms in detail.

Loading Files into Tables:

Inserting Data into Tables:

Static Partition Insertion:

The following command shows an example of dynamic partition insertion:

Inserting Data into Local Files:

Creating and Inserting Data into a Table Using a Single Query:

Consider the following update operation performed on the sal column:

Data Retrieval Queries

In this section, you learn about the following:

 Using the SELECT command

 Using the GROUP BY clause

 Using the HAVING clause

 Using the LIMIT clause

 Executing HiveQL queries

Using the SELECT Command:

Using the WHERE Clause:

Using the GROUP BY Clause:

Using the HAVING Clause:

Using the LIMIT Clause:

The output of the previous command is as follows:

Outer joins are of three types:

 Left Outer Join

 Full Outer Join

Right Outer Join:

The query involving right outer join can be written as follows:

The output of the preceding query is as follows:

Left Outer Join:

Full Outer Join:

Figure 12.15 illustrates the concept of full outer join:

The preceding query gives the following output:

Cartesian Product Joins:

The following is a query with a Cartesian product joint:

Figure 12.16 shows the map join procedure in Hive:

Figure 12.17 shows the map-side join in Hive:

1. Create a table, say, abhitbl to store your data:

2. Now, create a bucketed table:

You might also like