0% found this document useful (0 votes)
62 views

Accessing Hadoop Data Using Hive: Unit 2: Working With Hive DDL

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Accessing Hadoop Data Using Hive: Unit 2: Working With Hive DDL

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

IBM Software An IBM Proof of Technology

Accessing Hadoop Data Using Hive


Unit 2: Working with Hive DDL
An IBM Proof of Technology
Catalog Number

© Copyright IBM Corporation, 2013


US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM Software

Contents
LAB 2 WORKING WITH HIVE DDL............................................................................................................................... 4
1.1 ACCESSING THE HIVE BEELINE CLI ........................................................................................................... 5
1.2 WORKING WITH DATABASES IN HIVE .......................................................................................................... 6
1.3 EXPLORING OUR SAMPLE DATASET ........................................................................................................... 8
1.3.1 FINDING THE SAMPLE DATA ........................................................................................................ 8
1.3.2 SAMPLE DATA DESCRIPTIONS ................................................................................................... 12
1.4 TABLES IN HIVE ..................................................................................................................................... 15
1.4.1 MANAGED NON-PARTITIONED TABLES ....................................................................................... 15
1.4.2 MANAGED PARTITIONED TABLES ............................................................................................... 19
1.4.3 EXTERNAL TABLE .................................................................................................................... 20
1.5 SUMMARY ............................................................................................................................................. 23

Contents Page 3
IBM Software

Lab 2 Working with Hive DDL


Before we can begin working with and analyzing data in Hive we must first use Hive’s data manipulation
language. Using Hive DML will enable us to create databases, tables, partitions and more which we can
later load with data that can be queried and manipulated.

After completing this hands-on lab, you will be able to:

• Create, Alter, and Drop Databases in Hive.


• Create Managed, External, and Partitioned Tables in Hive.
• Be able to locate Databases and Tables in HDFS.

Allow 1 to 1.5 hours to complete this section of lab.

This version of the lab was designed using the InfoSphere BigInsights 3.0 Quick Start Edition.
Throughout this lab it is assumed that you will be using the following account login information:

Username Password

VM image setup screen root password

Linux biadmin biadmin

If you are continuing this series of hands on labs immediately after completing Accessing Hadoop Data
Using Hive Unit 1: Exploring Hive, you may move on to section 1.1 of this lab. Otherwise please refer to
Accessing Hadoop Data Using Hive Unit 1: Exploring Hive Section 1.1 to get started. (All Hadoop
components should be running)

Page 4 Hive DDL


IBM Software

1.1 Accessing the Hive Beeline CLI


In this section we will navigate to the Hive Beeline CLI and start an interactive CLI session.

__1. Open the Linux terminal by double clicking the Terminal icon from within the BigInsights Shell
directory on the desktop.

Note: You could also directly start the “original Hive CLI shell” by clicking the Hive shell icon
within the BigInsights Shell directory. This lab however uses the newer Beeline CLI.

__2. In the Linux terminal change to the Hive home bin directory

~> cd $HIVE_HOME/bin

__3. Start an interactive Hive shell session.

~> ./beeline –u jdbc:hive2://bivm.ibm.com:10000 –n biadmin –p biadmin

__4. Run the SHOW DATABASES statement from within the interactive Hive session.

hive> SHOW DATABASES;

Hands-on-Lab Page 5
IBM Software

1.2 Working with Databases in Hive


If we neglect to create a new database in Hive, then the “default” database will be used. Let’s create a
new database and work with it. In this exercise we will create two databases in the Hive system. One of
them will be used for future exercises. The other will be deleted.

__1. In the Hive shell create a database called testDB.

hive> CREATE DATABASE testDB;

__2. Let’s confirm that the new database was added to Hive’s catalog.

hive> SHOW DATABASES;

Notice that Hive converted “testDB” to lowercase.

__3. Now that we have created a new database, let’s describe it.

hive> DESCRIBE DATABASE testdb;

The DESCRIBE DATABASE shows us the location of testdb on HDFS. Notice that the testdb.db
schema is being stored inside HDFS in the /biginsights/hive/warehouse directory.

__4. Let’s confirm that the new testdb.db directory was in fact created on HDFS. Open a second
Linux terminal by double clicking the Terminal icon from within the BigInsights Shell directory on
the desktop.

Page 6 Hive DDL


IBM Software

__5. Check HDFS to confirm our new database directory was created.

~> hadoop fs –ls /biginsights/hive/warehouse

The testdb.db directory WAS created.

Keep this second Linux console open for the rest of this lab. We will continue to use it to
look at HDFS. You will be going back and forth between the Hive console and this Linux
console.

__6. Add some information to the DBPROPERTIES metadata for the testdb database. We do this by
using the ALTER DATABASE syntax.

hive> ALTER DATABASE testdb SET DBPROPERTIES (‘creator’ = ‘bigdatarockstar’);

__7. Let’s view the extended details of our testdb database.

hive> DESCRIBE DATABASE EXTENDED testdb;

Notice the updated database properties.

__8. Go ahead and delete the testdb database.

hive> DROP DATABASE testdb CASCADE;

Notice the CASCADE keyword. That is optional. Using it will cause Hive to delete all the tables in
your database (if there are any) before dropping the database. If you try to delete a database
that has tables without the CASCADE keyword, Hive won’t let you.

__9. Confirm that testdb is no longer in the Hive metastore catalog.

hive> SHOW DATABASES;

Hands-on-Lab Page 7
IBM Software

__10. Now we are going to create a database that will house the tables we will use for many of the
exercises in this course. This database will be called “computersalesdb”.

hive> CREATE DATABASE computersalesdb;

__11. Verify the DB was created. Note the location of the database directory in HDFS.

hive> DESCRIBE DATABASE computersalesdb;

We can see that the computersalesdb does in fact exist and the new directory was created on
HDFS in the /biginsights/hive/warehouse/computersalesdb.db folder.

__12. Tell Hive to Use the computersalesdb (we will use this database for the rest of this interactive
session).

hive> USE computersalesdb;

Keep your CLI open – we will be using it in the upcoming exercises.

1.3 Exploring Our Sample Dataset


Before we begin creating tables in our new database it is important to understand what data is in our
sample files and how that data is structured.

Earlier in this course you placed the lab data in the /home/biadmin directory. Now we will take a look at
the contents of those files.

1.3.1 Finding the Sample Data

__1. Click the biadmin’s Home shortcut on the desktop

Page 8 Hive DDL


IBM Software

A new window will open that shows us the contents of our /home/biadmin directory.

__2. Navigate to the following directory sampleData->Computer_Business

__3. In Hive there is not an easy way to remove the header row from a file. To make things easy we
have two directories – WithoutHeaders and WithRowHeaders.

Hands-on-Lab Page 9
IBM Software

WithRowHeaders - Contains 3 data files in csv format. The first row in each file is a header row.
We created this directory just so you can see the metadata of the table (what data each column
holds). You will only be using this directory in this exercise – and only to examine what the data
looks like.

WithoutHeaders – Contains the same 3 data files that the WithRowHeaders directory has,
EXCEPT the first rows (header data) are removed from this data. This data is ready to be used
with Hive.

__4. Let’s check out our data. Navigate to the WithRowHeaders directory.

__5. To view the contents of one of the files, right-click the file and then click from the menu. Then
click the “Display” button on the pop up box. Examine each of the 3 files.

Page 10 Hive DDL


IBM Software

Hands-on-Lab Page 11
IBM Software

1.3.2 Sample Data Descriptions

Our sample data is from a fictitious computer retailer. The company sells computer parts and
generally serves a single State in the country.

Customer.csv:

Purpose: Hold customer records.

Columns:

FNAME LNAME STATUS TELNO CUSTOMER_ID CITY|ZIP

City and Zip


Active or
Customer’s Customer’s Telephone Customer’s code separated
Inactive
First Name Last Name # unique ID by the “|”
status
character.

Example of contents:

Page 12 Hive DDL


IBM Software

Product.csv:

Purpose: Hold product records.

Columns:

PROD_ PROD_ PACKAGED_


DESCRIPTION CATEGORY QTY_ON_HAND
NAME NUM WITH

Colon
separated list of
Description of Category Quantity of Unique
Name of things that
computer product product in product
product come in
product belongs to warehouse number
package with
product.

Example of contents:

Hands-on-Lab Page 13
IBM Software

Sales.csv:

Purpose: Holds all historical sales records. Company updates once a month.

Columns:

CUST_ID PROD_NUM QTY DATE SALES_ID

ID of
customer ID of product that Unique sale
QTY purchased Date of sale
who made was purchased ID
purchase

Example of contents:

Page 14 Hive DDL


IBM Software

1.4 Tables in Hive

1.4.1 Managed Non-Partitioned Tables

The first table we will create in Hive is the products table. This table will be fully managed by Hive and
will not contain any partitions.

__1. In the CLI, create the new products table in Hive.


hive> CREATE TABLE products
(
prod_name STRING,
description STRING,
category STRING,
qty_on_hand INT,
prod_num STRING,
packaged_with ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':'
STORED AS TEXTFILE;

Note the data types we have assigned to the different columns. The packaged_with column is of
special interest – it is designated as an Array of Strings. The array will hold data that is separated
by the colon “:” character - e.g. satacable:manual. We also tell Hive that the columns in our rows
are delimited by commas “,”. The last line tells Hive that our data file is a plain text file.

__2. Ask Hive to show us the tables in our database.


hive> SHOW TABLES IN computersalesdb;

Hands-on-Lab Page 15
IBM Software

We can see that only one table exists in our database and it is the new products table we just
created.

__3. Add a note to the TBLPROPERTIES for our new products table.
hive> ALTER TABLE products SET TBLPROPERTIES (
‘details’ = ‘This table holds products’);

__4. List the extended details of the products table.


hive> DESCRIBE EXTENDED products;

Beeline’s default settings truncate all the output. Let’s adjust Beeline to display data in a different
format so we can see all of the output.

Inside of Beeline:
hive> !set outputformat vertical

Now, rerun the DESCRIBE EXTENDED products; command.

Page 16 Hive DDL


IBM Software

That is a lot of details! Notice there is some interesting info including the location of this table
within HDFS: /biginsights/hive/warehouse/computersalesdb.db/products

__5. Let’s verify that the products directory was created on HDFS in the location listed above. Run the
HDFS ls command from within the Linux console. First list the contents of the database directory,
then list the contents of the products table directory.

~> hadoop fs –ls /biginsights/hive/warehouse/computersalesdb.db;

~> hadoop fs –ls /biginsights/hive/warehouse/computersalesdb.db/products;

The first command confirms that there is in fact a products table directory on HDFS. The second
command shows that there are no files within the products directory yet. This directory will be
empty until we load data into the products table in a later exercise.

__6. Imagine that our fictitious computer company adds sales data to a “sales_staging” table at the
end of each month. From this sales_staging table they then move the data they want to analyze
into a partitioned “sales” table. The partitioned sales table is the one they actual use for their
analysis.

Hands-on-Lab Page 17
IBM Software

Now that we know how to create tables, we will create one more managed non-partitioned table
called “sales_staging”. This table will hold ALL of the sales data from the sales.csv file. In later
exercises we will actually split this sales_staging data into a partitioned table called “sales”.

In the CLI, create the new sales_staging table in Hive.


hive> CREATE TABLE sales_staging
(
cust_id STRING,
prod_num STRING,
qty INT,
sale_date STRING,
sales_id STRING
)
COMMENT ‘Staging table for sales data’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

__7. We can now assume that the new sales_staging table directory is on HDFS in the following
folder: /biginsights/hive/warehouse/computersalesdb.db/sales_staging. Let’s quickly confirm by
entering the following command in the Linux console:

~> hadoop fs –ls /biginsights/hive/warehouse/computersalesdb.db;

Sure enough, the sales_staging directory was created and is now being managed by Hive.

__8. Ask Hive to show us the tables in our database. Confirm your new sales_staging table is in the
Hive catalog.
hive> SHOW TABLES;

Page 18 Hive DDL


IBM Software

__9. Let’s pretend that we have decided we want to update some of our column metadata. We will
change the sale_date column in the sales_staging table from a STRING type to a DATE type.
hive> ALTER TABLE sales_staging CHANGE sale_date sale_date DATE;

1.4.2 Managed Partitioned Tables

__1. Now we will create a partitioned table. This table will be a managed table – Hive will manage the
metadata and lifecycle of this table, just like the tables we previously created.

In the CLI create the sales table. This table will be partitioned on the sales date.
hive> CREATE TABLE sales
(
cust_id STRING,
prod_num STRING,
qty INT,
sales_id STRING
)
COMMENT ‘Table for analysis of sales data’
PARTITIONED BY (sales_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Notice that we list the sales_date in the PARTITIONED BY clause instead of listing it in the data
column metadata. Since we are partitioning on sales_date, Hive will keep track of the dates for
us outside of the actual data.

Hands-on-Lab Page 19
IBM Software

__2. Let’s view the extended details of our new table.


hive> DESCRIBE EXTENDED sales;

We can see a directory was created at /biginsights/hive/warehouse/computersalesdb.db/sales.

When we later put data into this table, a new directory will be created inside the sales directory
for EACH partition.

The following line in the details shows us how our table is partitioned: partitionKeys:
[FieldSchema(name:sales_date, typestring,comment:null)]

1.4.3 External Table

Another department in our fictitious computer company would like to be able to analyze the customer
data. It therefore makes sense that we setup the customer table as EXTERNAL so they can use their
tools on the data and we can use ours (Hive). We will place a copy of the Customer.csv file in HDFS and
then create a new table in Hive that points to this data.

__1. First we need to create a new directory – let’s call it “shared_hive_data” - on HDFS that can
house our Customer.csv data file. Let’s put this in the /user/biadmin directory. We will run the
command to make the new directory from the Linux console.

~> hadoop fs –mkdir /user/biadmin/shared_hive_data;

Page 20 Hive DDL


IBM Software

__2. Now we will move a copy of the Customer.csv file into the /user/biadmin/shared_hive_data
directory. We can run the command to do this from within the Linux console.

~> hadoop fs –put


/home/biadmin/sampleData/Computer_Business/WithoutHeaders/Customer.csv
/user/biadmin/shared_hive_data/Customer.csv;

__3. Confirm that Customer.csv has been copied successfully into HDFS. Enter the following
command in the Linux console.

~> hadoop fs –ls /user/biadmin/shared_hive_data/;

If your output looks similar to the screen capture above, then that is good!

If you’d like, run the “cat” command to verify the data is in the Customer.csv file on HDFS.

~> hadoop fs –cat /user/biadmin/shared_hive_data/Customer.csv;

__4. Now we just need to define our external customer table.

hive> CREATE EXTERNAL TABLE customer


(

Hands-on-Lab Page 21
IBM Software

fname STRING,
lname STRING,
status STRING,
telno STRING,
customer_id STRING,
city_zip STRUCT<city:STRING, zip:STRING>
)
COMMENT ‘External table for customer data’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ‘|’
LOCATION ‘/user/biadmin/shared_hive_data/’;

There are a few things to note here. First, we use the EXTERNAL keyword in the CREATE line.
We are leaving out the “stored as line” when we are creating this external table since the default
format is already set to TEXTFILE. We also add the LOCATION ‘location/of/datadirectory’ line to
the end of the statement.

Hive expects that LOCATION will be a directory, not a file. For our exercises we will only have a
single file in the shared_hive_data directory. However, you could put multiple customer data files
into the shared_hive_data directory and Hive would use them all for your EXTERNAL table! That
is a common scenario.

__5. Let’s view the extended details of our new table.


hive> DESCRIBE EXTENDED customer;

Page 22 Hive DDL


IBM Software

You can see that the location points to the /user/biadmin/shared_hive_data directory we
designated on HDFS. Also notice towards the end of the output that
tableType:EXTERNAL_TABLE.

__6. Let’s set the Beeline output format back to the table style. This will give us cleaner output when
we run queries.
hive> !set outputformat table

__7. Since customer is an External table and Hive already knows where the data is sitting, you can
already begin to run queries on this table. Reward yourself for completing this lab by running a
simple select query as proof.
hive> SELECT * FROM customer LIMIT 5;

This Hive query didn’t run any MapReduce jobs. Hive is able to read the data file and write the
results to the CLI without using MapReduce, since this is a simple SELECT and LIMIT
statement.

1.5 Summary
Congratulations! You now know how to create, alter and remove databases in Hive. You can create
managed, external, and partitioned Hive tables. You also are familiar with the sample data that will be
used in this course. You may move on to the next Unit.

Hands-on-Lab Page 23
NOTES
NOTES
© Copyright IBM Corporation 2013.

The information contained in these materials is provided for


informational purposes only, and is provided AS IS without warranty
of any kind, express or implied. IBM shall not be responsible for any
damages arising out of the use of, or otherwise related to, these
materials. Nothing contained in these materials is intended to, nor
shall have the effect of, creating any warranties or representations
from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of
IBM software. References in these materials to IBM products,
programs, or services do not imply that they will be available in all
countries in which IBM operates. This information is based on
current IBM product plans and strategy, which are subject to change
by IBM without notice. Product release dates and/or capabilities
referenced in these materials may change at any time at IBM’s sole
discretion based on market opportunities or other factors, and are not
intended to be a commitment to future product or feature availability
in any way.

IBM, the IBM logo and ibm.com are trademarks of International


Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.

You might also like