Accessing Hadoop Data Using Hive: Unit 2: Working With Hive DDL
Accessing Hadoop Data Using Hive: Unit 2: Working With Hive DDL
Contents
LAB 2 WORKING WITH HIVE DDL............................................................................................................................... 4
1.1 ACCESSING THE HIVE BEELINE CLI ........................................................................................................... 5
1.2 WORKING WITH DATABASES IN HIVE .......................................................................................................... 6
1.3 EXPLORING OUR SAMPLE DATASET ........................................................................................................... 8
1.3.1 FINDING THE SAMPLE DATA ........................................................................................................ 8
1.3.2 SAMPLE DATA DESCRIPTIONS ................................................................................................... 12
1.4 TABLES IN HIVE ..................................................................................................................................... 15
1.4.1 MANAGED NON-PARTITIONED TABLES ....................................................................................... 15
1.4.2 MANAGED PARTITIONED TABLES ............................................................................................... 19
1.4.3 EXTERNAL TABLE .................................................................................................................... 20
1.5 SUMMARY ............................................................................................................................................. 23
Contents Page 3
IBM Software
This version of the lab was designed using the InfoSphere BigInsights 3.0 Quick Start Edition.
Throughout this lab it is assumed that you will be using the following account login information:
Username Password
If you are continuing this series of hands on labs immediately after completing Accessing Hadoop Data
Using Hive Unit 1: Exploring Hive, you may move on to section 1.1 of this lab. Otherwise please refer to
Accessing Hadoop Data Using Hive Unit 1: Exploring Hive Section 1.1 to get started. (All Hadoop
components should be running)
__1. Open the Linux terminal by double clicking the Terminal icon from within the BigInsights Shell
directory on the desktop.
Note: You could also directly start the “original Hive CLI shell” by clicking the Hive shell icon
within the BigInsights Shell directory. This lab however uses the newer Beeline CLI.
__2. In the Linux terminal change to the Hive home bin directory
~> cd $HIVE_HOME/bin
__4. Run the SHOW DATABASES statement from within the interactive Hive session.
Hands-on-Lab Page 5
IBM Software
__2. Let’s confirm that the new database was added to Hive’s catalog.
__3. Now that we have created a new database, let’s describe it.
The DESCRIBE DATABASE shows us the location of testdb on HDFS. Notice that the testdb.db
schema is being stored inside HDFS in the /biginsights/hive/warehouse directory.
__4. Let’s confirm that the new testdb.db directory was in fact created on HDFS. Open a second
Linux terminal by double clicking the Terminal icon from within the BigInsights Shell directory on
the desktop.
__5. Check HDFS to confirm our new database directory was created.
Keep this second Linux console open for the rest of this lab. We will continue to use it to
look at HDFS. You will be going back and forth between the Hive console and this Linux
console.
__6. Add some information to the DBPROPERTIES metadata for the testdb database. We do this by
using the ALTER DATABASE syntax.
Notice the CASCADE keyword. That is optional. Using it will cause Hive to delete all the tables in
your database (if there are any) before dropping the database. If you try to delete a database
that has tables without the CASCADE keyword, Hive won’t let you.
Hands-on-Lab Page 7
IBM Software
__10. Now we are going to create a database that will house the tables we will use for many of the
exercises in this course. This database will be called “computersalesdb”.
__11. Verify the DB was created. Note the location of the database directory in HDFS.
We can see that the computersalesdb does in fact exist and the new directory was created on
HDFS in the /biginsights/hive/warehouse/computersalesdb.db folder.
__12. Tell Hive to Use the computersalesdb (we will use this database for the rest of this interactive
session).
Earlier in this course you placed the lab data in the /home/biadmin directory. Now we will take a look at
the contents of those files.
A new window will open that shows us the contents of our /home/biadmin directory.
__3. In Hive there is not an easy way to remove the header row from a file. To make things easy we
have two directories – WithoutHeaders and WithRowHeaders.
Hands-on-Lab Page 9
IBM Software
WithRowHeaders - Contains 3 data files in csv format. The first row in each file is a header row.
We created this directory just so you can see the metadata of the table (what data each column
holds). You will only be using this directory in this exercise – and only to examine what the data
looks like.
WithoutHeaders – Contains the same 3 data files that the WithRowHeaders directory has,
EXCEPT the first rows (header data) are removed from this data. This data is ready to be used
with Hive.
__4. Let’s check out our data. Navigate to the WithRowHeaders directory.
__5. To view the contents of one of the files, right-click the file and then click from the menu. Then
click the “Display” button on the pop up box. Examine each of the 3 files.
Hands-on-Lab Page 11
IBM Software
Our sample data is from a fictitious computer retailer. The company sells computer parts and
generally serves a single State in the country.
Customer.csv:
Columns:
Example of contents:
Product.csv:
Columns:
Colon
separated list of
Description of Category Quantity of Unique
Name of things that
computer product product in product
product come in
product belongs to warehouse number
package with
product.
Example of contents:
Hands-on-Lab Page 13
IBM Software
Sales.csv:
Purpose: Holds all historical sales records. Company updates once a month.
Columns:
ID of
customer ID of product that Unique sale
QTY purchased Date of sale
who made was purchased ID
purchase
Example of contents:
The first table we will create in Hive is the products table. This table will be fully managed by Hive and
will not contain any partitions.
Note the data types we have assigned to the different columns. The packaged_with column is of
special interest – it is designated as an Array of Strings. The array will hold data that is separated
by the colon “:” character - e.g. satacable:manual. We also tell Hive that the columns in our rows
are delimited by commas “,”. The last line tells Hive that our data file is a plain text file.
Hands-on-Lab Page 15
IBM Software
We can see that only one table exists in our database and it is the new products table we just
created.
__3. Add a note to the TBLPROPERTIES for our new products table.
hive> ALTER TABLE products SET TBLPROPERTIES (
‘details’ = ‘This table holds products’);
Beeline’s default settings truncate all the output. Let’s adjust Beeline to display data in a different
format so we can see all of the output.
Inside of Beeline:
hive> !set outputformat vertical
That is a lot of details! Notice there is some interesting info including the location of this table
within HDFS: /biginsights/hive/warehouse/computersalesdb.db/products
__5. Let’s verify that the products directory was created on HDFS in the location listed above. Run the
HDFS ls command from within the Linux console. First list the contents of the database directory,
then list the contents of the products table directory.
The first command confirms that there is in fact a products table directory on HDFS. The second
command shows that there are no files within the products directory yet. This directory will be
empty until we load data into the products table in a later exercise.
__6. Imagine that our fictitious computer company adds sales data to a “sales_staging” table at the
end of each month. From this sales_staging table they then move the data they want to analyze
into a partitioned “sales” table. The partitioned sales table is the one they actual use for their
analysis.
Hands-on-Lab Page 17
IBM Software
Now that we know how to create tables, we will create one more managed non-partitioned table
called “sales_staging”. This table will hold ALL of the sales data from the sales.csv file. In later
exercises we will actually split this sales_staging data into a partitioned table called “sales”.
__7. We can now assume that the new sales_staging table directory is on HDFS in the following
folder: /biginsights/hive/warehouse/computersalesdb.db/sales_staging. Let’s quickly confirm by
entering the following command in the Linux console:
Sure enough, the sales_staging directory was created and is now being managed by Hive.
__8. Ask Hive to show us the tables in our database. Confirm your new sales_staging table is in the
Hive catalog.
hive> SHOW TABLES;
__9. Let’s pretend that we have decided we want to update some of our column metadata. We will
change the sale_date column in the sales_staging table from a STRING type to a DATE type.
hive> ALTER TABLE sales_staging CHANGE sale_date sale_date DATE;
__1. Now we will create a partitioned table. This table will be a managed table – Hive will manage the
metadata and lifecycle of this table, just like the tables we previously created.
In the CLI create the sales table. This table will be partitioned on the sales date.
hive> CREATE TABLE sales
(
cust_id STRING,
prod_num STRING,
qty INT,
sales_id STRING
)
COMMENT ‘Table for analysis of sales data’
PARTITIONED BY (sales_date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Notice that we list the sales_date in the PARTITIONED BY clause instead of listing it in the data
column metadata. Since we are partitioning on sales_date, Hive will keep track of the dates for
us outside of the actual data.
Hands-on-Lab Page 19
IBM Software
When we later put data into this table, a new directory will be created inside the sales directory
for EACH partition.
The following line in the details shows us how our table is partitioned: partitionKeys:
[FieldSchema(name:sales_date, typestring,comment:null)]
Another department in our fictitious computer company would like to be able to analyze the customer
data. It therefore makes sense that we setup the customer table as EXTERNAL so they can use their
tools on the data and we can use ours (Hive). We will place a copy of the Customer.csv file in HDFS and
then create a new table in Hive that points to this data.
__1. First we need to create a new directory – let’s call it “shared_hive_data” - on HDFS that can
house our Customer.csv data file. Let’s put this in the /user/biadmin directory. We will run the
command to make the new directory from the Linux console.
__2. Now we will move a copy of the Customer.csv file into the /user/biadmin/shared_hive_data
directory. We can run the command to do this from within the Linux console.
__3. Confirm that Customer.csv has been copied successfully into HDFS. Enter the following
command in the Linux console.
If your output looks similar to the screen capture above, then that is good!
If you’d like, run the “cat” command to verify the data is in the Customer.csv file on HDFS.
Hands-on-Lab Page 21
IBM Software
fname STRING,
lname STRING,
status STRING,
telno STRING,
customer_id STRING,
city_zip STRUCT<city:STRING, zip:STRING>
)
COMMENT ‘External table for customer data’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ‘|’
LOCATION ‘/user/biadmin/shared_hive_data/’;
There are a few things to note here. First, we use the EXTERNAL keyword in the CREATE line.
We are leaving out the “stored as line” when we are creating this external table since the default
format is already set to TEXTFILE. We also add the LOCATION ‘location/of/datadirectory’ line to
the end of the statement.
Hive expects that LOCATION will be a directory, not a file. For our exercises we will only have a
single file in the shared_hive_data directory. However, you could put multiple customer data files
into the shared_hive_data directory and Hive would use them all for your EXTERNAL table! That
is a common scenario.
You can see that the location points to the /user/biadmin/shared_hive_data directory we
designated on HDFS. Also notice towards the end of the output that
tableType:EXTERNAL_TABLE.
__6. Let’s set the Beeline output format back to the table style. This will give us cleaner output when
we run queries.
hive> !set outputformat table
__7. Since customer is an External table and Hive already knows where the data is sitting, you can
already begin to run queries on this table. Reward yourself for completing this lab by running a
simple select query as proof.
hive> SELECT * FROM customer LIMIT 5;
This Hive query didn’t run any MapReduce jobs. Hive is able to read the data file and write the
results to the CLI without using MapReduce, since this is a simple SELECT and LIMIT
statement.
1.5 Summary
Congratulations! You now know how to create, alter and remove databases in Hive. You can create
managed, external, and partitioned Hive tables. You also are familiar with the sample data that will be
used in this course. You may move on to the next Unit.
Hands-on-Lab Page 23
NOTES
NOTES
© Copyright IBM Corporation 2013.