0% found this document useful (0 votes)
4 views11 pages

Lab6F_Creating Hive Table with Complex Data Type

The document provides a comprehensive guide on creating Hive tables with complex data types, including String Array, Map, and Struct. It outlines the necessary steps for each scenario, including data understanding, table structure planning, and data loading into Hive. Additionally, it covers accessing tools like HUE, Hive, and MariaDB, as well as monitoring YARN applications.

Uploaded by

2024740897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Lab6F_Creating Hive Table with Complex Data Type

The document provides a comprehensive guide on creating Hive tables with complex data types, including String Array, Map, and Struct. It outlines the necessary steps for each scenario, including data understanding, table structure planning, and data loading into Hive. Additionally, it covers accessing tools like HUE, Hive, and MariaDB, as well as monitoring YARN applications.

Uploaded by

2024740897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

L6F - Creating Hive Table with Complex Data Type

Outlines • Concept
• Scenario 1 - Creating a table with String Array data type and load data into the table
• Scenario 2 - Creating a table with Map data type and load data into the table
• Scenario 3 - Creating a table with Struct data type and load data into the table
• Scenario 4 - Processing values from Array data type
• Scenario 5 - Creating a table with Struct data type and load data into the table and perform a calculation

concept Hive Data Types

Can be classified into two types:


1) Primitive data types
2) Collective data types

Reference
• https://www.educba.com/hive-data-types/
• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes
• https://impala.apache.org/docs/build/html/topics/impala_array.html#array

Primitive data types There are:

Collective data types There are:

1) Array - a sequence of elements of a common type that can be indexed, and the index value starts from zero

2) Map - a set of key-value pair elements

3) Struct - a datatype that comprises of a set of attributes from different data type

4) Uniontype - can hold any one of the specified data types (beyond the scope)
Scenario 1 • Creating a table with String Array data type and load data into the table

reference:
• https://stackoverflow.com/questions/33984794/loading-csv-file-on-hive-table-with-string-array

Steps • understand data and plan for the table structure


• create a directory and upload the file into HDFS
• construct and execute a command to create an external table
• check the output

data understanding The file: scenario1.txt


and table planning

The data:

Plan for the table structure:

1) determine the datatypes


• the primitive data type
o id as int
o title as string
o author as string
• the collective data type
o genre as array
o
2) determine the delimiter
• we need to separate each field by ','
• we need to separate each collection by '|'

upload into HDFS Create a directory and the upload the file, as follows: (*Note replace student30 with your student access number)

create the table Note:


• We can apply internal and external table
• For this exercise, external example is chosen
• make sure you have chosen your database, such as: use student30;

Run this command:

create table if not exists article (


id int,
title string,
author string,
genre array<string> )
row format delimited
fields terminated by ','
collection items terminated by '|'
location '/user/student30/scenario1';

check the output run this command to check the table structure:
• describe article;
run this command to check the location of data:
• show create table article;

• notice that, the location is in HDFS directory, not specifically in Hive metastore

run this command to check the data:


• select * from article;

what if you want to list the data of genre only?


• select genre from article;

What if you want to retrieve the first genre of each record?


• select genre[0] as first_genre from article;

Exploration - What if you decide to create an internal table?


internal table
Steps:

1) you need to create the table

create table if not exists article_int (


id int,
title string,
author string,
genre array<string> )
row format delimited
fields terminated by ','
collection items terminated by '|';

2) check out the expected location of the data:


• show create table article_int;

3) you need to load the data into the table in Hive metastore
• load data inpath '/user/student30/scenario1/scenario1.txt' overwrite into table article_int;

4) check out the data


• select * from article_int;

Exploration 1) As you use the same data source for the internal table, what happen to the data of the previously created external
table?
2) How to address that problem?

Scenario 2 • Creating a table with Map data type and load data into the table

reference:
• https://acadgild.com/blog/hive-complex-data-types-with-examples

Steps • understand data and plan for the table structure


• create a directory and upload the file into HDFS
• construct and execute a command to create an internal table
• load the data into the table
• check the output

data understanding the file: scenario2.txt

the data:

Plan for the table structure:

1) determine the datatypes


• the primitive data type
o school level as string
o state as string
o gender as string
• the collective data type
o total student as map where:
▪ year as key
▪ total as value

2) determine the delimiter
• we need to separate each field by space
• we need to separate each collection by ','
• we need to separate each map by ':'

upload into HDFS Create a directory and the upload the file, as follows:

create the table Run this command:

create table if not exists school_info (


school_level string,
state string,
gender string,
total_stud map<int,int> )
row format delimited
fields terminated by ' '
collection items terminated by ','
map keys terminated by ':';

check the created table structure:


load the data run this command:

load data inpath '/user/student30/scenario2/scenario2.txt' overwrite into table school_info;

Note:
• you will need to adjust the path according to your path

check the output run this command to check the location of data:
• show create table school_info;

run this command to check the data:


• select * from school_info;

what if you want to list the data of total only?


• select total_stud from school_info;

What if you want to retrieve the data of 2015 of each record?


• select total_stud[2015] as total from school_info;

Exploration What if, we want to count the total student for each year?

1) identify how to access key in the map


• mapname[key]
2) construct and execute this command
• select sum(total_stud[2015]) as total_2015, sum(total_stud[2016]) as total_2016, sum(total_stud[2017]) as total_2017 from
school_info;
3) you should get the following output:
Scenario 3 • Creating a table with Struct data type and load data into the table

references:
• https://acadgild.com/blog/hive-complex-data-types-with-examples
• http://myitlearnings.com/complex-data-type-in-hive-struct/

Steps • understand data and plan for the table structure


• create a directory and upload the file into HDFS
• construct and execute a command to create an internal table
• load the data into the table
• check the output

data understanding the file: scenario3.txt

the data:

Plan for the table structure:

1) determine the datatypes


• the primitive data type
o firstname as string
o lastname as string
• the collective data type
o address as struct where:
▪ house number as int
▪ road name as string
▪ city as string
▪ state as string

2) determine the delimiter
• we need to separate each field by '\t'
• we need to separate each collection by ','

upload into HDFS Create a directory and the upload the file, as follows:

create the table Run this command:

create table if not exists address_info (


firstname string,
lastname string,
address struct<num:int, road:string, city:string, state:string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

check the created table structure:


load the data run this command:

load data inpath '/user/student30/scenario3/scenario3.txt' overwrite into table address_info;

Note:
• you will need to adjust the path according to your path

check the output run this command to check the location of data:
• show create table address_info;

run this command to check the data:


• select * from address_info;

what if you want to list the address only?


• select address from address_info;

What if you want to retrieve the city only from each record?
• select address.city as city from address_info;

Scenario 4 • Processing the values of Array data type

Steps • understand data and plan for the table structure


• create a directory and upload the file into HDFS
• construct and execute a command to create an internal table
• load the data into the table
• check the output

the dataset The file: scenario4.txt

The data:

Plan for the table structure:


1) determine the datatypes
• the primitive data type
o class name as string
• the collective data type
o mark as array
2) determine the delimiter
• we need to separate each field by space
• we need to separate each collection by ','

upload into HDFS Create a directory and the upload the file, as follows:

create the table Run this command:

create table if not exists classmark (


classname string,
mark array<int> )
row format delimited
fields terminated by '\t'
collection items terminated by ',';

check the created table structure:

load the data run this command:

load data inpath '/user/student30/scenario4/scenario4.txt' overwrite into table classmark;

Note:
• you will need to adjust the path according to your path

check the output run this command to check the location of data:
• show create table classmark;

run this command to check the data:


• select * from classmark;

what if you want to list the mark only?


• select mark from classmark;
What if you want to count the size of array?
• select classname, size(mark) as num_of_mark from classmark;

What if you want to sum the total mark for index 0?


• select sum(mark[0]) as total_index0 from classmark;

Scenario 5 • Creating a table with Struct data type and load data into the table and perform a calculation

Steps • understand data and plan for the table structure


• create a directory and upload the file into HDFS
• construct and execute a command to create an internal table
• load the data into the table
• check the output

the dataset The file: region.csv

The data:

Plan for the table structure:

1) determine the datatypes


• the primitive data type
o r_regionkey as smallint
o r_name as string
• the collective data type
o r_nation as struct where
▪ n_nationkey as smallint
▪ n_name as string
▪ n_comment as string

2) determine the delimiter
• we need to separate each field by '|'
• we need to separate each collection by ','

upload into HDFS Create a directory and the upload the file, as follows:
create the table Run this command:

create table if not exists region (


r_regionkey smallint,
r_name string,
r_comment string,
r_nations struct<n_nationkey:smallint, n_name:string, n_comment:string> )
row format delimited
fields terminated by '|'
collection items terminated by ','
tblproperties("skip.header.line.count"="1");

Note:
• we need skip.header.line.count because the dataset contains header
• alternatively, we can manually delete header in the dataset

check the created table structure:

load the data run this command:

load data inpath '/user/student30/scenario5/region.csv' overwrite into table region;

Note:
• you will need to adjust the path according to your path

check the output run this command to check the location of data:
• show create table region;

run this command to check the data:


• select * from region;

What if you want to calculate total number of nation keys and group it by the region name?
• select r_name, count(r_nations.n_nationkey) as nation_num from region group by r_name;
Accessing HUE • to access HUE, go to https://bigdatalab-rm-en1.uitm.edu.my:8889/hue/accounts/login?next=/
• then login using the given account

Accessing Hive • to access Hive, execute the following command:


o beeline -u jdbc:hive2://bigdatalab-cdh-mn1.uitm.edu.my:10000 -n yourusername -p yourpassword
• then type in:
o use yourdatabasename
• then, you can browse the available tables, by typing in:
o show tables

Accessing MariaDB• Type in the following:


o mysql -ustudent -pp@ssw0rd retail_db

YARN monitoring • To view the monitored applications (Note: you must access within UiTM network), go
tools to http://10.5.19.231:8088/cluster/apps
• To view the monitored jobs (Note: you must access within UiTM network), go to http://10.5.19.231:19888/jobhistory/app

You might also like