Slide 5 High-Level Data Process Components Tutorial
Slide 5 High-Level Data Process Components Tutorial
S3Lab
Smart Software System Laboratory
1
“Without big data, you are blind and deaf
and in the middle of a freeway.”
– Geoffrey Moore
Big Data 2
Hadoop Ecosystem
3
Big Data
Apache Hive Tutorial
4
High-Level Data Process Components
Hive
6
Big Data
High-Level Data Process Components
Hive
● HiveQL example:
7
Big Data
Hive tutorial
8
Hive tutorial
Create and show databases;
Command: create database newdb;
9
Hive tutorial
Two types of table in Hive: managed table and external table
● For managed table, Hive is responsible for managing the data of a managed table. If you load the
data from a file present in HDFS into a Hive Managed Table and issue a DROP command on it, the
table along with its metadata will be deleted. So, the data belonging to the
dropped managed_table no longer exist anywhere in HDFS and you can’t retrieve it by any means.
Basically, you are moving the data when you issue the LOAD command from the HDFS file location
to the Hive warehouse directory.
● For external table, Hive is not responsible for managing the data. In this case, when you issue the
LOAD command, Hive moves the data into its warehouse directory. Then, Hive creates the
metadata information for the external table. Now, if you issue a DROP command on the external
table, only metadata information regarding the external table will be deleted. Therefore, you can still
retrive the data of that very external table from the warehouse directory using HDFS commands.
10
Hive tutorial
Create managed table (internal table)
11
Hive tutorial
Describe table
12
Hive tutorial
Create external table
● Let’s try to create some external table
13
Hive tutorial
14
Hive tutorial
15
Hive tutorial
● Let’s create new external table and store data in home directory of HDFS
16
Hive tutorial
17
Hive tutorial
● Let’s add one more column to the table and check the result
18
Hive tutorial
● Let’s change one column of the table and check the result
19
Hive tutorial
LOAD Data from Local into Hive Managed Table
● Command: LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Employee.csv' INTO TABLE employee;
Why NULL?
20
Hive tutorial
LOAD Data from Local into Hive Managed Table
● Check the schema of the table and the .csv file
21
Hive tutorial
LOAD Data from Local into Hive Managed Table
22
Hive tutorial
LOAD Data from Local into Hive Managed Table
23
Hive tutorial
LOAD Data from Local into Hive External Table
● Command: LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Employee.csv' INTO TABLE employee2;
24
Hive tutorial
Difference between managed and external table
25
Hive tutorial
Difference between managed and external table
● Let’s drop the external table
● Then check the directory associated with this external table in HDFS
26
Hive tutorial
Difference between managed and external table
● Let’s drop the internal table
● Then check the directory associated with this internal table in HDFS
27
Hive tutorial
LOAD Data from HDFS to Hive Table
● Let’s create some internal table
28
Hive tutorial
LOAD Data from HDFS to Hive Table
● Put file Student.csv to HDFS
29
Hive tutorial
LOAD Data from HDFS to Hive Table
30
Hive tutorial
Hive command using HUE
● Login to HUE
31
Hive tutorial
Hive command using HUE
32
Hive tutorial
Hive command using HUE
● Let’s check the file Employee.csv
33
Hive tutorial
Hive command using HUE
● Let’s check the file Student.csv
34
Hive tutorial
Hive command using HUE
● Let’s make some Hive query
35
Hive tutorial
Hive command using HUE
● Let’s make some Hive query
36
Hive tutorial
Hive command using HUE
● Check newly created file in HDFS
37
Hive tutorial
Hive command using HUE
● Let’s create some table in HUE
● Command: create table department (DepartmentID int, DepartmentName string) row format
delimited fields terminated by ',' tblproperties('skip.header.line.count'='1');
38
Hive tutorial
Hive command using HUE
● Upload file Department.csv in to department directory (use + button)
39
Hive tutorial
Hive command using HUE
● Query this newly created table
40
Hive tutorial
Partition in Hive
● Hive organizes tables into partitions for grouping similar type of data together based on
a column or partition key. Each Table can have one or more partition keys to identify a
particular partition. This allows us to have a faster query on slices of the data.
41
Hive tutorial
Static partition
● Create new database
42
Hive tutorial
Static partition
● Check the format of the table
43
Hive tutorial
Static partition
● Load data from file StudentHadoop.csv to partition (course=Hadoop)
44
Hive tutorial
Static partition
● Check new directory course=Hadoop in HDFS
45
Hive tutorial
Static partition
● Continue to load data to other partition
46
Hive tutorial
Static partition
● Check all created directories in HDFS
47
Hive tutorial
Dynamic partition
48
Hive tutorial
Dynamic partition
49
Hive tutorial
Dynamic partition
50
Hive tutorial
Dynamic partition
51
Hive tutorial
Dynamic partition
● Command: insert into student_partition partition(Course) select ID, Name, Age, Course from student;
52
Hive tutorial
Dynamic partition
● Check created directories in HDFS
53
Apache Pig Tutorial
54
High-Level Data Process Components
Pig
DUMP X;
56
Big Data
Pig tutorial
Pig data model
60
Big Data
Pig tutorial
Pig data model
61
Big Data
Pig tutorial
Pig data model
62
Big Data
Pig tutorial
Tuple and Bag
63
Big Data
Pig tutorial
Tuple
● Tuple is an ordered set of fields which may contain different data types for each field. You
can understand it as the records stored in a row in a relational database. A Tuple is a set of
cells from a single row as shown in the above image. The elements inside a tuple does not
necessarily need to have a schema attached to it.
● A tuple is represented by ‘()’ symbol.
● Example of tuple − (1, Linkin Park, 7, California)
● Since tuples are ordered, we can access fields in each tuple using indexes of the fields, like
$1 form above tuple will return a value ‘Linkin Park’. You can notice that above tuple doesn’t
have any schema attached to it.
64
Big Data
Pig tutorial
Bag
● A bag is a collection of a set of tuples and these tuples are subset of rows or
entire rows of a table. A bag can contain duplicate tuples, and it is not
mandatory that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can have different
number of fields. A bag can also have tuples with different data types.
● A bag is represented by ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death,
Los Angeles)}
65
Big Data
Pig tutorial
Bag
● For Apache Pig to effectively process bags, the fields and their respective
data types need to be in the same sequence.
● Set of bags −
● Outer bag or relation is noting but a bag of tuples. Here relations are similar as
relations in relational databases. To understand it better let us take an example:
● {(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}
● This above bag explains the relation between the Band and their place of Origin.
67
Big Data
Pig tutorial
Pig data model
● Here, first field type is a string while the second field type is a bag,
69
Big Data
Pig tutorial
Map
● A map is key-value pairs used to represent data elements. The key must be a
chararray [] and should be unique like column name, so it can be indexed
and value associated with it can be accessed on basis of the keys. The value
can be of any data type.
● Maps are represented by ‘[]’ symbol and key-value are separated by ‘#’
symbol, as you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ], [band#Metallica,
members#8 ]
70
Big Data
Pig tutorial
Schema
● Schema assigns name to the field and declares data type of the
field. Schema is optional in Pig Latin but Pig encourage you to use
them whenever possible, as the error checking becomes efficient
while parsing the script which results in efficient execution of
program. Schema can be declared as both simple and complex data
types. During LOAD function, if the schema is declared it is also
attached with the data.
71
Big Data
Pig tutorial
Schema
● Few Points on Schema in Pig:
● If the schema only includes the field name, the data type of field is considered as byte
array.
● If you assign a name to the field you can access the field by both, the field name and the
positional notation. Whereas if field name is missing we can only access it by the
positional notation i.e. $ followed by the index number.
● If you perform any operation which is a combination of relations (like JOIN, COGROUP,
etc.) and if any of the relation is missing schema, the resulting relation will have null
schema.
● If the schema is null, Pig will consider it as byte array and the real data type of field will
Big Databe determined dynamically. 72
Pig tutorial
Open Pig grunt shell
● Command: pig
73
Big Data
Pig tutorial
Open Pig grunt shell
74
Big Data
Pig tutorial
Load data into Apache Pig from the file system (HDFS/ Local)
75
Big Data
Pig tutorial
Load data from HDFS into Pig relation
● Open gedit and create a txt file and save it in home directory
76
Big Data
Pig tutorial
Load data from HDFS into Pig relation
77
Big Data
Pig tutorial
Load data from HDFS into Pig relation
● Let’s load data into Pig Relation using Pig Data Types.
● Command: employee = load 'employeeDetails.txt' using PigStorage(‘ ‘) as (id:int,
name:chararray,salary:float,task:chararray);
78
Big Data
Pig tutorial
Load data from HDFS into Pig relation
79
Big Data
Pig tutorial
Load data from HDFS into Pig relation
80
Big Data
Pig tutorial
Load data from Local File System into Pig relation
● Load file
81
Big Data
Pig tutorial
Load data from Local File System into Pig relation
82
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.
● Let us consider that we have the Hive table called student with some data in it
83
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.
● The command below will load the data from HIVE Table into PIG Relation called pigdataStudent
84
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.
85
Big Data
Pig tutorial
Filter operation
● Now the a1 relation has all the data, Let us try to filter only the values where age > 23.
● Command: plus23 = filter pigdataStudent by age > 23;
86
Big Data
Pig tutorial
Filter operation
87
Big Data
Pig tutorial
Filter operation
88
Big Data
Pig tutorial
Storing Data from PIG Relation
89
Big Data
Pig tutorial
Store PIG Relation into HDFS
90
Big Data
Pig tutorial
Store PIG Relation into HDFS
91
Big Data
Pig tutorial
Store PIG Relation into HDFS
92
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table
93
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table
● Store data from Pig relation into the newly created Hive table
94
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table
95
Big Data
Pig tutorial
Create Your First Apache Pig Script
● Create and open an Apache Pig script file in an editor (e.g. gedit)
96
Big Data
Pig tutorial
Create Your First Apache Pig Script
97
Big Data
Pig tutorial
Create Your First Apache Pig Script
98
Big Data
Pig tutorial
Create Your First Apache Pig Script
99
Big Data
Pig tutorial
Create Your First Apache Pig Script
100
Big Data
Pig tutorial
Positional notation reference
● So far, we fields in Pig Relation are referred by name (e.g. id, name, salary, task, etc.,)
● Names are assigned by you using schemas
● Positional notation is generated by the system. Positional notation is indicated with the
dollar sign ($) and begins with zero (0); for example, $0, $1, $2.
101
Big Data
Pig tutorial
Positional notation reference
102
Big Data
Pig tutorial
Positional notation reference
103
Big Data
Pig tutorial
Schema Handling
● You can define a schema that includes both the field name and field type.
● You can define a schema that includes the field name only; in this case, the field type
defaults to bytearray.
● You can choose not to define a schema; in this case, the field is un-named and the field
type defaults to bytearray.
104
Big Data
Pig tutorial
Schema Handling
● The field data types are not specified (defaults type is bytearray)
105
Big Data
Pig tutorial
Schema Handling
● Unknow schema
106
Big Data
Pig tutorial
Schema Handling
107
Big Data
Pig tutorial
Schema Handling
108
Big Data
High-Level Data Process Components
Hive & Pig
109
Big Data
High-Level Data Process Components
Impala
110
Big Data
111
112