0% found this document useful (0 votes)
9 views109 pages

Slide 5 High-Level Data Process Components Tutorial

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views109 pages

Slide 5 High-Level Data Process Components Tutorial

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

Big Data

Data Loading Tools


Trong-Hop Do

S3Lab
Smart Software System Laboratory

1
“Without big data, you are blind and deaf
and in the middle of a freeway.”
– Geoffrey Moore

Big Data 2
Hadoop Ecosystem

3
Big Data
Apache Hive Tutorial

4
High-Level Data Process Components
Hive

● An sql like interface to Hadoop.


● Data warehouse infrastructure built on top of Hadoop
● Provide data summarization, query and analysis
● Query execution via MapReduce
● Hive interpreter convert the query to Map-reduce format.
● Open source project.
● Developed by Facebook
● Also used by Netflix, Cnet, Digg, eHarmony etc.
5
Big Data
High-Level Data Process Components
Hive - architecture

6
Big Data
High-Level Data Process Components
Hive

● HiveQL example:

SELECT customerId, max(total_cost) from hive_purchases GROUP BY


customerId HAVING count(*) > 3;

7
Big Data
Hive tutorial

Let us get started with Command Line Interface(CLI)


Command: hive

8
Hive tutorial
Create and show databases;
Command: create database newdb;

Command: show databases;

9
Hive tutorial
Two types of table in Hive: managed table and external table
● For managed table, Hive is responsible for managing the data of a managed table. If you load the
data from a file present in HDFS into a Hive Managed Table and issue a DROP command on it, the
table along with its metadata will be deleted. So, the data belonging to the
dropped managed_table no longer exist anywhere in HDFS and you can’t retrieve it by any means.
Basically, you are moving the data when you issue the LOAD command from the HDFS file location
to the Hive warehouse directory.
● For external table, Hive is not responsible for managing the data. In this case, when you issue the
LOAD command, Hive moves the data into its warehouse directory. Then, Hive creates the
metadata information for the external table. Now, if you issue a DROP command on the external
table, only metadata information regarding the external table will be deleted. Therefore, you can still
retrive the data of that very external table from the warehouse directory using HDFS commands.

10
Hive tutorial
Create managed table (internal table)

11
Hive tutorial
Describe table

12
Hive tutorial
Create external table
● Let’s try to create some external table

13
Hive tutorial

14
Hive tutorial

● Let’s check the directory /user/hive/warehouse in HDFS

15
Hive tutorial

● Let’s create new external table and store data in home directory of HDFS

16
Hive tutorial

● Let’s rename the table and check the result

17
Hive tutorial

● Let’s add one more column to the table and check the result

18
Hive tutorial

● Let’s change one column of the table and check the result

19
Hive tutorial
LOAD Data from Local into Hive Managed Table
● Command: LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Employee.csv' INTO TABLE employee;

Why NULL?

20
Hive tutorial
LOAD Data from Local into Hive Managed Table
● Check the schema of the table and the .csv file

21
Hive tutorial
LOAD Data from Local into Hive Managed Table

22
Hive tutorial
LOAD Data from Local into Hive Managed Table

23
Hive tutorial
LOAD Data from Local into Hive External Table
● Command: LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Employee.csv' INTO TABLE employee2;

24
Hive tutorial
Difference between managed and external table

25
Hive tutorial
Difference between managed and external table
● Let’s drop the external table

● Then check the directory associated with this external table in HDFS

26
Hive tutorial
Difference between managed and external table
● Let’s drop the internal table

● Then check the directory associated with this internal table in HDFS

27
Hive tutorial
LOAD Data from HDFS to Hive Table
● Let’s create some internal table

declare this Hive’s property to skip the header in Student.csv file

28
Hive tutorial
LOAD Data from HDFS to Hive Table
● Put file Student.csv to HDFS

29
Hive tutorial
LOAD Data from HDFS to Hive Table

30
Hive tutorial
Hive command using HUE
● Login to HUE

31
Hive tutorial
Hive command using HUE

32
Hive tutorial
Hive command using HUE
● Let’s check the file Employee.csv

33
Hive tutorial
Hive command using HUE
● Let’s check the file Student.csv

34
Hive tutorial
Hive command using HUE
● Let’s make some Hive query

35
Hive tutorial
Hive command using HUE
● Let’s make some Hive query

36
Hive tutorial
Hive command using HUE
● Check newly created file in HDFS

37
Hive tutorial
Hive command using HUE
● Let’s create some table in HUE
● Command: create table department (DepartmentID int, DepartmentName string) row format
delimited fields terminated by ',' tblproperties('skip.header.line.count'='1');

38
Hive tutorial
Hive command using HUE
● Upload file Department.csv in to department directory (use + button)

39
Hive tutorial
Hive command using HUE
● Query this newly created table

40
Hive tutorial
Partition in Hive
● Hive organizes tables into partitions for grouping similar type of data together based on
a column or partition key. Each Table can have one or more partition keys to identify a
particular partition. This allows us to have a faster query on slices of the data.

41
Hive tutorial
Static partition
● Create new database

● Create new table with partition

42
Hive tutorial
Static partition
● Check the format of the table

43
Hive tutorial
Static partition
● Load data from file StudentHadoop.csv to partition (course=Hadoop)

44
Hive tutorial
Static partition
● Check new directory course=Hadoop in HDFS

45
Hive tutorial
Static partition
● Continue to load data to other partition

46
Hive tutorial
Static partition
● Check all created directories in HDFS

47
Hive tutorial
Dynamic partition

● Create new database

48
Hive tutorial
Dynamic partition

● Create table student (same like before)

49
Hive tutorial
Dynamic partition

● Load data to table student (same like before)

50
Hive tutorial
Dynamic partition

● Create table student_partition

51
Hive tutorial
Dynamic partition

● Command: insert into student_partition partition(Course) select ID, Name, Age, Course from student;

52
Hive tutorial
Dynamic partition
● Check created directories in HDFS

53
Apache Pig Tutorial

54
High-Level Data Process Components
Pig

● A scripting platform for processing and analyzing large data sets


● Apache Pig allows to write complex MapReduce programs using a
simple scripting language.
● Made of two components:
○ High level language: Pig Latin (data flow language).
○ Pig translate Pig Latin script into MapReduce to execute within Hadoop.

● Open source project


● Developed by Yahoo
55
Big Data
High-Level Data Process Components
Pig

● Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray,


age:int, gpa:float);

X = FOREACH A GENERATE name,$2;

DUMP X;

56
Big Data
Pig tutorial
Pig data model

60
Big Data
Pig tutorial
Pig data model

61
Big Data
Pig tutorial
Pig data model

62
Big Data
Pig tutorial
Tuple and Bag

63
Big Data
Pig tutorial
Tuple

● Tuple is an ordered set of fields which may contain different data types for each field. You
can understand it as the records stored in a row in a relational database. A Tuple is a set of
cells from a single row as shown in the above image. The elements inside a tuple does not
necessarily need to have a schema attached to it.
● A tuple is represented by ‘()’ symbol.
● Example of tuple − (1, Linkin Park, 7, California)
● Since tuples are ordered, we can access fields in each tuple using indexes of the fields, like
$1 form above tuple will return a value ‘Linkin Park’. You can notice that above tuple doesn’t
have any schema attached to it.
64
Big Data
Pig tutorial
Bag

● A bag is a collection of a set of tuples and these tuples are subset of rows or
entire rows of a table. A bag can contain duplicate tuples, and it is not
mandatory that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can have different
number of fields. A bag can also have tuples with different data types.
● A bag is represented by ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death,
Los Angeles)}
65
Big Data
Pig tutorial
Bag

● For Apache Pig to effectively process bags, the fields and their respective
data types need to be in the same sequence.

● Set of bags −

● {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)},

● {(Metallica, 8, Los Angeles), (Mega Death, 8), (Linkin Park, California)}


66
Big Data
Pig tutorial
Two types of Bag: Outer Bag and Inner Bag.

● Outer bag or relation is noting but a bag of tuples. Here relations are similar as
relations in relational databases. To understand it better let us take an example:

● {(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}

● This above bag explains the relation between the Band and their place of Origin.

67
Big Data
Pig tutorial
Pig data model

● Inner bag contains a bag inside a tuple. For Example, if we sort


Band tuples based on Band’s Origin, we will get:

● (Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})


● (California,{(Linkin Park, California)})

● Here, first field type is a string while the second field type is a bag,

Big Datawhich is an inner bag within a tuple. 68


Pig tutorial
Map

69
Big Data
Pig tutorial
Map

● A map is key-value pairs used to represent data elements. The key must be a
chararray [] and should be unique like column name, so it can be indexed
and value associated with it can be accessed on basis of the keys. The value
can be of any data type.
● Maps are represented by ‘[]’ symbol and key-value are separated by ‘#’
symbol, as you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ], [band#Metallica,
members#8 ]
70
Big Data
Pig tutorial
Schema

● Schema assigns name to the field and declares data type of the
field. Schema is optional in Pig Latin but Pig encourage you to use
them whenever possible, as the error checking becomes efficient
while parsing the script which results in efficient execution of
program. Schema can be declared as both simple and complex data
types. During LOAD function, if the schema is declared it is also
attached with the data.
71
Big Data
Pig tutorial
Schema
● Few Points on Schema in Pig:
● If the schema only includes the field name, the data type of field is considered as byte
array.
● If you assign a name to the field you can access the field by both, the field name and the
positional notation. Whereas if field name is missing we can only access it by the
positional notation i.e. $ followed by the index number.
● If you perform any operation which is a combination of relations (like JOIN, COGROUP,
etc.) and if any of the relation is missing schema, the resulting relation will have null
schema.
● If the schema is null, Pig will consider it as byte array and the real data type of field will
Big Databe determined dynamically. 72
Pig tutorial
Open Pig grunt shell

● Command: pig

73
Big Data
Pig tutorial
Open Pig grunt shell

● Invoke the ls command of Linux shell from the Grunt shell


● Command: sh ls

74
Big Data
Pig tutorial
Load data into Apache Pig from the file system (HDFS/ Local)

75
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Open gedit and create a txt file and save it in home directory

101 Anto 20000 Architect


102 Bob 7000 SoftwareEngineer
103 Jack 4000 Programmer
104 Bil 3000 ITConsultant
105 Henry 5000 Manager
106 Isac 9000 Sr.Manager
107 David 7000 VP
108 Kingston 9000 Sr.VP
109 Balmer 19923 CEO

76
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Put the file to HDFS

77
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Let’s load data into Pig Relation using Pig Data Types.
● Command: employee = load 'employeeDetails.txt' using PigStorage(‘ ‘) as (id:int,
name:chararray,salary:float,task:chararray);

78
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Let’s DESCRIBE the relation to see the Data type names.

79
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Let’s use dump operator to display the result

80
Big Data
Pig tutorial
Load data from Local File System into Pig relation

● Open pig shell in local mode by pig -x local

● Load file

81
Big Data
Pig tutorial
Load data from Local File System into Pig relation

● Check the result

82
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.

● Let us consider that we have the Hive table called student with some data in it

83
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.

● The command below will load the data from HIVE Table into PIG Relation called pigdataStudent

● Command: pigdataStudent = load 'student' using org.apache.hive.hcatalog.pig.HCatLoader();

84
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.

● Check the content of the relation


● Command: dump pigdataStudent;

85
Big Data
Pig tutorial
Filter operation

● Now the a1 relation has all the data, Let us try to filter only the values where age > 23.
● Command: plus23 = filter pigdataStudent by age > 23;

86
Big Data
Pig tutorial
Filter operation

● Let’s DESCRIBE the relation to see the Data type names

87
Big Data
Pig tutorial
Filter operation

● Check the result

88
Big Data
Pig tutorial
Storing Data from PIG Relation

89
Big Data
Pig tutorial
Store PIG Relation into HDFS

90
Big Data
Pig tutorial
Store PIG Relation into HDFS

● Check if the plus23 directory has been created in HDFS

91
Big Data
Pig tutorial
Store PIG Relation into HDFS

● Check the content of the file

92
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table

● Create a new Hive table

93
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table

● Store data from Pig relation into the newly created Hive table

94
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table

● Check the Hive table

95
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Create and open an Apache Pig script file in an editor (e.g. gedit)

96
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Run the script in linux terminal

97
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Create file test2.pig

98
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Run the script in grunt shell

99
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Check the result

100
Big Data
Pig tutorial
Positional notation reference

● So far, we fields in Pig Relation are referred by name (e.g. id, name, salary, task, etc.,)
● Names are assigned by you using schemas
● Positional notation is generated by the system. Positional notation is indicated with the
dollar sign ($) and begins with zero (0); for example, $0, $1, $2.

101
Big Data
Pig tutorial
Positional notation reference

● In this example, the field task is referenced by position notation $3

102
Big Data
Pig tutorial
Positional notation reference

● Check the result

103
Big Data
Pig tutorial
Schema Handling

● You can define a schema that includes both the field name and field type.
● You can define a schema that includes the field name only; in this case, the field type
defaults to bytearray.
● You can choose not to define a schema; in this case, the field is un-named and the field
type defaults to bytearray.

104
Big Data
Pig tutorial
Schema Handling

● The field data types are not specified (defaults type is bytearray)

105
Big Data
Pig tutorial
Schema Handling

● Unknow schema

106
Big Data
Pig tutorial
Schema Handling

● Check the result

107
Big Data
Pig tutorial
Schema Handling

● Declare the schema of the result

108
Big Data
High-Level Data Process Components
Hive & Pig

● Both requires compiler to generate Mapreduce jobs


● Hence high latency queries when used for real time responses to ad-
hoc queries
● Both are good for batch processing and ETL jobs
● Fault tolerant

109
Big Data
High-Level Data Process Components
Impala

● Cloudera Impala is a query engine that runs on Apache Hadoop.


● Similar to HiveQL.
● Does not use Map-reduce
● Optimized for low latency queries
● Open source apache project
● Developed by Cloudera
● Much faster than Hive or pig

110
Big Data
111
112

You might also like