0% found this document useful (0 votes)

9 views109 pages

Slide 5 High-Level Data Process Components Tutorial

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views109 pages

Slide 5 High-Level Data Process Components Tutorial

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 109

Big Data

Data Loading Tools

Trong-Hop Do

S3Lab
Smart Software System Laboratory

1
“Without big data, you are blind and deaf
and in the middle of a freeway.”
– Geoffrey Moore

Big Data 2
Hadoop Ecosystem

3
Big Data
Apache Hive Tutorial

4
High-Level Data Process Components
Hive

● An sql like interface to Hadoop.

● Data warehouse infrastructure built on top of Hadoop
● Provide data summarization, query and analysis
● Query execution via MapReduce
● Hive interpreter convert the query to Map-reduce format.
● Open source project.
● Developed by Facebook
● Also used by Netflix, Cnet, Digg, eHarmony etc.
5
Big Data
High-Level Data Process Components
Hive - architecture

6
Big Data
High-Level Data Process Components
Hive

● HiveQL example:

SELECT customerId, max(total_cost) from hive_purchases GROUP BY

customerId HAVING count(*) > 3;

7
Big Data
Hive tutorial

Let us get started with Command Line Interface(CLI)

Command: hive

8
Hive tutorial
Create and show databases;
Command: create database newdb;

Command: show databases;

9
Hive tutorial
Two types of table in Hive: managed table and external table
● For managed table, Hive is responsible for managing the data of a managed table. If you load the
data from a file present in HDFS into a Hive Managed Table and issue a DROP command on it, the
table along with its metadata will be deleted. So, the data belonging to the
dropped managed_table no longer exist anywhere in HDFS and you can’t retrieve it by any means.
Basically, you are moving the data when you issue the LOAD command from the HDFS file location
to the Hive warehouse directory.
● For external table, Hive is not responsible for managing the data. In this case, when you issue the
LOAD command, Hive moves the data into its warehouse directory. Then, Hive creates the
metadata information for the external table. Now, if you issue a DROP command on the external
table, only metadata information regarding the external table will be deleted. Therefore, you can still
retrive the data of that very external table from the warehouse directory using HDFS commands.

10
Hive tutorial
Create managed table (internal table)

11
Hive tutorial
Describe table

12
Hive tutorial
Create external table
● Let’s try to create some external table

13
Hive tutorial

14
Hive tutorial

● Let’s check the directory /user/hive/warehouse in HDFS

15
Hive tutorial

● Let’s create new external table and store data in home directory of HDFS

16
Hive tutorial

● Let’s rename the table and check the result

17
Hive tutorial

● Let’s add one more column to the table and check the result

18
Hive tutorial

● Let’s change one column of the table and check the result

19
Hive tutorial
LOAD Data from Local into Hive Managed Table
● Command: LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Employee.csv' INTO TABLE employee;

Why NULL?

20
Hive tutorial
LOAD Data from Local into Hive Managed Table
● Check the schema of the table and the .csv file

21
Hive tutorial
LOAD Data from Local into Hive Managed Table

22
Hive tutorial
LOAD Data from Local into Hive Managed Table

23
Hive tutorial
LOAD Data from Local into Hive External Table
● Command: LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Employee.csv' INTO TABLE employee2;

24
Hive tutorial
Difference between managed and external table

25
Hive tutorial
Difference between managed and external table
● Let’s drop the external table

● Then check the directory associated with this external table in HDFS

26
Hive tutorial
Difference between managed and external table
● Let’s drop the internal table

● Then check the directory associated with this internal table in HDFS

27
Hive tutorial
LOAD Data from HDFS to Hive Table
● Let’s create some internal table

declare this Hive’s property to skip the header in Student.csv file

28
Hive tutorial
LOAD Data from HDFS to Hive Table
● Put file Student.csv to HDFS

29
Hive tutorial
LOAD Data from HDFS to Hive Table

30
Hive tutorial
Hive command using HUE
● Login to HUE

31
Hive tutorial
Hive command using HUE

32
Hive tutorial
Hive command using HUE
● Let’s check the file Employee.csv

33
Hive tutorial
Hive command using HUE
● Let’s check the file Student.csv

34
Hive tutorial
Hive command using HUE
● Let’s make some Hive query

35
Hive tutorial
Hive command using HUE
● Let’s make some Hive query

36
Hive tutorial
Hive command using HUE
● Check newly created file in HDFS

37
Hive tutorial
Hive command using HUE
● Let’s create some table in HUE
● Command: create table department (DepartmentID int, DepartmentName string) row format
delimited fields terminated by ',' tblproperties('skip.header.line.count'='1');

38
Hive tutorial
Hive command using HUE
● Upload file Department.csv in to department directory (use + button)

39
Hive tutorial
Hive command using HUE
● Query this newly created table

40
Hive tutorial
Partition in Hive
● Hive organizes tables into partitions for grouping similar type of data together based on
a column or partition key. Each Table can have one or more partition keys to identify a
particular partition. This allows us to have a faster query on slices of the data.

41
Hive tutorial
Static partition
● Create new database

● Create new table with partition

42
Hive tutorial
Static partition
● Check the format of the table

43
Hive tutorial
Static partition
● Load data from file StudentHadoop.csv to partition (course=Hadoop)

44
Hive tutorial
Static partition
● Check new directory course=Hadoop in HDFS

45
Hive tutorial
Static partition
● Continue to load data to other partition

46
Hive tutorial
Static partition
● Check all created directories in HDFS

47
Hive tutorial
Dynamic partition

● Create new database

48
Hive tutorial
Dynamic partition

● Create table student (same like before)

49
Hive tutorial
Dynamic partition

● Load data to table student (same like before)

50
Hive tutorial
Dynamic partition

● Create table student_partition

51
Hive tutorial
Dynamic partition

● Command: insert into student_partition partition(Course) select ID, Name, Age, Course from student;

52
Hive tutorial
Dynamic partition
● Check created directories in HDFS

53
Apache Pig Tutorial

54
High-Level Data Process Components
Pig

● A scripting platform for processing and analyzing large data sets

● Apache Pig allows to write complex MapReduce programs using a
simple scripting language.
● Made of two components:
○ High level language: Pig Latin (data flow language).
○ Pig translate Pig Latin script into MapReduce to execute within Hadoop.

● Open source project

● Developed by Yahoo
55
Big Data
High-Level Data Process Components
Pig

● Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray,

age:int, gpa:float);

X = FOREACH A GENERATE name,$2;

DUMP X;

56
Big Data
Pig tutorial
Pig data model

60
Big Data
Pig tutorial
Pig data model

61
Big Data
Pig tutorial
Pig data model

62
Big Data
Pig tutorial
Tuple and Bag

63
Big Data
Pig tutorial
Tuple

● Tuple is an ordered set of fields which may contain different data types for each field. You
can understand it as the records stored in a row in a relational database. A Tuple is a set of
cells from a single row as shown in the above image. The elements inside a tuple does not
necessarily need to have a schema attached to it.
● A tuple is represented by ‘()’ symbol.
● Example of tuple − (1, Linkin Park, 7, California)
● Since tuples are ordered, we can access fields in each tuple using indexes of the fields, like
$1 form above tuple will return a value ‘Linkin Park’. You can notice that above tuple doesn’t
have any schema attached to it.
64
Big Data
Pig tutorial
Bag

● A bag is a collection of a set of tuples and these tuples are subset of rows or
entire rows of a table. A bag can contain duplicate tuples, and it is not
mandatory that they need to be unique.
● The bag has a flexible schema i.e. tuples within the bag can have different
number of fields. A bag can also have tuples with different data types.
● A bag is represented by ‘{}’ symbol.
● Example of a bag − {(Linkin Park, 7, California), (Metallica, 8), (Mega Death,
Los Angeles)}
65
Big Data
Pig tutorial
Bag

● For Apache Pig to effectively process bags, the fields and their respective
data types need to be in the same sequence.

● Set of bags −

● {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)},

● {(Metallica, 8, Los Angeles), (Mega Death, 8), (Linkin Park, California)}

66
Big Data
Pig tutorial
Two types of Bag: Outer Bag and Inner Bag.

● Outer bag or relation is noting but a bag of tuples. Here relations are similar as
relations in relational databases. To understand it better let us take an example:

● {(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}

● This above bag explains the relation between the Band and their place of Origin.

67
Big Data
Pig tutorial
Pig data model

● Inner bag contains a bag inside a tuple. For Example, if we sort

Band tuples based on Band’s Origin, we will get:

● (Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})

● (California,{(Linkin Park, California)})

● Here, first field type is a string while the second field type is a bag,

Big Datawhich is an inner bag within a tuple. 68

Pig tutorial
Map

69
Big Data
Pig tutorial
Map

● A map is key-value pairs used to represent data elements. The key must be a
chararray [] and should be unique like column name, so it can be indexed
and value associated with it can be accessed on basis of the keys. The value
can be of any data type.
● Maps are represented by ‘[]’ symbol and key-value are separated by ‘#’
symbol, as you can see in the above image.
● Example of maps− [band#Linkin Park, members#7 ], [band#Metallica,
members#8 ]
70
Big Data
Pig tutorial
Schema

● Schema assigns name to the field and declares data type of the
field. Schema is optional in Pig Latin but Pig encourage you to use
them whenever possible, as the error checking becomes efficient
while parsing the script which results in efficient execution of
program. Schema can be declared as both simple and complex data
types. During LOAD function, if the schema is declared it is also
attached with the data.
71
Big Data
Pig tutorial
Schema
● Few Points on Schema in Pig:
● If the schema only includes the field name, the data type of field is considered as byte
array.
● If you assign a name to the field you can access the field by both, the field name and the
positional notation. Whereas if field name is missing we can only access it by the
positional notation i.e. $ followed by the index number.
● If you perform any operation which is a combination of relations (like JOIN, COGROUP,
etc.) and if any of the relation is missing schema, the resulting relation will have null
schema.
● If the schema is null, Pig will consider it as byte array and the real data type of field will
Big Databe determined dynamically. 72
Pig tutorial
Open Pig grunt shell

● Command: pig

73
Big Data
Pig tutorial
Open Pig grunt shell

● Invoke the ls command of Linux shell from the Grunt shell

● Command: sh ls

74
Big Data
Pig tutorial
Load data into Apache Pig from the file system (HDFS/ Local)

75
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Open gedit and create a txt file and save it in home directory

101 Anto 20000 Architect

102 Bob 7000 SoftwareEngineer
103 Jack 4000 Programmer
104 Bil 3000 ITConsultant
105 Henry 5000 Manager
106 Isac 9000 Sr.Manager
107 David 7000 VP
108 Kingston 9000 Sr.VP
109 Balmer 19923 CEO

76
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Put the file to HDFS

77
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Let’s load data into Pig Relation using Pig Data Types.
● Command: employee = load 'employeeDetails.txt' using PigStorage(‘ ‘) as (id:int,
name:chararray,salary:float,task:chararray);

78
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Let’s DESCRIBE the relation to see the Data type names.

79
Big Data
Pig tutorial
Load data from HDFS into Pig relation

● Let’s use dump operator to display the result

80
Big Data
Pig tutorial
Load data from Local File System into Pig relation

● Open pig shell in local mode by pig -x local

● Load file

81
Big Data
Pig tutorial
Load data from Local File System into Pig relation

● Check the result

82
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.

● Let us consider that we have the Hive table called student with some data in it

83
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.

● The command below will load the data from HIVE Table into PIG Relation called pigdataStudent

● Command: pigdataStudent = load 'student' using org.apache.hive.hcatalog.pig.HCatLoader();

84
Big Data
Pig tutorial
LOAD Data from HIVE Table into PIG Relation.

● Check the content of the relation

● Command: dump pigdataStudent;

85
Big Data
Pig tutorial
Filter operation

● Now the a1 relation has all the data, Let us try to filter only the values where age > 23.
● Command: plus23 = filter pigdataStudent by age > 23;

86
Big Data
Pig tutorial
Filter operation

● Let’s DESCRIBE the relation to see the Data type names

87
Big Data
Pig tutorial
Filter operation

● Check the result

88
Big Data
Pig tutorial
Storing Data from PIG Relation

89
Big Data
Pig tutorial
Store PIG Relation into HDFS

90
Big Data
Pig tutorial
Store PIG Relation into HDFS

● Check if the plus23 directory has been created in HDFS

91
Big Data
Pig tutorial
Store PIG Relation into HDFS

● Check the content of the file

92
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table

● Create a new Hive table

93
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table

● Store data from Pig relation into the newly created Hive table

94
Big Data
Pig tutorial
STORE Data from PIG Relation Into HIVE Table

● Check the Hive table

95
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Create and open an Apache Pig script file in an editor (e.g. gedit)

96
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Run the script in linux terminal

97
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Create file test2.pig

98
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Run the script in grunt shell

99
Big Data
Pig tutorial
Create Your First Apache Pig Script

● Check the result

100
Big Data
Pig tutorial
Positional notation reference

● So far, we fields in Pig Relation are referred by name (e.g. id, name, salary, task, etc.,)
● Names are assigned by you using schemas
● Positional notation is generated by the system. Positional notation is indicated with the
dollar sign ($) and begins with zero (0); for example, $0, $1, $2.

101
Big Data
Pig tutorial
Positional notation reference

● In this example, the field task is referenced by position notation $3

102
Big Data
Pig tutorial
Positional notation reference

● Check the result

103
Big Data
Pig tutorial
Schema Handling

● You can define a schema that includes both the field name and field type.
● You can define a schema that includes the field name only; in this case, the field type
defaults to bytearray.
● You can choose not to define a schema; in this case, the field is un-named and the field
type defaults to bytearray.

104
Big Data
Pig tutorial
Schema Handling

● The field data types are not specified (defaults type is bytearray)

105
Big Data
Pig tutorial
Schema Handling

● Unknow schema

106
Big Data
Pig tutorial
Schema Handling

● Check the result

107
Big Data
Pig tutorial
Schema Handling

● Declare the schema of the result

108
Big Data
High-Level Data Process Components
Hive & Pig

● Both requires compiler to generate Mapreduce jobs

● Hence high latency queries when used for real time responses to ad-
hoc queries
● Both are good for batch processing and ETL jobs
● Fault tolerant

109
Big Data
High-Level Data Process Components
Impala

● Cloudera Impala is a query engine that runs on Apache Hadoop.

● Similar to HiveQL.
● Does not use Map-reduce
● Optimized for low latency queries
● Open source apache project
● Developed by Cloudera
● Much faster than Hive or pig

110
Big Data
111
112

R23-3rd-Year-B.Tech-AI-and-DS
No ratings yet
R23-3rd-Year-B.Tech-AI-and-DS
52 pages
Module-4
No ratings yet
Module-4
51 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Data Analytics Using Caat Tools-Ms Excel
No ratings yet
Data Analytics Using Caat Tools-Ms Excel
192 pages
The Software Design System Design Interview Preparation RoadMap
No ratings yet
The Software Design System Design Interview Preparation RoadMap
11 pages
Module-4
No ratings yet
Module-4
34 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
MySQL Enterprise Monitor
No ratings yet
MySQL Enterprise Monitor
352 pages
BDS-Unit-3-1
No ratings yet
BDS-Unit-3-1
42 pages
Hive
No ratings yet
Hive
42 pages
05b-Hive
No ratings yet
05b-Hive
37 pages
(r17a0528) Big Data Analytics-57-100
No ratings yet
(r17a0528) Big Data Analytics-57-100
44 pages
13 SQL Statements for 90% of Your Data Science Tasks _ by Youssef Hosni _ Level Up Coding
No ratings yet
13 SQL Statements for 90% of Your Data Science Tasks _ by Youssef Hosni _ Level Up Coding
26 pages
Practical-2 Hive (Show- Create- Load Commands).Pptx
No ratings yet
Practical-2 Hive (Show- Create- Load Commands).Pptx
13 pages
Hive-Part-2
No ratings yet
Hive-Part-2
53 pages
Lab6F_Creating Hive Table with Complex Data Type
No ratings yet
Lab6F_Creating Hive Table with Complex Data Type
11 pages
Computer Networks and MySQL - Notes
No ratings yet
Computer Networks and MySQL - Notes
16 pages
unit5bda
No ratings yet
unit5bda
42 pages
HIVE
No ratings yet
HIVE
28 pages
Presentation 3
No ratings yet
Presentation 3
35 pages
University of Calicut Master of Business Administration BUS 2C 15 Management Information Systems
No ratings yet
University of Calicut Master of Business Administration BUS 2C 15 Management Information Systems
129 pages
Hive-Part-2
No ratings yet
Hive-Part-2
47 pages
Questions Apex
No ratings yet
Questions Apex
8 pages
Exp 9 and 10
No ratings yet
Exp 9 and 10
7 pages
HIVE architecture
No ratings yet
HIVE architecture
5 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
hive table session
No ratings yet
hive table session
23 pages
Big Data
No ratings yet
Big Data
120 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
BDA IA-3 QB-1[1]
No ratings yet
BDA IA-3 QB-1[1]
17 pages
07planning Good Java
No ratings yet
07planning Good Java
64 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
BDA
No ratings yet
BDA
16 pages
Big Data
No ratings yet
Big Data
17 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
Library Science Dictionary Words
No ratings yet
Library Science Dictionary Words
2 pages
Hive_Main
No ratings yet
Hive_Main
33 pages
HIVE Data Types
No ratings yet
HIVE Data Types
6 pages
BDS Session 8
No ratings yet
BDS Session 8
49 pages
Knowledge Management System
No ratings yet
Knowledge Management System
13 pages
Unit 5(Pig,Hive,Hbase)
No ratings yet
Unit 5(Pig,Hive,Hbase)
18 pages
Student Result Management System Presentation
No ratings yet
Student Result Management System Presentation
11 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Hive
No ratings yet
Hive
4 pages
Tazheeb Shamsi-FullStackDeveloper-MERN-3yrs Of Exp
No ratings yet
Tazheeb Shamsi-FullStackDeveloper-MERN-3yrs Of Exp
1 page
Session 3.1
No ratings yet
Session 3.1
29 pages
Introduction to Hive
No ratings yet
Introduction to Hive
14 pages
Module 5_data analytics
No ratings yet
Module 5_data analytics
4 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
CSC 332 Database Management
No ratings yet
CSC 332 Database Management
6 pages
HIVE
No ratings yet
HIVE
80 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
SAP SD Interview questions
No ratings yet
SAP SD Interview questions
34 pages
Sad - Dss - Capítulo 01
No ratings yet
Sad - Dss - Capítulo 01
45 pages
Information Technology Knowledge Level
No ratings yet
Information Technology Knowledge Level
3 pages
Chapter 4 - Descriptive Statistical Measures
No ratings yet
Chapter 4 - Descriptive Statistical Measures
63 pages
Accessing Hadoop Data Using Hive: Hive DDL - VIDEO 1
No ratings yet
Accessing Hadoop Data Using Hive: Hive DDL - VIDEO 1
3 pages
SAP ABAP Intro Training - Powerpoint
No ratings yet
SAP ABAP Intro Training - Powerpoint
23 pages
Lecture38 PDF
No ratings yet
Lecture38 PDF
23 pages
TLC - PROGRAMMING - 2022 DEC 18 Dianne SAntos
100% (1)
TLC - PROGRAMMING - 2022 DEC 18 Dianne SAntos
29 pages
CH 6 - Foundations of Business Intelligence Databases and Information Management
100% (1)
CH 6 - Foundations of Business Intelligence Databases and Information Management
16 pages
Senior Data Engineer - Danial Syafiq J
No ratings yet
Senior Data Engineer - Danial Syafiq J
4 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Week 1
No ratings yet
Week 1
15 pages
What Would Happen If I Surrender
No ratings yet
What Would Happen If I Surrender
211 pages
Dbms Assignment 9
No ratings yet
Dbms Assignment 9
6 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
AR Interior Design App PDF
No ratings yet
AR Interior Design App PDF
116 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
IGCSE ICT - DataBase Types
No ratings yet
IGCSE ICT - DataBase Types
10 pages
Big Data Analytics: Seema Acharya Subhashini Chellappan
100% (1)
Big Data Analytics: Seema Acharya Subhashini Chellappan
47 pages
03 Hive
No ratings yet
03 Hive
48 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
Chapter: Measures of Dispersion: Batsman 1 Batsman 2
No ratings yet
Chapter: Measures of Dispersion: Batsman 1 Batsman 2
25 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Slide 5 High-Level Data Process Components Tutorial

Uploaded by

Slide 5 High-Level Data Process Components Tutorial

Uploaded by

Big Data

Data Loading Tools

● An sql like interface to Hadoop.

SELECT customerId, max(total_cost) from hive_purchases GROUP BY

Let us get started with Command Line Interface(CLI)

Command: show databases;

● Let’s check the directory /user/hive/warehouse in HDFS

● Let’s rename the table and check the result

declare this Hive’s property to skip the header in Student.csv file

● Create new table with partition

● Create new database

● Create table student (same like before)

● Load data to table student (same like before)

● Create table student_partition

● A scripting platform for processing and analyzing large data sets

● Open source project

● Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray,

X = FOREACH A GENERATE name,$2;

● {(Linkin Park, 7, California), (Metallica, 8), (Mega Death, Los Angeles)},

● {(Metallica, 8, Los Angeles), (Mega Death, 8), (Linkin Park, California)}

● Inner bag contains a bag inside a tuple. For Example, if we sort

● (Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})

Big Datawhich is an inner bag within a tuple. 68

● Invoke the ls command of Linux shell from the Grunt shell

101 Anto 20000 Architect

● Put the file to HDFS

● Let’s DESCRIBE the relation to see the Data type names.

● Let’s use dump operator to display the result

● Open pig shell in local mode by pig -x local

● Check the result

● Command: pigdataStudent = load 'student' using org.apache.hive.hcatalog.pig.HCatLoader();

● Check the content of the relation

● Let’s DESCRIBE the relation to see the Data type names

● Check the result

● Check if the plus23 directory has been created in HDFS

● Check the content of the file

● Create a new Hive table

● Check the Hive table

● Run the script in linux terminal

● Create file test2.pig

● Run the script in grunt shell

● Check the result

● In this example, the field task is referenced by position notation $3

● Check the result

● Check the result

● Declare the schema of the result

● Both requires compiler to generate Mapreduce jobs

● Cloudera Impala is a query engine that runs on Apache Hadoop.

You might also like