BDA Unit-4
BDA Unit-4
Hadoop Environment: Setting up a Hadoop Cluster, Cluster specification, Cluster Setup and Installation,
Hadoop Configuration, Security.
Pig: Installing and Running Pig, an Example, Comparison with Databases, Pig Latin, User Defined Functions,
Data Processing Operators.
To build a Hadoop cluster yourself, there are still a number of in‐ stallation options:
Apache tarballs
• The Apache Hadoop project and related projects provide binary (and source) tar‐ balls for each release.
• Installation from binary tarballs gives you the most flexibility but need the most amount of work, since
you need to decide on where the installation files, configuration files, and log files are located on the
filesystem, set their file permissions correctly, and so on.
Packages
• RPM and Debian packages are available from the Apache Bigtop project, as well as from all the
Hadoop vendors.
• Packages bring a number of advantages over tarballs: they provide a consistent filesystem layout, they
are tested together as a stack (so you know that the versions of Hadoop and Hive, say, will work
together), and they work well with configuration management tools like Puppet (is a toll to pull the
strings on multiple application servers at once).
• Cloudera Manager and Apache Ambari are examples of dedicated tools for installing and managing
a Hadoop cluster over its whole lifecycle. They provide a simple web UI, and are the recommended
way to set up a Hadoop cluster for most users and operators.
Cluster specification
A typical choice of machine for running an HDFS data node and a YARN node manager in 2014 would have
had the following specifications:
Cluster Sizing
How large should your cluster be? There isn’t an exact answer to this question, but the beauty of Hadoop is
that you can start with a small cluster (say, 10 nodes) and grow it as your storage and computational needs
grow. In many ways, a better question is this: how fast does your cluster need to grow? You can get a good feel
for this by considering storage capacity.
For example, if your data grows by 1 TB a day and you have three-way HDFS replication, you need an
additional 3 TB of raw storage per day. Allow some room for intermediate files and log files (around 30%,
say), and this is in the range of one (2014-vintage) machine per week. In practice, you wouldn’t buy a new
machine each week and add it to the cluster. The value of doing a back-of-the-envelope calculation like this is
that it gives you a feel for how big your cluster should be. In this example, a cluster that holds two years’ worth
of data needs 100 machines.
Depending on the size of the cluster, there are various configurations for running the master daemons: the
namenode, secondary namenode, resource manager, and history server.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the resource
manager on a single master machine. However, as the cluster gets larger, there are good reasons to separate
them.
• The namenode has high memory requirements, as it holds file and block metadata for the entire
namespace in memory. The secondary namenode, although idle most of the time, has a comparable
memory footprint to the primary when it creates a checkpoint. For file systems with a large number
of files, there may not be enough physical memory on one machine to run both the primary and
secondary namenode.
• Aside from simple resource requirements, the main reason to run masters on separate machines is
for high availability. Both HDFS and YARN support configurations where they can run masters in
active-standby pairs. If the active master fails, then the standby, running on separate hardware, takes
over with little or no interruption to the service
Network Topology
A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in Figure 10-1.
Typically there are 30 to 40 servers per rack (only 3 are shown in the diagram), with a 10 Gb switch for the
rack and an uplink to a core switch or router (at least 10 Gb or better).
Rack awareness:
• To get maximum performance out of Hadoop, it is important to configure Hadoop so that it knows the
topology of your network. If your cluster runs on a single rack, then there is nothing more to do, since
this is the default.
• However, for multirack clusters, you need to map nodes to racks. This allows Hadoop to prefer within-
rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce
tasks on nodes.
• HDFS will also be able to place replicas more intelligently to trade off performance and resilience.
• Network locations such as nodes and racks are represented in a tree, which reflects the network
“distance” between locations.
• The namenode uses the network location when determining where to place block replicas; the
MapReduce scheduler uses network location to determine where the closest replica is for input to a map
task.
• For the network in Figure 10-1, the rack topology is described by two network locations —say,
/switch1/rack1 and /switch1/rack2. Because there is only one top-level switch in this cluster, the
locations can be simplified to /rack1 and /rack2.
Install JDK
$java -version
$ javac -version
$su - hadoop
Hadoop Installation
Hadoop installation/configuration
• bashrc
• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• mapred-site.xml
• yarn-site.xml
bashrc
$cd /home/hadoop/hadoop-3.2.2
$sudo nano ~/.bashrc
$source ~/.bashrc
hadoop-env.sh
$which javac
Note the output of the above command which will be like /usr/bin/javac
$readlink -f /usr/bin/javac
Note the output of the above command to be used as java home path in the next command.
Add ABOVE LINE line after the text “#export JAVA_HOME=” in this file.
core-site.xml
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.2/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.2/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
mapred-site.xml
yarn-site.xml
$sbin/start-dfs.sh
$sbin/start-yarn.sh
$jps
Hadoop Security
Hadoop Security is generally defined as a procedure to secure the Hadoop Data Storage unit, by
offering a virtually impenetrable wall of security against any potential cyber threat. Hadoop attains this
high calibre security wall by following the below security protocol.
1. Authentication2.Authorization
3.Auditing
Authentication:
Authentication is the first stage where the user’s credentials are verified. The credentialstypically
include the user’s dedicated User-Name and a secret password.
Entered credentials will be checked against the available details on the security database. Ifvalid, the
user will be authenticated.
Authorization:
Authorization is the second stage where the system gets to decide whether to provide
permission to the user, to access data or not.
It is based on the predesignated access control list. The Confidential information is kept secure and only
authorized personnel can access it.
Auditing:
Auditing is the last stage; it simply keeps track of the operations performed by the authenticated
user during the period in which he was logged into the cluster. This is solely done for security
purposes only.
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data
representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Pig.
o write data analysis programs, Pig provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading, writing, and
processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while
performing any MapReduce tasks. Apache Pig is a boon for all such programmers.
• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex
codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as
less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16
times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc.
In addition, it also provides nested data types like tuples, bags, and maps that are missing from
MapReduce.
Features of Pig
• Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good
at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the
programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read, process, and
write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming languages
such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing
language which provides a rich set of data types and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of transformations applied by the Pig Framework, to
produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the
programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the
major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and
other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents
the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented
as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as
projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig.
First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under
the Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on
the link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
Step 3
These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click
the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.
Step 1
Create a directory with the name Pig in the same directory where the installation directories of Hadoop,
Java, and other software were installed.
$ mkdir Pig
Step 2
$ cd Downloads/
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and
pig.properties.
.bashrc file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various
parameters as given below.
pig -h properties
Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will
get the version of Apache Pig as shown below.
$ pig –version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of
Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using
Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job
is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.
• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In
this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a
single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our script.
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.
Command − Command −
$ ./pig –x local $ ./pig -x mapreduce
Output − Output −
Pig Latin Editors There are Pig Latin syntax highlighters available for a variety of editors, including Eclipse,
IntelliJ IDEA, Vim, Emacs, and TextMate
Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work MapReduce.
conveniently with Apache Pig.
Apache Pig uses multi-query approach, MapReduce will require almost 20 times more
thereby reducing the length of the the number of lines to perform the same task.
codes to a great extent.
Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
The data model in Apache Pig is nested The data model used in SQL is flat relational.
relational.
Apache Pig uses a language called Pig Latin. Hive uses a language called HiveQL. It was
It was originally created at Yahoo. originally created at Facebook.
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured data.
Pig Latin
Pig Latin is the language used to analyze data in Hadoop using Apache Pig.
The data model of Pig Latin is fully nested and it allows complex non-atomic data types such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as
string and can be used as string and number. int, long, float, double, chararray, and bytearray are
the atomic values of Pig. A piece of data or a simple atomic value is known as a field.
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of RDBMS.
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known
as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by
‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that
every tuple contain the same number of fields or that the fields in the same position (column)
have the same type.
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that
tuples are processed in any particular order).
Relational Operations
The following table describes the relational operators of Pig Latin.
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a
relation.
Filtering
Sorting
Diagnostic Operators
Table 16-3).
Functions
Eval function
A function that takes one or more expressions and returns another expression. An example of a
built-in eval function is MAX, which returns the maximum value of the entries in a bag. Some
eval functions are aggregate functions.
Filter function
A special type of eval function that returns a logical Boolean result. As the name suggests, filter
functions are used in the FILTER operator to remove unwanted rows. They can also be used in
other relational operators that take Boolean conditions, and in general, in expressions using
Boolean or conditional expressions. An example of a built-in filter function is IsEmpty, which
tests whether a bag or a map contains any items.
Load function
A function that specifies how to load data into a relation from external storage(local/HDFS).
Store function
A function that specifies how to save the contents of a relation to external storage. Often, load
and store functions are implemented by the same type. For example, PigStorage, which loads
data from delimited text files, can store data in the same format.
User-Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support for User
DefinedFunctions (UDF’s). Using these UDF’s, we can define our own functions and use
them.
The UDF support is provided in six programming languages, namely, Java, Jython,
Python,JavaScript, Ruby and Groovy.
Types of UDF’s in Java
While writing UDF’s using Java, we can create and use the following three types of
functions Filter Functions − The filter functions are used as conditions in filter
statements. Thesefunctions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE
statements. Thesefunctions accept a Pig value as input and return a Pig result.
You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator
of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand side, we
need to mention the name of the relation where we want to store the data, and on the right-hand
side, we have to define how we store the data. Given below is the syntax of the Load operator.
Where,
• relation_name − We have to mention the relation in which we want to store the data.
• Input file path − We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)
• function − We have to choose a function from the set of load functions provided by
Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required schema
as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case, the columns will be
addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.
First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown
below.
$ Pig –x mapreduce
grunt>
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by executing the following Pig Latin
statement in the Grunt shell.
Storage We have used the PigStorage() function. It loads and stores data as structured
function text files. It takes a delimiter using which each entity of a tuple is separated, as a
parameter. By default, it takes ‘\t’ as a parameter.
datatype int char array char array char array char array
You can store the loaded data in the file system using the store operator. This chapter explains
how to store data in Apache Pig using the Store operator.
Syntax
Example
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Output
After executing the store statement, you will get the following output. A directory is created
with the specified name and the data will be stored in it.
The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as
shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
Let us now use the Filter operator to get the details of the students who belong to the city
Chennai.
Verify the relation filter_data using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation filter_data as
follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
The FOREACH operator is used to generate specified data transformations based on the column
data.
Syntax
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as
shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
Let us now get the id, age, and city values of each student from the relation student_details and
store it into another relation named foreach_data using the foreach operator as shown below.
Verify the relation foreach_data using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation foreach_data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Grouping and Joining Data
The JOIN operator is used to combine records from two or more relations. While performing a
join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these
keys match, the two particular tuples are matched, else the records are dropped. Joins can be of
the following types −
• Self-join
• Inner-join
• Outer-join − left join, right join, and full join
Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory
of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as shown
below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily
renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under
different aliases (names). Therefore let us load the contents of the file customers.txt as two
tables as shown below.
Given below is the syntax of performing self-join operation using the JOIN operator.
Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
Verify the relation customers3 using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows
when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B) based upon
the join-predicate. The query compares each row of A with each row of B to find all pairs of
rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for
each matched pair of rows of A and B are combined into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
Let us perform inner join operation on the two relations customers and orders as shown below.
Verify the relation coustomer_orders using the DUMP operator as shown below.
You will get the following output that will the contents of the relation
named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations.
An outer join operation is carried out in three ways −
The left outer Join operation returns all rows from the left table, even if there are no matches in
the right relation.
Syntax
Given below is the syntax of performing left outer join operation using the JOIN operator.
Let us perform left outer join operation on the two relations customers and orders as shown
below.
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
The right outer join operation returns all rows from the right table, even if there are no matches
in the left table.
Syntax
Given below is the syntax of performing right outer join operation using the JOIN operator.
Let us perform right outer join operation on the two relations customers and orders as shown
below.
Verify the relation outer_right using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
The full outer join operation returns rows when there is a match in one of the relations.
Syntax
Given below is the syntax of performing full outer join using the JOIN operator.
Let us perform full outer join operation on the two relations customers and orders as shown
below.
Verify the relation outer_full using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Using Multiple Keys
Syntax
Here is how you can perform a JOIN operation on two tables using multiple keys.
employee.txt
001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001
employee_contact.txt
001,9848022337,[email protected],Hyderabad,003
002,9848022338,[email protected],Kolkata,003
003,9848022339,[email protected],Delhi,003
004,9848022330,[email protected],Pune,003
005,9848022336,[email protected],Bhuwaneshwar,003
006,9848022335,[email protected],Chennai,003
007,9848022334,[email protected],trivendram,002
008,9848022333,[email protected],Chennai,001
And we have loaded these two files into Pig with relations employee and employee_contact as
shown below.
grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);
Now, let us join the contents of these two relations using the JOIN operator as shown below.
Verify the relation emp using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation named emp as shown
below.
(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113)
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113)
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113)
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw
ar,113)
(6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113)
(7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112)
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,