Pig SKB
Pig SKB
The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS. Apart from that, Pig can also execute its
job in Apache Tez or Apache Spark.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than
efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to
execute over the data set.
4) Flexible
5) In-built operators
It provides built-in
It is difficult to perform data operations in operators to perform data
MapReduce. operations like union,
sorting and ordering.
o Less code - The Pig consumes less line of code to perform any
operation.
o Nested data types - The Pig provides a useful concept of nested data
types like tuple, bag, and map.
Apache Pig executes in two modes: Local Mode and MapReduce Mode.
Local Mode
1. $ pig-x local
MapReduce Mode
o In this Pig renders Pig Latin into MapReduce jobs and executes them
on the cluster.
1. $ pig
Or,
1. $ pig -x mapreduce
These are the following ways of executing a Pig program on local and
MapReduce mode: -
o Batch Mode - In this mode, we can run a script file having a .pig
extension. These files contain Pig Latin commands.
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyze the
data in Hadoop. It is a textual language that abstracts the programming
from the Java MapReduce idiom into a notation.
Convention Description
Type Description
Complex Types
Type Description
It defines an ordered set of fields.
tuple
Example - (15,12)
Pig Example
Use case: Using Pig find the most occurred start letter.
Solution:
Case 1: Load the data into bag named "lines". The entire line is stuck to
element line of type character array.
Case 2: The text in the bag lines needs to be tokenized this produces one
word per row.
Case 3: To retain the first letter of each word type the below
command .This commands uses substring method to take the first
character.
Case 4: Create a bag for unique character where the grouped bag will
contain the same character for each occurrence of that character.
Case 8: Store the result in HDFS . The result is saved in output directory
under sonoo folder.