0% found this document useful (0 votes)
19 views7 pages

Pig SKB

Apache Pig is a high-level data flow platform for executing MapReduce programs in Hadoop using the Pig Latin language, which simplifies complex programming tasks. It supports various data types and execution modes, including Local Mode for development and MapReduce Mode for production. Key features include ease of programming, optimization opportunities, extensibility, and built-in operators for data manipulation.

Uploaded by

maheshpuli078
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

Pig SKB

Apache Pig is a high-level data flow platform for executing MapReduce programs in Hadoop using the Pig Latin language, which simplifies complex programming tasks. It supports various data types and execution modes, including Local Mode for development and MapReduce Mode for production. Key features include ease of programming, optimization opportunities, extensibility, and built-in operators for data manipulation.

Uploaded by

maheshpuli078
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

What is Apache Pig

Apache Pig is a high-level data flow platform for executing MapReduce


programs of Hadoop. The language used for Pig is Pig Latin.

The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS. Apart from that, Pig can also execute its
job in Apache Tez or Apache Spark.

Pig can handle any type of data, i.e., structured, semi-structured or


unstructured and stores the corresponding results into Hadoop Data File
System. Every task which can be achieved using PIG can also be achieved
using java used in MapReduce.

Features of Apache Pig

Let's see the various uses of Pig technology.

1) Ease of programming

Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.

2) Optimization opportunities

It is how tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than
efficiency.

3) Extensibility

A user-defined function is written in which the user can write their logic to
execute over the data set.

4) Flexible

It can easily handle structured as well as unstructured data.

5) In-built operators

It contains various type of operators such as sort, filter and joins.

Differences between Apache MapReduce and PIG

Apache MapReduce Apache PIG

It is a high-level data flow


It is a low-level data processing tool.
tool.
It is not required to
Here, it is required to develop complex programs
develop complex
using Java or Python.
programs.

It provides built-in
It is difficult to perform data operations in operators to perform data
MapReduce. operations like union,
sorting and ordering.

It provides nested data


It doesn't allow nested data types. types like tuple, bag, and
map.

Advantages of Apache Pig

o Less code - The Pig consumes less line of code to perform any
operation.

o Reusability - The Pig code is flexible enough to reuse again.

o Nested data types - The Pig provides a useful concept of nested data
types like tuple, bag, and map.

Apache Pig Run Modes

Apache Pig executes in two modes: Local Mode and MapReduce Mode.

Local Mode

o It executes in a single JVM and is used for development


experimenting and prototyping.

o Here, files are installed and run using localhost.


o The local mode works on a local file system. The input and output
data stored in the local file system.

The command for local mode grunt shell:

1. $ pig-x local

MapReduce Mode

o The MapReduce mode is also known as Hadoop Mode.

o It is the default mode.

o In this Pig renders Pig Latin into MapReduce jobs and executes them
on the cluster.

o It can be executed against semi-distributed or fully distributed


Hadoop installation.

o Here, the input and output data are present on HDFS.

The command for Map reduce mode:

1. $ pig

Or,

1. $ pig -x mapreduce

Ways to execute Pig Program

These are the following ways of executing a Pig program on local and
MapReduce mode: -

o Interactive Mode - In this mode, the Pig is executed in the Grunt


shell. To invoke Grunt shell, run the pig command. Once the Grunt
mode executes, we can provide Pig Latin statements and command
interactively at the command line.

o Batch Mode - In this mode, we can run a script file having a .pig
extension. These files contain Pig Latin commands.

o Embedded Mode - In this mode, we can define our own functions.


These functions can be called as UDF (User Defined Functions). Here,
we use programming languages like Java and Python.

Pig Latin

The Pig Latin is a data flow language used by Apache Pig to analyze the
data in Hadoop. It is a textual language that abstracts the programming
from the Java MapReduce idiom into a notation.

Pig Latin Statements


The Pig Latin statements are used to process the data. It is an operator that
accepts a relation as an input and generates another relation as an output.

o It can span multiple lines.

o Each statement must end with a semi-colon.

o It may include expression and schemas.

o By default, these statements are processed using multi-query


execution

Pig Latin Conventions

Convention Description

The parenthesis can enclose one o


It can also be used to indicate the
()
type.
Example - (10, xyz, (3,6,9))

The straight brackets can enclose


items. It can also be used to indica
[]
data type.
Example - [INNER | OUTER]

The curly brackets enclose two or


{} can also be used to indicate the ba
Example - { block | nested_block }

The horizontal ellipsis points indica


... can repeat a portion of the code.
Example - cat path [path ...]

Latin Data Types

Simple Data Types

Type Description

It defines the signed 32-bit


int integer.
Example - 2
It defines the signed 64-bit
long integer.
Example - 2L or 2l

It defines 32-bit floating point


number.
float
Example - 2.5F or 2.5f or
2.5e2f or 2.5.E2F

It defines 64-bit floating point


number.
double
Example - 2.5 or 2.5 or 2.5e2f
or 2.5.E2F

It defines character array in


chararray Unicode UTF-8 format.
Example - javatpoint

bytearray It defines the byte array.

It defines the boolean type


boolean values.
Example - true/false

It defines the values in


datetime order.
datetime
Example - 1970-01-
01T00:00:00.000+00:00

It defines Java BigInteger


biginteger values.
Example - 5000000000000

It defines Java BigDecimal


bigdecimal values.
Example - 52.232344535345

Complex Types

Type Description
It defines an ordered set of fields.
tuple
Example - (15,12)

It defines a collection of tuples.


bag
Example - {(15,12), (12,15)}

It defines a set of key-value pairs.


map
Example - [open#apache]

Pig Example

Use case: Using Pig find the most occurred start letter.

Solution:

Case 1: Load the data into bag named "lines". The entire line is stuck to
element line of type character array.

1. grunt> lines = LOAD "/user/Desktop/data.txt" AS (line: chararray);

Case 2: The text in the bag lines needs to be tokenized this produces one
word per row.

1. grunt>tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) A


s token: chararray;

Case 3: To retain the first letter of each word type the below
command .This commands uses substring method to take the first
character.

1. grunt>letters = FOREACH tokens GENERATE SUBSTRING(0,1) as lett


er : chararray;

Case 4: Create a bag for unique character where the grouped bag will
contain the same character for each occurrence of that character.

1. grunt>lettergrp = GROUP letters by letter;

Case 5: The number of occurrence is counted in each group.

1. grunt>countletter = FOREACH lettergrp GENERATE group , COUNT(


letters);

Case 6: Arrange the output according to count in descending order using


the commands below.
1. grunt>OrderCnt = ORDER countletter BY $1 DESC;

Case 7: Limit to One to give the result.

1. grunt> result =LIMIT OrderCnt 1;

Case 8: Store the result in HDFS . The result is saved in output directory
under sonoo folder.

1. grunt> STORE result into 'home/sonoo/output';

You might also like