Big Data Analytics Unit 4

This document provides an overview of Pig and Hive, two frameworks for analyzing large datasets using Hadoop. Pig allows for data manipulation through Pig Latin scripts that are compiled into MapReduce jobs. Hive provides SQL-like queries through HiveQL that are also compiled into MapReduce or Spark jobs. Both Pig and Hive simplify the process of working with large datasets on Hadoop clusters for data analysts and scientists.

Uploaded by

18-1211 Apoorva Gangyada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

337 views83 pages

Big Data Analytics Unit 4

Uploaded by

18-1211 Apoorva Gangyada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Big Data Analytics

Unit 4
SURESH BABU M
ASST PROF
IT DEPT
UNIT-IV
Hadoop Eco System-I
 Pig: Introduction to PIG, Execution Modes of Pig,
Comparison of Pig with Databases, Grunt, Pig Latin, User
Defined Functions, Data Processing operators.
 Hive: Hive Shell, Hive Services, Hive Metastore,
Comparison with Traditional Databases, HiveQL, Tables,
Querying Data and User Defined Functions.
4.1 Introduction to Pig
 Apache Pig raises the level of abstraction for processing large
datasets.
 With Pig, the data structures are much richer, typically being
multivalued and nested, and the transformations you can
apply to the data are much more powerful.
 Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There
are currently twoenvironments: local execution in a single
JVM and distributed execution on a Hadoopcluster.
 Pig is a scripting language for exploring large
datasets.
 Pig was designed to be extensible.
 As another benefit, UDFs tend to be more
reusable than the libraries developed for writing
MapReduce programs.
 Pig is an open-source high level data flow system.
It provides a simple language called Pig Latin, for
queries and data manipulation, which are then
compiled in to MapReduce jobs that run on
Hadoop.
4.2 Execution Types
 Pig has two execution types or modes: local mode and MapReduce
mode.
Local mode
1)In local mode, Pig runs in a single JVM and accesses the local filesystem.
This mode is suitable only for small datasets and when trying out Pig.
% pig -x local
2)MapReduce mode
 The MapReduce mode is also known as Hadoop Mode.
 It is the default mode. In this Pig renders Pig Latin into MapReduce jobs
and executes them on the cluster.
 It can be executed against semi-distributed or fully distributed Hadoop
installation. Here, the input and output data are present on HDFS.
$ pig
4.3 Comparison of Pig with Databases
4.4 Grunt
 Grunt is an interactive shell for running Pig commands. Grunt
is started when no file is specified for Pig to run and the -e
option is not used. It is also possible to run Pig scripts from
within Grunt using run and exec.
4.5 Pig Latin
4.5.1 Structures
4.5.2 Statements
4.5.3 Expressions
4.5.4 Types
4.5.5 Schemas
4.5.6 Functions
4.5.7 Macros
4.5.1 Structures

 A Pig Latin program consists of a collection of

statements.
 a GROUP operation is a type of statement:
grouped_records = GROUP records BY year;
Statements are usually terminated with a semicolon
 Pig Latin has mixed rules on case sensitivity. Operators
and commands are not case sensitive (to make interactive
use more forgiving); however, aliases and function names
are case sensitive.
4.5.2 Statements

 When the Pig Latin interpreter sees the first line

containing the LOAD statement, it confirms that it is
syntactically and semantically correct and adds it to the
logical plan, but it does not load the data from the file.
 Pig validates the GROUP and FOREACH...GENERATE
statements, and adds them to the logical plan without
executing them. The trigger for Pig to start execution is
the DUMP statement. At that point, the logical plan is
compiled into a physical plan and executed.
The physical plan that Pig prepares is a series of
MapReduce jobs, which in local mode
Pig runs in the local JVM and in MapReduce mode Pig
runs on a Hadoop cluster.
Multiquery Execution
 Relations B and C are both derived from A, so to save reading
A twice, Pig can run this script as a single MapReduce job by
reading A once and writing two output files from the job, one
for each of B and C. This feature is called multiquery execution.
4.5.3 Expressions
4.5.4 Types
4.5.5 Schemas

 LOAD statement is used to attach a schema to a relation:

4.5.6 Functions
Functions in Pig come in four types:
1)Eval function: A function that takes one or more expressions
and returns another expression. An example of a built-in eval
function is MAX, which returns the maximum value of the
entries in a bag.
2) Filter function:An example of a built-in filter function is
IsEmpty, which tests whether a bag or a map contains any
items.
3)Load function:A function that specifies how to load data into a
relation from external storage.
4) Store function:A function that specifies how to save the
contents of a relation to external storage.
4.5.7 Macros
 Macros provide a way to package reusable pieces of Pig Latin
code from within Pig Latin itself.
4.6 User-Defined Functions
4.7 Data Processing Operators
Data Processing Operators
4.7.1 Loading and Storing Data
4.7.2 Filtering Data
4.7.3 Grouping and Joining Data
4.7.4 Sorting Data
4.7.5 Combining and Splitting Data
4.7.1 Loading and Storing Data

Storing the results is straightforward, too. Here’s an

example of using PigStorage to store tuples as plain-text
values separated by a colon character:
4.7.2 Filtering Data
 Once you have some data loaded into a relation,
often the next step is to filter it to remove the
data that you are not interested in.
 By filtering early in the processing pipeline, you
minimize the amount of data flowing through the
system, which can improve efficiency.
FOREACH...GENERATE
 The FOREACH...GENERATE operator is used
to act on every row in a relation. It can be used to
remove fields or to generate new ones.
4.7.3 Grouping and Joining Data

 Pig has very good built-in support for join operations.

 JOIN Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
COGROUP: The COGROUP statement is similar to JOIN, but
instead creates a nested set of output tuples. COGROUP
generates a tuple for each unique grouping key.
GROUP:Where COGROUP groups the data in two or more
relations, the GROUP statement groups the data in a single
relation. GROUP supports grouping by more than equality
of keys.
4.7.4 Sorting Data
 Relations are unordered in Pig. Consider a relation A:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
4.7.5 Combining and Splitting Data
Hive: Hive Shell, Hive Services,
Hive Metastore, Comparison with
Traditional Databases, HiveQL,
Tables, Querying Data and User
Defined Functions.
4.8 HIVE
 Hive was created to make it possible for analysts with strong
SQL skills (but meager Java programming skills) to run
queries on the huge volumes of data that Facebook stored in
HDFS.
 Today, Hive is a successful Apache project used by many
organizations as a general-purpose, scalable data processing
platform.
 Of course, SQL isn’t ideal for every big data problem—it’s
not a good fit for building complex machine-learning
algorithms
What is HIVE
 Hive is a data warehouse system which is used to
analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
 Hive provides the functionality of reading,
writing, and managing large datasets residing in
distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Features of Hive
These are the following features of Hive:
 Hive is fast and scalable.
 It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
 It is capable of analyzing large datasets stored in HDFS.
 It allows different storage types such as plain text, RCFile, and
HBase.
 It uses indexing to accelerate queries.
 It can operate on compressed data stored in the Hadoop
ecosystem.
 It supports user-defined functions (UDFs) where user can provide
its functionality.
4.8 The Hive Shell
 The shell is the primary way that we will interact with Hive,
by issuing commands in HiveQL. HiveQL is Hive’s query
language, a dialect of SQL.
4.9 Hive Services
The Hive shell is only one of several services that you can
run using the hive command.
1)Cli:The command-line interface to Hive (the shell). This
is the default service.
2) hiveserver2: HiveServer 2 improves on the original
Hive‐Server by supporting authentication and multiuser
concurrency.
3) beeline: A command-line interface to Hive that works in
embedded mode (like the regular CLI), or by connecting
to a HiveServer 2 process using JDBC.
4)Hwi :The Hive Web Interface. A simple web interface
that can be used as an alternative to the CLI without
having to install any client software.
5)jarThe Hive equivalent of hadoop jar, a convenient way
to run Java applications that includes both Hadoop and
Hive classes on the classpath.
6)Metastore By default, the metastore is run in the same
process as the Hive service. Using this service, it is
possible to run the metastore as a standalone (remote)
process. Set the METASTORE_PORT environment
variable
Hive clients
 Thrift Client The Hive server is exposed as a Thrift service, so
it’s possible to interact with it using any programming
language that supports Thrift. There are third-party projects
providing clients for Python and Ruby.
 JDBC driver: a Java application will connect to a Hive server
running in a separate process at the given host and port.
 ODBC driver:An ODBC driver allows applications that
support the ODBC protocol (such as business intelligence
software) to connect to Hive.
4.10 Hive Metastore
 The metastore is the central repository of Hive metadata.
 The metastore is divided into two pieces: a service and the
backing store for the data.
1)embedded metastore: one embedded Derby database can access
the database files on disk at any one time, which means you
can have only one Hive session open at a time that accesses
the same metastore.
2)local metastore: The solution to supporting multiple sessions
(and therefore multiple users) is to use a standalone database.
metastore service still runs in the same process as the Hive
service but connects to a database running in a separate
process, either on the same machine or on a remote
machine.
3) Remote metastore: where one or more metastore servers run
in separate processes to the Hive service. This brings better
manageability and security because the database tier can be
completely firewalled off, and the clients no longer need the
database credentials.
4.11 Comparison with Traditional
Databases
Schema on ReadVersus Schema onWrite
 schema on write :The data is checked against the schema when
it is written into the database.
 schema on read :Hive, on the other hand, doesn’t verify the
data when it is loaded, but rather when a query is issued.
 Schema on read makes for a very fast initial load, since the
data does not have to be read, parsed, and serialized to disk
in the database’s internal format.
 Schema on write makes query time performance faster
because the database can index columns and perform
compression on the data. The trade-off, however, is that it
takes longer to load data into the database.
Updates, Transactions, and Indexes
 Hive has long supported adding new rows in bulk to an
existing table by using INSERT INTO to add new data
files to a table.
 HDFS does not provide in-place file updates, so changes
resulting from inserts, updates, and deletes are stored in
small delta files.
 Delta files are periodically merged into the base table
files by MapReduce jobs that are run in the background
by the metastore.
 Hive also has support for table- and partition-level
locking.
 There are currently two index types: compact and bitmap.
 Compact indexes store the HDFS block numbers of each
value, rather than each file offset, so they don’t take up
much disk space but are still effective for the case where
values are clustered together in nearby rows.
 Bitmap indexes use compressed bitsets to efficiently
store the rows that a particular value appears in, and they
are usually appropriate for low-cardinality columns (such
as gender or country).
4.12 HiveQL
Operators and Functions
 The usual set of SQL operators is provided by Hive:
relational operators ,arithmetic operators and logical
operators.
 Hive comes with a large number of built-in functions—
too many to list here—divided into categories that
include mathematical and statistical functions, string
functions, date functions (for operating on string
representations of dates), conditional functions,
aggregate functions, and functions for working with
XML (using the xpath function) and JSON.
Conversions:
For example, a TINYINT will be converted to an INT if an
expression expects an INT; however, the reverse conversion
will not occur, and Hive will return an error unless the
CAST operator is used.
 Any numeric type can be implicitly converted to a wider
type, or to a text type (STRING,VARCHAR, CHAR).
 All the text types can be implicitly converted to another text
type.
 TIMESTAMP and DATE can be implicitly converted to a text
type.
 BOOLEAN types cannot be converted to any other type, and
they cannot be implicitly converted to any other type in
expressions.
4.13 Tables
4.13.1 Managed Tables and External Tables
4.13.2 Partitions and Buckets
4.13.3 Storage Formats
4.13.4 Importing Data
4.13.5 Altering Tables
4.13.6 Dropping Tables
4.13.1 Managed Tables and External
Tables
 A Hive table is logically made up of the data being stored and
the associated metadata describing the layout of the data in
the table.
Managed Tables and External Tables
 Managed Tables : When you create a table in Hive, by default
Hive will manage the data, which means that Hive moves the
data into its warehouse directory.
 External table:which tells Hive to refer to the data that is at an
existing location outside the warehouse directory.
 When you load data into a managed table, it is moved into
Hive’s warehouse directory.
 If the table is later dropped, using:
DROP TABLE managed_table;
Managed table: the table, including its metadata and its data, is
deleted.
 External Table: When you drop an external table, Hive will
leave the data untouched and only delete the metadata.
4.13.2 Partitions and Buckets
Partitions and Buckets
 Hive organizes tables into partitions—a way of dividing a table
into coarse-grained parts based on the value of a partition column,
such as a date. Using partitions can make it faster to do queries
on slices of the data.
 Tables or partitions may be subdivided further
into buckets to give extra structure to the data that
may be used for more efficient queries.
 For example, bucketing by user ID means we can
quickly evaluate a user-based query by running it
on a randomized sample of the total set of users.
 A table may be partitioned in multiple
dimensions. For example, in addition to
partitioning logs by date, we might also
subpartition each date partition by country to permit
efficient queries by location.
Buckets
 There are two reasons why you might want to organize
your tables (or partitions) into buckets. The first is to
enable more efficient queries.
 Bucketing imposes extra structure on the table, which
Hive can take advantage of when performing certain
queries.
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we are using the user ID to determine the bucket
4.13.3 Storage Formats
Storage Formats
 There are two dimensions that govern table storage in Hive:
the row format and the file format.
 In Hive parlance, the row format is defined by a SerDe, a
portmanteau word for a Serializer-Deserializer.
 When acting as a deserializer, which is the case when
querying a table, a SerDe will deserialize a row of data from
the bytes in the file to objects used internally by Hive to
operate on that row of data.
 When used as a serializer, which is the case when performing
an INSERT or CTAS (see “Importing Data” on page 500), the
table’s SerDe will serialize Hive’s internal representation of a
row of data into the bytes that are written to the output file.
4.13.4 Importing Data
 You can also populate a table with data from another
Hive table using an LOAD & INSERT statement, or at
creation time.
 using the CTAS construct, which is an abbreviation used to refer
to CREATE TABLE...AS SELECT.
Multitable insert
 multitable insert is more efficient than multiple INSERT
statements because the source table needs to be scanned
only once to produce the multiple disjoint outputs.
CREATE TABLE...AS SELECT
CREATE TABLE target AS SELECT col1, col2 FROM source;
 A CTAS operation is atomic, so if the SELECT query fails for
some reason, the table is not created.
4.13.5 Altering Tables
 You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
 Hive allows you to change the definition for columns, add
new columns, or even replace all existing columns in a table
with a new set.
ALTER TABLE target ADD COLUMNS (col3 STRING);
4.13.6 Dropping Tables
 The DROP TABLE statement deletes the data and metadata
for a table. In the case of external tables, only the metadata is
deleted; the data is left untouched.
 If you want to delete all the data in a table but keep the table
definition, use TRUNCATE
TABLE. For example:
TRUNCATE TABLE my_table;
 In a similar vein, if you want to create a new, empty table
with the same schema as another table, then use the LIKE
keyword:
CREATE TABLE new_table LIKE existing_table;
4.14 Querying Data

 4.14.1 Sorting and Aggregating

 4.14.2 MapReduce Scripts
 4.14.3 Joins
 4.14.4 Subqueries
 4.14.4 Views
4.14.1 Sorting and Aggregating

 Sorting data in Hive can be achieved by using a standard

ORDER BY clause.
4.14.2 MapReduce Scripts
 Using an approach like Hadoop Streaming, the
TRANSFORM, MAP, and REDUCE clauses make it possible
to invoke an external script or program from Hive.
4.14.3 Joins
Inner joins
Outer joins
 Outer joins allow you to find nonmatches in the tables being
joined. In the current example, when we performed an inner
join, the row for Ali did not appear in the output, because the
ID of the item she purchased was not present in the things
table.
Semi joins
4.14.4 Subqueries
 A subquery is a SELECT statement that is embedded in
another SQL statement.
 Hive has limited support for subqueries, permitting a
subquery in the FROM clause of a SELECT statement, or in
the WHERE clause in certain cases.
4.14.4 Views
 A view is a sort of “virtual table” that is defined by a SELECT
statement.
 Views in Hive are read-only, so there is no way to load or
insert data into an underlying base table via a view.
 Views can be used to present data to users in a way that
differs from the way it is actually stored on disk.
 Views may also be used to restrict users’ access to particular
subsets of tables that they are authorized to see.
4.15 User-Defined Functions
 UDFs have to be written in Java, the language that Hive itself
is written in.
 There are three types of UDF in Hive: (regular) UDFs,
user-defined aggregate functions (UDAFs), and user-defined
table-generating functions (UDTFs).
 They differ in the number of rows that they accept as input
and produce as output.
Writing a UDF
Writing a UDAF
An evaluator must implement five methods:
1)init():The init() method initializes the evaluator and resets its
internal state.
2)iterate():The iterate() method is called every time there is a
new value to be aggregated. The evaluator should update its
internal state with the result of performing the aggregation.
3) terminatePartial():The terminatePartial() method is called
when Hive wants a result for the partial aggregation.
4) merge():The merge() method is called when Hive decides to
combine one partial aggregation with another.
5)terminate():The terminate() method is called when the final
result of the aggregation is needed.

DEVOPS Spectrum
No ratings yet
DEVOPS Spectrum
44 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Ambiguity: E E+E - E E - (E) - Id
No ratings yet
Ambiguity: E E+E - E E - (E) - Id
9 pages
Bootstrap Notes
No ratings yet
Bootstrap Notes
21 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
CCS334 BDA Lab Manual Final
No ratings yet
CCS334 BDA Lab Manual Final
40 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
LP 4 Lab Manual
No ratings yet
LP 4 Lab Manual
52 pages
Cracking More Password Hashes With Patterns
No ratings yet
Cracking More Password Hashes With Patterns
69 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
TCS NQT Coding Sheet - TCS Coding Questions - Updated 2022
No ratings yet
TCS NQT Coding Sheet - TCS Coding Questions - Updated 2022
8 pages
BDA Experiment 14 PDF
No ratings yet
BDA Experiment 14 PDF
77 pages
2023 BD All Assignment
No ratings yet
2023 BD All Assignment
63 pages
API Best Practices
100% (2)
API Best Practices
43 pages
Cognizant Recent Question Papers
No ratings yet
Cognizant Recent Question Papers
4 pages
Bloomberg TPPA - Third Party Questionnaire
No ratings yet
Bloomberg TPPA - Third Party Questionnaire
4 pages
Sonata Software Sample Aptitude Placement Paper Level1
No ratings yet
Sonata Software Sample Aptitude Placement Paper Level1
7 pages
What Is A Distributed System - GeeksforGeeks
No ratings yet
What Is A Distributed System - GeeksforGeeks
8 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Cyber Security IMP Points Short Notes
No ratings yet
Cyber Security IMP Points Short Notes
20 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Digital Nurture 2.0 - Deep Skilling Stage - Handbook
No ratings yet
Digital Nurture 2.0 - Deep Skilling Stage - Handbook
11 pages
CS Investigatory Project Class 12
No ratings yet
CS Investigatory Project Class 12
24 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
UNIT 3 Developing IoTs-1
No ratings yet
UNIT 3 Developing IoTs-1
53 pages
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
0% (1)
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
5 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
Adobe Ccrack
No ratings yet
Adobe Ccrack
6 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
Bda Unit 5
No ratings yet
Bda Unit 5
30 pages
Digital Notes: (Department of Computer Applications)
No ratings yet
Digital Notes: (Department of Computer Applications)
14 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Question Bank - Module 2 - Module-3 Module 4 - Module 5
No ratings yet
Question Bank - Module 2 - Module-3 Module 4 - Module 5
4 pages
ABAP CDS On HANA Day 1 - Assignment Solution
No ratings yet
ABAP CDS On HANA Day 1 - Assignment Solution
5 pages
Important Questions of Machine Learning
No ratings yet
Important Questions of Machine Learning
5 pages
AJ - Lab Manual
No ratings yet
AJ - Lab Manual
97 pages
Synopsis: Stock Agent - A Java Stock Market Trading Program
No ratings yet
Synopsis: Stock Agent - A Java Stock Market Trading Program
27 pages
M.Tech (CSE) Big Data Analytics Curriculum
No ratings yet
M.Tech (CSE) Big Data Analytics Curriculum
69 pages
Django Ppts
No ratings yet
Django Ppts
243 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
BDA Unit-1
No ratings yet
BDA Unit-1
19 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Cs2358 Internet Programming Lab Anna University Syllabus
No ratings yet
Cs2358 Internet Programming Lab Anna University Syllabus
12 pages
Apex Interview
No ratings yet
Apex Interview
23 pages
Exercise 2
No ratings yet
Exercise 2
11 pages
4 Lisp (Chapt-3) Prolog
No ratings yet
4 Lisp (Chapt-3) Prolog
41 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
85 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
2008 - Apress - Beginning C-Sharp 2008 - From Novice To Professional (001-010)
No ratings yet
2008 - Apress - Beginning C-Sharp 2008 - From Novice To Professional (001-010)
10 pages
Module II
No ratings yet
Module II
22 pages
IOT Mod4@AzDOCUMENTS - in
No ratings yet
IOT Mod4@AzDOCUMENTS - in
17 pages
How To Read Photo From SAP System Using SAP NetWeaver Gateway
No ratings yet
How To Read Photo From SAP System Using SAP NetWeaver Gateway
7 pages
Sample Report 22-23 1
No ratings yet
Sample Report 22-23 1
30 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
GELS2020 - Gel Scripts - Advanced
No ratings yet
GELS2020 - Gel Scripts - Advanced
33 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
OS Module 1
No ratings yet
OS Module 1
59 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Event Driven ProgrammingChapter 1
No ratings yet
Event Driven ProgrammingChapter 1
18 pages
Lec3 B
No ratings yet
Lec3 B
24 pages
SE T04 - Variables - Storing Data in Programs
No ratings yet
SE T04 - Variables - Storing Data in Programs
13 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Alumni Management
No ratings yet
Alumni Management
31 pages
Global Variables Declaration
No ratings yet
Global Variables Declaration
11 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
An Iot Based Waste Segreggator For Recycling Biodegradable and Non-Biodegradable Waste
No ratings yet
An Iot Based Waste Segreggator For Recycling Biodegradable and Non-Biodegradable Waste
3 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Proxmox VE 7.1 Datasheet
No ratings yet
Proxmox VE 7.1 Datasheet
5 pages
12th Python Interface With SQL
No ratings yet
12th Python Interface With SQL
9 pages
Software Requirement Specification
No ratings yet
Software Requirement Specification
7 pages
How To Open A Theremino Logger File (.CSV) Into KST2
No ratings yet
How To Open A Theremino Logger File (.CSV) Into KST2
6 pages
Docker Cheat Sheet - 1
No ratings yet
Docker Cheat Sheet - 1
8 pages
Assignment 2 - IOS Arcade Game-1
No ratings yet
Assignment 2 - IOS Arcade Game-1
4 pages
Wipro Aptitude Exam-Aptitude Paper1
No ratings yet
Wipro Aptitude Exam-Aptitude Paper1
4 pages
Assignment
No ratings yet
Assignment
4 pages
AnanthVishwanath Resume
No ratings yet
AnanthVishwanath Resume
3 pages
Thebes Higher Institute For Management and Information Technology
No ratings yet
Thebes Higher Institute For Management and Information Technology
3 pages
Difference Between Path Variable & Environment Variable
No ratings yet
Difference Between Path Variable & Environment Variable
2 pages
Ieee Paper
No ratings yet
Ieee Paper
5 pages

Big Data Analytics Unit 4

Uploaded by

Big Data Analytics Unit 4

Uploaded by

Big Data Analytics

 A Pig Latin program consists of a collection of

 When the Pig Latin interpreter sees the first line

 LOAD statement is used to attach a schema to a relation:

Storing the results is straightforward, too. Here’s an

 Pig has very good built-in support for join operations.

 4.14.1 Sorting and Aggregating

 Sorting data in Hive can be achieved by using a standard

You might also like