0% found this document useful (0 votes)

88 views4 pages

Predicate Pushdown in Hive

Predicate pushdown is an optimization technique that pushes predicates (filters) from a SQL query down to where the data resides to reduce the amount of data processed. This is done by evaluating the predicates earlier, such as filtering data before it is transferred over the network, loaded into memory, or entire files/chunks are read. For example, in Hive, predicates in the WHERE clause can be pushed to the map phase to filter data before it is sent to the reduce phase.

Uploaded by

Pranoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views4 pages

Predicate Pushdown in Hive

Uploaded by

Pranoy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

WHAT IS PREDICATE PUSHDOWN?

The basic idea of predicate pushdown is that certain parts of SQL queries (the
predicates) can be “pushed” to where the data lives. This optimization can
drastically reduce query/processing time by filtering out data earlier rather than
later. Depending on the processing framework, predicate pushdown can optimize
your query by doing things like filtering data before it is transferred over the
network, filtering data before loading into memory, or skipping reading entire
files or chunks of files.

A “predicate” (in mathematics and functional programming) is a function that

returns a Boolean (true or false). In SQL queries predicates are usually
encountered in the WHERE clause and are used to filter data.

Predicate Pushdown in Hive

Generally, when executing SQL queries, a JOIN will be performed before the
filtering used in the WHERE clause. In Hive (Map Reduce), predicate pushdown is
used to filter data in the map phase before sending over the network to the
reduce phase.

For example in this query the WHERE a.country = 'Argentina' will be evaluated in the

map phase, reducing the amount data sent over the network:
SELECT
  a.*
3
FROM
4
  table1 a
5
JOIN
6
  table2 b ON a.id = b.id
7
WHERE
8
  a.country = 'Argentina';

Predicate Pushdown in hive is a feature to Push your predicate (where condition) further up in the query. It tries
to execute the expression as early as possible in plan.

Let’s try to understand this by example. let’s consider we have two tables, product and sales and we
want to answer following question.

How many products of brand Washington has been sold so far?

Non-Optimized Query
Following query will answer the above question. However, if you are familiar with sql you will realize
that above query is not optimized. It applies first joins the two table and then applies the condition
Predicate Push down pg. 1
(predicate).

select sum(s.unit_sales) from foodmart.product p

join
foodmart. sales_fact_dec_1998 s
on
p.product_id = s.product_id
where
p.brand_name = "Washington"

Optimized Query
We could easily optimize this above query by applying condition first on product table and then joining it to sales
table as mentioned below.

SELECT sum(s.unit_sales)
FROM foodmart.sales_fact_dec_1998 s
JOIN (
SELECT product_id, brand_name
FROM foodmart.product
WHERE
brand_name = "Washington"
)p
ON
p.product_id = s.product_id

This is what PPD (predicate pushdown) does internally. if you have ppd enabled your first query will
automatically be converted to second optimized query.

Let’s see this in action. Product table has total 1560 rows (product) with only 11 products with brand
name Washington.

For better understanding I have disabled the vectorization. If you are not sure what vectorization is,
please read the following blog post – What is vectorization?

Predicate Push down pg. 2

Running Query with PPD Disabled
Following is the DAG of first query with PPD disabled.
Please set the following parameter to false, to disable the PPD.

set hive.optimize.ppd=false;

if you notice, it’s reading all rows from product table and then passing it to reducer for join.

DAG for first query when PPD is disabled

Running Query with PPD Enabled.

And Following is the DAG of the same query with PPD Enabled.
Please set the following parameter to true, to enable the PPD.

set hive.optimize.ppd=true;

Once, we enable the PPD, it first applies the condition on product table and sends only 11 rows to the
reducer for join.

Predicate Push down pg. 3

DAG for first query when PPD is enabled

Predicate Pushdown in Parquet/ORC files

Parquet and ORC files maintain various stats about each column in different
chunks of data (such as min and max values). Programs reading these files can
use these indexes to determine if certain chunks, and even entire files, need to
be read at all. This allows programs to potentially skip over huge portions of the
data during processing.

Predicate Pushdown in Spark

Spark will attempt to move filtering of data as close to the source as possible to
avoid loading unnecessary data into memory.

Predicate Pushdown in Amazon Redshift Spectrum

Amazon Redshift Spectrum resides on dedicated servers separate from actual
Redshift clusters. Redshift Spectrum will use predicate pushdown to filter data at
the Redshift Spectrum layer to reduce data transfer, storage, and compute
resources on the Redshift cluster itself.

Predicate Push down pg. 4

Linux Commands For SAP Basis
100% (2)
Linux Commands For SAP Basis
3 pages
BDA Unit-5
No ratings yet
BDA Unit-5
39 pages
MySQL Notes
No ratings yet
MySQL Notes
120 pages
Hive
No ratings yet
Hive
65 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive
No ratings yet
Hive
29 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
CH 11 Imp
No ratings yet
CH 11 Imp
7 pages
403 C# MCQ Final
100% (1)
403 C# MCQ Final
14 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
7 Hive
No ratings yet
7 Hive
30 pages
TD Hive Guide V2.0
No ratings yet
TD Hive Guide V2.0
34 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
TD Hive Guide V2.0 PDF
No ratings yet
TD Hive Guide V2.0 PDF
34 pages
Hive
No ratings yet
Hive
13 pages
Hive Optimization - Quick Refresher
No ratings yet
Hive Optimization - Quick Refresher
7 pages
Chapter 4
No ratings yet
Chapter 4
19 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
Syllabus Information Retrieval Techniques
No ratings yet
Syllabus Information Retrieval Techniques
2 pages
Databricks
No ratings yet
Databricks
15 pages
Full Stack Developer Course Syllabus
No ratings yet
Full Stack Developer Course Syllabus
4 pages
Database Lecture4r
No ratings yet
Database Lecture4r
41 pages
Hu Vehicle Management System Project Edited
No ratings yet
Hu Vehicle Management System Project Edited
27 pages
Database Administration Level IV Theory Exam 9
No ratings yet
Database Administration Level IV Theory Exam 9
4 pages
VB Report
No ratings yet
VB Report
10 pages
Chapter 3 Entity Relationship Model Final - 2
No ratings yet
Chapter 3 Entity Relationship Model Final - 2
66 pages
SAP ABAP Basic Concepts
No ratings yet
SAP ABAP Basic Concepts
33 pages
CMMS OptiMaint - Installation
No ratings yet
CMMS OptiMaint - Installation
19 pages
Understanding The Value and Functionality of Oracle EBS Audit Trail - Syntax
No ratings yet
Understanding The Value and Functionality of Oracle EBS Audit Trail - Syntax
11 pages
PL SQL
No ratings yet
PL SQL
60 pages
2025 - Campus - GenC Next Hiring - Job Description
No ratings yet
2025 - Campus - GenC Next Hiring - Job Description
4 pages
AWS Document
100% (1)
AWS Document
16 pages
Python Django Developer Resume: Career Goal
100% (1)
Python Django Developer Resume: Career Goal
2 pages
Microsoft Certified Azure Data Fundamentals Skills Measured
No ratings yet
Microsoft Certified Azure Data Fundamentals Skills Measured
3 pages
ADB Chap04 Conceptual Design I2324
No ratings yet
ADB Chap04 Conceptual Design I2324
68 pages
Program 4
No ratings yet
Program 4
2 pages
Semantic Search
No ratings yet
Semantic Search
9 pages
Pharma Script Pawan
No ratings yet
Pharma Script Pawan
19 pages
Fundamental Research of Distributed Database PDF
No ratings yet
Fundamental Research of Distributed Database PDF
9 pages
Program 1
No ratings yet
Program 1
3 pages
Arpan Karki L2C2
No ratings yet
Arpan Karki L2C2
56 pages
3.3 Methods Used To Store Data & Information
No ratings yet
3.3 Methods Used To Store Data & Information
3 pages
DBMS Experiment No 8
No ratings yet
DBMS Experiment No 8
4 pages
Top 20 General Faqs: Oracle Fail Safe Frequently Asked Questions
No ratings yet
Top 20 General Faqs: Oracle Fail Safe Frequently Asked Questions
8 pages
PostgreSQL 16 Cookbook, Second Edition: Solve challenges across scalability, performance optimization, essential commands, cloud provisioning, backup, and recovery
From Everand
PostgreSQL 16 Cookbook, Second Edition: Solve challenges across scalability, performance optimization, essential commands, cloud provisioning, backup, and recovery
Peter G
No ratings yet
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
From Everand
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
Gus Frazer
No ratings yet
PostgreSQL 17 QuickStart Pro: Add expertise around WAL processing, JSON table, IO performance, logical replication and index vacuuming
From Everand
PostgreSQL 17 QuickStart Pro: Add expertise around WAL processing, JSON table, IO performance, logical replication and index vacuuming
Tessa Vorin
No ratings yet
PostgreSQL 16 Cookbook, Second Edition
From Everand
PostgreSQL 16 Cookbook, Second Edition
Peter G
No ratings yet
PostgreSQL 17 QuickStart Pro
From Everand
PostgreSQL 17 QuickStart Pro
Tessa Vorin
No ratings yet
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
From Everand
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
Jane Eslinger
No ratings yet
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
From Everand
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Kim Chantala
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
SAS For Dummies
From Everand
SAS For Dummies
Chris Hemedinger
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
From Everand
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
Matthew Windham
No ratings yet
SAP APO Interview Questions, Answers, and Explanations: SAP APO Certification Review
From Everand
SAP APO Interview Questions, Answers, and Explanations: SAP APO Certification Review
Equity Press
2/5 (9)
Microsoft Power Platform For Dummies
From Everand
Microsoft Power Platform For Dummies
Jack A. Hyman
1/5 (1)
PostgreSQL Server Programming - Second Edition
From Everand
PostgreSQL Server Programming - Second Edition
Hannu Krosing
No ratings yet
Functional Python Programming
From Everand
Functional Python Programming
Steven Lott
No ratings yet
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
Instant PostgreSQL Backup and Restore How-to
From Everand
Instant PostgreSQL Backup and Restore How-to
Shaun Thomas
No ratings yet
The Definitive Guide to Getting Started with OpenCart 2.x
From Everand
The Definitive Guide to Getting Started with OpenCart 2.x
iSenseLabs
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Getting Started with SAS Programming: Using SAS Studio in the Cloud
From Everand
Getting Started with SAS Programming: Using SAS Studio in the Cloud
Ron Cody
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Intermediate Load Runner With Oracle/Apex Concepts.
From Everand
Intermediate Load Runner With Oracle/Apex Concepts.
Rohan Gordon
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Basic DBA Query v.1: Oracle Database
From Everand
Basic DBA Query v.1: Oracle Database
Oraclesql-plsql
5/5 (1)
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Microsoft Azure Data Engineer DP 203
From Everand
Microsoft Azure Data Engineer DP 203
Manish Soni
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Microsoft Azure Database Administrator DP 300
From Everand
Microsoft Azure Database Administrator DP 300
Manish Soni
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Oracle APEX Tips and Tricks
From Everand
Oracle APEX Tips and Tricks
Priyanka Agarwal
No ratings yet
Blue Prism Developer Certification Case Based Practice Question - Latest 2023
From Everand
Blue Prism Developer Certification Case Based Practice Question - Latest 2023
Exam OG
No ratings yet
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
5/5 (1)
Oracle Quick Guides: Part 2 - Oracle Database Design
From Everand
Oracle Quick Guides: Part 2 - Oracle Database Design
Malcolm Coxall
No ratings yet
EnterpriseOne Interview Questions
From Everand
EnterpriseOne Interview Questions
equitypress
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
CompTIA A+ Exam Prep Guide : Your Ultimate Study Companion
From Everand
CompTIA A+ Exam Prep Guide : Your Ultimate Study Companion
SUJAN
No ratings yet
Advanced SAS Interview Questions You'll Most Likely Be Asked
From Everand
Advanced SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Production System: Fundamentals and Applications
From Everand
Production System: Fundamentals and Applications
Fouad Sabry
No ratings yet

Predicate Pushdown in Hive

Uploaded by

Predicate Pushdown in Hive

Uploaded by

WHAT IS PREDICATE PUSHDOWN?

A “predicate” (in mathematics and functional programming) is a function that

Predicate Pushdown in Hive

For example in this query the WHERE a.country = 'Argentina' will be evaluated in the

How many products of brand Washington has been sold so far?

select sum(s.unit_sales) from foodmart.product p

Predicate Push down pg. 2

DAG for first query when PPD is disabled

Running Query with PPD Enabled.

Predicate Push down pg. 3

Predicate Pushdown in Parquet/ORC files

Predicate Pushdown in Spark

Predicate Pushdown in Amazon Redshift Spectrum

Predicate Push down pg. 4

You might also like