0% found this document useful (0 votes)
88 views4 pages

Predicate Pushdown in Hive

Predicate pushdown is an optimization technique that pushes predicates (filters) from a SQL query down to where the data resides to reduce the amount of data processed. This is done by evaluating the predicates earlier, such as filtering data before it is transferred over the network, loaded into memory, or entire files/chunks are read. For example, in Hive, predicates in the WHERE clause can be pushed to the map phase to filter data before it is sent to the reduce phase.

Uploaded by

Pranoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views4 pages

Predicate Pushdown in Hive

Predicate pushdown is an optimization technique that pushes predicates (filters) from a SQL query down to where the data resides to reduce the amount of data processed. This is done by evaluating the predicates earlier, such as filtering data before it is transferred over the network, loaded into memory, or entire files/chunks are read. For example, in Hive, predicates in the WHERE clause can be pushed to the map phase to filter data before it is sent to the reduce phase.

Uploaded by

Pranoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

WHAT IS PREDICATE PUSHDOWN?

The basic idea of predicate pushdown is that certain parts of SQL queries (the
predicates) can be “pushed” to where the data lives.  This optimization can
drastically reduce query/processing time by filtering out data earlier rather than
later. Depending on the processing framework, predicate pushdown can optimize
your query by doing things like filtering data before it is transferred over the
network, filtering data before loading into memory, or skipping reading entire
files or chunks of files.

A “predicate” (in mathematics and functional programming) is a function that


returns a Boolean (true or false). In SQL queries predicates are usually
encountered in the WHERE clause and are used to filter data.

Predicate Pushdown in Hive


Generally, when executing SQL queries, a JOIN will be performed before the
filtering used in the WHERE clause. In Hive (Map Reduce), predicate pushdown is
used to filter data in the map phase before sending over the network to the
reduce phase.

For example in this query the WHERE a.country = 'Argentina' will be evaluated in the


map phase, reducing the amount data sent over the network:
SELECT
  a.*
3
FROM
4
  table1 a
5
JOIN
6
  table2 b ON a.id = b.id
7
WHERE
8
  a.country = 'Argentina';

Predicate Pushdown in hive is a feature to Push your predicate (where condition) further up in the query. It tries
to execute the expression as early as possible in plan. 

Let’s try to understand this by example. let’s consider we have two tables, product and sales and we
want to answer following question. 

How many products of brand Washington has been sold so far?

Non-Optimized Query
Following query will answer the above question. However, if you are familiar with sql you will realize
that above query is not optimized.  It applies first joins the two table and then applies the condition
Predicate Push down pg. 1
(predicate).

select sum(s.unit_sales) from foodmart.product p


join
foodmart. sales_fact_dec_1998 s
on
p.product_id = s.product_id
where
p.brand_name = "Washington"

Optimized Query
We could easily optimize this above query by applying condition first on product table and then joining it to sales
table as mentioned below.

SELECT sum(s.unit_sales)
FROM foodmart.sales_fact_dec_1998 s
JOIN (
SELECT product_id, brand_name
FROM foodmart.product
WHERE
brand_name = "Washington"
)p
ON
p.product_id = s.product_id

This is what PPD (predicate pushdown) does internally.  if you have ppd enabled your first query will
automatically be converted to second optimized query.

Let’s see this in action.  Product table has total 1560 rows (product) with only 11 products with brand
name Washington.

For better understanding I have disabled the vectorization.  If you are not sure what vectorization is,
please read the following blog post – What is vectorization? 

Predicate Push down pg. 2


Running Query with PPD Disabled
Following is the DAG of first query with PPD disabled. 
Please set the following parameter to false, to disable the PPD.

set hive.optimize.ppd=false;

if you notice, it’s reading all rows from product table and then passing it to reducer for join. 

DAG for first query when PPD is disabled

Running Query with PPD Enabled.


And Following is the DAG of the same query with PPD Enabled.
Please set the following parameter to true, to enable the PPD.

set hive.optimize.ppd=true;

Once, we enable the PPD, it first applies the condition on product table and sends only 11 rows to the
reducer for join.

Predicate Push down pg. 3


DAG for first query when PPD is enabled

Predicate Pushdown in Parquet/ORC files


Parquet and ORC files maintain various stats about each column in different
chunks of data (such as min and max values). Programs reading these files can
use these indexes to determine if certain chunks, and even entire files, need to
be read at all.  This allows programs to potentially skip over huge portions of the
data during processing.

Predicate Pushdown in Spark


Spark will attempt to move filtering of data as close to the source as possible to
avoid loading unnecessary data into memory.

Predicate Pushdown in Amazon Redshift Spectrum


Amazon Redshift Spectrum resides on dedicated servers separate from actual
Redshift clusters. Redshift Spectrum will use predicate pushdown to filter data at
the Redshift Spectrum layer to reduce data transfer, storage, and compute
resources on the Redshift cluster itself.

Predicate Push down pg. 4

You might also like