Predicate Pushdown in Hive
Predicate Pushdown in Hive
The basic idea of predicate pushdown is that certain parts of SQL queries (the
predicates) can be “pushed” to where the data lives. This optimization can
drastically reduce query/processing time by filtering out data earlier rather than
later. Depending on the processing framework, predicate pushdown can optimize
your query by doing things like filtering data before it is transferred over the
network, filtering data before loading into memory, or skipping reading entire
files or chunks of files.
Predicate Pushdown in hive is a feature to Push your predicate (where condition) further up in the query. It tries
to execute the expression as early as possible in plan.
Let’s try to understand this by example. let’s consider we have two tables, product and sales and we
want to answer following question.
Non-Optimized Query
Following query will answer the above question. However, if you are familiar with sql you will realize
that above query is not optimized. It applies first joins the two table and then applies the condition
Predicate Push down pg. 1
(predicate).
Optimized Query
We could easily optimize this above query by applying condition first on product table and then joining it to sales
table as mentioned below.
SELECT sum(s.unit_sales)
FROM foodmart.sales_fact_dec_1998 s
JOIN (
SELECT product_id, brand_name
FROM foodmart.product
WHERE
brand_name = "Washington"
)p
ON
p.product_id = s.product_id
This is what PPD (predicate pushdown) does internally. if you have ppd enabled your first query will
automatically be converted to second optimized query.
Let’s see this in action. Product table has total 1560 rows (product) with only 11 products with brand
name Washington.
For better understanding I have disabled the vectorization. If you are not sure what vectorization is,
please read the following blog post – What is vectorization?
set hive.optimize.ppd=false;
if you notice, it’s reading all rows from product table and then passing it to reducer for join.
set hive.optimize.ppd=true;
Once, we enable the PPD, it first applies the condition on product table and sends only 11 rows to the
reducer for join.