Implement aggregation and grouping pushdown#1
Conversation
The current implementation provides a mechanism for pushing down aggregation and/or grouping queries into the foreign data source. The Python side of the implementation will now receive two new kwargs, `aggs` and `group_clauses`, in which case it should return the corresponding aggreagation result. Still left to implement is consulting the Python side whether remote aggregation is possible at all, and if so which agregation functions are valid. Also missing are some more advanced aggregation cases (aggregating multiple functions, or handling `HAVING` clause for example). This is to be implemented separately.
Add a method to FDW Python instance that provides info on whether the pushdown is supported at all, and if so gives data for more granular decisions (for now only list of aggregation functions). Consult this method in `multicornGetForeignUpperPaths`.
Currently the parsing is incomplete for simple WHERE clauses due to the lack of T_OpExpr and T_Const cases in multicorn_foreign_expr_walker. Therefore, all WHERE clauses will be treated as local conditions, and not pushed down.
For the first iteration disable pushdown of `COUNT(*)`, like for `DISTINCT` clauses. These can be added later on, and tested on their eqivalents in ES, `doc_count` and `cardinality`.
mildbyte
left a comment
There was a problem hiding this comment.
I took a first pass at this and left some comments; will try understanding it deeper in the morning. Pretty impressive!
python/multicorn/__init__.py
Outdated
| The FDW has to inspect every sort, and respond which one are handled. | ||
| The sorts are cumulatives. |
|
|
||
| Return: | ||
| None if pushdown not supported, otherwise a dictionary containing | ||
| more granular details for the planning phase, in the form: |
There was a problem hiding this comment.
Needs docs on the expected dict output
There was a problem hiding this comment.
Adding docs for it in the next commit.
python/multicorn/__init__.py
Outdated
| column to be used in the aggregation operation. Result should be | ||
| returned under the provided aggregation key. | ||
| group_clauses (list): A list of columns used in GROUP BY statements. | ||
| The result should be returned for each column name provided. |
There was a problem hiding this comment.
What does this mean -- does every row we return need to have an entry for everything in columns + aggs?
There was a problem hiding this comment.
What I meant to say is that whenever there is a group_clauses kwarg, then for each column specified there the returned response should have a corresponding value for each row using that column name as the key.
I re-worded the docstring as above, hopefully this clarifies it.
src/python.c
Outdated
| p_object = PyMapping_GetItemString(p_upperrel_pushdown, "agg_functions"); | ||
| if (p_object != NULL && p_object != Py_None) | ||
| { | ||
| state->agg_functions = PyMapping_Keys(p_object); |
There was a problem hiding this comment.
I don't think you ever DECREF state->agg_functions, so this will slowly leak. I'd extract the contents into a separate List here and get rid of the PyObject here so that you also don't have to mess with the Python API in foreign_expr_walker.
There was a problem hiding this comment.
Good catch. I initially tried the route you mentioned but was stuck extracting Python Unicode objects into a PG List, so I went with this instead. Let me get back at this.
There was a problem hiding this comment.
Ok, I've now added storing of supported agg functions to a List.
| foreach(lc_groupc, state->group_clauses) | ||
| { | ||
| PyObject *column = PyUnicode_FromString(strVal(lfirst(lc_groupc))); | ||
| PyList_Append(group_clauses, column); |
There was a problem hiding this comment.
I think (but not entirely sure, since https://docs.python.org/3/c-api/list.html#c.PyList_Append doesn't mention it -- some evidence in https://stackoverflow.com/questions/3512414/does-this-pylist-appendlist-py-buildvalue-leak) that PyList_Append increments the refcounter, so you need to DECREF the column here.
| } | ||
| #endif | ||
|
|
||
| /* |
There was a problem hiding this comment.
That's a lot of code! Can you mark the parts taken from other FDWs (here and in deparse) and parts that you added yourself so that I know where to concentrate the review? Currently it kind of makes sense to me but knowing where it came from would make it clearer.
There was a problem hiding this comment.
Sure, I can do that. I can add some comments like // MY CODE START and // MY CODE END if that helps. Just keep in mind that the parts taken from other FDWs are also trimmed down, i.e. I've thrown away the irrelevant stuff so it's not 1-1.
There was a problem hiding this comment.
Done - I've enclosed all deviations from common FDW code (as used in postgres_fdw and other implementations) with the above comments in multicorn.c and deparse.c (other files should be more easier to parse I think).
Again worth mentioning that common FDW code that I've "appropriated" was pruned.
|
|
||
| initStringInfo(agg_key); | ||
| appendStringInfoString(agg_key, strVal(function)); | ||
| appendStringInfoString(agg_key, "."); |
There was a problem hiding this comment.
Just to check my understanding, does the Python FDW get a dict of {"functionname.colname": {"function": "functionname", "column": "colname"}} and is then expected to return a surrogate functionname.colname column in its response? e.g. https://github.com/splitgraph/postgres-elasticsearch-fdw/pull/1/files#diff-45ed0634a3ed30705f0b30dce58a096decc81bdf04af2df3906bc56d692c3de4R88-R92
src/python.c
Outdated
| pushdown_upperrel = true; | ||
| } | ||
|
|
||
| Py_DECREF(p_upperrel_pushdown); |
There was a problem hiding this comment.
Should this DECREF be inside of the if (p_upperrel_pushdown != NULL && p_upperrel_pushdown != Py_None) like other decrefs?
There was a problem hiding this comment.
We'd need to decrement a Py_None reference, which would leak if the Py_DECREF was in the if statement I believe.
That said, in the case of p_upperrel_pushdown being a null pointer it seems the proper approach is to use Py_XDECREF, like in the case of pythonDictToTuple function.
Also seems like I should do something similar for p_object inside the outer if statement.
Adding those changes now.
Multicorn support for Python FDW instances pushdown of an arbitrary combination of bare aggregations and/or groupings.
HAVINGclauses orWHEREclauses in case of aggregations. This case results in full record fetch and then subsequent filtering/aggregation on the PG side.ORDER BYclauses, but in this case it does push down the aggregation, and performs only the ordering of returned aggregations on the PG side (so it's an improvement, albeit there's still some work to be done on doing sorting on the remote server).DISTINCTorCOUNT(*)for the time being (defaults to full record fetch and subsequent processing on PG side).postgres_fdwand other FDW implementations.CU-1x57q56