Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: merge_asof threshold minimum #61164

Open
1 of 3 tasks
Lituchy opened this issue Mar 21, 2025 · 6 comments
Open
1 of 3 tasks

ENH: merge_asof threshold minimum #61164

Lituchy opened this issue Mar 21, 2025 · 6 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@Lituchy
Copy link

Lituchy commented Mar 21, 2025

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I often find myself using the merge_asof function on time series data. The tolerance and allow_exact_matches fields are very useful in filtering data, but it would be very useful to have more granular control over this tolerance. Being able to supply a minimum tolerance in addition to the currently existing maximum tolerance would be very beneficial in giving the user more control over this function.

Feature Description

A current example for this function is the following:

We only asof within 10ms between the quote time and the trade time
and we exclude exact matches on time. However prior data will
propagate forward

>>> pd.merge_asof(
...     trades,
...     quotes,
...     on="time",
...     by="ticker",
...     tolerance=pd.Timedelta("10ms"),
...     allow_exact_matches=False
... )
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100     NaN     NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100     NaN     NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

I am envisining a version where we could have

We only asof within 10ms between the quote time and the trade time but more than 2ms between the quote time and the trade time and we exclude exact matches on time. However prior data will propagate forward

>>> pd.merge_asof(
...     trades,
...     quotes,
...     on="time",
...     by="ticker",
...     tolerance=pd.Timedelta("10ms"),
...     mininum_tolerance=pd.Timedelta("2ms"),
...     allow_exact_matches=False
... )
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100     NaN     NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100     NaN     NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

Alternative Solutions

Another solution to this problem to augment the currently existing tolerance argument to accept a single datetimelike object, or a tuple of datetmelike objects which could act as a lower and upper bound, respectively.

Additional Context

No response

@Lituchy Lituchy added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2025
@rhshadrach
Copy link
Member

Thanks for the request!

However prior data will propagate forward

Can you provide a full example here, namely the DataFrames for trades and quotes.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 22, 2025
@Lituchy
Copy link
Author

Lituchy commented Mar 22, 2025

Thanks for the quick response! That phrasing was just taken from the current reference for this function in the examples section. I don't think it's paramount / actually important for my specific request. Is there any other info I can provide?

@rhshadrach
Copy link
Member

rhshadrach commented Mar 22, 2025

As is, I do not understand the issue with the current features merge_asof I think because I'm not understanding the source data that goes into the computation. Providing this would help.

@Lituchy
Copy link
Author

Lituchy commented Mar 22, 2025

Ah I see, here's an example with the current usage:

quotes
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.037   MSFT   51.98   51.99
4 2016-05-25 13:30:00.041   MSFT   51.99   52.00
5 2016-05-25 13:30:00.047   GOOG  720.50  720.93
6 2016-05-25 13:30:00.051   AAPL   97.99   98.01
7 2016-05-25 13:30:00.072   GOOG  720.50  720.88
8 2016-05-25 13:30:00.075   MSFT   52.01   52.03

trades
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.049   GOOG  720.92       100
4 2016-05-25 13:30:00.050   AAPL   98.00       100

pd.merge_asof(
    trades,
    quotes,
    on="time",
    by="ticker",
    tolerance=pd.Timedelta("10ms"),
    allow_exact_matches=False
)
                                           time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.98   51.99 <---- this bid and ask come from row 3 in quotes
2 2016-05-25 13:30:00.048   GOOG  720.77       100     720.50     720.93 <---- these come from row 5 in quotes
3 2016-05-25 13:30:00.049   GOOG  720.92       100     720.50     720.93 <---- these come from row 5 in quotes
4 2016-05-25 13:30:00.050   AAPL   98.00       100     NaN     NaN

Which effectively acts like a group by on "ticker" then finds the nearest other row with the same ticker that has a "time" within 10ms, but not the same time.

I am proposing that by specifying a min tolerance as well, we'd have

pd.merge_asof(
    trades,
    quotes,
    on="time",
    by="ticker",
    tolerance=pd.Timedelta("10ms"),
    min_tolerance=pd.Timedelta("2ms"),
    allow_exact_matches=False
)
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98 <---- since row 3 in quotes is within the min tolerance, it is skipped over and this row is actually merged on row 2 of quotes since its within [2ms, 10ms] of this time
2 2016-05-25 13:30:00.048   GOOG  720.77       100     NaN     NaN <---- Row 5 in quotes is only 1ms away, less than our min tolerance, and there is nothing within 10ms away so this does not match anymore
3 2016-05-25 13:30:00.049   GOOG  720.92       100     720.50     720.93 <---- Row 5 in quotes is 2ms away so it matches
4 2016-05-25 13:30:00.050   AAPL   98.00       100     NaN     NaN

Let me know if that makes sense as to why this feature is desired -- if not I can craft more examples or would be happy to have more discussion about this. Thanks!

@snitish
Copy link
Member

snitish commented Mar 22, 2025

@Lituchy I believe you can achieve the same behavior with an auxiliary column time_adj defined as below -

trades['time_adj'] = trades['time'] - pd.Timedelta("2ms")
pd.merge_asof(
    trades,
    quotes,
    left_on="time_adj",
    right_on="time",
    by="ticker",
    tolerance=pd.Timedelta("8ms"),
)

@Lituchy
Copy link
Author

Lituchy commented Mar 22, 2025

I think that could work in some cases -- like my posted example above -- however there are a few key features which would make this not necessarily work:

  • allow_exact_matches=False would no longer work as intended, and you may get an exact match when not intending
  • If direction=nearest instead of the default direction=backward, this could actually lead to an incorrect merge

If you have any other ideas on how we may be able to deal with the above I'd be very curious to hear; I haven't been able to think of a way to get all this behavior without a min_threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants