Skip to content

Fixes log deduplication when mutating Labels using LogQL #5289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Feb 4, 2022

Conversation

cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Feb 1, 2022

What this PR does / why we need it:

Labels for each sample and log were used to find and remove duplicates when iterating over them. However since those labels can be mutated by the LogQL engine for optimization or by the users ( label_format, sum()) the deduplication wasn't consistent anymore.

This introduce a new StreamHash function that contains the original stream label hash and can be used for deduplication purpose. This ensure the deduplication is consistent and independent of the queries made.

This PR is built on #5281 which separate the sorting logic to the dedupe logic allowing to have different behaviour, this was required to make sharding work.

Which issue(s) this PR fixes:
Fixes #4500

Example of metric queries with duplicates before and after:

image

The last hour is still buggy because I did not deploy the ingesters during my test.

Special notes for your reviewer:

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@cyriltovena cyriltovena requested a review from a team as a code owner February 1, 2022 12:33
Signed-off-by: Cyril Tovena <[email protected]>
Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

@cyriltovena cyriltovena merged commit e3b7415 into grafana:main Feb 4, 2022
KMiller-Grafana pushed a commit to KMiller-Grafana/loki that referenced this pull request Feb 4, 2022
* Changes the iterator interface to include labelshash.

Signed-off-by: Cyril Tovena <[email protected]>

* Improve unused args and fixes build

Signed-off-by: Cyril Tovena <[email protected]>

* Implement the new hash in the memchunk.

Signed-off-by: Cyril Tovena <[email protected]>

* Add the hash in logproto

Signed-off-by: Cyril Tovena <[email protected]>

* Insert the hash from ingester-querier

Signed-off-by: Cyril Tovena <[email protected]>

* Rename LabelsHash to StreamHash

Signed-off-by: Cyril Tovena <[email protected]>

* Working through fixing tests.

Signed-off-by: Cyril Tovena <[email protected]>

* Refactor HeapIterator into merge and sort Iterator.

Signed-off-by: Cyril Tovena <[email protected]>

* lint

Signed-off-by: Cyril Tovena <[email protected]>

* Order alphabetically when ordering samples.

Signed-off-by: Cyril Tovena <[email protected]>

* Do not JSON encode the hash for legacy model.

Signed-off-by: Cyril Tovena <[email protected]>

* Add more regression tests.

Signed-off-by: Cyril Tovena <[email protected]>

* Update CHANGELOG.md

Signed-off-by: Cyril Tovena <[email protected]>

* Review feedback
@cyriltovena cyriltovena mentioned this pull request Feb 24, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduping with label_format or metrics aggregation can be wrong
5 participants