Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot function is returning None values, when reading DB "too fast" #25136

Open
rusnackor opened this issue Jul 10, 2024 · 1 comment
Open

Pivot function is returning None values, when reading DB "too fast" #25136

rusnackor opened this issue Jul 10, 2024 · 1 comment
Assignees

Comments

@rusnackor
Copy link

Hello, I have been going through this issue already on influxdb-python-client, but from the communication with Jakub and additional tests, it seems like it is issue of the DB itself, not a client.
Here is the link to related issue on python-client page: influxdata/influxdb-client-python#662

Steps to reproduce:

  1. A process/thread that is writting about 200 data points into bucket "test_bucket" every second. Each data point has 5 tag and 7 field "columns". One of the field is unique sequence integer. It is generated externally, since InfluxDB does not have such capability, and added to data points before sending the batch of data points to DB (python-client).
  2. Second thread/process that is reading from "test_bucket" every second. No stop time given, just a start time. We are basically querying "give me data that were added since my last query". To receive every data only once, I use |> filter(fn: (r) => r["row_id"] > <last_received_id>)
  3. Even if "writter" does not store any null values in DB, sometimes "reader" receives null in random field value of random data point. Often more than one null column is present, but it is never tag.

Full query:

f'from(bucket: "{self.bucket_name}")'
f' |> range(start: _start)'
f' |> filter(fn: (r) => r["_measurement"] == "measurement_name")'
f' |> drop(columns: ["_start", "_stop"])'
f' |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")'
f' |> filter(fn: (r) => r["row_id"] > {id_start})'

What we have observed so far is, that those nulls happen very close to now() date. And Jakub suggested, it is caused by pivot function. To test that, I have added stop date, so delay. Jakub suggested this: |> range(start: _start, stop: -10ms). I did test with 1 minute delay, where start and stop datetimes were calculated in python and passed to query as parameters.
This is working, once I added this delay in reading, I havent seen any null values extracted from DB.
(It was like 5 nulls per hour before delay. No nulls observed after 24h of running test with delay)

BUT, this is not a solution for us. The data we are receiving (and writting) are not chronologically ordered, and it is impossible to have them chronologically ordered (on input). We also need to extract each new data point only once, not more. Otherwise next layer, above us will have duplicates and that will cause errors.
That is why we added the row_id for filtering. And because we use also row_id to filter data, it is possible that we will be loosing data if we add stop time in range() function.
Basically, the chronologically "younger" datapoint, that is excluded by stop parameter, can have lower row_id because it was received earlier than "older" data points. In that case, this data point will be excluded in first query by stop range parameter. Than in next query, it will be filtered out again by r["row_id"] > {id_start} filter. And we will never extract it from DB.
(I dont want to go too much in details about this here, since is different topic, but I can explain it in comments, if asked).

Expected behaviour:
Once datapoints are written, they are fully usable. So if I ask DB "give me all data from 1min ago until now", it will give me all data that are fully available. In know that internal structure of DB has each field assigned to its tag separately. But if I write datapoint with 7 field values, none of those 7 field values should be available for reading, until all of them are "processed" correctly. It should be one transaction.

Actual behaviour:
If I write data point with 7 field values, some of the values are available for reading before others. And if I read the data point "too fast" I can receive only partial information for this data point. This is unacceptable. It would be no issue, if I do not receive this data point at all, since it is not yet fully "processed". But returning partial data is causing us big issues in our implementation.

Environment info:

  • Client Version: influxdb-client-python/1.40.0
  • InfluxDB Version: 2.7.3
  • Platform: alpine-linux

Config:
No modification in config.

Logs:
Unfortunately, I cannot share any details publicly, because of corporate cybersecurity rules. However, I have permission to share details needed to reproduce the issue privately (direct messages on Slack, for example).

@davidby-influx
Copy link
Contributor

davidby-influx commented Jul 22, 2024

InfluxDB is a columnar, schema-less database. There is no way for InfluxDB to know when a point is fully written, because new fields can be added at any time. When processing a write operation, each field is written separately (which is how fields can be added to a point via multiple writes).

Writes to different fields for the same point can happen at different times. Here are two points being inserted. One has two values written in a single INSERT statement, then the other is written with two INSERT statements separated by a SELECT, and then the first point has a third field written.

As a user, if you add all fields of a point in one write operation, you can be assured that all fields are written when that operation finishes and returns a success code. So perhaps you can query by looking for data by row ID only after you are sure that the write for that row ID has completed. So in your filter you could say something like

<omitted code>
|> filter(fn: (r) => r["row_id"] > {id_start} and r["row_id"] <= {id_last_written})'
> INSERT foo,tagone=t1 v1=34,v2=35 1721684170301626470
> select * from foo
name: foo
time                tagone v1 v2
----                ------ -- --
1721684170301626470 t1     34 35
>
> INSERT foo,tagone=t2 v1=13 1721684170301626480
> select * from foo
name: foo
time                tagone v1 v2
----                ------ -- --
1721684170301626470 t1     34 35
1721684170301626480 t2     13 
>
> INSERT foo,tagone=t2 v2=15 1721684170301626480
> select * from foo
name: foo
time                tagone v1 v2
----                ------ -- --
1721684170301626470 t1     34 35
1721684170301626480 t2     13 15
>
> INSERT foo,tagone=t1 v3=67 1721684170301626470
> select * from foo
name: foo
time                tagone v1 v2 v3
----                ------ -- -- --
1721684170301626470 t1     34 35 67
1721684170301626480 t2     13 15 
> 

@davidby-influx davidby-influx self-assigned this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants