You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have been going through this issue already on influxdb-python-client, but from the communication with Jakub and additional tests, it seems like it is issue of the DB itself, not a client.
Here is the link to related issue on python-client page: influxdata/influxdb-client-python#662
Steps to reproduce:
A process/thread that is writting about 200 data points into bucket "test_bucket" every second. Each data point has 5 tag and 7 field "columns". One of the field is unique sequence integer. It is generated externally, since InfluxDB does not have such capability, and added to data points before sending the batch of data points to DB (python-client).
Second thread/process that is reading from "test_bucket" every second. No stop time given, just a start time. We are basically querying "give me data that were added since my last query". To receive every data only once, I use |> filter(fn: (r) => r["row_id"] > <last_received_id>)
Even if "writter" does not store any null values in DB, sometimes "reader" receives null in random field value of random data point. Often more than one null column is present, but it is never tag.
What we have observed so far is, that those nulls happen very close to now() date. And Jakub suggested, it is caused by pivot function. To test that, I have added stop date, so delay. Jakub suggested this: |> range(start: _start, stop: -10ms). I did test with 1 minute delay, where start and stop datetimes were calculated in python and passed to query as parameters.
This is working, once I added this delay in reading, I havent seen any null values extracted from DB.
(It was like 5 nulls per hour before delay. No nulls observed after 24h of running test with delay)
BUT, this is not a solution for us. The data we are receiving (and writting) are not chronologically ordered, and it is impossible to have them chronologically ordered (on input). We also need to extract each new data point only once, not more. Otherwise next layer, above us will have duplicates and that will cause errors.
That is why we added the row_id for filtering. And because we use also row_id to filter data, it is possible that we will be loosing data if we add stop time in range() function.
Basically, the chronologically "younger" datapoint, that is excluded by stop parameter, can have lower row_id because it was received earlier than "older" data points. In that case, this data point will be excluded in first query by stop range parameter. Than in next query, it will be filtered out again by r["row_id"] > {id_start} filter. And we will never extract it from DB.
(I dont want to go too much in details about this here, since is different topic, but I can explain it in comments, if asked).
Expected behaviour:
Once datapoints are written, they are fully usable. So if I ask DB "give me all data from 1min ago until now", it will give me all data that are fully available. In know that internal structure of DB has each field assigned to its tag separately. But if I write datapoint with 7 field values, none of those 7 field values should be available for reading, until all of them are "processed" correctly. It should be one transaction.
Actual behaviour:
If I write data point with 7 field values, some of the values are available for reading before others. And if I read the data point "too fast" I can receive only partial information for this data point. This is unacceptable. It would be no issue, if I do not receive this data point at all, since it is not yet fully "processed". But returning partial data is causing us big issues in our implementation.
Environment info:
Client Version: influxdb-client-python/1.40.0
InfluxDB Version: 2.7.3
Platform: alpine-linux
Config:
No modification in config.
Logs:
Unfortunately, I cannot share any details publicly, because of corporate cybersecurity rules. However, I have permission to share details needed to reproduce the issue privately (direct messages on Slack, for example).
The text was updated successfully, but these errors were encountered:
InfluxDB is a columnar, schema-less database. There is no way for InfluxDB to know when a point is fully written, because new fields can be added at any time. When processing a write operation, each field is written separately (which is how fields can be added to a point via multiple writes).
Writes to different fields for the same point can happen at different times. Here are two points being inserted. One has two values written in a single INSERT statement, then the other is written with two INSERT statements separated by a SELECT, and then the first point has a third field written.
As a user, if you add all fields of a point in one write operation, you can be assured that all fields are written when that operation finishes and returns a success code. So perhaps you can query by looking for data by row ID only after you are sure that the write for that row ID has completed. So in your filter you could say something like
Hello, I have been going through this issue already on influxdb-python-client, but from the communication with Jakub and additional tests, it seems like it is issue of the DB itself, not a client.
Here is the link to related issue on python-client page: influxdata/influxdb-client-python#662
Steps to reproduce:
tag
and 7field
"columns". One of thefield
is unique sequence integer. It is generated externally, since InfluxDB does not have such capability, and added to data points before sending the batch of data points to DB (python-client).stop
time given, just astart
time. We are basically querying "give me data that were added since my last query". To receive every data only once, I use|> filter(fn: (r) => r["row_id"] > <last_received_id>)
null
values in DB, sometimes "reader" receivesnull
in randomfield
value of random data point. Often more than onenull
column is present, but it is nevertag
.Full query:
What we have observed so far is, that those nulls happen very close to
now()
date. And Jakub suggested, it is caused bypivot
function. To test that, I have addedstop
date, so delay. Jakub suggested this:|> range(start: _start, stop: -10ms)
. I did test with 1 minute delay, wherestart
andstop
datetimes were calculated in python and passed to query as parameters.This is working, once I added this delay in reading, I havent seen any
null
values extracted from DB.(It was like 5 nulls per hour before delay. No nulls observed after 24h of running test with delay)
BUT, this is not a solution for us. The data we are receiving (and writting) are not chronologically ordered, and it is impossible to have them chronologically ordered (on input). We also need to extract each new data point only once, not more. Otherwise next layer, above us will have duplicates and that will cause errors.
That is why we added the
row_id
for filtering. And because we use alsorow_id
to filter data, it is possible that we will be loosing data if we addstop
time inrange()
function.Basically, the chronologically "younger" datapoint, that is excluded by
stop
parameter, can have lowerrow_id
because it was received earlier than "older" data points. In that case, this data point will be excluded in first query bystop
range parameter. Than in next query, it will be filtered out again byr["row_id"] > {id_start}
filter. And we will never extract it from DB.(I dont want to go too much in details about this here, since is different topic, but I can explain it in comments, if asked).
Expected behaviour:
Once datapoints are written, they are fully usable. So if I ask DB "give me all data from 1min ago until now", it will give me all data that are fully available. In know that internal structure of DB has each
field
assigned to itstag
separately. But if I write datapoint with 7field
values, none of those 7field
values should be available for reading, until all of them are "processed" correctly. It should be one transaction.Actual behaviour:
If I write data point with 7
field
values, some of the values are available for reading before others. And if I read the data point "too fast" I can receive only partial information for this data point. This is unacceptable. It would be no issue, if I do not receive this data point at all, since it is not yet fully "processed". But returning partial data is causing us big issues in our implementation.Environment info:
Config:
No modification in config.
Logs:
Unfortunately, I cannot share any details publicly, because of corporate cybersecurity rules. However, I have permission to share details needed to reproduce the issue privately (direct messages on Slack, for example).
The text was updated successfully, but these errors were encountered: