How do we identify entity end-dates across datasets? #392
Replies: 1 comment
-
|
Heritage at Risk is one example, where this has been raised as an issue with the data supplier (Historic England). In their case, they remove any records that are no longer valid from each release iteration. The data we hold in the The least worst option, in my view, is to treat the last known Presenting old records as valid is significantly worse from a data quality point of view and undermines trust in the platform with suppliers and users. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Challenge
Entities have an
end-datefield which should record the date at which an entity is no longer valid or in effect. Some data providers publish their data in ways that doesn't completely map to this model, for instance:end-date.This means that in some cases where we have continued to create new entities as new data releases are published we have retained entities that the provider would consider "ended" but we don't have an
end-datevalue for them.We could work out and add an
end-datevalue for these entities ourselves, however:brownfield-landwe experimented with self-hosting anend-datedataset, but it has not worked consistently as for some entities ourend-datevalue is de-prioritised against a blankend-datefact from the data provider)Task
Agree which datasets need end-dates, and which have separate data sources for
end-datevs. may need anend-dateinferred.Define standard methods we can use to identify end-dates:
We should document patterns of data publishing and what they mean for sourcing or inferring an entity
end-date. (e.g. provider publishes releases to new endpoints and removes entities). Factors to consider might be: single-source or compiled dataset, how frequently endpoints are updated, whether new releases are published to new or existing endpoints, is the identifier persistent between releases.If we identify common patterns we can close the loop with data design's dataset investigation process, and be confident we can reliably maintain a dataset once it is published.
Work out how we store and add end-dates we infer ourselves. We've experimented with self-hosting but there are challenges. Could we potentially store inferred end-dates in config data, like the
old-entity.csv?Explore methods for automating this where possible.
Beta Was this translation helpful? Give feedback.
All reactions