How do we identify entity end-dates across datasets? #392

greg-slater · 2025-04-23T10:49:36Z

greg-slater
Apr 23, 2025

Challenge

Entities have an end-date field which should record the date at which an entity is no longer valid or in effect. Some data providers publish their data in ways that doesn't completely map to this model, for instance:

they may remove entities from later data releases, rather than retaining them with an end-date.
they may publish information about entity end-dates in a separate dataset.

This means that in some cases where we have continued to create new entities as new data releases are published we have retained entities that the provider would consider "ended" but we don't have an end-date value for them.

We could work out and add an end-date value for these entities ourselves, however:

Identifying the end-date for an entity ourselves is potentially a time consuming task; the method can vary based on the dataset and how it is published.
We don't have a reliable method for adding end-date to entities ourselves. (For brownfield-land we experimented with self-hosting an end-date dataset, but it has not worked consistently as for some entities our end-date value is de-prioritised against a blank end-date fact from the data provider)

Task

Agree which datasets need end-dates, and which have separate data sources for end-date vs. may need an end-date inferred.
Define standard methods we can use to identify end-dates:
- We should document patterns of data publishing and what they mean for sourcing or inferring an entity end-date. (e.g. provider publishes releases to new endpoints and removes entities). Factors to consider might be: single-source or compiled dataset, how frequently endpoints are updated, whether new releases are published to new or existing endpoints, is the identifier persistent between releases.
- If we identify common patterns we can close the loop with data design's dataset investigation process, and be confident we can reliably maintain a dataset once it is published.
Work out how we store and add end-dates we infer ourselves. We've experimented with self-hosting but there are challenges. Could we potentially store inferred end-dates in config data, like the old-entity.csv?
Explore methods for automating this where possible.

drewhardy77 · 2025-07-17T08:13:37Z

drewhardy77
Jul 17, 2025
Maintainer

Heritage at Risk is one example, where this has been raised as an issue with the data supplier (Historic England). In their case, they remove any records that are no longer valid from each release iteration. The data we hold in the entry-date field represents the last iteration where the data was present. This therefore represents the last date that the record was known to be valid, which is of course not the same as it being a definitive end date. However without an end date present these records imply the existence of an entity which no longer exists.

The least worst option, in my view, is to treat the last known entry-date as the end date for records no longer in the latest version of the data supply, using the persistent identifier of reference to lookup against older versions. This should be done retrospectively as well as going forward.

Presenting old records as valid is significantly worse from a data quality point of view and undermines trust in the platform with suppliers and users.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digital Land

How do we identify entity end-dates across datasets? #392

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Digital Land

How do we identify entity end-dates across datasets? #392

Uh oh!

greg-slater Apr 23, 2025

Challenge

Task

Replies: 1 comment

Uh oh!

drewhardy77 Jul 17, 2025 Maintainer

greg-slater
Apr 23, 2025

drewhardy77
Jul 17, 2025
Maintainer