Best practices for testing changes to data sources #67

rossbar · 2025-05-20T17:16:38Z

There is no question that caching data is the right pattern in CI in order to improve efficiency and prevent data having to be re-acquired for every run.

However, another best-practice (I'd argue) is that the data access be done programmatically in the tutorial itself. The combination of these two patterns leads to cases where changes to the code that accesses/acquires data may not be tested due to data caching in CI.

I think this case should be addressed in the "how-to"/"faq" section of this site. Off the top of my head, the pattern that makes the most sense is to have a scheduled CI job that doesn't incorporate data caching. This job should be triggered only by cron and have an option for manual triggering as well, for cases when reviewers identify that data access has changed¹.

xref numpy/numpy-tutorials#255

This could of course be extended to be made automatic, e.g. with notebook metadata, but IMO that's too involved for a high-level recommendation, at least at this stage! ↩

The text was updated successfully, but these errors were encountered:

bsipocz · 2025-05-20T17:46:30Z

This feels like a large can of worms (if we go into the details of what paragrammatical access means, can it be a data grabber from another github repo or google drive; as in practice that's what failed in the referenced numpy tutorials case) that I'm not sure will have a generalised and fits all solution.

OTOH, I cannot argue with any of the basic principles you listed above, so if we keep the details on the above level so I would be happy to have some sections/FAQ about it.

rossbar · 2025-05-20T17:50:54Z

This feels like a large can of worms

Agreed 🙃

if we keep the details on the above level so I would be happy to have some sections/FAQ about it.

Yeah this is where I'm at - I don't think we should prescribe how to do "advanced" things/every possible corner case, but rather collect patterns that tutorial-infra-hosts are likely to come across and highlight ways to deal with them. That's why I envision information like this living somewhere like an FAQ or How-to collection, rather than in the flow of the main documentation.

bsipocz added the type: Enhancement New feature or request label May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for testing changes to data sources #67

Best practices for testing changes to data sources #67

rossbar commented May 20, 2025

bsipocz commented May 20, 2025

Uh oh!

rossbar commented May 20, 2025

Uh oh!

Best practices for testing changes to data sources #67

Best practices for testing changes to data sources #67

Comments

rossbar commented May 20, 2025

Footnotes

bsipocz commented May 20, 2025

Uh oh!

rossbar commented May 20, 2025

Uh oh!