Skip to content

Best practices for testing changes to data sources #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rossbar opened this issue May 20, 2025 · 2 comments
Open

Best practices for testing changes to data sources #67

rossbar opened this issue May 20, 2025 · 2 comments
Labels
type: Enhancement New feature or request

Comments

@rossbar
Copy link
Member

rossbar commented May 20, 2025

There is no question that caching data is the right pattern in CI in order to improve efficiency and prevent data having to be re-acquired for every run.

However, another best-practice (I'd argue) is that the data access be done programmatically in the tutorial itself. The combination of these two patterns leads to cases where changes to the code that accesses/acquires data may not be tested due to data caching in CI.

I think this case should be addressed in the "how-to"/"faq" section of this site. Off the top of my head, the pattern that makes the most sense is to have a scheduled CI job that doesn't incorporate data caching. This job should be triggered only by cron and have an option for manual triggering as well, for cases when reviewers identify that data access has changed1.

xref numpy/numpy-tutorials#255

Footnotes

  1. This could of course be extended to be made automatic, e.g. with notebook metadata, but IMO that's too involved for a high-level recommendation, at least at this stage!

@bsipocz
Copy link
Member

bsipocz commented May 20, 2025

This feels like a large can of worms (if we go into the details of what paragrammatical access means, can it be a data grabber from another github repo or google drive; as in practice that's what failed in the referenced numpy tutorials case) that I'm not sure will have a generalised and fits all solution.

OTOH, I cannot argue with any of the basic principles you listed above, so if we keep the details on the above level so I would be happy to have some sections/FAQ about it.

@bsipocz bsipocz added the type: Enhancement New feature or request label May 20, 2025
@rossbar
Copy link
Member Author

rossbar commented May 20, 2025

This feels like a large can of worms

Agreed 🙃

if we keep the details on the above level so I would be happy to have some sections/FAQ about it.

Yeah this is where I'm at - I don't think we should prescribe how to do "advanced" things/every possible corner case, but rather collect patterns that tutorial-infra-hosts are likely to come across and highlight ways to deal with them. That's why I envision information like this living somewhere like an FAQ or How-to collection, rather than in the flow of the main documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants