You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is no question that caching data is the right pattern in CI in order to improve efficiency and prevent data having to be re-acquired for every run.
However, another best-practice (I'd argue) is that the data access be done programmatically in the tutorial itself. The combination of these two patterns leads to cases where changes to the code that accesses/acquires data may not be tested due to data caching in CI.
I think this case should be addressed in the "how-to"/"faq" section of this site. Off the top of my head, the pattern that makes the most sense is to have a scheduled CI job that doesn't incorporate data caching. This job should be triggered only by cron and have an option for manual triggering as well, for cases when reviewers identify that data access has changed1.
This could of course be extended to be made automatic, e.g. with notebook metadata, but IMO that's too involved for a high-level recommendation, at least at this stage! ↩
The text was updated successfully, but these errors were encountered:
This feels like a large can of worms (if we go into the details of what paragrammatical access means, can it be a data grabber from another github repo or google drive; as in practice that's what failed in the referenced numpy tutorials case) that I'm not sure will have a generalised and fits all solution.
OTOH, I cannot argue with any of the basic principles you listed above, so if we keep the details on the above level so I would be happy to have some sections/FAQ about it.
if we keep the details on the above level so I would be happy to have some sections/FAQ about it.
Yeah this is where I'm at - I don't think we should prescribe how to do "advanced" things/every possible corner case, but rather collect patterns that tutorial-infra-hosts are likely to come across and highlight ways to deal with them. That's why I envision information like this living somewhere like an FAQ or How-to collection, rather than in the flow of the main documentation.
There is no question that caching data is the right pattern in CI in order to improve efficiency and prevent data having to be re-acquired for every run.
However, another best-practice (I'd argue) is that the data access be done programmatically in the tutorial itself. The combination of these two patterns leads to cases where changes to the code that accesses/acquires data may not be tested due to data caching in CI.
I think this case should be addressed in the "how-to"/"faq" section of this site. Off the top of my head, the pattern that makes the most sense is to have a scheduled CI job that doesn't incorporate data caching. This job should be triggered only by
cron
and have an option for manual triggering as well, for cases when reviewers identify that data access has changed1.xref numpy/numpy-tutorials#255
Footnotes
This could of course be extended to be made automatic, e.g. with notebook metadata, but IMO that's too involved for a high-level recommendation, at least at this stage! ↩
The text was updated successfully, but these errors were encountered: