Skip to content

BUG: data access issue for the DL tutorial #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bsipocz opened this issue May 19, 2025 · 4 comments · Fixed by #255
Closed

BUG: data access issue for the DL tutorial #254

bsipocz opened this issue May 19, 2025 · 4 comments · Fixed by #255
Labels
bug Something isn't working infrastructure Issues relevant to infrasructure, rather than content

Comments

@bsipocz
Copy link
Member

bsipocz commented May 19, 2025

CI cron run into data access issues with the DL notebook. A series of restarts fixed most of the jobs, but not all, and anyway it should not be a problem when there is only a handful runners trying to grab the same data (e.g. the problem will be way more present when it's a full room of workshop attendees).

So I open this issue as a reminder, and if this is not a one-off problem, to do something about it.

@bsipocz bsipocz added bug Something isn't working infrastructure Issues relevant to infrasructure, rather than content labels May 19, 2025
@rossbar
Copy link
Collaborator

rossbar commented May 19, 2025

Yeah looks like 429 errors, i.e. server request limits. Right now the data is hosted on github (in a personal repo of mine 😱 ) which was our "temporary" solution to the 503 errors when we were pinging the server on which the data was originally hosted.

I would be very surprised if the MNIST data weren't already hosted somewhere publicly and more sustainably, so we should investigate + switch to that!

@melissawm
Copy link
Member

Total overcomplication but if we can't find another alternative, torchvision packages a version of mnist: https://docs.pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html

@bsipocz
Copy link
Member Author

bsipocz commented May 19, 2025

If I see correctly they are still just grabbing from these two addresses:


    mirrors = [
        "https://ossci-datasets.s3.amazonaws.com/mnist/",
        "http://yann.lecun.com/exdb/mnist/",
    ]

@bsipocz
Copy link
Member Author

bsipocz commented May 19, 2025

(but at least the s3 one should be resilient for multiple downloads, so I would be +1 for swapping over the uris. However, I would not like adding pytorch as a dependency just for making use of this function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working infrastructure Issues relevant to infrasructure, rather than content
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants