-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Make fetch_openml atomically cache the download #21833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I would rather download the file into a temporary folder using the with tempfile.TemporaryDirectory(dir=data_home) as tmpdir:
# call the private function that downloads the file before moving it to its final location
# tmpdir is automatically deleted when exiting the context manager EDIT: let's pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Just a quick suggestion below.
Could you please document this enhancement in the doc/whats_new/v1.1.rst
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @siavrez !
doc/whats_new/v1.1.rst
Outdated
:mod:`sklearn.datasets` | ||
................... | ||
|
||
- |Fix| The fetch_* functions are now thread safe. Data is first downloaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_open_openml_url
is only used in fetch_openml
:
- |Fix| The fetch_* functions are now thread safe. Data is first downloaded | |
- |Fix| :func:`datasets.fetch_openml` is now thread safe. Data is first downloaded |
opener = gzip.GzipFile | ||
with opener(os.path.join(tmpdir, file_name), "wb") as fdst: | ||
shutil.copyfileobj(fsrc, fdst) | ||
shutil.move(fdst.name, local_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mostly safe as long as the temporary directory and the destination are in the same filesystem. According to shutil.move docs, shutil.move
is the same as os.rename
on the same filesystem.
If there is actually a copy, I can see issues such as:
- If two threads try to copy to the same location at the same time, I think the file can end up corrupted.
- If one thread is in the process of copying over the file, another thread will see that file exist and attempt to read it, which would fail since the file is not fully copied yet.
I do not know a great way forward. Using os.rename
directly may break use cases where the temporary directory is not on the same filesystem as the cache. And copying itself may lead to errors as stated above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original idea was to use a naming scheme based on process id or thread Id and create a soft symlik instead of renaming. If same process wants to clean cache remove the original and symlink If another process tries clean cache create a new file based on its id and for reading it can use the original symlink. What do you think?
@thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tempfile is guaranteed to be in the same filesystem and therefore the renaming to be atomic because we create the tempfolder as a subfolder.
So I think the current code is fine. Maybe @siavrez you can add an inline comment to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tempfile is guaranteed to be in the same filesystem and therefore the renaming to be atomic because we create the tempfolder as a subfolder.
Ah I see it now. Thank you for the explanation.
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>
Adding a tempfile name generator
change _retry_with_clean_cache function
Reference Issues/PRs
#21798
What does this implement/fix? Explain your changes.
Any other comments?