Skip to content

FIX Make fetch_openml atomically cache the download #21833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 4, 2021

Conversation

siavrez
Copy link
Contributor

@siavrez siavrez commented Nov 30, 2021

Adding a tempfile name generator
change _retry_with_clean_cache function

Reference Issues/PRs

#21798

What does this implement/fix? Explain your changes.

Any other comments?

@ogrisel
Copy link
Member

ogrisel commented Nov 30, 2021

I would rather download the file into a temporary folder using the tempfile module from the standard library:

with tempfile.TemporaryDirectory(dir=data_home) as tmpdir:
    # call the private function that downloads the file before moving it to its final location

# tmpdir is automatically deleted when exiting the context manager

EDIT: let's pass dir=data_home to make sure we are on the same filesystem so that the move operation is guaranteed to be atomic.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Just a quick suggestion below.

Could you please document this enhancement in the doc/whats_new/v1.1.rst file?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ogrisel ogrisel changed the title adding tempfile name genrator Make fetch_openml atomicly cache the donwloaded file to ensure concurrence-safety Dec 2, 2021
@ogrisel ogrisel changed the title Make fetch_openml atomicly cache the donwloaded file to ensure concurrence-safety Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety Dec 2, 2021
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @siavrez !

:mod:`sklearn.datasets`
...................

- |Fix| The fetch_* functions are now thread safe. Data is first downloaded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_open_openml_url is only used in fetch_openml:

Suggested change
- |Fix| The fetch_* functions are now thread safe. Data is first downloaded
- |Fix| :func:`datasets.fetch_openml` is now thread safe. Data is first downloaded

opener = gzip.GzipFile
with opener(os.path.join(tmpdir, file_name), "wb") as fdst:
shutil.copyfileobj(fsrc, fdst)
shutil.move(fdst.name, local_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly safe as long as the temporary directory and the destination are in the same filesystem. According to shutil.move docs, shutil.move is the same as os.rename on the same filesystem.

If there is actually a copy, I can see issues such as:

  • If two threads try to copy to the same location at the same time, I think the file can end up corrupted.
  • If one thread is in the process of copying over the file, another thread will see that file exist and attempt to read it, which would fail since the file is not fully copied yet.

I do not know a great way forward. Using os.rename directly may break use cases where the temporary directory is not on the same filesystem as the cache. And copying itself may lead to errors as stated above.

Copy link
Contributor Author

@siavrez siavrez Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original idea was to use a naming scheme based on process id or thread Id and create a soft symlik instead of renaming. If same process wants to clean cache remove the original and symlink If another process tries clean cache create a new file based on its id and for reading it can use the original symlink. What do you think?
@thomasjpfan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tempfile is guaranteed to be in the same filesystem and therefore the renaming to be atomic because we create the tempfolder as a subfolder.

So I think the current code is fine. Maybe @siavrez you can add an inline comment to explain this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tempfile is guaranteed to be in the same filesystem and therefore the renaming to be atomic because we create the tempfolder as a subfolder.

Ah I see it now. Thank you for the explanation.

Co-authored-by: Thomas J. Fan <[email protected]>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety FIX Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety Dec 4, 2021
@thomasjpfan thomasjpfan changed the title FIX Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety FIX Make fetch_openml atomically cache the download Dec 4, 2021
@thomasjpfan thomasjpfan merged commit b1202af into scikit-learn:main Dec 4, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants