FIX Make fetch_openml atomically cache the download #21833

siavrez · 2021-11-30T10:00:45Z

Adding a tempfile name generator
change _retry_with_clean_cache function

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

ogrisel · 2021-11-30T10:56:08Z

I would rather download the file into a temporary folder using the tempfile module from the standard library:

with tempfile.TemporaryDirectory(dir=data_home) as tmpdir:
    # call the private function that downloads the file before moving it to its final location

# tmpdir is automatically deleted when exiting the context manager

EDIT: let's pass dir=data_home to make sure we are on the same filesystem so that the move operation is guaranteed to be atomic.

ogrisel

This looks good to me. Just a quick suggestion below.

Could you please document this enhancement in the doc/whats_new/v1.1.rst file?

sklearn/datasets/_openml.py

doc/whats_new/v1.1.rst

ogrisel

LGTM. Thanks!

thomasjpfan

Thank you for working on this @siavrez !

thomasjpfan · 2021-12-03T16:35:52Z

doc/whats_new/v1.1.rst

+:mod:`sklearn.datasets`
+...................
+
+- |Fix| The fetch_* functions are now thread safe. Data is first downloaded


_open_openml_url is only used in fetch_openml:

Suggested change

- |Fix| The fetch_* functions are now thread safe. Data is first downloaded

- |Fix| :func:`datasets.fetch_openml` is now thread safe. Data is first downloaded

thomasjpfan · 2021-12-03T16:53:21Z

sklearn/datasets/_openml.py

+                        opener = gzip.GzipFile
+                    with opener(os.path.join(tmpdir, file_name), "wb") as fdst:
+                        shutil.copyfileobj(fsrc, fdst)
+                shutil.move(fdst.name, local_path)


I think this is mostly safe as long as the temporary directory and the destination are in the same filesystem. According to shutil.move docs, shutil.move is the same as os.rename on the same filesystem.

If there is actually a copy, I can see issues such as:

If two threads try to copy to the same location at the same time, I think the file can end up corrupted.

If one thread is in the process of copying over the file, another thread will see that file exist and attempt to read it, which would fail since the file is not fully copied yet.

I do not know a great way forward. Using os.rename directly may break use cases where the temporary directory is not on the same filesystem as the cache. And copying itself may lead to errors as stated above.

My original idea was to use a naming scheme based on process id or thread Id and create a soft symlik instead of renaming. If same process wants to clean cache remove the original and symlink If another process tries clean cache create a new file based on its id and for reading it can use the original symlink. What do you think?
@thomasjpfan

I think the tempfile is guaranteed to be in the same filesystem and therefore the renaming to be atomic because we create the tempfolder as a subfolder.

So I think the current code is fine. Maybe @siavrez you can add an inline comment to explain this.

I think the tempfile is guaranteed to be in the same filesystem and therefore the renaming to be atomic because we create the tempfolder as a subfolder.

Ah I see it now. Thank you for the explanation.

Co-authored-by: Thomas J. Fan <[email protected]>

sklearn/datasets/_openml.py

Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan

LGTM

Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

adding tempfile skip 4 test cases

e128f8f

github-actions bot added the module:datasets label Nov 30, 2021

change get_native_id with get_ident

33c76ff

siavrez added 3 commits November 30, 2021 17:14

using tempfile module

1ff7959

path corrected

ea50158

changed fileobj with filename

a84d920

ogrisel approved these changes Dec 2, 2021

View reviewed changes

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved

siavrez added 2 commits December 2, 2021 16:07

removed exeption handing from os.makedirs, added what's new entry

ef25e95

changed pr no in what's new

5a44fb0

ogrisel reviewed Dec 2, 2021

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

Typo [ci skip]

8364494

ogrisel approved these changes Dec 2, 2021

View reviewed changes

ogrisel changed the title ~~adding tempfile name genrator~~ Make fetch_openml atomicly cache the donwloaded file to ensure concurrence-safety Dec 2, 2021

ogrisel changed the title ~~Make fetch_openml atomicly cache the donwloaded file to ensure concurrence-safety~~ Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety Dec 2, 2021

thomasjpfan reviewed Dec 3, 2021

View reviewed changes

Update doc/whats_new/v1.1.rst

b27ca97

Co-authored-by: Thomas J. Fan <[email protected]>

ogrisel reviewed Dec 4, 2021

View reviewed changes

sklearn/datasets/_openml.py Show resolved Hide resolved

Update sklearn/datasets/_openml.py [ci skip]

aaf5842

Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan approved these changes Dec 4, 2021

View reviewed changes

thomasjpfan changed the title ~~Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety~~ FIX Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety Dec 4, 2021

thomasjpfan changed the title ~~FIX Make fetch_openml atomically cache the donwloaded file to ensure concurrence-safety~~ FIX Make fetch_openml atomically cache the download Dec 4, 2021

thomasjpfan merged commit b1202af into scikit-learn:main Dec 4, 2021

ogrisel mentioned this pull request Dec 10, 2021

ENH Add a retry mechanism in fetch_openml #21901

Merged

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021

FIX Make fetch_openml atomically cache the download (scikit-learn#21833)

a836bca

Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

glemaitre pushed a commit that referenced this pull request Dec 25, 2021

FIX Make fetch_openml atomically cache the download (#21833)

93ec1d7

Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan mentioned this pull request Jan 26, 2022

fetch_openml can raise "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process" #21798

Closed

This was referenced Apr 5, 2022

Covtype dataset raises error when fetching #23048

Closed

ENH improve ARFF parser using pandas #21938

Merged

iasoon mentioned this pull request Apr 11, 2022

FIX: fetch covtype dataset concurrent-safe #23113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX Make fetch_openml atomically cache the download #21833

FIX Make fetch_openml atomically cache the download #21833

Uh oh!

siavrez commented Nov 30, 2021

Uh oh!

ogrisel commented Nov 30, 2021 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Uh oh!

thomasjpfan left a comment

Uh oh!

thomasjpfan Dec 3, 2021

Uh oh!

thomasjpfan Dec 3, 2021

Uh oh!

siavrez Dec 3, 2021 •

edited

Loading

Uh oh!

ogrisel Dec 4, 2021

Uh oh!

thomasjpfan Dec 4, 2021

Uh oh!

Uh oh!

thomasjpfan left a comment

Uh oh!

Uh oh!

	- \|Fix\| The fetch_* functions are now thread safe. Data is first downloaded
	- \|Fix\| :func:`datasets.fetch_openml` is now thread safe. Data is first downloaded

Uh oh!

FIX Make fetch_openml atomically cache the download #21833

FIX Make fetch_openml atomically cache the download #21833

Uh oh!

Conversation

siavrez commented Nov 30, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ogrisel commented Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Dec 3, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Dec 3, 2021

Choose a reason for hiding this comment

Uh oh!

siavrez Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Dec 4, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Dec 4, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Nov 30, 2021 •

edited

Loading

siavrez Dec 3, 2021 •

edited

Loading