-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Fix PermissionError in datasets fetchers on Windows #9847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@lesteve what I don't see is why I'll find myself a windows machine to ensure that everything works but I don't see the issue. |
No suggestion off the top of my head, I'll try to take a look shortly. Welcome to the beautiful world of file locks in Windows in any case. My personal setting for testing on Windows is VirtualBox and there was an official Windows 10 iso that you could download for free at some point. Not sure whether that's still the case, google it. |
Just tested this solution on Windows for
|
With this patch it seems to work: diff --git a/sklearn/datasets/california_housing.py b/sklearn/datasets/california_housing.py
index 581e962..727a9cb 100644
--- a/sklearn/datasets/california_housing.py
+++ b/sklearn/datasets/california_housing.py
@@ -100,9 +100,10 @@ def fetch_california_housing(data_home=None, download_if_missing=True):
archive_path = _fetch_remote(ARCHIVE, dirname=data_home)
- with tarfile.open(mode="r:gz", name=archive_path).extractfile(
- 'CaliforniaHousing/cal_housing.data') as f:
- cal_housing = np.loadtxt(f, delimiter=',')
+ with tarfile.open(mode="r:gz", name=archive_path) as f:
+ cal_housing = np.loadtxt(
+ f.extractfile('CaliforniaHousing/cal_housing.data'),
+ delimiter=',')
# Columns are not in the same order compared to the previous
# URL resource on lib.stat.cmu.edu
columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0] |
Basically the context manager was not closing the right file ... |
OK I think everything should be fixed now. |
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process. This was happening when trying to remove the downloaded archive because the archive was not properly closed.
I ran the script I used for the figshare work on Windows to make sure I could download all the datasets from scratch and everything went fine. |
There's something I don't understand tough. My first solution was: with tarfile.open(...) as f:
fileobj = f.extractfile(...)
cal_housing = np.loadtxt(fileobj, ...) And this was not working for me. So why would adding it in the call work? |
Weird, oh well, it's this kind of stuff you want to not ever touch again and forget about when everything starts working ;-). And just a side-comment, I seem to remember (but maybe it's my imagination) that you added this |
This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher. |
nilearn has some testing of the fetchers. From what I remembered it was not pretty (monkey-patching of some urllib functions maybe?) but it was doing its job. |
Having said that this problem may not have been spotted with mocks. Basically you need a real file to realise that on Windows (tough luck) the file is locked. All I am saying is that mocks may not cover all the edge cases. It is better than no tests as we currently have of course. |
LGTM |
Reference Issue
Fixes #9820,
What does this implement/fix? Explain your changes.
Any other comments?