Skip to content

[MRG+2] Fix PermissionError in datasets fetchers on Windows #9847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 3, 2017

Conversation

massich
Copy link
Contributor

@massich massich commented Sep 28, 2017

Reference Issue

Fixes #9820,

What does this implement/fix? Explain your changes.

Any other comments?

@massich
Copy link
Contributor Author

massich commented Sep 28, 2017

@lesteve what I don't see is why fetch_rcv1 and fetch_species_distributions are affected. Both fetchers remove temporal files indeed. But none of them uses a reference of the content after deleting the temporal files as was the case for california housing.

I'll find myself a windows machine to ensure that everything works but I don't see the issue.

@lesteve
Copy link
Member

lesteve commented Sep 28, 2017

No suggestion off the top of my head, I'll try to take a look shortly. Welcome to the beautiful world of file locks in Windows in any case.

My personal setting for testing on Windows is VirtualBox and there was an official Windows 10 iso that you could download for free at some point. Not sure whether that's still the case, google it.

@lesteve
Copy link
Member

lesteve commented Sep 28, 2017

Just tested this solution on Windows for fetch_california_housing and nope this does not work, i.e. I still get:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\lesteve\\scikit_learn_data\\cal_housing.tgz'

@lesteve
Copy link
Member

lesteve commented Sep 28, 2017

With this patch it seems to work:

diff --git a/sklearn/datasets/california_housing.py b/sklearn/datasets/california_housing.py
index 581e962..727a9cb 100644
--- a/sklearn/datasets/california_housing.py
+++ b/sklearn/datasets/california_housing.py
@@ -100,9 +100,10 @@ def fetch_california_housing(data_home=None, download_if_missing=True):

         archive_path = _fetch_remote(ARCHIVE, dirname=data_home)

-        with tarfile.open(mode="r:gz", name=archive_path).extractfile(
-                'CaliforniaHousing/cal_housing.data') as f:
-            cal_housing = np.loadtxt(f, delimiter=',')
+        with tarfile.open(mode="r:gz", name=archive_path) as f:
+            cal_housing = np.loadtxt(
+                f.extractfile('CaliforniaHousing/cal_housing.data'),
+                delimiter=',')
             # Columns are not in the same order compared to the previous
             # URL resource on lib.stat.cmu.edu
             columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]

@lesteve
Copy link
Member

lesteve commented Sep 28, 2017

Basically the context manager was not closing the right file ...

@lesteve
Copy link
Member

lesteve commented Sep 29, 2017

OK I think everything should be fixed now.

@lesteve lesteve changed the title [WIP] Add context manager to california housing fetcher [MRG] FIX PermissionError in datasets fetchers on Windows Sep 29, 2017
PermissionError: [WinError 32] The process cannot access the file
because it is being used by another process. This was happening when
trying to remove the downloaded archive because the archive was not
properly closed.
@lesteve lesteve added this to the 0.19.1 milestone Sep 29, 2017
@lesteve lesteve changed the title [MRG] FIX PermissionError in datasets fetchers on Windows [MRG] Fix PermissionError in datasets fetchers on Windows Sep 29, 2017
@lesteve
Copy link
Member

lesteve commented Sep 29, 2017

I ran the script I used for the figshare work on Windows to make sure I could download all the datasets from scratch and everything went fine.

@massich
Copy link
Contributor Author

massich commented Sep 29, 2017

There's something I don't understand tough. My first solution was:

with tarfile.open(...) as f:
       fileobj = f.extractfile(...)
       cal_housing = np.loadtxt(fileobj, ...) 

And this was not working for me. So why would adding it in the call work?
Anyhow. I just checked your commit and everything looks good.

@massich massich changed the title [MRG] Fix PermissionError in datasets fetchers on Windows [MRG+1] Fix PermissionError in datasets fetchers on Windows Sep 29, 2017
@lesteve
Copy link
Member

lesteve commented Sep 29, 2017

And this was not working for me. So why would adding it in the call work?
Anyhow. I just checked your commit and everything looks good.

Weird, oh well, it's this kind of stuff you want to not ever touch again and forget about when everything starts working ;-).

And just a side-comment, I seem to remember (but maybe it's my imagination) that you added this remove for consistency between fetchers (in some fetchers the archive were removed in some other they were not). Personally I took it as a cautionary talke that tells us that it is very easy to break something even if the change looks really innocuous. I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

@massich
Copy link
Contributor Author

massich commented Sep 29, 2017

I guess this is even worse in this case because sklearn.datasets coverage is close to non-existent (on Linux too, i.e. not Windows-specific).

This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

@lesteve
Copy link
Member

lesteve commented Sep 29, 2017

This is exactly what I was thinking, that we should find a manner to test all this either with an overnight process or with mocking objects or something. I'll review all this when we retake the partial fetcher.

nilearn has some testing of the fetchers. From what I remembered it was not pretty (monkey-patching of some urllib functions maybe?) but it was doing its job.

@lesteve
Copy link
Member

lesteve commented Sep 29, 2017

Having said that this problem may not have been spotted with mocks. Basically you need a real file to realise that on Windows (tough luck) the file is locked. All I am saying is that mocks may not cover all the edge cases. It is better than no tests as we currently have of course.

@jnothman
Copy link
Member

jnothman commented Oct 3, 2017

LGTM

@jnothman jnothman changed the title [MRG+1] Fix PermissionError in datasets fetchers on Windows [MRG+2] Fix PermissionError in datasets fetchers on Windows Oct 3, 2017
@jnothman jnothman merged commit 534f68b into scikit-learn:master Oct 3, 2017
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
@massich massich deleted the 9820 branch June 6, 2018 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Windows] PermissionError in datasets fetchers when trying to remove the downloaded archive
3 participants