Skip to content

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Oct 10, 2022

Archives with previously lacking sha256 checks:

  • 20Newsgroups/20news-bydate.tar.gz (update: fetch_data.py script deleted since actually unused)
  • reuters21578-mld/reuters21578.tar.gz
  • movie-review-data/review_polarity.tar.gz

This is a security concern not to check the digests of tarballs before extracting them because they could overwrite sensitive system files such as /etc/hosts.

Note that this PR only fixes code in documentation (tutorial and examples) and not library code. Our datasets fetchers under the sklearn namespace 2s already checking the digests systematically. Still fixing those code snippets in the documentation is good from an education point of view.

The scikit-learn-1.1.2.tar.gz source tarball contains those code snippets because it includes the documentation and example files. But this is not the case for the wheel files and conda packages. Not sure if this warrants a security bugfix release or not.

…action

- 20Newsgroups/20news-bydate.tar.gz
- reuters21578-mld/reuters21578.tar.gz
- movie-review-data/review_polarity.tar.gz
@ogrisel ogrisel marked this pull request as ready for review October 10, 2022 09:47
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are close to 1.2 and the attack surface of this is low, I say we do not need a bug fix release. I think the attack surface is low, because it is a bug fix in documentation code and not library code.

This is a big enough change to have a whats new entry.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an alternative solution that checks the paths directly:

def safe_tar_extractall(tar, path):
    path_abs = os.path.abspath(path)

    for member in tar.getmembers():
        member_abs = os.path.abspath(os.path.join(path, member.name))
        prefix = os.path.commonprefix([path_abs, member_abs])
        if prefix != path_abs:
            raise IOError
    tar.extractall(path=path)

but I think checking the checksum in this PR is better.

I ran the two updated files locally and it works. LGTM

@adrinjalali adrinjalali changed the title Check sha256 digests of tarballs in tutorial and examples before extraction DOC Check sha256 digests of tarballs in tutorial and examples before extraction Oct 10, 2022
@adrinjalali adrinjalali merged commit e9cf0c9 into scikit-learn:main Oct 10, 2022
@ogrisel ogrisel deleted the check-digest-before-extracting-tarballs branch October 11, 2022 08:15
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants