-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
DOC Check sha256 digests of tarballs in tutorial and examples before extraction #24617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Check sha256 digests of tarballs in tutorial and examples before extraction #24617
Conversation
…action - 20Newsgroups/20news-bydate.tar.gz - reuters21578-mld/reuters21578.tar.gz - movie-review-data/review_polarity.tar.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we are close to 1.2 and the attack surface of this is low, I say we do not need a bug fix release. I think the attack surface is low, because it is a bug fix in documentation code and not library code.
This is a big enough change to have a whats new entry.
doc/tutorial/text_analytics/data/twenty_newsgroups/fetch_data.py
Outdated
Show resolved
Hide resolved
doc/tutorial/text_analytics/data/twenty_newsgroups/fetch_data.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an alternative solution that checks the paths directly:
def safe_tar_extractall(tar, path):
path_abs = os.path.abspath(path)
for member in tar.getmembers():
member_abs = os.path.abspath(os.path.join(path, member.name))
prefix = os.path.commonprefix([path_abs, member_abs])
if prefix != path_abs:
raise IOError
tar.extractall(path=path)
but I think checking the checksum in this PR is better.
I ran the two updated files locally and it works. LGTM
…extraction (scikit-learn#24617) Co-authored-by: Thomas J. Fan <[email protected]>
Archives with previously lacking sha256 checks:
(update:20Newsgroups/20news-bydate.tar.gz
fetch_data.py
script deleted since actually unused)reuters21578-mld/reuters21578.tar.gz
movie-review-data/review_polarity.tar.gz
This is a security concern not to check the digests of tarballs before extracting them because they could overwrite sensitive system files such as
/etc/hosts
.Note that this PR only fixes code in documentation (tutorial and examples) and not library code. Our datasets fetchers under the
sklearn
namespace 2s already checking the digests systematically. Still fixing those code snippets in the documentation is good from an education point of view.The scikit-learn-1.1.2.tar.gz source tarball contains those code snippets because it includes the documentation and example files. But this is not the case for the wheel files and conda packages. Not sure if this warrants a security bugfix release or not.