Skip to content

Maintain a list of reverse dependencies of sklearn #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mirekphd opened this issue Dec 31, 2022 · 7 comments
Closed

Maintain a list of reverse dependencies of sklearn #6

mirekphd opened this issue Dec 31, 2022 · 7 comments

Comments

@mirekphd
Copy link

We have just got our first container build broken by this error. The containers packages lists are very large (hundreds of packages, mostly in the form of secondary and tertiary dependencies), with many data scientists contributing their desired packages to the installation list. Yet pip is very uninformative as to the source of the problem, failing to show which package has deprecated sklearn in its requirements.

Can you perhaps start a packages blacklist with primary packages that still require sklearn and let Github users maintain it?

@lesteve
Copy link
Member

lesteve commented Jan 9, 2023

Yet pip is very uninformative as to the source of the problem, failing to show which package has deprecated sklearn in its requirements.

There may be a way to use pip options to have more info where dependencies come from during the install, but I haven't found something convincing in less than 5 minutes.

If you have a working environment with sklearn installed you may be able to use pipdeptree to figure out which package requires sklearn, something like:

pipdeptree -r -p sklearn

Can you perhaps start a packages blacklist with primary packages that still require sklearn and let Github users maintain it?

This does not seem like a very workable approach, there are likely many thousand packages depending on sklearn for example github says more than 3,000 although this is probably an order of magnitude rather than an exact number: https://github.com/scikit-learn/scikit-learn/network/dependents?package_id=UGFja2FnZS01MjU4ODg0Mw%3D%3D

My hope so far is that people can identify which package still depend on sklearn, open an issue in the relevant repository, so that the situation will gradually improve.

@hugovk
Copy link

hugovk commented Jan 9, 2023

Here's a list of 1,605 PyPI packages that depend on sklearn:

Taken from the database dump from https://github.com/sethmlarson/pypi-data via:

sqlite3 'pypi.db' 'SELECT package_name FROM deps WHERE dep_name LIKE "sklearn" GROUP BY package_name;' > deps.txt

@lesteve
Copy link
Member

lesteve commented Jan 9, 2023

Nice, thanks a lot! I guess it could also be useful to have it ordered by number of downloads (descending), which seems doable if I read the project README correctly. This would allow to potentially open issues/PRs in most downloaded repos first.

Note that there are likely some caveats in this kind of things:

  • I guess Python logic in setup.py may not be always detected
  • if a package pins sklearn==0.0 you will never get an error. It would still be nice to tell the project that using scikit-learn is recommended.

@hugovk
Copy link

hugovk commented Jan 9, 2023

sqlite3 'pypi.db' 'SELECT DISTINCT downloads, package_name FROM deps INNER JOIN packages ON deps.package_name = packages.name WHERE dep_name LIKE "sklearn" ORDER BY downloads DESC;' > deps-by-downloads.txt

deps-by-downloads.txt

Here's the top 50, it quickly tails off:

13133|statsforecast
4675|mmdet
3173|anndata
2854|scrubadub
1686|fn-graph
1448|miceforest
1296|astro-ghost
1075|fastestimator-nightly
684|tabpy
668|atlantis
570|psmpy
544|fairdynamicrec
542|sciann
532|spatialcluster
494|junky
481|gps-building-blocks
458|python-video-annotator
377|mlrose
369|tfidf-matcher
340|biosaur2
334|mlbench-core
318|hicstuff
313|textpack
302|accuinsight
288|autoads-test
287|pykeen
275|deep-training
233|autooptimizer
227|lepmlutils
226|paddleseg
225|pydelling
222|chronometry
222|iacs-ipac-reader
220|palmari
219|napari-filament-annotator
212|sherlockpipe
209|arbok
209|fast-scores
205|pysurvival
201|segsrgan
200|hivecode
198|pydatamodel
189|ai-graphics
189|lytools
188|nolds
187|augraphy
186|acmecontentcollectors-pkg-rioatmadja2018
184|auquan-toolbox
184|nerda
181|catsim

@lesteve
Copy link
Member

lesteve commented Jan 9, 2023

@lesteve
Copy link
Member

lesteve commented Jan 9, 2023

Also as a side comment, it seems like packages depending on sklearn account for a small portion of all the sklearn downloads (this was already noted in previous attempts trying to figure out where sklearn downloads were coming from ...)

for sklearn, ~332k downloads per day

❯ sqlite3 'pypi.db' 'SELECT DISTINCT downloads, name from packages WHERE name LIKE "sklearn";'
331695|sklearn

Summing the number of downloads in the top 50 packages depending on sklearn listed in #6 (comment), I get ~30k so less than 10% of the total sklearn downloads.

@lesteve
Copy link
Member

lesteve commented Nov 29, 2023

Closing this one, the brownout period stops in a few days (December 1st) and we are not planning to do something more about this.

@lesteve lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants