-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
I would like to propose that scipy sparse arrays follow the array-api.
A big part of the push to move from the matrix API to a numpy-array-like API for sparse data structures is to have better interoperability with dense numpy arrays. The array-api and broader data-api consortium generalizes this goal for a variety of array implementations – and has the support of large amounts of the numeric-python ecosystem.
Benefits
- Downstream libraries can support a single code path for dense and sparse arrays in more cases
- Key usecase: support in array wrapper libraries like xarray and dask
- It comes with a test suite!
- Easier decision making around API since many decisions have already been made
- scipy itself is already adopting the array-api
Big caveat: parts of the array api don't make sense for sparse arrays
There are cases where the array-api isn't a good match for sparse data. I'll give some examples below, but would propose partial API support would be reasonable.
Off the top of my head:
dlpack
based interchange
Since sparse arrays are meant to be an efficient encoding of matrices with large amounts of zeros, it does not neccesarily make sense to allocate all those missing values as a default interchange mechanism.
Surely we can achieve more reasonable interchange of sparse matrices between devices.
nD support
While 1d support may be reasonable nD support, especially for non-COO formats, is likely out of scope for this library. See also:
This means that specific concatenation, indexing, and reshaping operations may not work. Arguably, the reshape operation may not even make sense.
Broadcasting with null values
Sparse libraries often don't play very well with null values. The optimization of skipping the 0 values often means
Example
from scipy import sparse
import numpy as np
coo = sparse.coo_array(([1, 1, 1, 1, 1], ([0, 0, 1, 2, 2], [0, 1, 1, 0, 2])))
coo.toarray()
# array([[1, 1, 0],
# [0, 1, 0],
# [1, 0, 1]])
(coo * np.array([np.nan, np.nan, 2.])).toarray()
# array([[nan, nan, 0.],
# [ 0., nan, 0.],
# [nan, 0., 2.]])
Alternative: this is for pydata/sparse to do
While I like pydata sparse, I think it's adoption by the broader ecosystem has some major barriers.
First is the use of numba
as a dependency. I quite like numba, and have made a number of contributions there with the specific goal of making operations of sparse arrays work better (a lot of slicing and indexing). However, it is not as friendly a runtime dependency as scipy. It has had multiple compatibility issues with libraries which I'd like to share my sparse data with (like pytorch
and jax
) and frequently pins both numpy and python.
A second reason is that it would be strange to end up at "you should use pydata/sparse" but still have sparse linear algebra and IO libraries included in scipy. This could make sense if scipy depended on pydata/sparse, though I understand this to be a non-starter due to numba
.
If pydata/sparse and scipy.sparse both need to exist, it would be really nice if they could exist largely interchangably. E.g. one array-api code path.
Aside: it could be interesting to explore AOT compiling a subset of pydata/sparse (COO, CSR, CSC) and distributing the compiled implementations through scipy.