Skip to content

ENH: array-api for sparse arrays (as much as possible) #18915

@ivirshup

Description

@ivirshup

I would like to propose that scipy sparse arrays follow the array-api.

A big part of the push to move from the matrix API to a numpy-array-like API for sparse data structures is to have better interoperability with dense numpy arrays. The array-api and broader data-api consortium generalizes this goal for a variety of array implementations – and has the support of large amounts of the numeric-python ecosystem.

Benefits

Big caveat: parts of the array api don't make sense for sparse arrays

There are cases where the array-api isn't a good match for sparse data. I'll give some examples below, but would propose partial API support would be reasonable.

Off the top of my head:

dlpack based interchange

Since sparse arrays are meant to be an efficient encoding of matrices with large amounts of zeros, it does not neccesarily make sense to allocate all those missing values as a default interchange mechanism.

Surely we can achieve more reasonable interchange of sparse matrices between devices.

nD support

While 1d support may be reasonable nD support, especially for non-COO formats, is likely out of scope for this library. See also:

This means that specific concatenation, indexing, and reshaping operations may not work. Arguably, the reshape operation may not even make sense.

Broadcasting with null values

Sparse libraries often don't play very well with null values. The optimization of skipping the 0 values often means

Example
from scipy import sparse
import numpy as np

coo = sparse.coo_array(([1, 1, 1, 1, 1], ([0, 0, 1, 2, 2], [0, 1, 1, 0, 2])))
coo.toarray()
# array([[1, 1, 0],
#        [0, 1, 0],
#        [1, 0, 1]])

(coo * np.array([np.nan, np.nan, 2.])).toarray()
# array([[nan, nan,  0.],
#        [ 0., nan,  0.],
#        [nan,  0.,  2.]])

Alternative: this is for pydata/sparse to do

While I like pydata sparse, I think it's adoption by the broader ecosystem has some major barriers.

First is the use of numba as a dependency. I quite like numba, and have made a number of contributions there with the specific goal of making operations of sparse arrays work better (a lot of slicing and indexing). However, it is not as friendly a runtime dependency as scipy. It has had multiple compatibility issues with libraries which I'd like to share my sparse data with (like pytorch and jax) and frequently pins both numpy and python.

A second reason is that it would be strange to end up at "you should use pydata/sparse" but still have sparse linear algebra and IO libraries included in scipy. This could make sense if scipy depended on pydata/sparse, though I understand this to be a non-starter due to numba.

If pydata/sparse and scipy.sparse both need to exist, it would be really nice if they could exist largely interchangably. E.g. one array-api code path.

Aside: it could be interesting to explore AOT compiling a subset of pydata/sparse (COO, CSR, CSC) and distributing the compiled implementations through scipy.

cc: @perimosocordiae @dschult @jjerphan @rossbar @stefanv

Metadata

Metadata

Assignees

No one assigned

    Labels

    SciPEPSciPy Enhancement Proposalarray typesItems related to array API support and input array validation (see gh-18286)enhancementA new feature or improvementscipy.sparse

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions