Skip to content

ENH: argmax and argmin methods for sparse matrices #6761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 16, 2017

Conversation

nmayorov
Copy link
Contributor

@nmayorov nmayorov commented Nov 4, 2016

Issue #5883

I think my approach is reasonable and efficient in terms of algorithmic complexity, but maybe it can be done with less Python loops.

For now I included tests as a separate file (instead of test_base.py). Just easier to test and demonstrate what's going on. At the end we can move it to test_base.py.

@perimosocordiae please look when you can.

@nmayorov nmayorov added enhancement A new feature or improvement scipy.sparse labels Nov 4, 2016
Copy link
Member

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, overall. It might be nice to follow the pattern used by _min_or_max, rather than handling the axis=None and row/column-wise cases all together.

mat.sum_duplicates()

line_size = mat.shape[axis]
ret = np.empty(mat.shape[1 - axis], dtype=int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ret_size, line_size = mat._swap(mat.shape)
ret = np.zeros(ret_size, dtype=int)

line_size = mat.shape[axis]
ret = np.empty(mat.shape[1 - axis], dtype=int)

for i in range(ret.shape[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can avoid the loop entirely, but we can at least vectorize the first condition:

nz_lines, = np.diff(mat.indptr) > 0
for i in nz_lines:
    p, q = mat.indptr[i:i+2]
    data = mat.data[p:q]
    # etc...


D2 = D1.transpose()

classes = [bsr_matrix, coo_matrix, csr_matrix, csc_matrix]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you integrate this with test_base.py, you can add the argmin/argmax tests to the _TestMinMax class and follow the existing tests as a template.

for axis in [None, 0, 1]:
mat = spmatrix(D)
assert_raises(ValueError, mat.argmax, axis=axis)
assert_raises(ValueError, mat.argmin, axis=axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these cases actually don't raise an error when mat is a numpy array:

In [1]: x = np.ones((5,0))

In [2]: x.argmax(axis=0)
Out[2]: array([], dtype=int64)

@nmayorov
Copy link
Contributor Author

nmayorov commented Nov 5, 2016

@perimosocordiae great advices, I think I handled them now.

If you are fine with the updated state, I will move tests to test_base.py.

Another thing I forgot to mention. I decided that returning ndarray is more convenient than any form of sparse matrix, because I expect that in majority of situation people will want to work with ndarray eventually. Is it right decision?

@perimosocordiae
Copy link
Member

Looks good to me. I agree that dense results are reasonable, considering that an argmin/argmax of zero doesn't necessarily indicate missing data.

If we want to follow the numpy matrix convention (which spmatrix mimics), the result should be a matrix (row matrix for axis=0, column matrix for axis=1). On the other hand, argmax/argmin are typically then used for indexing, where a flat ndarray is typically the most useful. I'm leaning toward the matrix return type for now, but I could be convinced otherwise.

@nmayorov
Copy link
Contributor Author

nmayorov commented Nov 5, 2016

@perimosocordiae

Maybe I'm wrong on that, but it seems to me that people usually avoid using numpy matrices. Leaving consistency aside, I think having ndarray right away is more practical. At least for me it's very true. Leave to you to decide.

@nmayorov
Copy link
Contributor Author

nmayorov commented Nov 9, 2016

I made sure that the minimum possible index is always returned and moved tests to test_base.py.

Could you please make the final decision about whether to return array or matrix?


mat = self.spmatrix(D)

assert_equal(mat.argmax(), argmax)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I think it's clearer to have tests of the form:

assert_equal(mat.argmax(), np.argmax(D))

Rather than computing all the expected results first.

@perimosocordiae
Copy link
Member

Looks good to me. I'm +1 to merge.

I'll defer the final choice about array vs matrix return types to another reviewer. @pv, @rgommers, others: what do you think?

@rgommers
Copy link
Member

rgommers commented Dec 6, 2016

I'll defer the final choice about array vs matrix return types to another reviewer. @pv, @rgommers, others: what do you think?

Not a strong preference, but despite my dislike of matrix I think we should go for consistency and return a matrix here.

@nmayorov
Copy link
Contributor Author

nmayorov commented Dec 6, 2016

Not a strong preference, but despite my dislike of matrix I think we should go for consistency and return a matrix here.

OK, maybe later we can change to array everywhere (like for 1.0 release).

I changed to matrix for now.

@rgommers rgommers added this to the 0.19.0 milestone Dec 21, 2016
@perimosocordiae perimosocordiae merged commit 56f2045 into scipy:master Jan 16, 2017
@perimosocordiae
Copy link
Member

I want this in version 0.19, so merging now. Thanks, @nmayorov!

perimosocordiae added a commit that referenced this pull request Jan 16, 2017
The methods were added in gh-6761.
perimosocordiae added a commit to perimosocordiae/scipy that referenced this pull request Jan 16, 2017
@nmayorov nmayorov deleted the sparse_argmax branch March 15, 2017 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants