Skip to content

ENH: ARM Neon implementation with intrinsic for np.argmax. #16375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 1, 2020

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented May 26, 2020

I measured the execution time of np.argmax by using asv benchmark suite. The Performance has increased by 70%. Here is the detail result in Linux OpenEuler OS+Kunpeng 920 aarch64 machine:

============================
        dtype
 ---------------------------
        numpy.float32    217±20μs
        bool       47.7±0.5μs
============================
        dtype
 --------------------------
        numpy.float32    220±20μs
        bool       161±0.3μs
============================
       before           after         ratio
     [3f11db40]       [00b21d1b]
     <master>         <neon-argmax>
-       161±0.3μs       47.7±0.5μs     0.30  bench_reduce.ArgMax.time_argmax(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Reference:

  1. SSE2Neon : https://github.com/jratcliff63367/sse2neon
  2. http://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon

@Qiyu8 Qiyu8 changed the title ENT: ARM Neon implementation with intrinsic for np.argmax. ENH: ARM Neon implementation with intrinsic for np.argmax. May 26, 2020
@mattip
Copy link
Member

mattip commented May 26, 2020

Nice. This will need refactoring once the Universal Intrinsics code goes in. It might provide a nice benchmark for the conversion (to prove performance on non-x86 hardware is not degraded), so I think we should put it in. Do we have any other non-X86 intrinsic use?

@Qiyu8
Copy link
Member Author

Qiyu8 commented May 26, 2020

@mattip you are right, this will be a good benchmark case to demonstrate the power of Universal Intrinsics, one piece of code leads to performance improvements across all platforms(X86/ARM/Power), more non-X86 intrinsic usage will bring later.

@Qiyu8 Qiyu8 requested a review from mattip May 28, 2020 03:09
@mattip
Copy link
Member

mattip commented May 31, 2020

LGTM. It is strange that argmax is not a ufunc. Just for completeness, how much does this change the size of _multiarray_umath.so?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Jun 1, 2020

@mattip The size of _multiarray_umath.so at each branch is:
master: 16423KB
neon-argmax: 16423KB
The size remains the same. :)

@mattip mattip merged commit bdd4e2e into numpy:master Jun 1, 2020
@mattip
Copy link
Member

mattip commented Jun 1, 2020

Thanks @Qiyu8. I am a little concerned that this does no runtime detection of the neon instructions, I assume this is valid across all processors of a certain architecture (like we require SSE2 for X86).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants