-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
ENH: ARM Neon implementation with intrinsic for np.argmax. #16375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge branch 'master' of https://github.com/numpy/numpy into neon-argmax
Merge branch 'master' of https://github.com/numpy/numpy into neon-argmax
Nice. This will need refactoring once the Universal Intrinsics code goes in. It might provide a nice benchmark for the conversion (to prove performance on non-x86 hardware is not degraded), so I think we should put it in. Do we have any other non-X86 intrinsic use? |
@mattip you are right, this will be a good benchmark case to demonstrate the power of Universal Intrinsics, one piece of code leads to performance improvements across all platforms(X86/ARM/Power), more non-X86 intrinsic usage will bring later. |
LGTM. It is strange that |
@mattip The size of |
Thanks @Qiyu8. I am a little concerned that this does no runtime detection of the neon instructions, I assume this is valid across all processors of a certain architecture (like we require SSE2 for X86). |
I measured the execution time of
np.argmax
by using asv benchmark suite. The Performance has increased by 70%. Here is the detail result in Linux OpenEuler OS+Kunpeng 920 aarch64 machine:SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
Reference: