ENH: ARM Neon implementation with intrinsic for np.argmax. #16375

Qiyu8 · 2020-05-26T03:31:45Z

I measured the execution time of np.argmax by using asv benchmark suite. The Performance has increased by 70%. Here is the detail result in Linux OpenEuler OS+Kunpeng 920 aarch64 machine:

============================
        dtype
 ---------------------------
        numpy.float32    217±20μs
        bool       47.7±0.5μs
============================
        dtype
 --------------------------
        numpy.float32    220±20μs
        bool       161±0.3μs
============================
       before           after         ratio
     [3f11db40]       [00b21d1b]
     <master>         <neon-argmax>
-       161±0.3μs       47.7±0.5μs     0.30  bench_reduce.ArgMax.time_argmax(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Reference:

Merge branch 'master' of https://github.com/numpy/numpy into neon-argmax

…ON.h

mattip · 2020-05-26T08:43:26Z

Nice. This will need refactoring once the Universal Intrinsics code goes in. It might provide a nice benchmark for the conversion (to prove performance on non-x86 hardware is not degraded), so I think we should put it in. Do we have any other non-X86 intrinsic use?

Qiyu8 · 2020-05-26T09:18:48Z

@mattip you are right, this will be a good benchmark case to demonstrate the power of Universal Intrinsics, one piece of code leads to performance improvements across all platforms(X86/ARM/Power), more non-X86 intrinsic usage will bring later.

numpy/core/src/multiarray/arraytypes.c.src

mattip · 2020-05-31T10:12:56Z

LGTM. It is strange that argmax is not a ufunc. Just for completeness, how much does this change the size of _multiarray_umath.so?

Qiyu8 · 2020-06-01T04:06:57Z

@mattip The size of _multiarray_umath.so at each branch is:
master: 16423KB
neon-argmax: 16423KB
The size remains the same. :)

mattip · 2020-06-01T06:20:29Z

Thanks @Qiyu8. I am a little concerned that this does no runtime detection of the neon instructions, I assume this is valid across all processors of a certain architecture (like we require SSE2 for X86).

Qiyu8 and others added 13 commits March 4, 2020 10:57

Neon implementation with intrinsic for bool argmax

70264f6

Neon implementation with intrinsic for bool argmax

2acb5d9

Neon implementation with intrinsic for bool argmax

0da814a

Neon implementation with intrinsic for bool argmax

b4d85f3

Neon implementation with intrinsic for bool argmax

6de7de7

Neon implementation with intrinsic for bool argmax

90e9c35

Neon implementation with intrinsic for bool argmax

c6c067f

Neon implementation with intrinsic for bool argmax

34feb50

update branch

a875ed6

Merge branch 'master' of https://github.com/numpy/numpy into neon-argmax

modify the name

2c8e9ff

update

00b21d1

Merge branch 'master' of https://github.com/numpy/numpy into neon-argmax

modify class name

d9b7351

add ref https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NE…

eeaac71

…ON.h

Qiyu8 changed the title ~~ENT: ARM Neon implementation with intrinsic for np.argmax.~~ ENH: ARM Neon implementation with intrinsic for np.argmax. May 26, 2020

charris added 01 - Enhancement component: numpy._core labels May 26, 2020

seiko2plus reviewed May 27, 2020

View reviewed changes

numpy/core/src/multiarray/arraytypes.c.src Show resolved Hide resolved

seiko2plus reviewed May 27, 2020

View reviewed changes

numpy/core/src/multiarray/arraytypes.c.src Outdated Show resolved Hide resolved

Qiyu8 added 6 commits May 28, 2020 10:18

optimize _mm_movemask_epi8_neon

be683b0

optimize _mm_movemask_epi8_neon

e01e223

change macro define

ccf8f81

change macro define

2a556d1

Merge branch 'neon-argmax' of github.com:Qiyu8/numpy into neon-argmax

32fd0c3

change macro define

7d55ba3

Qiyu8 requested a review from mattip May 28, 2020 03:09

Qiyu8 added 3 commits May 28, 2020 11:41

add macro

002c21f

change macro define

4d0bcd7

change macro define

27620e3

change macro define

ddfdf1c

mattip merged commit bdd4e2e into numpy:master Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: ARM Neon implementation with intrinsic for np.argmax. #16375

ENH: ARM Neon implementation with intrinsic for np.argmax. #16375

Uh oh!

Qiyu8 commented May 26, 2020 •

edited

Loading

Uh oh!

mattip commented May 26, 2020

Uh oh!

Qiyu8 commented May 26, 2020

Uh oh!

Uh oh!

Uh oh!

mattip commented May 31, 2020

Uh oh!

Qiyu8 commented Jun 1, 2020

Uh oh!

mattip commented Jun 1, 2020

Uh oh!

Uh oh!

Uh oh!

ENH: ARM Neon implementation with intrinsic for np.argmax. #16375

ENH: ARM Neon implementation with intrinsic for np.argmax. #16375

Uh oh!

Conversation

Qiyu8 commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented May 26, 2020

Uh oh!

Qiyu8 commented May 26, 2020

Uh oh!

Uh oh!

Uh oh!

mattip commented May 31, 2020

Uh oh!

Qiyu8 commented Jun 1, 2020

Uh oh!

mattip commented Jun 1, 2020

Uh oh!

Uh oh!

Qiyu8 commented May 26, 2020 •

edited

Loading