Add AdvSimd optimised GetPointerToFirstInvalidChar #121383

ylpoonlg · 2025-11-05T16:46:51Z

The intrinsic codepath was temporarily disabled (#42052) for AdvSimd due to performance regression (#41699), but never re-enabled again. Since there is no equivalent instruction for ExtractMostSignificantBits in AdvSimd, using the same algorithm as SSE2 is inherently slower.

This PR adds optimizations specific to arm64 AdvSimd based on the generic Vector128 algorithm. It makes use of the Unzip instruction to convert vectors into scalar for processing, which offers some speedups against generic Vector128.

Performance results

Run on Neoverse-V2 (lower is better)

Method	Input	Version	Mean	Error	Ratio
GetByteCount	EnglishAllAscii	Before	4.414 us	0.0852 us	1.000
GetByteCount	EnglishAllAscii	After	4.420 us	0.0858 us	1.001
GetByteCount	EnglishMostlyAscii	Before	20.168 us	0.0613 us	1.000
GetByteCount	EnglishMostlyAscii	After	18.366 us	0.1092 us	0.911
GetByteCount	Chinese	Before	9.132 us	0.0052 us	1.000
GetByteCount	Chinese	After	8.157 us	0.0342 us	0.893
GetByteCount	Cyrillic	Before	7.929 us	0.0069 us	1.000
GetByteCount	Cyrillic	After	7.157 us	0.0521 us	0.903
GetByteCount	Greek	Before	10.042 us	0.0093 us	1.000
GetByteCount	Greek	After	9.109 us	0.0646 us	0.907

cc @dotnet/arm64-contrib @SwapnilGaikwad

The intrinsic codepath was disabled for AdvSimd due to slower performance than the Vector128 codepath. Since there is no equivalent instruction for ExtractMostSignificantBits in AdvSimd, using the SSE2 algorithm will be slow. Therefore a new specialized algorithm is added to optimise for AdvSimd, based on the generic Vector128 algorithm. Fixes dotnet#41699 and dotnet#42052. Performance results on Neoverse-V2 (lower is better): | Method | Input | Version | Mean | Error | Ratio | |------------- |------------------- |-------- |----------:|----------:|------:| | GetByteCount | EnglishAllAscii | Before | 4.414 us | 0.0852 us | 1.000 | | GetByteCount | EnglishAllAscii | After | 4.420 us | 0.0858 us | 1.001 | | GetByteCount | EnglishMostlyAscii | Before | 20.168 us | 0.0613 us | 1.000 | | GetByteCount | EnglishMostlyAscii | After | 18.366 us | 0.1092 us | 0.911 | | GetByteCount | Chinese | Before | 9.132 us | 0.0052 us | 1.000 | | GetByteCount | Chinese | After | 8.157 us | 0.0342 us | 0.893 | | GetByteCount | Cyrillic | Before | 7.929 us | 0.0069 us | 1.000 | | GetByteCount | Cyrillic | After | 7.157 us | 0.0521 us | 0.903 | | GetByteCount | Greek | Before | 10.042 us | 0.0093 us | 1.000 | | GetByteCount | Greek | After | 9.109 us | 0.0646 us | 0.907 |

SwapnilGaikwad · 2025-11-05T16:51:40Z

cc: @EgorBo @a74nh @JulieLeeMSFT

EgorBo · 2025-11-06T15:53:15Z

I think @tannergooding is a better person to review this as it's Libraries code.

Is any of this can be replaced with the newly added "IndexOf" like Vector APIs?

tannergooding · 2025-11-06T16:05:21Z

Is any of this can be replaced with the newly added "IndexOf" like Vector APIs?

👍. This is not a large improvement and makes the code harder to maintain. I'd personally lean more towards removing the Sse2 specialized path and instead just having the Vector128.IsHardwareAccelerated path. Ideally utilizing the newer xplat APIs, like Count(...) and IndexOf(...), to handle such optimizations (and tweaking the JIT if things aren't quite right).

The general goal is to reduce the amount of platform specific code we have to maintain over time. This is a goal even if there are relatively minor regressions in doing so. We really only want to have platform specific paths if we know there is "significant" advantage such as much higher throughput numbers and/or real world benchmarks (not microbenchmarks) showing the gains.

-- Not all 10% is created equal. 10% of 1us (100ns) is significantly less than 10% of 1ms (100us), for example. And while a 10% gain on 10us can be impactful to some apps, it is likely not the bottleneck or to have any kind of measurable impact to typical workloads. This is particularly true when things like Tiered Compilation don't kick in until there's a 100ms gap between Tier 0 compilations after startup. So we typically want to see reduction in code complexity or something showing the complexity increase is worthwhile.

SwapnilGaikwad · 2025-11-07T12:49:01Z

This is not a large improvement and makes the code harder to maintain.

Thanks @tannergooding for your comment and explanation. I certainly agree, we will close this PR.

ylpoonlg · 2025-11-07T12:53:20Z

Thanks all, that makes sense, I will bear that in mind and explore the suggestions of using IndexOf APIs. Closing this for now.

Re-attempt at #121383. Refactor the vectorized code path by combining the SSE2 "intrinsified" path with the original Vector128 algorithm. There are still some platform specific code (for AdvSimd), as it is difficult to fully rely on Vector128 APIs without sacrificing performance too much. The main issue is the lack of an instruction for `Vector128.ExtractMostSignificantBits` on Arm, so it is significantly slower when trying to force it to use the same mask format as the SSE2 algorithm. I have looked into the possibility of using `IndexOf` and `Count` etc, but they also use `ExtractMostSignificantBits` so it poses the same problem. This PR tries to encapsulate this difference in a few helper methods so they can share the same code path for the main algorithm. Performance wise, there is not as much improvements, but hopefully the code will be easier to maintain. Arm Neoverse-V2: | Method | Input | Version | Mean | Error | Ratio | |------------- |------------------- |-------- |----------:|----------:|------:| | GetByteCount | EnglishAllAscii | Before | 4.437 us | 0.0437 us | 1.000 | | GetByteCount | EnglishAllAscii | After | 4.475 us | 0.1618 us | 1.009 | | GetByteCount | EnglishMostlyAscii | Before | 20.387 us | 0.1744 us | 1.000 | | GetByteCount | EnglishMostlyAscii | After | 19.941 us | 0.1079 us | 0.978 | | GetByteCount | Chinese | Before | 9.145 us | 0.0072 us | 1.000 | | GetByteCount | Chinese | After | 8.992 us | 0.0069 us | 0.983 | | GetByteCount | Cyrillic | Before | 7.936 us | 0.0095 us | 1.000 | | GetByteCount | Cyrillic | After | 7.812 us | 0.0056 us | 0.984 | | GetByteCount | Greek | Before | 10.077 us | 0.0106 us | 1.000 | | GetByteCount | Greek | After | 9.952 us | 0.0120 us | 0.988 | Intel Sapphire Rapids: | Method | Input | Version | Mean | Error | Ratio | |------------- |------------------- |-------- |----------:|----------:|------:| | GetByteCount | EnglishAllAscii | Before | 8.144 us | 0.3398 us | 1.000 | | GetByteCount | EnglishAllAscii | After | 8.126 us | 0.2759 us | 0.998 | | GetByteCount | EnglishMostlyAscii | Before | 22.971 us | 0.4046 us | 1.000 | | GetByteCount | EnglishMostlyAscii | After | 22.155 us | 0.9902 us | 0.964 | | GetByteCount | Chinese | Before | 10.582 us | 0.3425 us | 1.000 | | GetByteCount | Chinese | After | 10.048 us | 0.2135 us | 0.950 | | GetByteCount | Cyrillic | Before | 9.222 us | 0.1874 us | 1.000 | | GetByteCount | Cyrillic | After | 9.100 us | 0.2704 us | 0.987 | | GetByteCount | Greek | Before | 11.802 us | 0.3551 us | 1.000 | | GetByteCount | Greek | After | 11.224 us | 0.3505 us | 0.951 |

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 5, 2025

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 5, 2025

SwapnilGaikwad requested a review from EgorBo November 5, 2025 16:54

EgorBo requested a review from tannergooding November 6, 2025 15:53

ylpoonlg closed this Nov 7, 2025

ylpoonlg mentioned this pull request Nov 26, 2025

Simplify UTF-16 validation Vector128 codepath #121981

Merged

github-actions bot locked and limited conversation to collaborators Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AdvSimd optimised GetPointerToFirstInvalidChar #121383

Add AdvSimd optimised GetPointerToFirstInvalidChar #121383

Uh oh!

ylpoonlg commented Nov 5, 2025

Uh oh!

SwapnilGaikwad commented Nov 5, 2025

Uh oh!

EgorBo commented Nov 6, 2025

Uh oh!

tannergooding commented Nov 6, 2025

Uh oh!

SwapnilGaikwad commented Nov 7, 2025 •

edited

Loading

Uh oh!

ylpoonlg commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add AdvSimd optimised GetPointerToFirstInvalidChar #121383

Add AdvSimd optimised GetPointerToFirstInvalidChar #121383

Uh oh!

Conversation

ylpoonlg commented Nov 5, 2025

Performance results

Uh oh!

SwapnilGaikwad commented Nov 5, 2025

Uh oh!

EgorBo commented Nov 6, 2025

Uh oh!

tannergooding commented Nov 6, 2025

Uh oh!

SwapnilGaikwad commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylpoonlg commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SwapnilGaikwad commented Nov 7, 2025 •

edited

Loading