The reason for the latter is rooted in the "Processor Group" architecture of Windows operating systems:
* On a traditional dual-CPU system Windows will generally assign all the cores from the first CPU to "processor group 1" and all the cores from the second CPU to "processor group 2". An application unaware of processor groups would be confined to only run on the cores of one of the processor groups. This would ensure that each application only accesses the RAM of the same CPU and not the RAM of the other CPU, which would have a higher latency.
* A processor group can contain up to 64 logical CPUs.
* Nowdays, when even a single CPU can have more than 32 cores / 64 threads, Windows just fills up additional processor groups, even if all cores are located on the same physical CPU.
* More information: https://learn.microsoft.com/en-us/windows/win32/procthread/processor-groups
In the above example with 96 cores / 192 threads, Windows would create 3 processor groups with 64 logical CPUs each. (modifying the motherboard's BIOS settings, it would also be possible to create 4 processor groups with 48 logical CPUs each)
In older Windows versions (Windows 7 to 10), an application unaware of processor groups would always be confined to running on at most 64 cores of one processor group, no matter how many threads it creates.
However, in Windows 11 this restriction was removed: Now, even if an application does not support processor groups at all, it can still use all available cores, als long as it creates enough threads.
For this reason, 7-Zip will utilize 128 cores on my CPU if I set the number of threads to 128. (but as said before, it won't allow me to set 192 threads)
This leaves me with two Feature Requests:
1. Instead of using the old APIs for detecting how many cores are available on a system, 7-Zip should use the newer APIs and then set the default/maximum threads for compression based on that value. (this one should be very easy to implement and would allow Windows 11+ and Windows Server 2022+ users to utilize all available CPU cores)
2. Add "real" support for processor groups to 7-Zip. This would be much more complex to implement and would basically only benefit Windows 7 to 10 users (who most likely will not be using CPUs with that high numbers of cores). While it would be nice to have, it's probably not worth the time.
Processor groups can be simulated on any system, even if it has less than 32 cores / 64 threads: https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/boot-parameters-to-test-drivers-for-multiple-processor-group-support
There are two things to note:
1) actually 7-zip can't get big performance gain from big number of working threads, if we compress real data. Only benchmark command can get big gain from > 64-128 threads.
2) Now it's difficult for me to debug that >64 threads feature, because I have no Windows 11 system with big number of threads for debug.
So now I'm not ready to change that code. Maybe in future it will be simpler for me to implement it.
You don't need a machine with > 64 cores to test the code. You can simulate processor groups:
I just quickly cooked up the following code:
On my current system (16 cores / 32 threads), the output would be:
When using above
bcdedit.exe
commands and settinggroupsize 8
, the output changes to:So the total number of available logical CPUs would be detected, even if spread amongst multiple processor groups.
Also we need to detect that system can use more than one group.
I adjusted the code a bit to make it backwards compatible:
GetActiveProcessorCount
instead ofGetMaximumProcessorCount
. If the groups don't have equal size,GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS)
will assume all groups to have the maximum size and return the wrong value.GetActiveProcessorGroupCount
could be used)I tried multiple configurations on my 96C/192T system:
With 4 NUMA nodes per processor (system acts as a quad-CPU system):
The two nonsensical groupsizes (29 and 61) also cause some other strange output, for example, the Windows Task Manager reports 98 or even 100 cores available.
7-zip process can have affinity mask set to non-default value.
For example, affinity mask can be set for 4 cores from 8. And we will create only 4 threads in 7-zip for that case.
So we must know how many threads in all groups are available for current running process.
And it must work in any system from Windows XP to Windows 11.
We need affinity for process.
For example, If we have 192 threads in system, we must have 192-bit affinity vector. How to get that affinity vector?
Note: I use Windows 10, and I have no Windows 11 for testing.
But I can still try to implement required code for groups, if it will be not too difficult.
Last edit: Igor Pavlov 2024-12-03
There is no 192-bit affinity vector, only the normal 64-bit affinity vector.
When a process is created, Windows 10 and 11 will always assign it a primary processor group (Round-Robin) and an affinity mask including all available CPUs in that processor group (up to 64). Alternatively, the user creating the process may specify the primary processor group and the affinity mask for that processor group.
All new threads in that process not using the processor group specific APIs will always be created in the primary processor group. If using the new APIs, the process may also specify that the new thread should be created in a different processor group.
The big change with Windows 11 is that when a process is unaware of processor groups (and uses the old APIs), Windows 11 may nevertheless create new threads in a different processor group.
However, this will only happen if all available logical CPUs in the primary processor group are already running a thread of that process:
The last code fragment I posted would return:
GetProcessGroupAffinity
returns multiple processor groups and the code sums up their individual active CPU counts (the primary processor group also only has 64 CPUs, but Windows can create new threads in any available groups)To get back to your remark:
If a user specifies that 7-Zip should only use 4 out of 8 logical CPUs by specifying a matching affinity mask when creating the 7-Zip process, my code posted above would nevertheless return "8" on a system with 8 logical CPUs.
However, if it would be a 192 CPU system, the user only could specifiy "use 4 out of 64 CPUs". The user could not specify "use 4 out of 192 CPU" since the primary processor group only has 64 CPUs.
Of course, the user is still free to run 7-Zip with 4 cores in total by specifying to use 4 threads instead of the default value.
My primary concern is just that the selectable default/maximum number of threads is too low and should be based on all available CPUs, not just on the ones in the primary processor group.
If we use Window 10,
CreateThread
will use one group, as I suppose.But your counting function in Windows 10 still can show big number of cores (192) at some conditions, for example, if
SetThreadGroupAffinity()
was used before by some reason.For example, we count cores in DLL, but main EXE already used all groups before.
So we can have counted value of 192 cores in Windows 10, and we will try to call
CreateThread
192 times. But all 192 threads will be placed to one group (64 cores) as suppose.So if we want to use only
CreateThread
, maybe we need some additional check that shows that we have newCreateThread
in Windows 11 instead of oldCreateThread
in Windows 10?Can you suggest simplest way to check it?
Is there some flag or function that shows that we have new "Windows 11 groups"?
And some question about threads in Windows 11.
For example, we create thread with
CreateThread
. It can start new thread in any group in Windows 11.And then the system can move that thread to any another group at any time?
Or the system can move the thread only within the group in which the thread was started?
Last edit: Igor Pavlov 2024-12-04
I see new function
SetThreadSelectedCpuSetMasks
that must work only in Windows 11.So probably we can use
GetProcAddress("SetThreadSelectedCpuSetMasks"
to check that we have new "Windows 11" groups?Also I still think what is better, default multi-group threads of Windoiws 11, or single group
SetThreadGroupAffinity
threads like in Windows-10.For example, if Windows 11 can move thread from one group to another, it can be slow in some cases. For example, if thread was created in processor-1 (group-1), but when primary group (group-0) in (processor-0) will be free, the thread will be moved to that primary group (processor-0). But numa memory for that thread could be allocated already by processor-1.
Last edit: Igor Pavlov 2024-12-04
Maybe a picture tells more than 1000 words.
This is a screenshot of running
start 7zFM.exe
without any other parameters, going to the menu, selecting "benchmark" and then selecting to use 100 threads.The primary process group 7-Zip was assigned to appears to be the middlle group (group 1) with all 64 logical CPUs at 100% utilization.
Since there are still 36 threads missing to get to 100, the remaining threads were put into process groups 0 and 2. First 32 threads were put into group 2. Windows even took care of "SMT" by only using every second core (since the two SMT threads of one core share their cache). The remaining 4 threads were then put into group 0.
Of course the picture is far from perfect (additional threads only at 40-70% utilization), but it is what we would get on Windows 11 with an application completely unaware of processor groups.
Last edit: Alice 2024-12-04
This one is the same test, but this time started with
start /affinity FFFFFFFFFFEFFBFE 7zFM.exe
.This time the primary group is group 0. Since the affinity is set to exclude CPUs 0, 10 and 20, they are at 0% and the other 61 CPUs in that group run att 100% utilization. All 100 threads are now in process group 0.
However, this time Windows did not add new threads to other processor groups, since setting an affinity tells Windows that somebody is taking care of the scheduling already.
Using
SetThreadSelectedCpuSetMasks
allows to define which processor groups a process should run on. On Windows 11, the default value is "all available groups". (or maybe "all groups of one NUMA node", but I don't have a dual-CPU system, so I can't test)The
CreateThread
function does not allow specifying processor groups. One would need to useCreateRemoteThreadEx
(yes, "remote" to create a thread in the current process) and provide process group information in thelpAttributeList
parameter.No, Windows 11 will only move around threads within one group (affinity mask restrictions apply). Moving an existing thread to a different group requires the application explicitly telling Windows where to move the thread (and what affinity to apply in the new group). Only a newly created thread would be placed in a new processor group, if the old one is at max capacity.
Actually I didn't see information about that thing in docs.
And what do you have for your code that counts cores for
/affinity FFFFFFFFFFEFFBFE
case (that showed 192 cores without/affinity
)?I suppose it was so before Windows 11.
But if Windows 11 allows
SetThreadSelectedCpuSetMasks
.So actually there is full affinity vector for all 192 cores. And by default it's allowed to start thread in any core from all these 192 cores. So maybe Windows 11 can move thread from one group to another group because the default affinity list allows it?
I don't want to use
SetThreadSelectedCpuSetMasks
function.I just want to use
GetProcAddress("SetThreadSelectedCpuSetMasks"
to detect what Windows we are running: Windows-11 or pre-Windows-11, because we can use 192 calls forCreateThread
in Windows 11, and only 64 calls ofCreateThread
in Windows 10.Last edit: Igor Pavlov 2024-12-04
I did not code anything for this,
start
is a build-in command of the Windows Command Processor (cmd.exe
) which allows setting the affinity of the process it launches with/affinity <mask>
. (it also has a parameter/node
to select the NUMA node, but doesn't appear to work for selecting processor groups)If you don't want to use
SetThreadSelectedCpuSetMasks
, maybe checking the Windows build number would be better?So using the
GetVersion
API call and checkingdwBuild >= 20348
would be the most compatible thing to do.But it should not really be necessary, since
GetProcessGroupAffinity
returns whatever settings the OS or the user has applied to the current process.Last edit: Alice 2024-12-05
you have code examples above for core counting, but you called it without
/affinity FFFFFFFFFFEFFBFE
.Core counting function can return different results in WIndows 10. It can return 64 cores and it can return 192 cores in another cases.
But
CreateThread
will always be in one group in Windcows 10.If we count 192 cores in windows 10, we don't want to call
CreateThread
192 times. Instead we want to callCreateThread
only 64 times.Last edit: Igor Pavlov 2024-12-05
I don't understand. There is no "192-but affinity mask". The affinity mask will always be 64-bit since one processor group can only contain up to 64 logical CPUs. There is no hypothetical affinity mask a process would have if one of its threads would get moved to a different processor group. (the user just cannot set such a value)
The feature to spread the threads to different processor groups only exists on Windows 11+ and Windows Server 2022+. On Windows 10, there is no way to ever have the code return more than 64 since
SetThreadSelectedCpuSetMasks
is not available and all processes are confined into their primary processor group.Windows 10 supports 192 cores.
So the code that counts the cores can return 192 cores in Windows 10 at some conditions.
Example:
There is some EXE application that knows about windows 10 groups. It counts 192 cores and uses all these 192 cores with the
CreateRemoteThreadEx
function.Then this application calls future version of
7z.dll
.7z.dll
counts the cores and also sees 192 cores. So7z.dll
callsCreateThread
192 times in Windows 10.We don't want it. So we must know that we are running in
Windows 10
insteadWindows 11
. That is whyGetProcAddress("SetThreadSelectedCpuSetMasks"
can help us.Last edit: Igor Pavlov 2024-12-06
I think there is some misunderstanding.
No, it can not. It only could return 64 on Windows 10.
If an application wants to use 192 cores on Windows 10, then that application has to manually do all the scheduling work and manually move around its own threads to the processor groups it wants them to run on. (otherwise it will be limited to at most 64 cores in its primary processor group)
If any of those 192 manually scheduled threads calls
GetProcessGroupAffinity
, it will only get its own one (and only) processor group as result, which can contain at most 64 logical CPUs. So the code above would only return up to 64, not up to 192.If one of those 192 threads would call a
7z.dll
export, then 7-Zip would also only get up to 64 as result, not up to 192. If7z.dll
wanted to use 192 cores, it would have to manually implement its own scheduling and manually move its threads to the correct processor groups. If7z.dll
just starts 192 threads, all of them will run only on up to 64 cores in the thread's processor group.This is what I initially listed as:
On Windows 11 it is still possible to manually schedule everything (and you would probably get better performance out of it), but it is no longer necessary in order to use more than 64 cores.
An application could just create 192 threads and have Windows 11 automatically distribute them among the 192 cores. However, for this to work two conditions need to be true:
I suppose you are wrong.
GetProcessGroupAffinity
was supported in Windows 10.So
GetProcessGroupAffinity
can return all groups in Windows 10.If
GetProcessGroupAffinity
can't return several groups, then why Windows 10 needsGroupCount
andGroupArray
variable inGetProcessGroupAffinity
call?Ok, now I see where my error of thoughts was:
In both cases it is the same OS and the API calls return the same values, but there is no way to differentiate them.
7z.dll
would not know if the process calling the DLL has already changed affinity masks or not, so it would not know if it should create 64 or 192 threads.Unfortunately, there is no way I see for a DLL to get that information. There is no documented API for it and no way to know what a third party process has done before calling the DLL. If the DLL would now by itself start scheduling threads, this might influence the calling process who might be relying on Windows 11 doing all the scheduling work.
So
CreateThread
will be ineffective in Windows 11, if the process usedCreateRemoteThreadEx
for some group before?If it's so, then it's not good for main process too. If DLL calls
CreateRemoteThreadEx
for best affinity, main process will be ineffective withCreateThread
later.The things can be simpler for Windows 10, because only
CreateRemoteThreadEx
is effective in Windows. So exe and dll can try to useCreateRemoteThreadEx
for more threads.BTW, you still can call 7-zip benchmark with 192 cores from command line:
Last edit: Igor Pavlov 2024-12-07
I did some tests:
SetProcessAffinityMask
(available Windows XP+) andSetProcessDefaultCpuSetMasks
(available Windows 11+) kill Windows 11 automatic scheduling. However, if executed again and only passing0
/NULL
parameters, the original state can be restored.SetThreadAffinityMask
,SetThreadGroupAffinity
,SetProcessDefaultCpuSets
,SetThreadSelectedCpuSets
,SetThreadSelectedCpuSetMasks
don't seem to disable Windows 11 automatic scheduling.CreateRemoteThreadEx
does not influence Windows 11 automatic scheduling, if used to launch a thread in a specific processor groupSo, in theory it is possible to restore Windows 11 automatic thread scheduling if some third party code has disabled it. Of course, that application probably had some reason to do it.
Last edit: Alice 2024-12-07