Endlich!
Über ein Jahr nach Torch und Monate nach PlentyChess hat es nun auch das Stockfishteam geschafft, den Verbatim Net Patch zu integrieren, der bei Ranglisten-Testruns einen sehr deutlichen speedup verspricht. Und somit (hoffentlich) einen meßbaren Celo-Progress von Stockfish. Meinen Testrun für meine UHO-Top15 Ranglist habe ich soeben gestartet.
Author: Tomasz Sobczyk
Date: Sun Nov 2 16:04:09 2025 +0100
Timestamp: 1762095849
Use shared memory for network weights
This enables different Stockfish processes that use the same weights to use the
same memory. The approach establishes equivalence by memory content, and is
compatible with NUMA replication. The benefit of sharing is reduced memory usage
and a speedup thanks to improved (inter-process) caching of the network in the
CPUs cache, and thus reduced bandwidth usage to main memory. Even though this
change doesn't benefit a user running a single process, this helps on fishtest
or e.g. for Lichess, when multiple games run concurrently, or multiple
positions are analyzed in parallel.
This concept was probably first introduced in the Monty engine
(
https://github.com/official-monty/Monty/pull/62), after a discussion in
https://github.com/official-stockfish/fishtest/issues/2077 on the issue of
memory pressure. Measurements based on Torch
(
https://github.com/user-attachments/files/21386224/verbatim.pdf) further
suggested that large gains were possible. Multiple other engines have
adopted this 'verbatim' format as well.
The implementation here adds the flexibility needed for SF, for example, retains
the ability to bundle compressed networks with the binary, to load nets by uci
option, and to distribute the shared nets to the proper NUMA region. This
flexibility comes with a fair amount of complexity in the implementation, such
as OS specific code, and fallback code.
For most users this should be transparent. However, for example, those running
docker containers should ensure the `--ipc` flag is set correctly, and
`--shm-size` is sufficiently large.
The benefits of this patch significantly depend on hardware, with systems with
many cores and a large (O(150MB), the net size) L3 cache benefitting typically
most. On such systems SF speedups (as measured via nps playing games with
large concurrency but just 1 thread) can be 38%, which results in master vs.
patch Elo which gains about 25 Elo.
```
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 shared_memoryPR : 24.8 1.9 39432.0 73728 53
2 master : 0.0 ---- 34296.0 73728 47
```
In a multithreaded setup, where weights are already shared, that benefit is smaller,
for example on the same HW as above, but with 8t for each side.
```
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 shared_memoryPR : 5.2 3.5 9351.0 18432 51
2 master : 0.0 ---- 9081.0 18432 49
```
On fishtest with a typical hardware mix of our contributors, the following was measured:
STC, 60k games
https://tests.stockfishchess.org/tests/view/69074a49ea4b268f1fac236cElo: 4.69 ± 1.4 (95%) LOS: 100.0%
Total: 60000 W: 16085 L: 15275 D: 28640 Elo +4.69
Ptnml(0-2): 154, 6440, 16053, 7148, 205
nElo: 9.38 ± 2.8 (95%) PairsRatio: 1.12
To verify correctness with a single process on a NUMA architecture,
speedtest was used, confirming near equivalence:
```
master: Average (over 10): 296236186
shared_memory: Average (over 10): 295769332
```
Currently, using large pages for the shared network weights is not always possible,
which can lead to a small slowdown (1-2%), in case a single process is run.
closes
https://github.com/official-stockfish/Stockfish/pull/6173No functional change