Stockfish Netzgröße 1024 ist offiziell die neue Nr.1 inkl. neues grünes Netz

By Max Siegfried Date 2021-08-16 14:39

https://abrok.eu/stockfish/
Author: Tomasz Sobczyk
Date: Sun Aug 15 12:05:43 2021 +0200
Timestamp: 1629021943

New NNUE architecture and net

Introduces a new NNUE network architecture and associated network parameters

The summary of the changes:

* Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on.
* The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code.
* The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq

The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit# for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures.

The network was trained with a slightly modified version of the pytorch trainer (https://github.com/glinscott/nnue-pytorch); the changes are in https://github.com/glinscott/nnue-pytorch/pull/143

The training utilized 2 datasets.

dataset A - https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing
dataset B - as described in https://github.com/official-stockfish/Stockfish/commit/ba01f4b95448bcb324755f4dd2a632a57c6e67bc

The training process was as following:

train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training).
convert the .ckpt to .pt
--resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function.

The first training command:

python3 train.py \
../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \
../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \
--gpus "$3," \
--threads 1 \
--num-workers 1 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--smart-fen-skipping \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=1.0 \
--max_epochs=600 \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

The second training command:

python3 serialize.py \
--features=HalfKAv2_hm^ \
../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \
../nnue-pytorch-training/experiment_$1/base/base.pt

python3 train.py \
../nnue-pytorch-training/data/michael_commit_b94a65.binpack \
../nnue-pytorch-training/data/michael_commit_b94a65.binpack \
--gpus "$3," \
--threads 1 \
--num-workers 1 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--smart-fen-skipping \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=0.8 \
--max_epochs=600 \
--resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

STC: https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4

LLR: 2.97 (-2.94,2.94) <-0.50,2.50>
Total: 22480 W: 2434 L: 2251 D: 17795 Elo +2.83
Ptnml(0-2): 101, 1736, 7410, 1865, 128

LTC: https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea

LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 9776 W: 442 L: 333 D: 9001 Elo +3.87
Ptnml(0-2): 5, 295, 4180, 402, 6

closes https://github.com/official-stockfish/Stockfish/pull/3646

bench: 5189338
see source

https://tests.stockfishchess.org/nns
21-08-09 12:18:12 nn-e8321e467bf6.nnue Sopel 2021-08-09 12:39:01 2021-08-16 12:03:51 1358

Habt ihr schon die Testsuites laufen lassen?

By Thomas Zipproth Date 2021-08-16 18:39

Max Siegfried schrieb:

Habt ihr schon die Testsuites laufen lassen?

Auf NextChessMove hat es die bis jetzt beste Wertung.
https://nextchessmove.com/dev-builds

By Peter Martan Date 2021-08-17 11:11 Edited 2021-08-17 12:06 Upvotes 1

Taktisch reißt's mich nicht vom Hocker:

Analysis by Stockfish 150821avx2

Code:

Löse: C:\...\Testsets\HTC114.cbh
Maximale Lösungszeit = 60s.

1. Hard-Talkchess-2020.001,  HTC114   Gelöst in 18.02s/51; Gelöst: 1
2. Hard-Talkchess-2020.002,  HTC114   Gelöst in 16.05s/32; Gelöst: 2
3. Hard-Talkchess-2020.003,  HTC114   > 60s.
4. Hard-Talkchess-2020.007,  HTC114   > 60s.
5. Hard-Talkchess-2020.008,  HTC114   > 60s.
6. Hard-Talkchess-2020.010,  HTC114   Gelöst in 0.25s/17; Gelöst: 3
7. Hard-Talkchess-2020.011,  HTC114   Gelöst in 30.34s/34; Gelöst: 4
8. Hard-Talkchess-2020.012,  HTC114   Gelöst in 3.27s/25; Gelöst: 5
9. Hard-Talkchess-2020.013,  HTC114   Gelöst in 24.97s/39; Gelöst: 6
10. Hard-Talkchess-2020.014,  HTC114   > 60s.
11. Hard-Talkchess-2020.016,  HTC114   Gelöst in 18.20s/24; Gelöst: 7
12. Hard-Talkchess-2020.019,  HTC114   > 60s.
13. Hard-Talkchess-2020.020,  HTC114   Gelöst in 7.80s/27; Gelöst: 8
14. Hard-Talkchess-2020.021,  HTC114   Gelöst in 9.14s/38; Gelöst: 9
15. Hard-Talkchess-2020.023,  HTC114   > 60s.
16. Hard-Talkchess-2020.028,  HTC114   Gelöst in 0.17s/14; Gelöst: 10
17. Hard-Talkchess-2020.029,  HTC114   Gelöst in 3.11s/21; Gelöst: 11
18. Hard-Talkchess-2020.031,  HTC114   Gelöst in 14.03s/48; Gelöst: 12
19. Hard-Talkchess-2020.034,  HTC114   Gelöst in 40.20s/45; Gelöst: 13
20. Hard-Talkchess-2020.035,  HTC114   Gelöst in 28.20s/39; Gelöst: 14
21. Hard-Talkchess-2020.036,  HTC114   Gelöst in 1.91s/22; Gelöst: 15
22. Hard-Talkchess-2020.038,  HTC114   > 60s.
23. Hard-Talkchess-2020.039,  HTC114   > 60s.
24. Hard-Talkchess-2020.043,  HTC114   > 60s.
25. Hard-Talkchess-2020.046,  HTC114   Gelöst in 0.11s/10; Gelöst: 16
26. Hard-Talkchess-2020.047,  HTC114   > 60s.
27. Hard-Talkchess-2020.049,  HTC114   Gelöst in 0.20s/16; Gelöst: 17
28. Hard-Talkchess-2020.050,  HTC114   > 60s.
29. Hard-Talkchess-2020.052,  HTC114   Gelöst in 6.45s/27; Gelöst: 18
30. Hard-Talkchess-2020.053,  HTC114   Gelöst in 30.73s/46; Gelöst: 19
31. Hard-Talkchess-2020.054,  HTC114   > 60s.
32. Hard-Talkchess-2020.056,  HTC114   Gelöst in 22.67s/33; Gelöst: 20
33. Hard-Talkchess-2020.058,  HTC114   Gelöst in 0.16s/10; Gelöst: 21
34. Hard-Talkchess-2020.059,  HTC114   Gelöst in 11.75s/32; Gelöst: 22
35. Hard-Talkchess-2020.061,  HTC114   Gelöst in 9.09s/33; Gelöst: 23
36. Hard-Talkchess-2020.065,  HTC114   Gelöst in 0.53s/17; Gelöst: 24
37. Hard-Talkchess-2020.066,  HTC114   Gelöst in 0.09s/10; Gelöst: 25
38. Hard-Talkchess-2020.067,  HTC114   > 60s.
39. Hard-Talkchess-2020.068,  HTC114   Gelöst in 24.45s/43; Gelöst: 26
40. Hard-Talkchess-2020.069,  HTC114   > 60s.
41. Hard-Talkchess-2020.070,  HTC114   Gelöst in 19.03s/45; Gelöst: 27
42. Hard-Talkchess-2020.072,  HTC114   Gelöst in 3.30s/49; Gelöst: 28
43. Hard-Talkchess-2020.074,  HTC114   > 60s.
44. Hard-Talkchess-2020.078,  HTC114   > 60s.
45. Hard-Talkchess-2020.081,  HTC114   > 60s.
46. Hard-Talkchess-2020.083,  HTC114   Gelöst in 4.64s/29; Gelöst: 29
47. Hard-Talkchess-2020.087,  HTC114   Gelöst in 31.14s/66; Gelöst: 30
48. Hard-Talkchess-2020.089,  HTC114   Gelöst in 1.20s/17; Gelöst: 31
49. Hard-Talkchess-2020.090,  HTC114   Gelöst in 0.08s/16; Gelöst: 32
50. Hard-Talkchess-2020.091,  HTC114   > 60s.
51. Hard-Talkchess-2020.093,  HTC114   > 60s.
52. Hard-Talkchess-2020.094,  HTC114   Gelöst in 53.59s/57; Gelöst: 33
53. Hard-Talkchess-2020.095,  HTC114   > 60s.
54. Hard-Talkchess-2020.096,  HTC114   Gelöst in 2.98s/26; Gelöst: 34
55. Hard-Talkchess-2020.097,  HTC114   Gelöst in 6.83s/29; Gelöst: 35
56. Hard-Talkchess-2020.099,  HTC114   > 60s.
57. Hard-Talkchess-2020.101,  HTC114   Gelöst in 0.27s/18; Gelöst: 36
58. Hard-Talkchess-2020.103,  HTC114   > 60s.
59. Hard-Talkchess-2020.105,  HTC114   Gelöst in 7.73s/48; Gelöst: 37
60. Hard-Talkchess-2020.108,  HTC114   > 60s.
61. Hard-Talkchess-2020.109,  HTC114   Gelöst in 18.89s/45; Gelöst: 38
62. Hard-Talkchess-2020.110,  HTC114   Gelöst in 1.34s/27; Gelöst: 39
63. Hard-Talkchess-2020.113,  HTC114   > 60s.
64. Hard-Talkchess-2020.114,  HTC114   Gelöst in 0.42s/29; Gelöst: 40
65. Hard-Talkchess-2020.116,  HTC114   > 60s.
66. Hard-Talkchess-2020.117,  HTC114   > 60s.
67. Hard-Talkchess-2020.119,  HTC114   Gelöst in 17s/43; Gelöst: 41
68. Hard-Talkchess-2020.121,  HTC114   Gelöst in 10.41s/48; Gelöst: 42
69. Hard-Talkchess-2020.122,  HTC114   Gelöst in 13.38s/34; Gelöst: 43
70. Hard-Talkchess-2020.125,  HTC114   Gelöst in 0.17s/20; Gelöst: 44
71. Hard-Talkchess-2020.126,  HTC114   Gelöst in 28.88s/50; Gelöst: 45
72. Hard-Talkchess-2020.127,  HTC114   Gelöst in 1.05s/20; Gelöst: 46
73. Hard-Talkchess-2020.129,  HTC114   > 60s.
74. Hard-Talkchess-2020.130,  HTC114   > 60s.
75. Hard-Talkchess-2020.131,  HTC114   Gelöst in 6.58s/30; Gelöst: 47
76. Hard-Talkchess-2020.132,  HTC114   Gelöst in 1.09s/22; Gelöst: 48
77. Hard-Talkchess-2020.133,  HTC114   Gelöst in 3.11s/34; Gelöst: 49
78. Hard-Talkchess-2020.135,  HTC114   > 60s.
79. Hard-Talkchess-2020.140,  HTC114   Gelöst in 0.23s/28; Gelöst: 50
80. Hard-Talkchess-2020.144,  HTC114   > 60s.
81. Hard-Talkchess-2020.146,  HTC114   Gelöst in 57.83s/29; Gelöst: 51
82. Hard-Talkchess-2020.147,  HTC114   > 60s.
83. Hard-Talkchess-2020.153,  HTC114   > 60s.
84. Hard-Talkchess-2020.155,  HTC114   Gelöst in 0.52s/19; Gelöst: 52
85. Hard-Talkchess-2020.156,  HTC114   > 60s.
86. Hard-Talkchess-2020.158,  HTC114   > 60s.
87. Hard-Talkchess-2020.159,  HTC114   Gelöst in 4.81s/24; Gelöst: 53
88. Hard-Talkchess-2020.160,  HTC114   Gelöst in 4.09s/23; Gelöst: 54
89. Hard-Talkchess-2020.164,  HTC114   Gelöst in 3.30s/25; Gelöst: 55
90. Hard-Talkchess-2020.166,  HTC114   > 60s.
91. Hard-Talkchess-2020.169,  HTC114   > 60s.
92. Hard-Talkchess-2020.170,  HTC114   > 60s.
93. Hard-Talkchess-2020.171,  HTC114   > 60s.
94. Hard-Talkchess-2020.177,  HTC114   > 60s.
95. Hard-Talkchess-2020.179,  HTC114   Gelöst in 25.36s/29; Gelöst: 56
96. Hard-Talkchess-2020.181,  HTC114   Gelöst in 22.91s/38; Gelöst: 57
97. Hard-Talkchess-2020.182,  HTC114   Gelöst in 13.86s/31; Gelöst: 58
98. Hard-Talkchess-2020.183,  HTC114   Gelöst in 16.81s/29; Gelöst: 59
99. Hard-Talkchess-2020.184,  HTC114   Gelöst in 41.28s/33; Gelöst: 60
100. Hard-Talkchess-2020.185,  HTC114   Gelöst in 0.22s/15; Gelöst: 61
101. Hard-Talkchess-2020.186,  HTC114   Gelöst in 1.73s/22; Gelöst: 62
102. Hard-Talkchess-2020.190,  HTC114   Gelöst in 59.03s/35; Gelöst: 63
103. Hard-Talkchess-2020.191,  HTC114   Gelöst in 28.56s/29; Gelöst: 64
104. Hard-Talkchess-2020.194,  HTC114   Gelöst in 0.50s/19; Gelöst: 65
105. Hard-Talkchess-2020.195,  HTC114   Gelöst in 9.75s/26; Gelöst: 66
106. Hard-Talkchess-2020.196,  HTC114   > 60s.
107. Hard-Talkchess-2020.198,  HTC114   Gelöst in 11.81s/35; Gelöst: 67
108. Hard-Talkchess-2020.200,  HTC114   Gelöst in 21.20s/29; Gelöst: 68
109. Hard-Talkchess-2020.203,  HTC114   Gelöst in 46.39s/35; Gelöst: 69
110. Hard-Talkchess-2020.208,  HTC114   Gelöst in 1.17s/19; Gelöst: 70
111. Hard-Talkchess-2020.209,  HTC114   Gelöst in 10.45s/37; Gelöst: 71
112. Hard-Talkchess-2020.210,  HTC114   Gelöst in 18.61s/29; Gelöst: 72
113. Hard-Talkchess-2020.211,  HTC114   > 60s.
114. Hard-Talkchess-2020.213,  HTC114   Gelöst in 23.11s/37; Gelöst: 73

Ergebnis: 73 aus 114 = 64.0%. Durchschnittszeit = 13.40s / 30.54

Auf Vincent Lejeunes Rangliste (er hat die Stellungen gesammelt, das Hard Talkchess-Set gibt's sein Jahren, das kleine 114-Subset stammt daraus, er testet mit 30 Minuten/Stellung und single core wegen der Reproduzierbarkeit) fangen die guten Resultate bei 80 an, mein bestes Scoring mit ShashChess und Crystal SMP und MultiPV=4 ist ex aequo 102/114 bei 60"/Stellung und 32 Threads, 8G Hash. Das waren, abgesehen vom MultiPV, das in dieser Suite aber gar nicht soviel bringt wie z.B. im Eret, dieselben Einstellungen bei diesem SF- dev.- Run.

By Peter Martan Date 2021-08-17 12:24 Upvotes 2

Peter Martan schrieb:

Das waren, abgesehen vom MultiPV, das in dieser Suite aber gar nicht soviel bringt wie z.B. im Eret, dieselben Einstellungen bei diesem SF- dev.- Run.

Nehme alles zurück und behaupte das Gegenteil, diesen SF 150821 mit der neuen Netzarchitektur bringt das MultiPV=4 doch merkbar voran:

Bisher gelöst: 86 von 114  ;  43:58m

         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 -------------------------------------------------------------------------------------
   0 |   1   3   -   -   -  10   2   2   -   -   4  22  10  13   -   0   5   1  59   2
  20 |   4   -   8  21   0   -   0   -  12   -   -   2   0   -  12   0   0   -  27   1
  40 |  14   0   -   -   9  17   4   0   0   -   -  34   -   1   6   0   0   1   8   6
  60 |   3  13   -   1   -   0   7   2   1   0   8   1   5   1   -   0   1  11   0  48
  80 |   9  13   -   0   -  28   2   -   3   4   -   -   -  46  22  37   1  43  46   1
 100  |   3  38   4   1   5   -  35   6  56   2   7  38  20  14

Sonst alles gleich wie oben, was die Einstellungen angeht.

By Lothar Jung Date 2021-08-17 12:50

Danke Peter,

ganz wichtiger Test und Feststellung.

Zukünftig werde ich alle Lc0 und Ceres Engines mit MultiPV=4 zusätzlich testen.
Mal sehen, was das bringt.
Wegen MCTS könnte gerade Ceres damit mehr finden.
Leider kann ich das jetzt noch nicht prüfen, da Ceres eine UCI-Macke hat, die noch gepatscht werden muß.

Viele Grüße

Lothar

By Peter Martan Date 2021-08-17 13:05 Edited 2021-08-17 13:47

Ja, ich find's einfach interessant, wie sehr welche Engines von MultiPV profitieren, weil's ja ein für die Analyse wichtiges Feature ist, und was man mit Stellungstests an Ergebnissen bekommt, sagt über die Analysefähigkeiten mehr aus als über's Game Playing, zumindest wenn man nur die taktischen best move Suiten und nur das short TC- Eng-Eng-Match betrachtet.

Sei halt nicht zu enttäuscht, wenn mit LC0 und Ceres der Zuwachs durchs MultiPV geringer ist als bei den A-B-Engines, und auch bei denen kommt's mehr drauf an, wieviel Pruning, vor allem wie viel Nullmove und LMRs sie verwenden in den primary und in den non primary lines, zu denen im Multi Primary V.- Mode halt mehrere erhoben werden, was die Suchparameter angeht.

Echter MCTS sollte weniger Zeit an Time to Depth durch mehrere Primaries im Output verlieren, meine Erfahrungen auch mit komodo MCTS gehen allerdings eben genau dahin, dass es zwar, wie Larry Kaufman schreibt, "frei" ist, im TC- Aufwand, ganz so ist's aber natürlich auch nicht, und es bringt einfach viel weniger als bei A-B.
Just my two cents.

Und gerade bei Ceres 0.93 hatte ich ja da genau gleiche Ergebnisse mit MultiPV=1 und 4, wenn du dich erinnerst:

https://forum.computerschach.de/cgi-bin/mwf/topic_show.pl?pid=146914#pid146914

LC0 war auch fast gleich mit und ohne Multi.

Im Shredder hat's mit
MultiPV=4
als Syntax in der .uci - Datei (übers GUI geht's ja nicht beim automatischen Stellungstest, mit Arena schon, aber da stimmt die Auswertung nicht, die Reihenfolge der Lösungszüge wird im Arena- Analyse-File umgedreht in der Bewertung des best move) gut funktioniert, was Anzeige von 4 Lines und ihre Sortierung nach der Bewertung mit Ceres 0.93 angeht. Es fehlen nur die
---
Zeilen zwischen den 4er- Paketen, die man beim GUI-MV-Mode sieht, aber du hast immer 4 Lines mit gleicher Zeitangabe. Der oberste wird als best move ausgewiesen, auch am Brett und in der Protokoll- Datei.
Die Engines, die ich da so in der letzten Zeit im Shredder habe laufen lassen, haben alle geklappt. Es gibt auch andere, Eman z.B. hat damit nie funktioniert (falsche Bewertung richtiger Lösungen durch's GUI) SlowChess z.B. geht auch nicht, sicherheitshalber am Anfang etwas zuschauen, wenn die ersten 10 stimmen, kann man sich schon verlassen, dass der Rest auch passt.

Am Ende der Protokolldatei hast du dann auch das praktische Auswertungs- Diagramm, und wenn du's aus dieser .txt- Datei ausliest, hat's keine Tabulatoren außer am Zeilenumbruch und lässt sich gut mit dem tt- Icon der Forensoftware einfügen

By dkappe Date 2021-08-17 14:04 Upvotes 1

Lothar Jung schrieb:

Zukünftig werde ich alle Lc0 und Ceres Engines mit MultiPV=4 zusätzlich testen.

Bei mcts bringt multipv=4 nichts außer das die weiteren Züge aus dem Suchbaum gezeigt werden. Die Suche läuft ganz normal wie vorher.

By Peter Martan Date 2021-08-17 17:12

Dacht ich's doch (so ähnlich). Lass das aber nur nicht Larry Kaufman lesen.

By dkappe Date 2021-08-17 18:36

Bei AB engines bringt es doch was. Bei lc0 nichts. Bei Dragon MCTS? Keine Idee.

By Peter Martan Date 2021-08-17 19:03 Edited 2021-08-17 19:13

Also wenn man mich fragt, wenig bis auch nix, jetzt wieder nur von Stellungen mit eindeutigem taktischem single best move zu sprechen.

Das Problem ist, dass komodo MCTS bei taktischen Stellungstests sowieso viel schwächer ist als non-MCTS, dadurch dauern die momentan aktuellen Suiten wie der HTC deutlich länger und sind schlechter reproduzierbar, weil ein einzelner Fund von insgesamt wenigen viel mehr relativen Unterschied macht. Single core ist mir zu langweilig, bräuchte erst recht viel längere TC, und mit avx2 hab' ich's noch gar nicht zu objektivieren versucht.

HTC114 läuft gerade mit Dragon2 MCTS MultiPV=4, ich werde das Ergebnis nur veröffentlichen, wenn's besser ist als mit single primary variant.

By Max Siegfried Date 2021-08-20 08:04

Max Siegfried schrieb:

<a class='ura' href='https://abrok.eu/stockfish/'>https://abrok.eu/stockfish/</a>
Author: Tomasz Sobczyk
Date: Sun Aug 15 12:05:43 2021 +0200
Timestamp: 1629021943

New NNUE architecture and net

Introduces a new NNUE network architecture and associated network parameters

The summary of the changes:

* Position for each perspective mirrored such that the king is on e..h files. Cuts the feature transformer size in half, while preserving enough knowledge to be good. See <a class='ura' href='https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on.'>https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.b40q4rb1w7on.</a>
* The number of neurons after the feature transformer increased two-fold, to 1024x2. This is possibly mostly due to the now very optimized feature transformer update code.
* The number of neurons after the second layer is reduced from 16 to 8, to reduce the speed impact. This, perhaps surprisingly, doesn't harm the strength much. See <a class='ura' href='https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq'>https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#heading=h.6qkocr97fezq</a>

The AffineTransform code did not work out-of-the box with the smaller number of neurons after the second layer, so some temporary changes have been made to add a special case for InputDimensions == 8. Also additional 0 padding is added to the output for some archs that cannot process inputs by <=8 (SSE2, NEON). VNNI uses an implementation that can keep all outputs in the registers while reducing the number of loads by 3 for each 16 inputs, thanks to the reduced number of output neurons. However GCC is particularily bad at optimization here (and perhaps why the current way the affine transform is done even passed sprt) (see <a class='ura' href='https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#'>https://docs.google.com/document/d/1gTlrr02qSNKiXNZ_SuO4-RjK4MXBiFlLE6jvNqqMkAY/edit#</a> for details) and more work will be done on this in the following days. I expect the current VNNI implementation to be improved and extended to other architectures.

The network was trained with a slightly modified version of the pytorch trainer (<a class='ura' href='https://github.com/glinscott/nnue-pytorch'>https://github.com/glinscott/nnue-pytorch</a>); the changes are in <a class='ura' href='https://github.com/glinscott/nnue-pytorch/pull/143'>https://github.com/glinscott/nnue-pytorch/pull/143</a>

The training utilized 2 datasets.

dataset A - <a class='ura' href='https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing'>https://drive.google.com/file/d/1VlhnHL8f-20AXhGkILujnNXHwy9T-MQw/view?usp=sharing</a>
dataset B - as described in <a class='ura' href='https://github.com/official-stockfish/Stockfish/commit/ba01f4b95448bcb324755f4dd2a632a57c6e67bc'>https://github.com/official-stockfish/Stockfish/commit/ba01f4b95448bcb324755f4dd2a632a57c6e67bc</a>

The training process was as following:

train on dataset A for 350 epochs, take the best net in terms of elo at 20k nodes per move (it's fine to take anything from later stages of training).
convert the .ckpt to .pt
--resume-from-model from the .pt file, train on dataset B for <600 epochs, take the best net. Lambda=0.8, applied before the loss function.

The first training command:

python3 train.py \
../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \
../nnue-pytorch-training/data/large_gensfen_multipvdiff_100_d9.binpack \
--gpus "$3," \
--threads 1 \
--num-workers 1 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--smart-fen-skipping \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=1.0 \
--max_epochs=600 \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

The second training command:

python3 serialize.py \
--features=HalfKAv2_hm^ \
../nnue-pytorch-training/experiment_131/run_6/default/version_0/checkpoints/epoch-499.ckpt \
../nnue-pytorch-training/experiment_$1/base/base.pt

python3 train.py \
../nnue-pytorch-training/data/michael_commit_b94a65.binpack \
../nnue-pytorch-training/data/michael_commit_b94a65.binpack \
--gpus "$3," \
--threads 1 \
--num-workers 1 \
--batch-size 16384 \
--progress_bar_refresh_rate 20 \
--smart-fen-skipping \
--random-fen-skipping 3 \
--features=HalfKAv2_hm^ \
--lambda=0.8 \
--max_epochs=600 \
--resume-from-model ../nnue-pytorch-training/experiment_$1/base/base.pt \
--default_root_dir ../nnue-pytorch-training/experiment_$1/run_$2

STC: <a class='ura' href='https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4'>https://tests.stockfishchess.org/tests/view/611120b32a8a49ac5be798c4</a>

LLR: 2.97 (-2.94,2.94) <-0.50,2.50>
Total: 22480 W: 2434 L: 2251 D: 17795 Elo +2.83
Ptnml(0-2): 101, 1736, 7410, 1865, 128

LTC: <a class='ura' href='https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea'>https://tests.stockfishchess.org/tests/view/611152b32a8a49ac5be798ea</a>

LLR: 2.93 (-2.94,2.94) <0.50,3.50>
Total: 9776 W: 442 L: 333 D: 9001 Elo +3.87
Ptnml(0-2): 5, 295, 4180, 402, 6

closes <a class='ura' href='https://github.com/official-stockfish/Stockfish/pull/3646'>https://github.com/official-stockfish/Stockfish/pull/3646</a>

bench: 5189338
see source

<a class='ura' href='https://tests.stockfishchess.org/nns'>https://tests.stockfishchess.org/nns</a>
21-08-09 12:18:12 nn-e8321e467bf6.nnue Sopel 2021-08-09 12:39:01 2021-08-16 12:03:51 1358

Habt ihr schon die Testsuites laufen lassen?

Neue Netzarchitektur wurde getestet:
https://www.sp-cc.de/