Lc0 mit Doppelherz

By Lothar Jung Date 2025-01-06 09:44 Edited 2025-01-06 09:49 Upvotes 1

Der Entwickler von KataGo hatte seine Go-Engine auf den MAC umgesetzt und dabei die KI parallel auf der GPU und NE rechnen lassen.
Dieser Ansatz wird jetzt auch für Lc0 umgesetzt.
Dieses Multiplexing auf der GPU und der Neuteral Engine (NE) bringt einen deutlichen Performancegewinn.
Es ist auch eine Blaupause für andere Prozessoren mit einer dezidierten KI Einheit.
Ich bin gespannt, auf welchem Niveau Lc0 angehoben wird.

Hier sein heutiger Beitrag auf Discord:

„This is absolutely crazy, but also incredibly exciting! I just have to share it with you all. The GPU and Neural Engine (NE) can run in parallel with a modified demux backend, and the performance is impressive. From my benchmarking, the combined GPU+NE setup reaches a throughput of 587.922 nps, which is a huge step up from the GPU-only performance of 434.691 nps.

That said, there’s still room for improvement in the source code to handle the imbalance in processing rates between the GPU and NE. While the NE achieves a throughput of 210.861 nps, the GPU ends up waiting for the NE to finish if the workload is evenly distributed. To solve this, I’ve modified the source code so the GPU processes two-thirds of the data, while the NE handles one-third. This way, both GPU and NE finish their batches almost simultaneously, minimizing the time either one spends waiting for the other.

It’s an exciting optimization that brings everything closer to peak performance!
```
_
| _ | |
|_ |_ |_|[0m v0.32.0-dev+git.dirty built Jan 6 2025
Found pb network file: /Users/chinchangyang/Code/lc0-ccy/xcuserdata/lc0/DerivedData/lc0/Build/Products/Release/BT4-1024x15x32h-swa-6147500.pb.gz
Weights file has multihead format, updating format flag
Creating backend [demux]...
Creating backend [metal]...
Initialized metal backend on device Apple M3 Max
Creating backend [coreml]...
Compiling model: lc0.mlpackage/ -- file:///Users/chinchangyang/Code/lc0-ccy/xcuserdata/lc0/DerivedData/lc0/Build/Products/Release/
Compiled model URL: file:///var/folders/dv/kdr9x4yn4s106_94ydk5jnjc0000gn/T/lc0_43964B54-0554-4B72-B151-911B2941BDC8.mlmodelc
Initializing model with the compiled model URL...
Model successfully initialized
Benchmark batch size 30 with inference average time 53.6801ms - throughput 558.867 nps.
Benchmark batch size 33 with inference average time 56.8639ms - throughput 580.333 nps.
Benchmark batch size 36 with inference average time 61.2326ms - throughput 587.922 nps.
```
„

By Max Siegfried Date 2025-01-07 08:27

Grob gesagt kann man sagen dass die Entwicklung so war:
150.000 nps... 300.000 nps... 450.000 nps... und jetzt mit +Neural Engine insgesamt 600.000 nps.
+150.000 nps einfach so, sind ein massiver Sprung nach oben.
Dabei muss er noch diverse Sachen anpassen und optimieren und mindestens +60.000 nps (150.000 + 60.000 = 210.000 nps) sind noch drin.
= Anstatt nur eine GPU mit 450.000 nps, gibt es mit der Neural Engine (+210.000 nps) zusammen mindestens 660.000 nps.

Dabei hat er nur den deutlich schwächeren M3 MAX verwendet und nicht den M4 MAX.
Das die Apple M Geräte schon von Anfang an besser waren als von anderen behauptet, vorausgesetzt die Software Programme nutzen auch die verfügbare Apple Hardware, war schon lange klar.

Das wird gigantische Geschwindigkeitszuwächse bei LC0 und Stockfish bringen.

By Lothar Jung Date 2025-01-07 08:52 Upvotes 1

Gut, deutliche Steigerung, aber wegen NNUE nicht für Stockfish.