By Lothar Jung
Date 2025-01-06 09:44
Edited 2025-01-06 09:49
Upvotes 1
Der Entwickler von KataGo hatte seine Go-Engine auf den MAC umgesetzt und dabei die KI parallel auf der GPU und NE rechnen lassen.
Dieser Ansatz wird jetzt auch für Lc0 umgesetzt.
Dieses Multiplexing auf der GPU und der Neuteral Engine (NE) bringt einen deutlichen Performancegewinn.
Es ist auch eine Blaupause für andere Prozessoren mit einer dezidierten KI Einheit.
Ich bin gespannt, auf welchem Niveau Lc0 angehoben wird.
Hier sein heutiger Beitrag auf Discord:
„This is absolutely crazy, but also incredibly exciting! I just have to share it with you all. The GPU and Neural Engine (NE) can run in parallel with a modified demux backend, and the performance is impressive. From my benchmarking, the combined GPU+NE setup reaches a throughput of 587.922 nps, which is a huge step up from the GPU-only performance of 434.691 nps.
That said, there’s still room for improvement in the source code to handle the imbalance in processing rates between the GPU and NE. While the NE achieves a throughput of 210.861 nps, the GPU ends up waiting for the NE to finish if the workload is evenly distributed. To solve this, I’ve modified the source code so the GPU processes two-thirds of the data, while the NE handles one-third. This way, both GPU and NE finish their batches almost simultaneously, minimizing the time either one spends waiting for the other.
It’s an exciting optimization that brings everything closer to peak performance!
```
_
| _ | |
|_ |_ |_|[0m v0.32.0-dev+git.dirty built Jan 6 2025
Found pb network file: /Users/chinchangyang/Code/lc0-ccy/xcuserdata/lc0/DerivedData/lc0/Build/Products/Release/BT4-1024x15x32h-swa-6147500.pb.gz
Weights file has multihead format, updating format flag
Creating backend [demux]...
Creating backend [metal]...
Initialized metal backend on device Apple M3 Max
Creating backend [coreml]...
Compiling model: lc0.mlpackage/ -- file:///Users/chinchangyang/Code/lc0-ccy/xcuserdata/lc0/DerivedData/lc0/Build/Products/Release/
Compiled model URL: file:///var/folders/dv/kdr9x4yn4s106_94ydk5jnjc0000gn/T/lc0_43964B54-0554-4B72-B151-911B2941BDC8.mlmodelc
Initializing model with the compiled model URL...
Model successfully initialized
Benchmark batch size 30 with inference average time 53.6801ms - throughput 558.867 nps.
Benchmark batch size 33 with inference average time 56.8639ms - throughput 580.333 nps.
Benchmark batch size 36 with inference average time 61.2326ms - throughput 587.922 nps.
```
„