Multi Stream steigert Lc0-Performance um 20%

By Lothar Jung Date 2021-06-01 10:56 Edited 2021-06-01 11:04

…auf 2 RTX 3080 GPUs.

Hier die Beiträge auf Discord dazu:

With work from multiple streams in the queue, the GPU can opportunistically schedule some work from the other queue when SMs are idle. The benefit is more for bigger GPUs and smaller batch sizes when we can't fill the GPU, and 20% improvement I got on GA100 is actually more than what I was expecting.

For testing, I just tried with batch size of 1024, and it's pretty surprising that I still get decent speedup with multi-stream.
```
lc0 benchmark -t 2 --minibatch-size=1024 --backend-opts=multi_stream=false
===========================
Total time (ms) : 341722
Nodes searched : 27518102
Nodes/second : 80528

lc0 benchmark -t 4 --minibatch-size=1024 --backend-opts=multi_stream=false
===========================
Total time (ms) : 343309
Nodes searched : 25945394
Nodes/second : 75574

lc0 benchmark -t 2 --minibatch-size=1024 --backend-opts=multi_stream=true
===========================
Total time (ms) : 341139
Nodes searched : 28915422
Nodes/second : 84761

lc0 benchmark -t 4 --minibatch-size=1024 --backend-opts=multi_stream=true
===========================
Total time (ms) : 341830
Nodes searched : 30176533
Nodes/second : 88279
```
This indicates that even with bigger batch size there is enough gaps in GPU utilization that a second concurrent execution can fill.

Also note that multi-stream should work reliably on Linux but on windows you may need to enable hw-accelerated gpu scheduling to see benefits (<https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/>) unless you are running a Tesla card in TCC driver mode.

from ankan's branch or you can download it from appveryor
https://ci.appveyor.com/project/LeelaChessZero/lc0/builds/39392346

with one 3080
```master 2 threads
===========================
Total time (ms) : 340559
Nodes searched : 9101864
Nodes/second : 26726

master 4 threads
===========================
Total time (ms) : 340960
Nodes searched : 9752158
Nodes/second : 28602

multi_stream 2 threads
===========================
Total time (ms) : 340503
Nodes searched : 10086106
Nodes/second : 29621

multi_stream 4 threads
===========================
Total time (ms) : 340872
Nodes searched : 10719261
Nodes/second : 31446

winograd-opts 2 threads
===========================
Total time (ms) : 340470
Nodes searched : 10017434
Nodes/second : 29422

winograd-opts 4 threads
===========================
Total time (ms) : 340845
Nodes searched : 10708551
Nodes/second : 31418
```
With 2 3080:
```
master 2 threads
===========================
Total time (ms) : 340262
Nodes searched : 18132533
Nodes/second : 53290

master 4 threads
===========================
Total time (ms) : 340501
Nodes searched : 20663398
Nodes/second : 60685

multi_stream 2 threads
===========================
Total time (ms) : 340229
Nodes searched : 19783269
Nodes/second : 58147

multi_stream 4 threads
===========================
Total time (ms) : 340491
Nodes searched : 22705293
Nodes/second : 66684

winograd-opts 2 threads
===========================
Total time (ms) : 340221
Nodes searched : 19668991
Nodes/second : 57812

winograd-opts 4 threads
===========================
Total time (ms) : 340474
Nodes searched : 22520779
Nodes/second : 66145
```

Lothar

PS: Falls noch Klärungsbedarf besteht, bitte antworten!