Hier der aktuelle Blog:
Big transformer 3. Tentative new network arch using smolgen-augmented self-attention (see
https://github.com/Ergodice/lczero-training/blob/attention-net-body/README.md) from BT2. Will probably switch main activation from square relu to mish. The plan is to have embedding size 1024, ffn projection size 1024, 32 heads per layer, and 10 total layers. We have also found a way of preprocessing the inputs that prevents early layers from doing nothing, which should slightly improve performance. Removing layernorms after attention also improves performance and slightly decreases latency. Experiments are currently in progress to quantize the dense layers to int8 precision to improve speed. There are also cuda optimizations available, which should reduce latency by 10 to 15%.
Es werden hiermit verschiedene Performance Verbesserungen angekündigt.