Hier einige Informationen, welche mir ein Freund zugesendet hat, deshalb auf English:
Big transformer 5. Tentative network architecture. The only real improvements we have are removing the biases in the qkv projections and replacing the layer normalizations with RMS norms, which
improves training speed by 10%. Likely to be
substantially larger than BT4.
Happy to present some new stuff ready for BT5. The first is relative positional encodings, which when replacing smolgen add roughly
0.5% policy accuracy without much throughput loss
https://arxiv.org/pdf/1803.02155.pdf (
green line is a
compact version of rpe with
15x15 params per channel,
blue is a
larger version with
64x64 params per channel, orange is just smolgen). We're also finding that
large MLP hidden depths greatly improve performance consistent with other transformers when integrated with transformers with rpe encoding. Why large hidden depths didn't work before we have yet to explain. We've also been experimenting with mixture of experts (specifically expert choice
https://arxiv.org/pdf/2202.09368v1.pdf) and
found around a 1% gain without flops increase.
Wird man die Policy Accuracy auf 100% steigern können?