Big transformer 5. Improvements over BT4 include removing biases from qkv and centering and biases from layer normalizations, in addition to a new relative position encoding replacing the previous "smolgen". Training started late June 2024 and is expected to take 6 months. It has 15 layers with 1024 embedding size, 32 heads per layer, and 4096 dff size, roughly tripling the dff size over BT4 and doubling the overall model size.