1

Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo
Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count
Does SGD really happen in tiny subspaces?
Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure
Linear attention is (maybe) all you need (to understand transformer optimization)