1

Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo
Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count
Does SGD really happen in tiny subspaces?