1

Provable Memorization via Deep Neural Networks using Sub-linear Parameters
A Unifying View on Implicit Bias in Training Linear Neural Networks
Minimum Width for Universal Approximation
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Low-Rank Bottleneck in Multi-head Attention Models
Are Transformers universal approximators of sequence-to-sequence functions?
Are deep ResNets provably better than linear predictors?