1

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Low-Rank Bottleneck in Multi-head Attention Models
Are Transformers universal approximators of sequence-to-sequence functions?
Are deep ResNets provably better than linear predictors?
Global optimality conditions for deep neural networks
Face detection using Local Hybrid Patterns