Publications

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count. arXiv preprint, 2024.
Stochastic Extragradient with Flip-Flop Shuffling & Anchoring: Provable Improvements. NeurIPS 2024, 2024.
DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity. NeurIPS 2024, 2024.
Provable Benefit of Cutout and CutMix for Feature Learning. NeurIPS 2024 (Spotlight), 2024.
Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure. NeurIPS 2024, 2024.
Does SGD really happen in tiny subspaces?. ICML 2024 Workshop on High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024.
Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults. ICML 2024 Workshop on High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024.
Fundamental Benefit of Alternating Updates in Minimax Optimization. ICML 2024 (Spotlight), 2024.
Linear attention is (maybe) all you need (to understand transformer optimization). ICLR 2024, 2023.
Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint. NeurIPS 2023, 2023.
PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning. NeurIPS 2023, 2023.
Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima. NeurIPS 2023 (Spotlight), 2023.
Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond. ICML 2023 (Oral), 2023.
On the Training Instability of Shuffling SGD with Batch Normalization. ICML 2023, 2023.
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond. ICLR 2022 (Oral), 2022.
Provable Memorization via Deep Neural Networks using Sub-linear Parameters. COLT 2021, 2021.
A Unifying View on Implicit Bias in Training Linear Neural Networks. ICLR 2021, 2021.
Minimum Width for Universal Approximation. ICLR 2021 (Spotlight), 2021.
SGD with shuffling: optimal rates without component convexity and large epoch requirements. NeurIPS 2020 (Spotlight), 2020.
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers. NeurIPS 2020, 2020.
Low-Rank Bottleneck in Multi-head Attention Models. ICML 2020, 2020.
Are Transformers universal approximators of sequence-to-sequence functions?. ICLR 2020, 2020.
Are deep ResNets provably better than linear predictors?. NeurIPS 2019, 2019.
Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. NeurIPS 2019 (Spotlight), 2019.
Efficiently testing local optimality and escaping saddles for ReLU networks. ICLR 2019, 2019.
Minimax Bounds on Stochastic Batched Convex Optimization. COLT 2018, 2018.
Global optimality conditions for deep neural networks. ICLR 2018, 2018.
Face detection using Local Hybrid Patterns. ICASSP 2015, 2015.