Publications

Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study. arXiv preprint, early version at NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (Oral), 2024.
Fundamental Benefit of Alternating Updates in Minimax Optimization. ICML 2024, early version at ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning, 2024.
Linear attention is (maybe) all you need (to understand transformer optimization). ICLR 2024, short version at NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning (Oral), 2023.
Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint. NeurIPS 2023, 2023.
PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning. NeurIPS 2023, 2023.
Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima. NeurIPS 2023 (Spotlight), 2023.
Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond. ICML 2023 (Oral), 2023.
On the Training Instability of Shuffling SGD with Batch Normalization. ICML 2023, 2023.
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond. ICLR 2022 (Oral), 2022.
Provable Memorization via Deep Neural Networks using Sub-linear Parameters. COLT 2021, 2021.
A Unifying View on Implicit Bias in Training Linear Neural Networks. ICLR 2021, short version at NeurIPS 2020 Workshop on Optimization for Machine Learning (OPT 2020), 2021.
Minimum Width for Universal Approximation. ICLR 2021 (Spotlight), 2021.
SGD with shuffling: optimal rates without component convexity and large epoch requirements. NeurIPS 2020 (Spotlight), 2020.
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers. NeurIPS 2020, 2020.
Low-Rank Bottleneck in Multi-head Attention Models. ICML 2020, 2020.
Are Transformers universal approximators of sequence-to-sequence functions?. ICLR 2020, short version at NeurIPS 2019 Workshop on Machine Learning with Guarantees, 2020.
Are deep ResNets provably better than linear predictors?. NeurIPS 2019, 2019.
Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. NeurIPS 2019 (Spotlight), 2019.
Efficiently testing local optimality and escaping saddles for ReLU networks. ICLR 2019, 2019.
Minimax Bounds on Stochastic Batched Convex Optimization. COLT 2018, 2018.
Global optimality conditions for deep neural networks. ICLR 2018, short version at NIPS 2017 Workshop on Deep Learning: Bridging Theory and Practice, 2018.
Face detection using Local Hybrid Patterns. ICASSP 2015, 2015.