Hayden Prairie

CSE PhD @ University of California, San Diego

Hi, I’m Hayden! I am currently an PhD student at UCSD studying Computer Science and Engineering. I am part of Sandy Research Lab and advised by Dan Fu and Taylor Berg-Kirkpatrick. I am originally from Austin, Texas and previously did by undergraduate at University of Texas at Austin. My primary interests are core machine learning and computer systems!

My research mostly covers the intersection of ML and systems, including SSMs, structured sparsity, and all things GPU. I am broadly interested in developing an understanding of how we can better interpret and exploit sparsity to improve the efficiency and expressivity of large models.

Please check out my GitHub to see what I am currently working on and the projects I have contributed to!

Updates

Jul 2025

I started working part-time @together.ai as a research intern with the kernels team.

Apr 2025

I will be starting my PhD this semptember at UCSD working with Dan Fu.

Publications

Parcae: Scaling Laws For Stable Looped Language Models

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Daniel Y. Fu

ICLR 2026, LIT Workshop (Keynote)

Abstract

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

arXiv GitHub Hugging Face

Search Your Block Floating Point Scales!

Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu, Reyna Abhyankar, Qingyang Wu, Austin Silveria, Pragaash Ponnusamy, Jue Wang, Ben Athiwaratkun, Shuaiwen Song, Tri Dao, Daniel Fu, Christopher De Sa

MLSys 2026

Abstract

Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention im- proves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.

OpenReview

Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

Sunny Sanyal*, Hayden Prairie*, Rudrajit Das*, Ali Kavis*, Sujay Sanghavi

ICML 2025, Spotlight

Abstract

Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a sample weighting scheme for the fine-tuning data solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving more accuracy on the pre-training datasets.

arXiv GitHub

Blog Posts

No blog posts available yet.