CompreSSM: MIT Technique Compresses AI Models Mid-Training Using Control Theory
New method from MIT CSAIL reduces state-space model training costs by 40x while maintaining accuracy—using Hankel singular values to identify and remove dead weight during training.

Something quietly extraordinary happened at MIT last week. Researchers at CSAIL developed a compression technique that makes AI models shed their dead weight while they're still learning. The approach, called CompreSSM, sidesteps a fundamental trade-off in model development: traditionally, you either train a massive model and trim it later (expensive), or start small and accept weaker performance.
The Control Theory Insight
The breakthrough comes from borrowing mathematical tools from control systems engineering. Lead author Makram Chahine and his team use Hankel singular values—a measure from classical control theory that quantifies how much each internal state contributes to a system's overall behaviour. Think of it as identifying which gears in a complex machine are actually doing work versus just spinning along for the ride.
These singular values reveal something remarkable about state-space models during training: the relative importance of different internal dimensions stabilises surprisingly early. "After only about 10 percent of the training process," the researchers found they could reliably predict which parts of the model would remain critical and which would become redundant.
The technique targets state-space models—architectures like Mamba that have emerged as efficient alternatives to transformers for long sequences. Unlike transformers that pay quadratic attention costs, state-space models maintain constant memory. Their bottleneck is the internal state dimension.
Engineering Results
CompreSSM delivered substantial gains across benchmarks. On CIFAR-10 image classification, a compressed model reduced to roughly a quarter of its original state dimension achieved 85.7% accuracy compared to just 81.8% for a model trained small from scratch. For Mamba architectures specifically, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.
The speed advantage is dramatic. Compared to Hankel nuclear norm regularisation—another recent technique for compact state-space models—CompreSSM runs more than 40 times faster while achieving higher accuracy. The regularisation approach requires expensive eigenvalue computations at every gradient step, slowing training by roughly 16x.
"You get the performance of the larger model because you capture most of the complex dynamics during the warm-up phase, then only keep the most-useful states," explains Chahine.
Mathematical Foundation
The theoretical grounding relies on Weyl's theorem, which the researchers applied to prove that the importance of individual model states changes smoothly during training. This mathematical guarantee means dimensions identified as negligible early on won't suddenly become critical later—a property that makes mid-training compression safe.
The approach works by monitoring the Hankel singular values throughout training. When these values fall below a predefined relative threshold, the corresponding dimensions get pruned. The method includes a pragmatic safety net: if compression causes unexpected performance drops, practitioners can revert to previously saved checkpoints.
Production Implications
For engineering teams, CompreSSM represents a shift from post-hoc optimisation to architecture-aware training. Instead of planning for compression after expensive pre-training, teams can build efficiency directly into the training pipeline. This matters particularly for state-space models, which are gaining traction in applications where transformer inference costs become prohibitive.
Antonio Orvieto from ELLIS Institute Tübingen, who wasn't involved in the research, notes that "the proposed algorithm has the potential to become a standard approach when pre-training large SSM-based models." The technique presented at ICLR 2026 offers a theoretically principled path to efficiency without sacrificing model capability.
As AI infrastructure costs compound, techniques that embed efficiency into training rather than retrofitting it afterward become competitive advantages. CompreSSM provides exactly this—a mathematically rigorous way to train lean without starting small.