Qwen 3.5's Gated DeltaNet: How a 9B Model Is Embarrassing 120B Giants
Alibaba's Qwen 3.5 Small series ships a hybrid attention architecture that beats OpenAI's GPT-OSS-120B on graduate-level reasoning — while running on your phone. If you build anything that touches edge inference, this changes your arithmetic.

Something quietly extraordinary happened on 2 March. While most of the AI commentariat was busy arguing about whether Gemini 3.1 Flash-Lite's pricing constitutes dumping, Alibaba's Qwen team dropped a family of four small dense models — 0.8B, 2B, 4B, and 9B parameters — that casually rewrite the rules of what 'small' means in 2026.
The headline number: Qwen3.5-9B scores 82.5 on MMLU-Pro and 81.7 on GPQA Diamond. OpenAI's GPT-OSS-120B, a model literally 13.5 times larger, manages 80.8 and 80.1 respectively. Yes, you read that correctly. A model that fits in 6GB of RAM is outperforming one that needs a data centre rack.
But the benchmarks are not even the interesting part. The interesting part is how they did it.
The Architecture: Nobody Agrees on Attention Anymore
Qwen 3.5 introduces a hybrid architecture that interleaves two fundamentally different approaches to token mixing. Every fourth layer uses traditional full softmax attention with grouped query attention (GQA) — the mechanism you know from GPT-4, Claude, and every other transformer since 2017. The other 75% of layers use something called Gated DeltaNet, a linear attention mechanism that processes tokens in constant time per step rather than quadratic.
The layer configuration is elegant in its specificity: 60 layers organised as 15 repetitions of a four-layer block. Each block runs three Gated DeltaNet layers followed by one full attention layer. The DeltaNet layers use 64 linear attention heads for values and 16 for query-key pairs, all at dimension 128.
Gated DeltaNet itself is a fascinating chimera. It combines the delta rule — an error-correcting memory update mechanism from classical neural network theory — with exponential gating borrowed from Mamba2's state-space architecture. Add causal Conv1D for local context capture and L2 normalisation on queries and keys (replacing the softmax that transformers have relied on since Vaswani et al.), and you get a layer that maintains a compressed memory state updated at each token rather than attending over the entire sequence.
The practical consequence: full attention layers provide global context and strong retrieval capability for the 25% of layers that need it, while the linear layers handle the other 75% of computation at O(1) per token during inference. The model supports 262,144 tokens of native context, extensible to over one million tokens via YaRN rotary position encoding — and it does so without the memory wall that pure transformer architectures hit at those lengths.
What Shipped: Not Just Language
Here is where Qwen 3.5 Small gets properly interesting for anyone building production systems. All four models — from the tiny 0.8B to the 9B flagship — process text, images, and video from a single unified architecture. Vision is not bolted on as a separate encoder; it is native to the model's processing pipeline.
The multimodal results are startling. On MMMU-Pro (visual reasoning), the 9B model scores 70.1 — beating Google's Gemini 2.5 Flash-Lite at 59.7 and even the specialised Qwen3-VL-30B-A3B at 63.0. On Video-MME with subtitles, it hits 84.5 versus Flash-Lite's 74.6. A sub-10B model is outperforming dedicated vision models three times its size.
For agentic tasks — the category every platform company is betting their roadmap on — the 9B variant scores 66.1 on BFCL-V4 (function calling), 79.1 on TAU2-Bench (tool use), and 41.8 on OSWorld-Verified (desktop automation). These numbers beat Qwen's own 80B model on all four agentic benchmarks. Read that again: 9B parameters, beating 80B, on the tasks that matter most for AI agent deployment.
The Deployment Story: Actually Runs on Real Hardware
Four-bit quantisation reduces VRAM requirements by roughly 75%, making the 9B model runnable on an 8GB GPU or a laptop with 16GB of RAM. The 2B variant runs on any recent iPhone with just 4GB of RAM. The 0.8B model can target microcontrollers and IoT devices.
The deployment stack is mature. Day-zero support exists for llama.cpp, Ollama, and MLX (Apple Silicon). AMD published day-zero support for Instinct GPUs via ROCm. You can have this model running locally within minutes of reading this article — ollama run qwen3.5:9b and you are in business.
For those counting tokens per second: developers report the 9B model generates code at 30+ tokens per second on consumer hardware like an RTX 4090. That is fast enough for interactive coding assistance, real-time function calling, and responsive chat — the use cases where latency matters more than raw capability.
The Community Verdict
The Hacker News thread pulled 363 points and 173 comments within hours of release — significant engagement for a model launch. One founder reported being blown away by coding performance on the 27B variant running locally on a 4090. Another developer called the 35B model the most capable agentic coding model at that size.
The measured takes are worth noting too. Several developers pointed out that benchmark performance does not always translate to production vibes — the subjective quality of responses in real workflows. The 9B model is not replacing Claude Sonnet 4.6 or GPT-5.3 for complex multi-step reasoning. But that is not the point. The point is that it runs on your hardware, for free, under Apache 2.0.
That licence deserves emphasis. Unlike Meta's Llama (community licence with restrictions above 700M monthly users), Google's Gemma (restricted redistribution), and Microsoft's Phi (MIT but with fine-print), Qwen 3.5 Small ships under Apache 2.0. Full commercial deployment, fine-tuning, redistribution, no royalties, no usage caps. For any startup building on-device AI products, this removes an entire category of legal risk.
Why Your CTO Should Care
The strategic implications are threefold.
First, the inference cost equation just changed. If a 9B model can handle 80% of the tasks you are currently routing to a 120B cloud API, your per-query cost drops from fractions of a penny to effectively zero on owned hardware. For high-volume applications — customer support, document processing, code review — the savings compound aggressively.
Second, data sovereignty becomes trivially achievable. The entire model runs on a laptop. No API calls, no data leaving your network, no third-party processing agreements. For regulated industries — healthcare, finance, legal, government — this is not a nice-to-have. It is a compliance requirement that just became dramatically easier to meet.
Third, the hybrid attention architecture signals where the field is heading. Pure transformers are becoming the exception, not the rule. Mamba, RWKV, Gated DeltaNet — the zoo of sub-quadratic architectures is producing models that are not just cheaper but better at specific tasks. If your team is still assuming 'transformer = best', it is time to update that assumption.
The practical next step for any technical leader: download Qwen3.5-9B today, run it on your existing hardware, and benchmark it against whatever cloud API you are currently paying for on your actual workloads. Not synthetic benchmarks — your real queries, your real documents, your real code. The results might surprise you. They certainly surprised me.