Mamba2 is Literally Attention. Here’s How.
Last Updated on June 22, 2026 by Editorial Team
Author(s): Jun Nishimura
Originally published on Towards AI.
Part 2 of 3 — Mamba-2 and the State Space Duality. How the recurrence becomes a matmul, why Nemotron-H ships this one, for ML engineers who already speak fluent Transformer.

1. Why this one, why now
Part 1 of this series showed that any trained Mamba-1 layer can be rewritten as an attention matrix — the recurrence’s output decomposes into a content-weighted sum of past inputs, exactly the shape attention has. But that was an equivalence of functions, proved on paper after the fact. Mamba-1’s forward pass is a recurrence: carry a state, update it token by token, read it out. No L×L matrix is ever allocated. The attention matrix is something you write down to interpret the model — not a tensor the GPU ever holds.
In Mamba-2 that rewrite isn’t a description. It’s the forward computation. The training kernel literally builds the matrices — CBᵀ (the scores), Λ (the decay mask), M = Λ ⊙ CBᵀ — as real tensors in memory, then multiplies y = M·x. That's an equivalence of computation, not just of functions: the model scores every pair of positions, masks, and takes a weighted sum of values because that is the sequence of ops it runs, not because of how we chose to read it.
Put differently, the arrow of derivation flips. In M1 the recurrence is primary and the matrix is derived; in M2 the matrix is primary — it’s how training computes — and the recurrence becomes the derived fast path for inference. (Two caveats keep this honest, both cashed out later: it’s structured masked attention, not softmax — there’s no exp-normalization, §7; and the literal construction is the training/dual form, while the inference recurrence computes the identical function by the SSD theorem, §6.)
That single shift is what made selective SSMs production-ready. Nemotron-H, IBM Granite-4, AI21 Jamba — the recent hybrids ship Mamba-2 layers, not Mamba-1. The reason is unfussy: M2 trains 2–8× faster at matched parameter count, supports a much larger state dimension (d_state 64–256 vs M1’s 16), and the kernel is plain tensor-core matmul rather than a custom CUDA selective-scan.
This article is about how one design choice — restricting the state matrix A from a diagonal of N distinct rates down to a single scalar — unlocks all of that, and how the “Mamba is attention” reading sharpens from rhyme into identity. Same TinyStories setup as Part 1: a small 6-layer M2 trained alongside the same parameter-matched Transformer, the same prompt for the comparison figure, so you can put the two articles’ heatmaps next to each other.
No code in the post — just the math and the pictures. The notebooks and a fully-worked code tutorial are linked at the bottom.
2. The three moves (Mamba-2 in one paragraph)
Dao & Gu’s 2024 paper (Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality) makes three changes to Mamba-1:
- A becomes scalar. Where M1’s A was a diagonal matrix with N distinct decay rates per channel (the HiPPO spectrum we plotted in Part 1’s section 4), M2 sets
A_t = a_t · I— one decay rate per head per token. All N state slots inside a head share that decay at each step. - The recurrence becomes a structured matrix multiply. Unrolling the SSM with scalar A gives
y = M · xwhereM = Λ ⊙ (CBᵀ). Λ is a lower-triangular scalar decay mask withΛ[t,j] = ∏ ā_kforj+1 ≤ k ≤ t, andCBᵀis an L×L content-similarity matrix — exactly the shape of attention scores. This is the State Space Duality: the same SSD computation can be evaluated as a recurrence (linear cost, fixed state memory) or as a matmul (quadratic cost, all dense matrix ops), and the two views agree by construction. - The block reshapes into multi-head. Channels regroup into H heads of width P (so d_inner = H·P). Δ promotes from per-channel to per-head. B and C are shared across head groups, GQA-style. Per-channel timescale diversity is gone, but you get capacity back by having more heads at a larger d_state.
The payoff is hardware. Forming Λ, forming CBᵀ, taking their Hadamard product, and computing M · x are all dense matrix operations — exactly what GPUs and TPUs want. The linear recurrence still exists (and is what inference uses), but training rides the quadratic dual.
The rest of this post earns those three claims one at a time: §3 shows where the recurrence comes from and why “scalar A” is the load-bearing choice, §6 unrolls it into M, and §5/§7/§8 look at what the resulting matrix actually does.
3. From an ODE to a recurrence: the one scalar that changes everything
Before we can say “the recurrence becomes a matrix,” it’s worth being honest about where the recurrence itself comes from — because that’s where the scalar-A choice does its work.
A selective SSM begins life as a continuous system. For a single state channel, with state h(s) ∈ ℝ^N:
dh/ds = a · h(s) + B · x(s), y = C · h
To run it on a sequence of tokens you have to discretize — turn the continuous derivative into a per-step update. The standard move (zero-order hold) is to integrate the ODE across one step of size Δ, holding the input fixed across that step. The exact solution of a linear ODE over one step has two pieces: the old state, shrunk by an exponential of the elapsed "time," plus the new input, written in:
h_t = exp(Δ_t · a) · h_{t-1} + (input gain) · B_t · x_t
└──── decay ā_t ────┘
The decay factor ā_t = exp(Δ_t · a) is the entire story. Three observations:
Δ_tis the selective timestep. It's a positive number the model emits per token (Δ_t = softplus(W_Δ x_t)). LargeΔ_t→ā_tnear 0 → forget the past aggressively; smallΔ_t→ā_tnear 1 → hold the state. This is the "selective" in selective SSM, and it survives untouched into M2.ais where M1 and M2 part ways. In Mamba-1,ais the n-th diagonal entry of A — a different rate for each of the N state slots. In Mamba-2,ais one scalar per head: every slot in a head decays at the same rate at each step. That is the whole of "A = a·I."- The input gain quietly disappears. The exact discretization puts a gain on the input term too (it tends to
Δ_tin the small-step limit — that's M1's familiarΔ_t B_t x_t). M2's clean presentation folds that gain into the already-learnedBprojection, soΔends up living only in the decay. It's a mild simplification, not an identity — but it's why everyΔyou'll see from here on sits insideā, never on the write. (The full zero-order-hold derivation, including why dropping the gain is safe, is in the code tutorial linked at the end.)
Why does making a a scalar matter so much? Because of what it does to the next step. When a is a single number, ā_t is a single number — a plain scalar multiplier on the whole state. And a scalar multiplier is something you can pull out of a sum. That one algebraic fact is what lets the recurrence collapse into a clean matrix in §6. With M1's per-slot diagonal A, the decay is trapped inside the state and no such collapse exists. The scalar isn't a minor simplification of the recurrence; it's the hinge the entire matrix view swings on.
4. Selectivity survives

The thing that made Mamba Mamba — Δ as a content-aware write gate — is still there. This is a Δ heatmap from the same selective-copy task as Part 1’s fig 1, but the rows are now heads (M2 has H=8 here, sharing Δ within each head) instead of channels. The pattern is the same: most of the sequence sits at Δ ≈ 0, and at the few positions holding a real content token, certain heads light up.
What changed visually: only a couple of heads (4 and 7 in this run) carry the selective firing. The rest stay dark. M1 spread the same job across hundreds of channels; M2 concentrates it into a handful of heads, each operating on a much larger N-dimensional state. The mean-Δ panel at the bottom shows the per-head spikes still cleanly track real-token positions even after averaging — the same content-aware write gating, just coarser-grained and louder per head.
The selectivity that gave the family its name isn’t what M2 traded away.
5. What A = a·I actually costs

A^(n) initialized to −n, giving an almost-linear staircase of half-lives from ~100 tokens at n=0 down to ~6 at n=15. That ladder inside a single head was M1's headline structural fact — one layer, many timescales, baked into A's diagonal.In M2, that ladder vanishes. With A_t = a_t · I, every slot inside one head decays at the same rate at every step. There is no longer a "long-horizon slot" and a "short-horizon slot" within a head; there is just one number, picked per token by the model.
The figure shows the half-life picture from the trained 6-layer M2 at layer 3. The ladder isn’t gone — it has moved across heads. Head 1 holds ~1⁰⁵ tokens of memory (the bright yellow stripe in the heatmap); head 3 forgets in a fraction of a token; heads 0 and 2 sit between at a few hundred to a few thousand tokens. Each head settles on a characteristic timescale and varies modestly around it per token — head 3 varies the most, its ā collapsing on specific tokens to overwrite the state. The diversity M1 baked into A's diagonal initialization, M2 distributes across heads.
That tradeoff is real. On paper, M2’s per-layer expressive class is strictly smaller than M1’s — you’ve lost the ability to hold multiple horizons in the same head simultaneously. In practice, two compensations close the gap: heads multiply the number of decay rates available at any given timestep, and d_state grows from 16 to 64–256, so each head carries much more capacity. Whether the cross-head ladder beats the per-slot ladder is an empirical question; the paper’s answer is “comfortably, once you scale heads × d_state past this toy regime.”
6. The unroll: how a loop becomes a matrix
Here’s the move the whole paper is built on, and with §3 in hand it’s three lines of algebra. Take the discretized recurrence — decay the state, write the new input, read it out:
h_t = ā_t · h_{t-1} + B_t · x_t, y_t = C_t · h_t
Unroll it by back-substitution. Feed in an input at step j and follow it forward. It's written in fresh at j, then shrunk by ā_{j+1} at the next step, by ā_{j+2} after that, and so on, until you read out at step t. So the contribution of x_j to the state at time t is x_j multiplied by every decay in between:
h_t = Σ_{j ≤ t} (∏_{k=j+1}^{t} ā_k) · B_j · x_j
Now read it out with C_t. The decay factor is a scalar and C_t · is linear, so both slide inside the sum:
y_t = Σ_{j ≤ t} (∏_{k=j+1}^{t} ā_k) · (C_t · B_j) · x_j
That is an input→output map y_t = Σ_j M[t,j] · x_j. Read off the coefficient of x_j:
M[t,j] = (∏_{k=j+1}^{t} ā_k) · (C_t · B_j), for j ≤ t (and 0 for j > t)
└──── scalar decay ────┘ └── dot product ──┘
This is exactly where the scalar A pays off. Inside M1, the decay lives inside the sum over state slots — each slot n decays at its own rate A^(n), so you cannot separate "how much does t care about j" from "how aged is that signal." With scalar a, the decay is the same for every slot, so it factors straight out of the sum, and what's left, Σ_n C_t^(n) B_j^(n), collapses into a single inner product C_t · B_j — the literal analogue of a Q·K score. The scalar didn't simplify the recurrence so much as it untangled score from decay.
Stack M[t,j] over all pairs (t,j) and the two factors separate into two matrices:
M = Λ ⊙ (CBᵀ), Λ[t,j] = ∏_{k=j+1}^{t} ā_k
One more practical point, because it’s why this is fast and not just elegant. You never compute that product directly. Since ā_k = exp(Δ_k a), a product of decays is the exponential of a sum of log-decays — and a sum over a range [j+1, t] is just a difference of running totals. One cumulative sum over the sequence, one outer-difference, one exp, and the entire L×L mask Λ falls out — no loops, no underflow from multiplying hundreds of sub-1 numbers. That cumulative-product-as-prefix-sum structure is also exactly what makes Λ a 1-semiseparable matrix (every off-diagonal block is rank-1), the property that lets the same matrix multiply be re-evaluated as a linear-time recurrence. Duality, made of one cumsum.
7. The structured matrix view

Λ[t,j] = ∏ ā_k. Lower-triangular by construction (causality), diagonal equal to 1 (the current token enters un-decayed), and each subdiagonal extends the one above it by one more ā factor. The middle panel is CBᵀ — the content-similarity matrix, same shape as a Transformer's QKᵀ, and notably not causal on its own and shared across heads. The right panel is the Hadamard product M = Λ ⊙ (CBᵀ) — the full structured "attention" matrix Mamba-2 multiplies against the input.Read the three panels as a division of labor: CBᵀ decides what’s relevant, Λ decides what’s still in reach, and M is their product. The causal triangle comes entirely from Λ; the texture comes entirely from CBᵀ.
If you set every ā_k = 1 (no decay), Λ becomes the standard causal 0/1 mask and M reduces to plain linear attention. So Mamba-2 is literally "causal linear attention with a learned multiplicative decay mask" — and the mask is what gives it expressive power that plain linear attention does not have. Look at Λ on its own: that's the only piece without a Transformer analogue, and it's the entire structural prior that separates the two families.
8. The reveal: same prompt, three matrices

Λ ⊙ (CBᵀ), no unrolling), and the Transformer's softmax attention.Three things to notice.
M2 is actually softer than M1 here — and §5 says why. M1’s diagonal trace is crisper; M2’s pattern spreads more uniformly across the lower triangle. That’s the opposite of what the averaging math alone would suggest: M1 averages |M| over 512 channels, M2 over only 4 heads, so fewer terms in the mean should give a sharper picture. The explanation is in §5: head 1 holds memory across the entire prompt (Λ ≈ I… rather, Λ ≈ 1 on the whole lower triangle), so its contribution to the head-mean M is essentially |C·Bᵀ| — a dense matrix with no causal falloff. That one long-memory head dominates the head-average and washes out the others. At production scale (M2 routinely runs 24–32 heads, not 4), no single head dominates the mean this way.
Both M1 and M2 are softer than the Transformer. Softmax produces hard, nearly one-hot weighting at peaks; M2’s mask is monotone-decaying but not normalised, so it spreads. The pattern rhymes with attention, but the type of weighting still differs — there is no exp-of-scores normalization anywhere in M2.
The bright leftmost column persists in all three. In M1 and M2 it’s partly structural (early positions accumulate the most decay-survival inside Λ); in the Transformer it’s the well-studied attention-sink phenomenon. Three different mechanisms, one converging visual.
The honest read is the same as Part 1: at this toy scale, on this prompt, the routing patterns rhyme. What Part 2 adds is the computational identification — M2 and the Transformer now agree not just in interpretation but in form. Both compute (structured mask) ⊙ (content score) · values. M2's mask is learned, soft, multiplicative; the Transformer's is a hard causal 0/1 followed by softmax normalization. Whether the rhyme survives long-context retrieval at production scale is the open question hybrid stacks are implicitly betting on.
9. Why production stacks ship Mamba-2
Three reasons, in order of how loudly they show up in benchmarks.
Wall-clock training. M1’s selective-scan kernel is fast but does sequential scalar updates that don’t use tensor cores well. M2’s training reduces to a sequence of dense matmuls — CBᵀ, the cumulative log-sums forming Λ, the Hadamard, and M·x, all of which sit in tensor-core sweet spots. The paper reports the SSD core layer is 2–8× faster than Mamba-1’s optimized selective scan — a layer-level speedup, and one that comes precisely because the work is now dense matmul on tensor cores. That’s the line that turns a research architecture into something you can train at scale on a fixed budget.
Larger d_state. Dropping A from diagonal to scalar also drops the cost of growing N. M1 carried d_state = 16; M2 routinely runs at 64, 128, or 256. That’s the same trade Transformers made when going from MHA’s small d_head to GQA’s larger d_head per remaining head — more capacity per head, fewer total heads, simpler kernel.
Shared kernel surface with attention. Because the M2 forward is (Λ ⊙ CBᵀ) · x, a lot of the same chunking, recompute, and FlashAttention-style tiling tricks apply directly. In practice the real kernel never materializes the full L×L matrix at all: it cuts the sequence into chunks, does the dense dual form within each chunk, and passes a single rank-1 state across chunk boundaries (that 1-semiseparable property from §6). The chunked block algorithm is essentially "FlashAttention with a multiplicative decay mask." The same engineers writing fast attention kernels can write fast SSD kernels.
What’s next in the series
- Part 3 — Mamba-3. What the line does after SSD: the next iteration on the selective-SSM family. Covered when there’s enough public detail to say something visual about it.
References
- Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2). arXiv:2405.21060.
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
- Ali, A., Zimerman, I. & Wolf, L. (2024). The Hidden Attention of Mamba Models. arXiv:2403.01590.
- Lieber, O. et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.