Mamba2 is Literally Attention. Here’s How.

Last Updated on June 22, 2026 by Editorial Team

Author(s): Jun Nishimura

Originally published on Towards AI.

Part 2 of 3 — Mamba-2 and the State Space Duality. How the recurrence becomes a matmul, why Nemotron-H ships this one, for ML engineers who already speak fluent Transformer.

Mamba2 is Literally Attention. Here’s How.

1. Why this one, why now

Part 1 of this series showed that any trained Mamba-1 layer can be rewritten as an attention matrix — the recurrence’s output decomposes into a content-weighted sum of past inputs, exactly the shape attention has. But that was an equivalence of functions, proved on paper after the fact. Mamba-1’s forward pass is a recurrence: carry a state, update it token by token, read it out. No L×L matrix is ever allocated. The attention matrix is something you write down to interpret the model — not a tensor the GPU ever holds.

In Mamba-2 that rewrite isn’t a description. It’s the forward computation. The training kernel literally builds the matrices — CBᵀ (the scores), Λ (the decay mask), M = Λ ⊙ CBᵀ — as real tensors in memory, then multiplies y = M·x. That's an equivalence of computation, not just of functions: the model scores every pair of positions, masks, and takes a weighted sum of values because that is the sequence of ops it runs, not because of how we chose to read it.

Put differently, the arrow of derivation flips. In M1 the recurrence is primary and the matrix is derived; in M2 the matrix is primary — it’s how training computes — and the recurrence becomes the derived fast path for inference. (Two caveats keep this honest, both cashed out later: it’s structured masked attention, not softmax — there’s no exp-normalization, §7; and the literal construction is the training/dual form, while the inference recurrence computes the identical function by the SSD theorem, §6.)

That single shift is what made selective SSMs production-ready. Nemotron-H, IBM Granite-4, AI21 Jamba — the recent hybrids ship Mamba-2 layers, not Mamba-1. The reason is unfussy: M2 trains 2–8× faster at matched parameter count, supports a much larger state dimension (d_state 64–256 vs M1’s 16), and the kernel is plain tensor-core matmul rather than a custom CUDA selective-scan.

This article is about how one design choice — restricting the state matrix A from a diagonal of N distinct rates down to a single scalar — unlocks all of that, and how the “Mamba is attention” reading sharpens from rhyme into identity. Same TinyStories setup as Part 1: a small 6-layer M2 trained alongside the same parameter-matched Transformer, the same prompt for the comparison figure, so you can put the two articles’ heatmaps next to each other.

No code in the post — just the math and the pictures. The notebooks and a fully-worked code tutorial are linked at the bottom.

2. The three moves (Mamba-2 in one paragraph)

Dao & Gu’s 2024 paper (Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality) makes three changes to Mamba-1:

A becomes scalar. Where M1’s A was a diagonal matrix with N distinct decay rates per channel (the HiPPO spectrum we plotted in Part 1’s section 4), M2 sets A_t = a_t · I — one decay rate per head per token. All N state slots inside a head share that decay at each step.
The recurrence becomes a structured matrix multiply. Unrolling the SSM with scalar A gives y = M · x where M = Λ ⊙ (CBᵀ). Λ is a lower-triangular scalar decay mask with Λ[t,j] = ∏ ā_k for j+1 ≤ k ≤ t, and CBᵀ is an L×L content-similarity matrix — exactly the shape of attention scores. This is the State Space Duality: the same SSD computation can be evaluated as a recurrence (linear cost, fixed state memory) or as a matmul (quadratic cost, all dense matrix ops), and the two views agree by construction.
The block reshapes into multi-head. Channels regroup into H heads of width P (so d_inner = H·P). Δ promotes from per-channel to per-head. B and C are shared across head groups, GQA-style. Per-channel timescale diversity is gone, but you get capacity back by having more heads at a larger d_state.

The payoff is hardware. Forming Λ, forming CBᵀ, taking their Hadamard product, and computing M · x are all dense matrix operations — exactly what GPUs and TPUs want. The linear recurrence still exists (and is what inference uses), but training rides the quadratic dual.

The rest of this post earns those three claims one at a time: §3 shows where the recurrence comes from and why “scalar A” is the load-bearing choice, §6 unrolls it into M, and §5/§7/§8 look at what the resulting matrix actually does.

3. From an ODE to a recurrence: the one scalar that changes everything

Before we can say “the recurrence becomes a matrix,” it’s worth being honest about where the recurrence itself comes from — because that’s where the scalar-A choice does its work.

A selective SSM begins life as a continuous system. For a single state channel, with state h(s) ∈ ℝ^N:

dh/ds = a · h(s) + B · x(s), y = C · h

To run it on a sequence of tokens you have to discretize — turn the continuous derivative into a per-step update. The standard move (zero-order hold) is to integrate the ODE across one step of size Δ, holding the input fixed across that step. The exact solution of a linear ODE over one step has two pieces: the old state, shrunk by an exponential of the elapsed "time," plus the new input, written in:

h_t = exp(Δ_t · a) · h_{t-1} + (input gain) · B_t · x_t
 └──── decay ā_t ────┘

The decay factor ā_t = exp(Δ_t · a) is the entire story. Three observations:

Δ_t is the selective timestep. It's a positive number the model emits per token (Δ_t = softplus(W_Δ x_t)). Large Δ_t → ā_t near 0 → forget the past aggressively; small Δ_t → ā_t near 1 → hold the state. This is the "selective" in selective SSM, and it survives untouched into M2.
a is where M1 and M2 part ways. In Mamba-1, a is the n-th diagonal entry of A — a different rate for each of the N state slots. In Mamba-2, a is one scalar per head: every slot in a head decays at the same rate at each step. That is the whole of "A = a·I."
The input gain quietly disappears. The exact discretization puts a gain on the input term too (it tends to Δ_t in the small-step limit — that's M1's familiar Δ_t B_t x_t). M2's clean presentation folds that gain into the already-learned B projection, so Δ ends up living only in the decay. It's a mild simplification, not an identity — but it's why every Δ you'll see from here on sits inside ā, never on the write. (The full zero-order-hold derivation, including why dropping the gain is safe, is in the code tutorial linked at the end.)

Why does making a a scalar matter so much? Because of what it does to the next step. When a is a single number, ā_t is a single number — a plain scalar multiplier on the whole state. And a scalar multiplier is something you can pull out of a sum. That one algebraic fact is what lets the recurrence collapse into a clean matrix in §6. With M1's per-slot diagonal A, the decay is trapped inside the state and no such collapse exists. The scalar isn't a minor simplification of the recurrence; it's the hinge the entire matrix view swings on.

4. Selectivity survives

The thing that made Mamba Mamba — Δ as a content-aware write gate — is still there. This is a Δ heatmap from the same selective-copy task as Part 1’s fig 1, but the rows are now heads (M2 has H=8 here, sharing Δ within each head) instead of channels. The pattern is the same: most of the sequence sits at Δ ≈ 0, and at the few positions holding a real content token, certain heads light up.

What changed visually: only a couple of heads (4 and 7 in this run) carry the selective firing. The rest stay dark. M1 spread the same job across hundreds of channels; M2 concentrates it into a handful of heads, each operating on a much larger N-dimensional state. The mean-Δ panel at the bottom shows the per-head spikes still cleanly track real-token positions even after averaging — the same content-aware write gating, just coarser-grained and louder per head.

The selectivity that gave the family its name isn’t what M2 traded away.

5. What A = a·I actually costs

In Part 1, section 4 showed M1’s memory-horizon spectrum: each of the N=16 state slots inside a channel had its own `A^(n)` initialized to −n, giving an almost-linear staircase of half-lives from ~100 tokens at n=0 down to ~6 at n=15. That ladder *inside* a single head was M1's headline structural fact — one layer, many timescales, baked into A's diagonal.

In M2, that ladder vanishes. With A_t = a_t · I, every slot inside one head decays at the same rate at every step. There is no longer a "long-horizon slot" and a "short-horizon slot" within a head; there is just one number, picked per token by the model.

The figure shows the half-life picture from the trained 6-layer M2 at layer 3. The ladder isn’t gone — it has moved across heads. Head 1 holds ~1⁰⁵ tokens of memory (the bright yellow stripe in the heatmap); head 3 forgets in a fraction of a token; heads 0 and 2 sit between at a few hundred to a few thousand tokens. Each head settles on a characteristic timescale and varies modestly around it per token — head 3 varies the most, its ā collapsing on specific tokens to overwrite the state. The diversity M1 baked into A's diagonal initialization, M2 distributes across heads.

That tradeoff is real. On paper, M2’s per-layer expressive class is strictly smaller than M1’s — you’ve lost the ability to hold multiple horizons in the same head simultaneously. In practice, two compensations close the gap: heads multiply the number of decay rates available at any given timestep, and d_state grows from 16 to 64–256, so each head carries much more capacity. Whether the cross-head ladder beats the per-slot ladder is an empirical question; the paper’s answer is “comfortably, once you scale heads × d_state past this toy regime.”

6. The unroll: how a loop becomes a matrix

Here’s the move the whole paper is built on, and with §3 in hand it’s three lines of algebra. Take the discretized recurrence — decay the state, write the new input, read it out:

h_t = ā_t · h_{t-1} + B_t · x_t, y_t = C_t · h_t

Unroll it by back-substitution. Feed in an input at step j and follow it forward. It's written in fresh at j, then shrunk by ā_{j+1} at the next step, by ā_{j+2} after that, and so on, until you read out at step t. So the contribution of x_j to the state at time t is x_j multiplied by every decay in between:

h_t = Σ_{j ≤ t} (∏_{k=j+1}^{t} ā_k) · B_j · x_j

Now read it out with C_t. The decay factor is a scalar and C_t · is linear, so both slide inside the sum:

y_t = Σ_{j ≤ t} (∏_{k=j+1}^{t} ā_k) · (C_t · B_j) · x_j

That is an input→output map y_t = Σ_j M[t,j] · x_j. Read off the coefficient of x_j:

M[t,j] = (∏_{k=j+1}^{t} ā_k) · (C_t · B_j), for j ≤ t (and 0 for j > t)
 └──── scalar decay ────┘ └── dot product ──┘

This is exactly where the scalar A pays off. Inside M1, the decay lives inside the sum over state slots — each slot n decays at its own rate A^(n), so you cannot separate "how much does t care about j" from "how aged is that signal." With scalar a, the decay is the same for every slot, so it factors straight out of the sum, and what's left, Σ_n C_t^(n) B_j^(n), collapses into a single inner product C_t · B_j — the literal analogue of a Q·K score. The scalar didn't simplify the recurrence so much as it untangled score from decay.

Stack M[t,j] over all pairs (t,j) and the two factors separate into two matrices:

M = Λ ⊙ (CBᵀ), Λ[t,j] = ∏_{k=j+1}^{t} ā_k

One more practical point, because it’s why this is fast and not just elegant. You never compute that product directly. Since ā_k = exp(Δ_k a), a product of decays is the exponential of a sum of log-decays — and a sum over a range [j+1, t] is just a difference of running totals. One cumulative sum over the sequence, one outer-difference, one exp, and the entire L×L mask Λ falls out — no loops, no underflow from multiplying hundreds of sub-1 numbers. That cumulative-product-as-prefix-sum structure is also exactly what makes Λ a 1-semiseparable matrix (every off-diagonal block is rank-1), the property that lets the same matrix multiply be re-evaluated as a linear-time recurrence. Duality, made of one cumsum.

7. The structured matrix view

§6 was the algebra; this is the picture. The left panel is Λ alone — the soft causal decay mask we just built, `Λ[t,j] = ∏ ā_k`. Lower-triangular by construction (causality), diagonal equal to 1 (the current token enters un-decayed), and each subdiagonal extends the one above it by one more `ā` factor. The middle panel is **CBᵀ** — the content-similarity matrix, same shape as a Transformer's `QKᵀ`, and notably *not* causal on its own and shared across heads. The right panel is the Hadamard product `M = Λ ⊙ (CBᵀ)` — the full structured "attention" matrix Mamba-2 multiplies against the input.

Read the three panels as a division of labor: CBᵀ decides what’s relevant, Λ decides what’s still in reach, and M is their product. The causal triangle comes entirely from Λ; the texture comes entirely from CBᵀ.

If you set every ā_k = 1 (no decay), Λ becomes the standard causal 0/1 mask and M reduces to plain linear attention. So Mamba-2 is literally "causal linear attention with a learned multiplicative decay mask" — and the mask is what gives it expressive power that plain linear attention does not have. Look at Λ on its own: that's the only piece without a Transformer analogue, and it's the entire structural prior that separates the two families.

8. The reveal: same prompt, three matrices

The same prompt from Part 1 — *“Lily had a red ball. Tom had a blue ball. Lily wanted to play, so she gave her”* — through three layer-3 representations: Mamba-1’s implicit |M| (the post-hoc decomposition from Part 1), Mamba-2’s M (computed *directly* as `Λ ⊙ (CBᵀ)`, no unrolling), and the Transformer's softmax attention.

Three things to notice.

M2 is actually softer than M1 here — and §5 says why. M1’s diagonal trace is crisper; M2’s pattern spreads more uniformly across the lower triangle. That’s the opposite of what the averaging math alone would suggest: M1 averages |M| over 512 channels, M2 over only 4 heads, so fewer terms in the mean should give a sharper picture. The explanation is in §5: head 1 holds memory across the entire prompt (Λ ≈ I… rather, Λ ≈ 1 on the whole lower triangle), so its contribution to the head-mean M is essentially |C·Bᵀ| — a dense matrix with no causal falloff. That one long-memory head dominates the head-average and washes out the others. At production scale (M2 routinely runs 24–32 heads, not 4), no single head dominates the mean this way.

Both M1 and M2 are softer than the Transformer. Softmax produces hard, nearly one-hot weighting at peaks; M2’s mask is monotone-decaying but not normalised, so it spreads. The pattern rhymes with attention, but the type of weighting still differs — there is no exp-of-scores normalization anywhere in M2.

The bright leftmost column persists in all three. In M1 and M2 it’s partly structural (early positions accumulate the most decay-survival inside Λ); in the Transformer it’s the well-studied attention-sink phenomenon. Three different mechanisms, one converging visual.

The honest read is the same as Part 1: at this toy scale, on this prompt, the routing patterns rhyme. What Part 2 adds is the computational identification — M2 and the Transformer now agree not just in interpretation but in form. Both compute (structured mask) ⊙ (content score) · values. M2's mask is learned, soft, multiplicative; the Transformer's is a hard causal 0/1 followed by softmax normalization. Whether the rhyme survives long-context retrieval at production scale is the open question hybrid stacks are implicitly betting on.

9. Why production stacks ship Mamba-2

Three reasons, in order of how loudly they show up in benchmarks.

Wall-clock training. M1’s selective-scan kernel is fast but does sequential scalar updates that don’t use tensor cores well. M2’s training reduces to a sequence of dense matmuls — CBᵀ, the cumulative log-sums forming Λ, the Hadamard, and M·x, all of which sit in tensor-core sweet spots. The paper reports the SSD core layer is 2–8× faster than Mamba-1’s optimized selective scan — a layer-level speedup, and one that comes precisely because the work is now dense matmul on tensor cores. That’s the line that turns a research architecture into something you can train at scale on a fixed budget.

Larger d_state. Dropping A from diagonal to scalar also drops the cost of growing N. M1 carried d_state = 16; M2 routinely runs at 64, 128, or 256. That’s the same trade Transformers made when going from MHA’s small d_head to GQA’s larger d_head per remaining head — more capacity per head, fewer total heads, simpler kernel.

Shared kernel surface with attention. Because the M2 forward is (Λ ⊙ CBᵀ) · x, a lot of the same chunking, recompute, and FlashAttention-style tiling tricks apply directly. In practice the real kernel never materializes the full L×L matrix at all: it cuts the sequence into chunks, does the dense dual form within each chunk, and passes a single rank-1 state across chunk boundaries (that 1-semiseparable property from §6). The chunked block algorithm is essentially "FlashAttention with a multiplicative decay mask." The same engineers writing fast attention kernels can write fast SSD kernels.

What’s next in the series

Part 3 — Mamba-3. What the line does after SSD: the next iteration on the selective-SSM family. Covered when there’s enough public detail to say something visual about it.

References

Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2). arXiv:2405.21060.
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
Ali, A., Zimerman, I. & Wolf, L. (2024). The Hidden Attention of Mamba Models. arXiv:2403.01590.
Lieber, O. et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Mamba2 is Literally Attention. Here’s How.

Author(s): Jun Nishimura

1. Why this one, why now

2. The three moves (Mamba-2 in one paragraph)

3. From an ODE to a recurrence: the one scalar that changes everything

4. Selectivity survives

5. What A = a·I actually costs

6. The unroll: how a loop becomes a matrix

7. The structured matrix view

8. The reveal: same prompt, three matrices

9. Why production stacks ship Mamba-2

What’s next in the series

References

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

I Deleted Every Static Claude API Key I Owned. Here’s the Keyless Migration, Provider by Provider.

I Replaced ChatGPT With Local AI for 30 Days. Here’s What Actually Happened.

A Practical Guide to Evaluating a Cloud Migration Partner

AsyncIO in Python: What It Actually Is and Why Your ‘Async’ Code Might Not Be Async

Building Long-Running Claude Managed Agents: Why State Matters More Than Compute

The Building Blocks of LangGraph (Part 0)

Five Ways Claude Code Runs Multi-Step Work. The Two Questions That Pick the Right One.

Choose Wisely: Models Should Follow Your Use Case.

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Mamba2 is Literally Attention. Here’s How.

Author(s): Jun Nishimura

1. Why this one, why now

2. The three moves (Mamba-2 in one paragraph)

3. From an ODE to a recurrence: the one scalar that changes everything

4. Selectivity survives

5. What A = a·I actually costs

6. The unroll: how a loop becomes a matrix

7. The structured matrix view

8. The reveal: same prompt, three matrices

9. Why production stacks ship Mamba-2

What’s next in the series

References

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement