VGGT: 5 Images → 3D in 62ms — TensorRT Optimization for CVPR 2025 Best Paper
Last Updated on June 14, 2026 by Editorial Team
Author(s): kyon
Originally published on Towards AI.

VGGT: 5 Images → 3D in 62ms — TensorRT Optimization for CVPR 2025 Best Paper
A hands-on benchmark, a COLMAP comparison, and a full TensorRT FP16 conversion of a 1.26B-parameter 3D reconstruction Transformer.
If you’ve spent any time with COLMAP for Structure-from-Motion, you know the part I mean: the waiting. Feature extraction, then pairwise matching, then bundle adjustment, and the matching step alone scales O(n²) in the number of images. You go make coffee.
VGGT skips all of that. One forward pass and you get camera poses, depth maps, a point cloud, and point tracks, all at once. It picked up the CVPR 2025 Best Paper Award, and after using it for a while I get why.
So in this post I run it on an RTX 5090, put it head to head with COLMAP, and then spend most of my time on the thing I actually wanted to do: convert the whole model to TensorRT FP16. Jumping ahead to the result, that gets me 6.9x on real images with roughly 0.13% accuracy loss.
What Is VGGT?
VGGT, short for Visual Geometry Grounded Transformer, comes out of Meta and Oxford and weighs in at 1.26B parameters.
The old way (COLMAP) was a pipeline of separate stages: feature extraction, pairwise matching, bundle adjustment, dense reconstruction. Each stage stood on its own, errors from one fed into the next, and every one of them had knobs you had to tune.
VGGT collapses that into a single Transformer. A DINOv2 backbone turns the images into tokens, 24 layers of alternating self-attention (within each frame, then globally across frames) mix everything together, and a set of task-specific heads produce the outputs in parallel. Nothing after that. No iterative refinement, no cleanup pass.
DUSt3R and MASt3R from Naver Labs were already heading this direction, but they still split the work into pairwise inference plus a separate global optimization step. VGGT is feed-forward the whole way through, and it doesn’t even care what order you feed the images in.
Running It

(Image generated with AI.)
Running It

Environment
GPU : NVIDIA GeForce RTX 5090 (Blackwell, 32GB GDDR7)
PyTorch : 2.12.0.dev (nightly) + CUDA 12.8
Model : facebook/VGGT-1B (1.26B params)
OS : Windows 11 → WSL2 Ubuntu 24.04
One gotcha up front: the RTX 5090 (sm_120 / Blackwell) won’t run on stable PyTorch. You need the nightly with cu128. I lost some time to that before I thought to check.
Inference Code
import torch
from vggt.models.vggt import VGGT
from vggt.utils.load_fn import load_and_preprocess_images
model = VGGT.from_pretrained("facebook/VGGT-1B") # https://huggingface.co/facebook/VGGT-1B
model.eval().to("cuda")
images = load_and_preprocess_images("path/to/images/*.jpg").to("cuda")
with torch.no_grad():
predictions = model(images)
# Camera poses, depth maps, 3D points, confidence, all in one call
That’s the whole thing. No config file, no tuning pass, which still feels slightly off if you’re coming from COLMAP.
Scaling (RTX 5090, measured)
Images Time VRAM Points (conf>0.5)
──────────────────────────────────────────────────
2 0.15s 7.3 GB 362,600
5 0.21s 8.3 GB 906,500
10 0.36s 9.8 GB 1,813,000
25 1.21s 12.5 GB 4,532,500
Time and VRAM both scale more or less linearly with image count. Two images, 0.15 seconds. The point cloud is dense (one point per pixel), so you get around 180,000 points per image, and that adds up fast.
Quick note before anyone gets confused: these are the plain VGGT numbers, no TensorRT yet. The 62ms in the title is the same 5-image job after the TensorRT work further down.
COLMAP vs VGGT on the Same Images
Same kitchen-scene images for both, so it’s a fair comparison.
25 images: COLMAP succeeds
COLMAP VGGT
─────────────────────────────────────────────
Processing time 9.05s 1.19s (7.6x faster)
3D points 2,075 4,532,500
Registered images 24/25 25/25
COLMAP spends about 75% of its runtime on feature matching, which is exactly the O(n²) part.
10 images: COLMAP fails
COLMAP VGGT
─────────────────────────────────────────────
Processing time 5.75s 0.35s (16.2x faster)
3D points 42 1,813,000
Registered images 2/10 10/10
This is where it got interesting. COLMAP managed to register only 2 of the 10 images and gave me all of 42 points. Drop the viewpoint overlap and its matching just falls apart; the log was a wall of Discarding reconstruction due to insufficient size.
VGGT took the same 10 images and produced 1.8 million points without complaint. No matching step means there’s nothing to fall apart, so sparse overlap doesn’t bother it.
To be fair to COLMAP: I ran it with CPU SIFT, so GPU SIFT would be quicker. And its sparse points are feature-based and genuinely accurate, so comparing raw point counts isn’t really apples to apples anyway. The two are solving different problems at different densities.
TensorRT Optimization
This is the part I actually cared about.
VGGT is fast, but ~400ms for 5 frames in eager FP32 isn’t fast enough for real-time work, or for chewing through a large batch. TensorRT (TRT) is NVIDIA’s inference engine: FP16, layer fusion, kernel autotuning, the usual. The catch was that when I went looking for prior art I came up empty. Nobody seems to have put a 1.26B-parameter Transformer like this through TRT, at least not anywhere I could find.
Monolithic export: impossible
torch.export.export(model, ...) # → Failed
List outputs, dynamic reshapes, conditional branches, take your pick. torch.export choked with both strict=True and strict=False. After enough failed attempts I stopped trying to export the model in one piece and split it apart instead.
Split into 5 Components
I broke VGGT into 5 independent pieces and ran each through ONNX → TRT FP16. Each piece had its own way of fighting back, so here’s the rundown, phase by phase.

(Image generated with AI.)
Phase A: DINOv2 and the antialias wall
DINOv2 interpolates its positional encoding with F.interpolate(bicubic, antialias=True). TensorRT doesn't support that op, so that's a dead end head-on.
The way around it: the input resolution is fixed at 350×518, which means the positional encoding never actually changes. So I computed it once and baked it into a buffer.
class DINOv2FixedPosEnc(nn.Module):
def __init__(self, original_dinov2, H=350, W=518):
pos_embed = interpolate_pos_encoding(...) # compute once
self.register_buffer('fixed_pos_embed', pos_embed)
Results (DINOv2 only):
Batch=1: TRT 4.73ms vs Eager 21.25ms 4.5x
Batch=5: TRT 13.65ms vs Eager 80.16ms 5.9x
Batch=10: TRT 27.16ms vs Eager 156.25ms 5.8x
Accuracy: rel_err 0.59%
Phase B: Aggregator and RoPE
The Aggregator is the big one, 48 Transformer blocks (24 frame-level, 24 global). The trouble was hiding in the RoPE (Rotary Position Embedding) code:
# Original VGGT, data-dependent
max_pos = int(positions.max()) + 1 # Python int pulled from a tensor value
That int(positions.max()) reads a Python int straight out of a tensor, and torch.export can't trace through it. Same flavor of problem as Phase A: something dynamic that didn't need to be.
The fix was the same idea, too. I wrote a FixedRoPE2D that precomputes the cos/sin tables up front and keeps them as buffers, then dropped it into all 48 blocks.
class FixedRoPE2D(nn.Module):
def __init__(self, original_rope, max_positions=1024):
freqs = compute_all_freqs(...)
self.register_buffer('cos_cached', freqs.cos())
self.register_buffer('sin_cached', freqs.sin())
Results (Aggregator only):
S=2: TRT 17.7ms vs Eager 90.0ms 5.1x
S=5: TRT 35.0ms vs Eager 235.8ms 6.7x
S=10: TRT 89.6ms vs Eager 591.0ms 6.6x
Engine: 1,232MB, 23,569 TRT layers, built in 54s
Accuracy: rel_err 0.16%
Phase C: the three heads
The heads each had their own thing going on.
CameraHead runs a 4-iteration refinement loop, and only the first iteration gets None as input. That None-then-Tensor switch is more control flow than torch.export will follow, so I just unrolled the four iterations by hand.
DepthHead and PointHead (both DPT) gave me three smaller ones:
custom_interpolateused a data-dependent chunk, so I monkey-patched it back toF.interpolate- there were leftover
nn.quantized.FloatFunctional.add()calls, which I swapped for a plain+ - an
einsumbroke TRT's external-weight parsing, fixed the same way as before by precomputing cos/sin into buffers
Results (Heads, S=5):
Camera: TRT 2.1ms vs Eager 4.8ms 2.3x
Depth: TRT 5.3ms vs Eager 19.3ms 3.7x
Point: TRT 5.3ms vs Eager 19.3ms 3.6x
Phase D: full pipeline, end to end
With all 5 engines built, I wired them together in Python to get the full pipeline running. The Aggregator eats most of the budget, 59.5% of total time.
End-to-End Benchmark (RTX 5090):
Frames TRT FP16 Eager FP32 Speedup
─────────────────────────────────────────────
2 31.8ms 175.3ms 5.5x
5 61.7ms 401.9ms 6.5x
10 142.4ms 1,628.8ms 11.4x
So 5 images reconstructs in about 62ms, which is the number in the title. The 6.5x here is measured on clean synthetic input. The real-image figure comes out a little higher (6.9x), but that’s after I fixed a bug in Phase E, so hold that thought for a second.
At 10 frames the gap widens to 11.4x: eager takes 1.6 seconds, TRT does it in 142ms. The more frames you add, the more the memory-layout work pays off.
Accuracy (TRT FP16 vs Eager FP32):
Output rel_err
──────────────────────────
Camera poses 1.03%
Depth maps 0.03%
3D points 0.12%
Phase E: the .contiguous() bug that cost me hours
Everything converted, all the piece-wise numbers looking good. Then I ran the full thing on real images (llff_flower, 5 frames) and got a depth rel_err of 51%.
Fifty-one percent. Not a rounding issue, not a precision wobble, just wrong.
First instinct was precision, so I toggled FP16, FP32, TF32, and the optimization level every which way. Nothing moved. So it wasn’t that. Tracing it stage by stage, the DINOv2 TRT step alone was off by 94%, and that blew up to 107% by the end. But, and this is the part that had me stuck, each head on its own came in within 0.05% when I fed it eager inputs.
The culprit turned out to be memory layout.
# BAD: numpy -> permute creates a non-contiguous tensor
images = torch.from_numpy(np_array).permute(0,3,1,2).cuda()
# .is_contiguous() == False
# TRT reads from data_ptr() assuming NCHW-contiguous, so it gets garbage
# FIX: force contiguous memory
images = images.contiguous()
Here’s what bit me. torch.randn() hands back contiguous memory by default, so through every synthetic benchmark the problem simply never existed. Real images take a different path: numpy, then permute, which leaves you with a non-contiguous tensor. TRT reads straight from data_ptr() expecting an NCHW-contiguous layout, gets something else, and quietly produces nonsense.
The fix is one line, .contiguous(), which felt almost insulting given how long it took to find. I haven't seen it written down anywhere, so for the record: if you're handing tensors to TRT, make them contiguous first.
After the fix: real-image results
Output Eager range TRT range rel_err
──────────────────────────────────────────────────────
Depth [0.33, 2.72] [0.33, 2.71] 0.13%
3D Points [-1.10, 2.53] [-1.11, 2.52] 0.21%
Camera — — 0.07%
Inference: Eager 404.7ms → TRT 58.7ms (6.9x)
On real images, 404.7ms eager drops to 58.7ms, so 6.9x. That’s the real-world number, and the one I’d actually quote. Depth, the point cloud (PLY), camera poses, all of it lines up with the eager output by eye.

Technical Takeaways
Why TRT conversion was feasible
I expected this to be worse than it was, and the reason it wasn’t comes down to one thing: VGGT has zero custom CUDA kernels. It’s all stock PyTorch ops (F.scaled_dot_product_attention, F.embedding, F.interpolate, and friends). Compare that to something like SLAM3R, which has 73 CuRoPE custom-kernel call sites. That's the kind of thing that turns a TRT conversion into a slog.
Three patterns covered everything
Looking back, every problem I hit was a variant of the same three patterns, and every fix was the same move: take whatever was being computed dynamically at runtime and precompute it at build time instead.

(Image generated with AI.)
The bottleneck is the Aggregator
Component Time Share
──────────────────────────────────
DINOv2 22.6ms 20.7%
Aggregator 64.9ms 59.5% (604M params, 48 blocks)
Camera Head 3.4ms 3.1%
Depth Head 9.1ms 8.3%
Point Head 9.0ms 8.2%
If I wanted to push further, the Aggregator is where to look, either by splitting it up or going to INT8. That’s the next thing on my list.
Summary
Performance:
- Eager FP32, 5 frames: 401.9ms
- TRT FP16, 5 frames: 61.7ms (6.5x, clean input)
- TRT FP16, 10 frames: 142.4ms (11.4x)
- Real images, 5 frames: 58.7ms (6.9x)
- Accuracy loss: depth 0.13%, points 0.21%, basically nothing
What I built:
- 5 TRT FP16 engines, 2,521 MB all together
- The full pipeline: DINOv2 → Aggregator → Camera/Depth/Point heads
- Real-image validation, including that one-line
.contiguous()fix
What I’d take away:
- No custom CUDA kernels is what made this doable in the first place
- Every problem came down to moving work from runtime to build time
- The Aggregator (59.5%) is the bottleneck, and INT8 is next
VGGT is the paper that, to me, marks the shift in 3D reconstruction from iterative optimization to plain feed-forward inference. TensorRT just takes that and runs another 6 to 11x with it.

Reference
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.