VGGT: 5 Images → 3D in 62ms — TensorRT Optimization for CVPR 2025 Best Paper

Last Updated on June 14, 2026 by Editorial Team

Author(s): kyon

Originally published on Towards AI.

VGGT: 5 Images → 3D in 62ms — TensorRT Optimization for CVPR 2025 Best Paper

A hands-on benchmark, a COLMAP comparison, and a full TensorRT FP16 conversion of a 1.26B-parameter 3D reconstruction Transformer.

If you’ve spent any time with COLMAP for Structure-from-Motion, you know the part I mean: the waiting. Feature extraction, then pairwise matching, then bundle adjustment, and the matching step alone scales O(n²) in the number of images. You go make coffee.

VGGT skips all of that. One forward pass and you get camera poses, depth maps, a point cloud, and point tracks, all at once. It picked up the CVPR 2025 Best Paper Award, and after using it for a while I get why.

So in this post I run it on an RTX 5090, put it head to head with COLMAP, and then spend most of my time on the thing I actually wanted to do: convert the whole model to TensorRT FP16. Jumping ahead to the result, that gets me 6.9x on real images with roughly 0.13% accuracy loss.

What Is VGGT?

VGGT, short for Visual Geometry Grounded Transformer, comes out of Meta and Oxford and weighs in at 1.26B parameters.

The old way (COLMAP) was a pipeline of separate stages: feature extraction, pairwise matching, bundle adjustment, dense reconstruction. Each stage stood on its own, errors from one fed into the next, and every one of them had knobs you had to tune.

VGGT collapses that into a single Transformer. A DINOv2 backbone turns the images into tokens, 24 layers of alternating self-attention (within each frame, then globally across frames) mix everything together, and a set of task-specific heads produce the outputs in parallel. Nothing after that. No iterative refinement, no cleanup pass.

DUSt3R and MASt3R from Naver Labs were already heading this direction, but they still split the work into pairwise inference plus a separate global optimization step. VGGT is feed-forward the whole way through, and it doesn’t even care what order you feed the images in.

Running It

(Image generated with AI.)

Running It

Environment

GPU : NVIDIA GeForce RTX 5090 (Blackwell, 32GB GDDR7)
PyTorch : 2.12.0.dev (nightly) + CUDA 12.8
Model : facebook/VGGT-1B (1.26B params)
OS : Windows 11 → WSL2 Ubuntu 24.04

One gotcha up front: the RTX 5090 (sm_120 / Blackwell) won’t run on stable PyTorch. You need the nightly with cu128. I lost some time to that before I thought to check.

Inference Code

import torch
from vggt.models.vggt import VGGT
from vggt.utils.load_fn import load_and_preprocess_images

model = VGGT.from_pretrained("facebook/VGGT-1B") # https://huggingface.co/facebook/VGGT-1B
model.eval().to("cuda")

images = load_and_preprocess_images("path/to/images/*.jpg").to("cuda")
with torch.no_grad():
 predictions = model(images)
# Camera poses, depth maps, 3D points, confidence, all in one call

That’s the whole thing. No config file, no tuning pass, which still feels slightly off if you’re coming from COLMAP.

Scaling (RTX 5090, measured)

Images Time VRAM Points (conf>0.5)
──────────────────────────────────────────────────
 2 0.15s 7.3 GB 362,600
 5 0.21s 8.3 GB 906,500
 10 0.36s 9.8 GB 1,813,000
 25 1.21s 12.5 GB 4,532,500

Time and VRAM both scale more or less linearly with image count. Two images, 0.15 seconds. The point cloud is dense (one point per pixel), so you get around 180,000 points per image, and that adds up fast.

Quick note before anyone gets confused: these are the plain VGGT numbers, no TensorRT yet. The 62ms in the title is the same 5-image job after the TensorRT work further down.

COLMAP vs VGGT on the Same Images

Same kitchen-scene images for both, so it’s a fair comparison.

25 images: COLMAP succeeds

 COLMAP VGGT
─────────────────────────────────────────────
Processing time 9.05s 1.19s (7.6x faster)
3D points 2,075 4,532,500
Registered images 24/25 25/25

COLMAP spends about 75% of its runtime on feature matching, which is exactly the O(n²) part.

10 images: COLMAP fails

 COLMAP VGGT
─────────────────────────────────────────────
Processing time 5.75s 0.35s (16.2x faster)
3D points 42 1,813,000
Registered images 2/10 10/10

This is where it got interesting. COLMAP managed to register only 2 of the 10 images and gave me all of 42 points. Drop the viewpoint overlap and its matching just falls apart; the log was a wall of Discarding reconstruction due to insufficient size.

VGGT took the same 10 images and produced 1.8 million points without complaint. No matching step means there’s nothing to fall apart, so sparse overlap doesn’t bother it.

To be fair to COLMAP: I ran it with CPU SIFT, so GPU SIFT would be quicker. And its sparse points are feature-based and genuinely accurate, so comparing raw point counts isn’t really apples to apples anyway. The two are solving different problems at different densities.

TensorRT Optimization

This is the part I actually cared about.

VGGT is fast, but ~400ms for 5 frames in eager FP32 isn’t fast enough for real-time work, or for chewing through a large batch. TensorRT (TRT) is NVIDIA’s inference engine: FP16, layer fusion, kernel autotuning, the usual. The catch was that when I went looking for prior art I came up empty. Nobody seems to have put a 1.26B-parameter Transformer like this through TRT, at least not anywhere I could find.

Monolithic export: impossible

torch.export.export(model, ...) # → Failed

List outputs, dynamic reshapes, conditional branches, take your pick. torch.export choked with both strict=True and strict=False. After enough failed attempts I stopped trying to export the model in one piece and split it apart instead.

Split into 5 Components

I broke VGGT into 5 independent pieces and ran each through ONNX → TRT FP16. Each piece had its own way of fighting back, so here’s the rundown, phase by phase.

(Image generated with AI.)

Phase A: DINOv2 and the antialias wall

DINOv2 interpolates its positional encoding with F.interpolate(bicubic, antialias=True). TensorRT doesn't support that op, so that's a dead end head-on.

The way around it: the input resolution is fixed at 350×518, which means the positional encoding never actually changes. So I computed it once and baked it into a buffer.

class DINOv2FixedPosEnc(nn.Module):
 def __init__(self, original_dinov2, H=350, W=518):
 pos_embed = interpolate_pos_encoding(...) # compute once
 self.register_buffer('fixed_pos_embed', pos_embed)

Results (DINOv2 only):
 Batch=1: TRT 4.73ms vs Eager 21.25ms 4.5x
 Batch=5: TRT 13.65ms vs Eager 80.16ms 5.9x
 Batch=10: TRT 27.16ms vs Eager 156.25ms 5.8x
 Accuracy: rel_err 0.59%

Phase B: Aggregator and RoPE

The Aggregator is the big one, 48 Transformer blocks (24 frame-level, 24 global). The trouble was hiding in the RoPE (Rotary Position Embedding) code:

# Original VGGT, data-dependent
max_pos = int(positions.max()) + 1 # Python int pulled from a tensor value

That int(positions.max()) reads a Python int straight out of a tensor, and torch.export can't trace through it. Same flavor of problem as Phase A: something dynamic that didn't need to be.

The fix was the same idea, too. I wrote a FixedRoPE2D that precomputes the cos/sin tables up front and keeps them as buffers, then dropped it into all 48 blocks.

class FixedRoPE2D(nn.Module):
 def __init__(self, original_rope, max_positions=1024):
 freqs = compute_all_freqs(...)
 self.register_buffer('cos_cached', freqs.cos())
 self.register_buffer('sin_cached', freqs.sin())

Results (Aggregator only):
 S=2: TRT 17.7ms vs Eager 90.0ms 5.1x
 S=5: TRT 35.0ms vs Eager 235.8ms 6.7x
 S=10: TRT 89.6ms vs Eager 591.0ms 6.6x
 Engine: 1,232MB, 23,569 TRT layers, built in 54s
 Accuracy: rel_err 0.16%

Phase C: the three heads

The heads each had their own thing going on.

CameraHead runs a 4-iteration refinement loop, and only the first iteration gets None as input. That None-then-Tensor switch is more control flow than torch.export will follow, so I just unrolled the four iterations by hand.

DepthHead and PointHead (both DPT) gave me three smaller ones:

custom_interpolate used a data-dependent chunk, so I monkey-patched it back to F.interpolate
there were leftover nn.quantized.FloatFunctional.add() calls, which I swapped for a plain +
an einsum broke TRT's external-weight parsing, fixed the same way as before by precomputing cos/sin into buffers

Results (Heads, S=5):
 Camera: TRT 2.1ms vs Eager 4.8ms 2.3x
 Depth: TRT 5.3ms vs Eager 19.3ms 3.7x
 Point: TRT 5.3ms vs Eager 19.3ms 3.6x

Phase D: full pipeline, end to end

With all 5 engines built, I wired them together in Python to get the full pipeline running. The Aggregator eats most of the budget, 59.5% of total time.

End-to-End Benchmark (RTX 5090):
 Frames TRT FP16 Eager FP32 Speedup
─────────────────────────────────────────────
 2 31.8ms 175.3ms 5.5x
 5 61.7ms 401.9ms 6.5x
 10 142.4ms 1,628.8ms 11.4x

So 5 images reconstructs in about 62ms, which is the number in the title. The 6.5x here is measured on clean synthetic input. The real-image figure comes out a little higher (6.9x), but that’s after I fixed a bug in Phase E, so hold that thought for a second.

At 10 frames the gap widens to 11.4x: eager takes 1.6 seconds, TRT does it in 142ms. The more frames you add, the more the memory-layout work pays off.

Accuracy (TRT FP16 vs Eager FP32):
 Output rel_err
 ──────────────────────────
 Camera poses 1.03%
 Depth maps 0.03%
 3D points 0.12%

Phase E: the .contiguous() bug that cost me hours

Everything converted, all the piece-wise numbers looking good. Then I ran the full thing on real images (llff_flower, 5 frames) and got a depth rel_err of 51%.

Fifty-one percent. Not a rounding issue, not a precision wobble, just wrong.

First instinct was precision, so I toggled FP16, FP32, TF32, and the optimization level every which way. Nothing moved. So it wasn’t that. Tracing it stage by stage, the DINOv2 TRT step alone was off by 94%, and that blew up to 107% by the end. But, and this is the part that had me stuck, each head on its own came in within 0.05% when I fed it eager inputs.

The culprit turned out to be memory layout.

# BAD: numpy -> permute creates a non-contiguous tensor
images = torch.from_numpy(np_array).permute(0,3,1,2).cuda()
# .is_contiguous() == False
# TRT reads from data_ptr() assuming NCHW-contiguous, so it gets garbage

# FIX: force contiguous memory
images = images.contiguous()

Here’s what bit me. torch.randn() hands back contiguous memory by default, so through every synthetic benchmark the problem simply never existed. Real images take a different path: numpy, then permute, which leaves you with a non-contiguous tensor. TRT reads straight from data_ptr() expecting an NCHW-contiguous layout, gets something else, and quietly produces nonsense.

The fix is one line, .contiguous(), which felt almost insulting given how long it took to find. I haven't seen it written down anywhere, so for the record: if you're handing tensors to TRT, make them contiguous first.

After the fix: real-image results

Output Eager range TRT range rel_err
──────────────────────────────────────────────────────
Depth [0.33, 2.72] [0.33, 2.71] 0.13%
3D Points [-1.10, 2.53] [-1.11, 2.52] 0.21%
Camera — — 0.07%

Inference: Eager 404.7ms → TRT 58.7ms (6.9x)

On real images, 404.7ms eager drops to 58.7ms, so 6.9x. That’s the real-world number, and the one I’d actually quote. Depth, the point cloud (PLY), camera poses, all of it lines up with the eager output by eye.

Technical Takeaways

Why TRT conversion was feasible

I expected this to be worse than it was, and the reason it wasn’t comes down to one thing: VGGT has zero custom CUDA kernels. It’s all stock PyTorch ops (F.scaled_dot_product_attention, F.embedding, F.interpolate, and friends). Compare that to something like SLAM3R, which has 73 CuRoPE custom-kernel call sites. That's the kind of thing that turns a TRT conversion into a slog.

Three patterns covered everything

Looking back, every problem I hit was a variant of the same three patterns, and every fix was the same move: take whatever was being computed dynamically at runtime and precompute it at build time instead.

(Image generated with AI.)

The bottleneck is the Aggregator

Component Time Share
──────────────────────────────────
DINOv2 22.6ms 20.7%
Aggregator 64.9ms 59.5% (604M params, 48 blocks)
Camera Head 3.4ms 3.1%
Depth Head 9.1ms 8.3%
Point Head 9.0ms 8.2%

If I wanted to push further, the Aggregator is where to look, either by splitting it up or going to INT8. That’s the next thing on my list.

Summary

Performance:

Eager FP32, 5 frames: 401.9ms
TRT FP16, 5 frames: 61.7ms (6.5x, clean input)
TRT FP16, 10 frames: 142.4ms (11.4x)
Real images, 5 frames: 58.7ms (6.9x)
Accuracy loss: depth 0.13%, points 0.21%, basically nothing

What I built:

5 TRT FP16 engines, 2,521 MB all together
The full pipeline: DINOv2 → Aggregator → Camera/Depth/Point heads
Real-image validation, including that one-line .contiguous() fix

What I’d take away:

No custom CUDA kernels is what made this doable in the first place
Every problem came down to moving work from runtime to build time
The Aggregator (59.5%) is the bottleneck, and INT8 is next

VGGT is the paper that, to me, marks the shift in 3D reconstruction from iterative optimization to plain feed-forward inference. TensorRT just takes that and runs another 6 to 11x with it.

Reference

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

VGGT: 5 Images → 3D in 62ms — TensorRT Optimization for CVPR 2025 Best Paper

Author(s): kyon

VGGT: 5 Images → 3D in 62ms — TensorRT Optimization for CVPR 2025 Best Paper

A hands-on benchmark, a COLMAP comparison, and a full TensorRT FP16 conversion of a 1.26B-parameter 3D reconstruction Transformer.

What Is VGGT?

Running It

Running It

Environment

Inference Code

Scaling (RTX 5090, measured)

COLMAP vs VGGT on the Same Images

25 images: COLMAP succeeds

10 images: COLMAP fails

TensorRT Optimization

Monolithic export: impossible

Split into 5 Components

Phase A: DINOv2 and the antialias wall

Phase B: Aggregator and RoPE

Phase C: the three heads

Phase D: full pipeline, end to end

Phase E: the .contiguous() bug that cost me hours

After the fix: real-image results

Technical Takeaways

Why TRT conversion was feasible

Three patterns covered everything

The bottleneck is the Aggregator

Summary

Reference

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement