TechEcho

5 comments

Novel view synthesis via 3DGS requires knowledge of the camera pose for every input image, ie the cam position and orientation in 3D space.Historically camera poses have been estimated via 2D image matching techniques like SIFT [1], through software packages like COLMAP.These algorithms work well when you have many images that methodically cover a scene. However they often struggle to produce accurate estimates in the few image regime, or “in the wild” where photos are taken casually with less rigorous scene coverage.To address this, a major trend in the field is to move away from classical 2D algorithms, instead leveraging methods that incorporate 3D “priors” or knowledge of the world.To that end, this paper builds heavily on MASt3R [2], which is a vision transformer model that has been trained to reconstruct a 3D scene from 2D image pairs. The authors added another projection head to output the initial parameters for each gaussian primitive. They further optimize the gaussians through some clever use of the original image pair and randomly selected and rendered novel views, which is basically the original 3DGS algorithm but using synthesized target images instead (hence “zero-shot” in the title).I do think this general approach will dominate the field in the coming years, but it brings its own unique challenges.In particular, the quadratic time complexity of transformers is the main computational bottleneck preventing this technique from being scaled up to more than two images at a time, and to resolutions beyond 512 x 512.Also, naive image matching itself has quadratic time complexity, which is really painful with large dense latent vectors and can’t be accelerated with kd-trees due to the curse of dimensionality. That’s why the authors use a hierarchical coarse to fine algorithm that approximates the exact computation and achieves linear time complexity wrt to image resolution.[1] <a href="https://en.m.wikipedia.org/wiki/Scale-invariant_feature_transform" rel="nofollow">https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...</a>[2] <a href="https://github.com/naver/mast3r">https://github.com/naver/mast3r</a>

jonhohle9 months ago

The mirror in the example with the washing machine is amazing. Obviously the model doesn’t understand that it’s a mirror so renders it as if it were a window with volume behind the wall. But it does it so realistically that it produces the same effect as a mirror when viewed from different angles. This feels like something out of a sci-fi detective movie.

评论 #41368494 未加载

评论 #41374230 未加载

评论 #41372857 未加载

S0y9 months ago

This is really awesome. A question for someone who knows more about this: How much harder would it be to make this work using any number of photos? I'm assuming this is the end goal for a model like this.Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.

评论 #41368627 未加载

评论 #41368608 未加载

teqsun9 months ago

Just to check my understanding, the novel part of this is the fact that it generates it from two pictures from any camera without custom hand-calibration for that particular camera, and everything else involved is existing technology?

Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image Pairs

5 comments

Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image Pairs

5 comments