GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization

Abstract

Although various visual localization approaches exist, such as scene coordinate regression and camera pose regression, these methods often struggle with optimization complexity or limited accuracy. To address these challenges, we explore the use of novel view synthesis techniques, particularly 3D Gaussian Splatting (3DGS), which enables the compact encoding of both 3D geometry and scene appearance. We propose a two-stage procedure that integrates dense and robust keypoint descriptors from the lightweight XFeat feature extractor into 3DGS, enhancing performance in both indoor and outdoor environments. The coarse pose estimates are directly obtained via 2D-3D correspondences between the 3DGS representation and query image descriptors. In the second stage, the initial pose estimate is refined by minimizing the rendering-based photometric warp loss. Benchmarking on widely used indoor and outdoor datasets demonstrates improvements over recent neural rendering-based localization methods, such as NeRFMatch and PNeRFLoc.

Video

Approach

We model the scene using a feature-based 3D Gaussian Splatting (3DGS) approach, grounding keypoint descriptors into a 3D representation for fast, reliable coarse pose estimation. Descriptors from the XFeat network enable localization in both static and dynamic environments. In the test stage, we estimate the initial coarse pose by matching 2D sparse keypoints from the query image to 3D points in the 3DGS model using a simple but effective greedy matching strategy, followed by a Perspective-n-Point (PnP) solver in a RANSAC loop. We refine the pose by aligning a rendered image with the input query using an RGB warping loss, improving accuracy through test-time optimization. Our pipeline is fully end-to-end, meaning it works without extra learnable modules or complex steps.

7Scenes evaluation

We evaluate our method in indoor scenarios using the 7Scenes dataset. Each image below is divided by a slider. The left side shows the ground truth (GT) as a query image, while the right side displays the render from our estimated refined pose. By comparing these side-by-side visualizations, you can observe the high accuracy of our approach in matching the estimated pose to the ground truth.

Fire

Office

Cambridge Landmarks evaluation

We evaluate our method on the Cambridge Landmarks dataset for outdoor scenarios, addressing challenges like varying lighting and diverse architectural features. Our approach demonstrates robust performance in complex urban environments, accurately estimating camera poses.

KingsCollege

ShopFacade

Dynamic scenes evaluation

Additionally, we evaluate our method in challenging dynamic environments using outdoor Phototourism dataset and indoor Sitcoms3D dataset. These datasets allow us to assess the performance of our approach in more complex and variable scenes, demonstrating its robustness and versatility across different types of environments.

Phototourism / Brandenburg Gate

Sitcoms3D / TBBT Big living room

BibTeX


          @misc{sidorov2025gsplatlocgroundingkeypointdescriptors,
            title={GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization},
            author={Gennady Sidorov and Malik Mohrat and Denis Gridusov and Ruslan Rakhimov and Sergey Kolyubin},
            year={2025},
            eprint={2409.16502},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2409.16502},
          }