2022-06-21, 11:15–11:45, Room 1
Visual localization is a key technology for applications such as augmented, mixed and virtual reality, as well as robotics and autonomous driving. It addresses the problem of estimating the 6-degree-of-freedom (DoF) camera pose from which a given image or sequence of images was captured relative to a reference scene representation, often in the form of images with known poses. Although much research has been done in this area in recent years, large variations in appearance caused by season, weather, illumination, and man-made changes, as well as large-scale environments, are still challenging for visual localization solutions. To overcome the limitations caused by appearance changes, traditional hand-crafted local image feature descriptors such as SIFT (Lowe, 2004) or SURF (Bay et al., 2008) are replaced by learned feature descriptors such as SuperPoint (DeTone et al., 2018), R2D2 (Revaud et al., 2019), ASLFeat (Luo et al., 2020), DISK (Tyszkiewicz et al., 2020) or ALIKE (Zhao et al., 2022). Hierarchical approaches combining image retrieval and structure-based localization (Sarlin et al., 2019) are developed to deal with large environments, both to keep the required computational resources low and to ensure the uniqueness of the local features.
In this talk we present our visual localization solution based on a huge database of images and associated poses obtained from mobile mapping campaigns. Our visual localization solution follows a hierarchical approach and consists of three steps. Steps one and two are performed in our own Python library, while step three is performed in COLMAP, an open-source structure-from-motion (SfM) software that we bound with PyBind11. To localize a query image, we first select potential reference images in the database with a spatial query of the prior pose of the query image. Then, we perform image retrieval to find the 15 most similar reference images. Second, we extract local features from query and reference images and match them pairwise. Third, we build a COLMAP database with the matches and image metadata and import it to COLMAP. Then we use the bound COLMAP functions to perform a geometric verification of the raw matches, reconstruct the 3D scene from the reference images, and register the query image to the 3D scene by 2D-3D matches. Finally, we obtain the pose and the associated standard deviation of the query image. We tested our approach using accurately georeferenced street-level imagery provided by our project partner iNovitas AG. Experiments in road and railway environments demonstrated a high accuracy potential of sub-decimeter in position and sub-degree in orientation.