The source code can be downloaded here:

This page briefly summarizes what the benchmark can do and what it measures. For set-up and runtime instructions, please refer to the documentation on GitHub. For instructions about participating in the challenge (which does not actually require running the benchmark), please refer to the submission page.


Our pipeline contains the following blocks:

  1. Feature extraction
  2. Feature matching
  3. Outlier pre-filtering
  4. Performance evaluation on downstream tasks
    • Stereo task
    • Multi-view task

In the first stage, we extract features (keypoints and descriptors) for every image in a scene. In the second stage, we use them to generate a list of putative matches for every pair of images. The outlier filtering step embeds deep networks for correspondence estimation; currently we make available Context Networks inside the benchmark. These matches are then given to a stereo task (e.g. RANSAC) and a multi-view task (Bundle Adjustment), which run separately.

These components often interact in unforeseen ways. They need to be tuned one at a time, and are hard to evaluate with intermediate metrics, which requires implementing the entire pipeline and embedding baseline methods in it. We believe this is the first benchmark to provide an integrated solution to this problem. It can be used to evaluate any of the following:

Task 1: Wide-baseline stereo

In this task we simply match two images across wide baselines. Image pairs are selected according to (loose) co-visibility constraints so that at least part of the scene is guaranteed to overlap. To do so we compute a simple co-visibility estimation based on the size of the bounding box containing all the reconstructed points co-visible for every possible pair of images, which we then threshold to generate lists of valid pairs of increasing difficulty. We typically consider about 4-5k image pairs per scene.

Our main evaluation metric is the quality of the estimated poses. We measure them in relative terms, as stereo poses can be recovered up to a scale factor. To measure the accuracy for a given pair of images, we compute the angular difference between the estimated and ground truth translation vectors, and between the estimated and ground truth rotation vectors, and take the largest of the two. We threshold this error to reach a binary decision, e.g. is this estimate accurate at X degrees, and accumulate it over multiple image pairs to compute the "average accuracy". We do this for multiple error thresholds, from 1 to 10 degrees, with a 1-degree resolution. Finally, we compute the area under this curve: this provides a single scalar value that we call "mean Average Accuracy" or mAA, analogously to the mean Average Precision or mAP commonly used in object detection.

Additionally, we compute traditional metrics such as keypoint repeatability or descriptor matching score. As many modern local feature extractors do not have a clear notion of their support region, we compute them from keypoint locations only, using the ground truth depth maps to determine valid/invalid correspondences across images. Note that while generally accurate, we cannot rely on them for sub-pixel accuracy, and they are constrained to non-occluded areas.

Stereo reconstruction examples Stereo reconstruction examples Stereo reconstruction examples Stereo reconstruction examples Stereo reconstruction examples
Stereo matching example: 2k SIFT features and DEGENSAC (Chum et al, CVPR'05). We show the inliers, color-coded using the ground truth depth: green to yellow indicate correct matches (with green encoding 0 error and yellow the maximum error allowed: 5 pixels), red indicates incorrect matches, and blue indicates that keypoints fall on areas without ground truth.

Task 2: Multi-view reconstruction from image subsets

Recent, learned methods have shown promising results in stereo, but it is not clear whether this translates after large-scale reconstruction with Bundle Adjustment. We thus propose to evaluate SfM directly, as previously done for instance by the Comparative Evaluation Benchmark. Unfortunately, it is not feasible to obtain truly accurate depth measurements for large-scale image collections from heterogenous sensors, and under most circumstances the best we can do is collect statistics such as the number landmarks, their track length, or the reprojection error.

By contrast, we propose to build SfM reconstructions with Colmap from small (5, 10, 25) subsets of images, which we call bags, and use the poses obtained from the full collections as ground truth. Specifically, we subsample each scene to 100 images and, from them, generate sets of 5 images (100 of them), 10 images (50 of them), and 25 images (25 of them), sampled at random from the 100-image subset (enforcing a minimum degree of co-visibility between the images).

This task is evaluated with the same metric we proposed for stereo, averaged over every pair of images in a subset (e.g. 10 pairs for one bag of size 5, 45 for one bag of size 10, and 300 for one bag of size 25). We believe that this provides a better proxy metric to evaluate feature extractors and matching algorithms than what has been used in previous efforts. Note that this penalizes reconstructions that fail to register images. If Colmap generates multiple 3D models which cannot be co-registered, we consider the largest one (the one with the most images).

Multi-view reconstruction examples Multi-view reconstruction examples Multi-view reconstruction examples Multi-view reconstruction examples
Multi-view reconstruction examples Multi-view reconstruction examples Multi-view reconstruction examples Multi-view reconstruction examples
Multi-view task example: Reconstruction with 2k SIFT features with Colmap and a subset of 25 images. We draw the keypoints in blue if they become a 3D landmark, and in red otherwise. We then compute the mean Average Accuracy for every possible pair of images using the same metric as for stereo.

Several stages in the benchmark pipeline, such as the robust stereo matchers or bundle adjustment with Colmap, use random optimizations. In order to evaluate this effect, we average the results across three runs. In total, a full evaluation round for a single submission requires computing ~150k stereo pairs and ~5k Structure-from-Motion reconstructions.