Dense Stereo Matching

This page explains the mathematical foundations of dense depth estimation in AquaMVS. The goal is to compute a dense depth map for each reference camera, assigning a depth value to every pixel. Two complementary approaches are available: plane sweep stereo (with optional sparse feature guidance) and dense feature matching.

Overview

Dense depth estimation is the core reconstruction step, transforming 2D images into metric 3D information. For each reference camera, we aim to produce:

  • Depth map: \(D(u, v)\) giving ray depth at each pixel

  • Confidence map: \(C(u, v)\) indicating reliability of the depth estimate

AquaMVS offers two main pathways:

  1. Sparse → Dense: Extract sparse features (SuperPoint + LightGlue), triangulate to get 3D points, use them to guide plane sweep depth range.

  2. Dense Matching: Use RoMa v2 for dense correspondence between camera pairs, triangulate dense matches directly.

Both approaches account for refractive ray geometry (see Refractive Geometry).

Sparse Feature Matching

Sparse features provide initial 3D information and depth range priors for plane sweep stereo.

Feature Extraction

SuperPoint detects keypoints and computes descriptors on undistorted images. For a camera with \(H \times W\) image:

  • Keypoints: \(\{(u_i, v_i)\}_{i=1}^N\) (sub-pixel locations)

  • Descriptors: \(\{\mathbf{f}_i \in \mathbb{R}^{256}\}_{i=1}^N\) (L2-normalized)

Feature Matching

LightGlue matches descriptors between reference and source cameras using a learned correspondence network. For cameras \(A\) and \(B\):

  • Input: Descriptors \(\{\mathbf{f}_i^A\}, \{\mathbf{f}_j^B\}\)

  • Output: Match set \(\mathcal{M} = \{(i, j) : \mathbf{f}_i^A \leftrightarrow \mathbf{f}_j^B\}\)

LightGlue uses attention mechanisms to refine matches and prune outliers.

Cross-Pair Triangulation

For each match \((i, j)\) with pixel coordinates \((u_i^A, v_i^A)\) and \((u_j^B, v_j^B)\):

  1. Cast rays through both pixels using refractive ray model:

    \[\mathbf{r}_A: \mathbf{p}(t) = \mathbf{O}_A + t \, \mathbf{d}_A\]
    \[\mathbf{r}_B: \mathbf{p}(s) = \mathbf{O}_B + s \, \mathbf{d}_B\]
  2. Find closest point of approach (3D point minimizing distance to both rays).

  3. Compute ray depths \(t^*\) and \(s^*\) for the closest point.

The sparse point cloud \(\{\mathbf{p}_k\}\) is used to:

  • Filter outliers: Statistical outlier removal (points with few neighbors)

  • Estimate depth range: Compute percentile-based depth bounds per camera (e.g., 5th to 95th percentile avoids outlier contamination)

Depth Range Computation

For reference camera \(R\), sparse points are projected onto reference rays to get ray depths \(\{d_1, \ldots, d_M\}\). The plane sweep depth range is:

\[d_{\min} = \text{percentile}(\{d_i\}, 5\%), \quad d_{\max} = \text{percentile}(\{d_i\}, 95\%)\]

This provides adaptive depth bounds without requiring manual specification.

Plane Sweep Stereo

Plane sweep stereo evaluates photometric similarity at discrete depth hypotheses to build a cost volume, then extracts the best-matching depth per pixel.

Algorithm Overview

For each reference pixel \((u, v)\):

  1. Sample depth hypotheses \(\{d_1, \ldots, d_D\}\) uniformly in \([d_{\min}, d_{\max}]\).

  2. For each depth \(d_k\):

    1. Back-project to 3D: \(\mathbf{p}_k = \mathbf{O} + d_k \, \mathbf{d}\) (using refractive ray model).

    2. Project into each source camera \(S_j\) to get pixel location \((u_j, v_j)\).

    3. Sample source image \(I_j\) at \((u_j, v_j)\) via bilinear interpolation.

    4. Compute photometric cost between reference and warped source patches.

  3. Aggregate costs across all source cameras.

  4. Select depth with minimum cost: \(\hat{d}(u, v) = \arg\min_k C(u, v, k)\).

        graph LR
  subgraph "Reference Camera"
    P["Pixel (u,v)"]
  end
  subgraph "Depth Hypotheses"
    D1["d₁"] --> X1["3D Point p₁"]
    D2["d₂"] --> X2["3D Point p₂"]
    DN["dₙ"] --> XN["3D Point pₙ"]
  end
  subgraph "Source Cameras"
    S1["Source 1<br/>Sample I₁(u₁,v₁)"]
    S2["Source 2<br/>Sample I₂(u₂,v₂)"]
  end
  P --> D1
  P --> D2
  P --> DN
  X1 --> S1
  X1 --> S2
  X2 --> S1
  X2 --> S2
  XN --> S1
  XN --> S2
  style X1 fill:#81c784,stroke:#388e3c
  style X2 fill:#81c784,stroke:#388e3c
  style XN fill:#81c784,stroke:#388e3c
    
Cost Volume

The cost volume is a 3D tensor:

\[\mathbf{C} \in \mathbb{R}^{H \times W \times D}\]

where \(C(u, v, k)\) is the aggregated photometric cost at pixel \((u, v)\) and depth hypothesis \(d_k\).

Photometric Cost Function

AquaMVS uses Normalized Cross-Correlation (NCC) to measure local patch similarity. For reference pixel \((u, v)\) and source pixel \((u', v')\), NCC in an \(w \times w\) window is:

\[\text{NCC}(u, v; u', v') = \frac{\sum_{(i,j) \in W} (I_R(i,j) - \bar{I}_R) (I_S(i,j) - \bar{I}_S)}{\sqrt{\sum_{(i,j) \in W} (I_R(i,j) - \bar{I}_R)^2} \sqrt{\sum_{(i,j) \in W} (I_S(i,j) - \bar{I}_S)^2}}\]

where \(\bar{I}_R\) and \(\bar{I}_S\) are local means in the window \(W\) centered at \((u,v)\) and \((u',v')\), respectively.

The cost is defined as:

\[\text{Cost} = 1 - \text{NCC}\]

so that:

  • Cost = 0: Perfect correlation (identical patches)

  • Cost = 1: Uncorrelated

  • Cost = 2: Perfect anti-correlation

Cost Aggregation

When multiple source cameras are available, costs are combined via averaging:

\[C(u, v, k) = \frac{1}{M} \sum_{j=1}^{M} \text{Cost}_j(u, v, k)\]

where \(M\) is the number of source cameras.

Winner-Take-All Depth Selection

The depth estimate at each pixel is:

\[\hat{d}(u, v) = d_{k^*}, \quad k^* = \arg\min_k C(u, v, k)\]
Confidence Estimation

Confidence is derived from the cost distribution. A sharp minimum indicates high confidence. AquaMVS uses the cost ratio:

\[\text{Confidence}(u, v) = 1 - \frac{C(u, v, k^*)}{C(u, v, k_2)}\]

where \(k^*\) is the best depth and \(k_2\) is the second-best. High confidence when the best cost is much lower than the second-best.

Confidence values are in \([0, 1]\), with 1 indicating high reliability.

Dense Matching Alternative

As an alternative to plane sweep, AquaMVS supports RoMa v2 for dense correspondence estimation.

RoMa Overview

RoMa (Robust Matching) v2 is a learned dense matcher that produces per-pixel correspondence fields between image pairs. Unlike sparse matchers, it predicts a match for every pixel.

For cameras \(A\) and \(B\):

  • Input: Undistorted images \(I_A, I_B\)

  • Output: Dense correspondence map \(\mathbf{F}: (u_A, v_A) \to (u_B, v_B)\)

  • Confidence map indicating match reliability

Dense Triangulation

For each pixel \((u_A, v_A)\) in camera \(A\) with match \((u_B, v_B)\) in camera \(B\):

  1. Cast rays through both pixels.

  2. Triangulate to get 3D point \(\mathbf{p}\).

  3. Compute ray depth for camera \(A\).

This produces a dense point cloud directly, without plane sweep.

Comparison: Plane Sweep vs. Dense Matching

Aspect

Plane Sweep

Dense Matching (RoMa)

Coverage

Full dense depth map

Full dense matches

Speed

Slower (evaluates all depth hypotheses)

Faster (single forward pass)

Accuracy

High (multi-view consensus)

Moderate (pairwise only)

Robustness

Robust to textureless regions (if enough views)

Struggles with large viewpoint changes

Use Case

High-quality reconstruction

Fast prototyping, preview reconstruction

Connection to Code

The algorithms described here are implemented in:

  • aquamvs.dense.plane_sweep_stereo(): Main plane sweep function.

  • aquamvs.dense.compute_ncc(): NCC cost computation.

  • aquamvs.dense.extract_depth(): Winner-take-all depth selection and confidence estimation.

  • aquamvs.features.matching: Sparse feature matching (SuperPoint + LightGlue).

  • aquamvs.dense.roma_depth: Dense matching via RoMa v2.

For API details, see Reconstruction.

Next Steps

Once depth maps are computed for all cameras, they must be fused into a unified 3D representation. The next section covers Multi-View Fusion and Surface Reconstruction, which describes multi-view depth fusion, geometric consistency filtering, and surface reconstruction methods.