Dense Stereo Matching¶
This page explains the mathematical foundations of dense depth estimation in AquaMVS. The goal is to compute a dense depth map for each reference camera, assigning a depth value to every pixel. Two complementary approaches are available: plane sweep stereo (with optional sparse feature guidance) and dense feature matching.
Overview¶
Dense depth estimation is the core reconstruction step, transforming 2D images into metric 3D information. For each reference camera, we aim to produce:
Depth map: \(D(u, v)\) giving ray depth at each pixel
Confidence map: \(C(u, v)\) indicating reliability of the depth estimate
AquaMVS offers two main pathways:
Sparse → Dense: Extract sparse features (SuperPoint + LightGlue), triangulate to get 3D points, use them to guide plane sweep depth range.
Dense Matching: Use RoMa v2 for dense correspondence between camera pairs, triangulate dense matches directly.
Both approaches account for refractive ray geometry (see Refractive Geometry).
Sparse Feature Matching¶
Sparse features provide initial 3D information and depth range priors for plane sweep stereo.
- Feature Extraction
SuperPoint detects keypoints and computes descriptors on undistorted images. For a camera with \(H \times W\) image:
Keypoints: \(\{(u_i, v_i)\}_{i=1}^N\) (sub-pixel locations)
Descriptors: \(\{\mathbf{f}_i \in \mathbb{R}^{256}\}_{i=1}^N\) (L2-normalized)
- Feature Matching
LightGlue matches descriptors between reference and source cameras using a learned correspondence network. For cameras \(A\) and \(B\):
Input: Descriptors \(\{\mathbf{f}_i^A\}, \{\mathbf{f}_j^B\}\)
Output: Match set \(\mathcal{M} = \{(i, j) : \mathbf{f}_i^A \leftrightarrow \mathbf{f}_j^B\}\)
LightGlue uses attention mechanisms to refine matches and prune outliers.
- Cross-Pair Triangulation
For each match \((i, j)\) with pixel coordinates \((u_i^A, v_i^A)\) and \((u_j^B, v_j^B)\):
Cast rays through both pixels using refractive ray model:
\[\mathbf{r}_A: \mathbf{p}(t) = \mathbf{O}_A + t \, \mathbf{d}_A\]\[\mathbf{r}_B: \mathbf{p}(s) = \mathbf{O}_B + s \, \mathbf{d}_B\]Find closest point of approach (3D point minimizing distance to both rays).
Compute ray depths \(t^*\) and \(s^*\) for the closest point.
The sparse point cloud \(\{\mathbf{p}_k\}\) is used to:
Filter outliers: Statistical outlier removal (points with few neighbors)
Estimate depth range: Compute percentile-based depth bounds per camera (e.g., 5th to 95th percentile avoids outlier contamination)
- Depth Range Computation
For reference camera \(R\), sparse points are projected onto reference rays to get ray depths \(\{d_1, \ldots, d_M\}\). The plane sweep depth range is:
\[d_{\min} = \text{percentile}(\{d_i\}, 5\%), \quad d_{\max} = \text{percentile}(\{d_i\}, 95\%)\]This provides adaptive depth bounds without requiring manual specification.
Plane Sweep Stereo¶
Plane sweep stereo evaluates photometric similarity at discrete depth hypotheses to build a cost volume, then extracts the best-matching depth per pixel.
- Algorithm Overview
For each reference pixel \((u, v)\):
Sample depth hypotheses \(\{d_1, \ldots, d_D\}\) uniformly in \([d_{\min}, d_{\max}]\).
For each depth \(d_k\):
Back-project to 3D: \(\mathbf{p}_k = \mathbf{O} + d_k \, \mathbf{d}\) (using refractive ray model).
Project into each source camera \(S_j\) to get pixel location \((u_j, v_j)\).
Sample source image \(I_j\) at \((u_j, v_j)\) via bilinear interpolation.
Compute photometric cost between reference and warped source patches.
Aggregate costs across all source cameras.
Select depth with minimum cost: \(\hat{d}(u, v) = \arg\min_k C(u, v, k)\).
graph LR
subgraph "Reference Camera"
P["Pixel (u,v)"]
end
subgraph "Depth Hypotheses"
D1["d₁"] --> X1["3D Point p₁"]
D2["d₂"] --> X2["3D Point p₂"]
DN["dₙ"] --> XN["3D Point pₙ"]
end
subgraph "Source Cameras"
S1["Source 1<br/>Sample I₁(u₁,v₁)"]
S2["Source 2<br/>Sample I₂(u₂,v₂)"]
end
P --> D1
P --> D2
P --> DN
X1 --> S1
X1 --> S2
X2 --> S1
X2 --> S2
XN --> S1
XN --> S2
style X1 fill:#81c784,stroke:#388e3c
style X2 fill:#81c784,stroke:#388e3c
style XN fill:#81c784,stroke:#388e3c
- Cost Volume
The cost volume is a 3D tensor:
\[\mathbf{C} \in \mathbb{R}^{H \times W \times D}\]where \(C(u, v, k)\) is the aggregated photometric cost at pixel \((u, v)\) and depth hypothesis \(d_k\).
- Photometric Cost Function
AquaMVS uses Normalized Cross-Correlation (NCC) to measure local patch similarity. For reference pixel \((u, v)\) and source pixel \((u', v')\), NCC in an \(w \times w\) window is:
\[\text{NCC}(u, v; u', v') = \frac{\sum_{(i,j) \in W} (I_R(i,j) - \bar{I}_R) (I_S(i,j) - \bar{I}_S)}{\sqrt{\sum_{(i,j) \in W} (I_R(i,j) - \bar{I}_R)^2} \sqrt{\sum_{(i,j) \in W} (I_S(i,j) - \bar{I}_S)^2}}\]where \(\bar{I}_R\) and \(\bar{I}_S\) are local means in the window \(W\) centered at \((u,v)\) and \((u',v')\), respectively.
The cost is defined as:
\[\text{Cost} = 1 - \text{NCC}\]so that:
Cost = 0: Perfect correlation (identical patches)
Cost = 1: Uncorrelated
Cost = 2: Perfect anti-correlation
- Cost Aggregation
When multiple source cameras are available, costs are combined via averaging:
\[C(u, v, k) = \frac{1}{M} \sum_{j=1}^{M} \text{Cost}_j(u, v, k)\]where \(M\) is the number of source cameras.
- Winner-Take-All Depth Selection
The depth estimate at each pixel is:
\[\hat{d}(u, v) = d_{k^*}, \quad k^* = \arg\min_k C(u, v, k)\]- Confidence Estimation
Confidence is derived from the cost distribution. A sharp minimum indicates high confidence. AquaMVS uses the cost ratio:
\[\text{Confidence}(u, v) = 1 - \frac{C(u, v, k^*)}{C(u, v, k_2)}\]where \(k^*\) is the best depth and \(k_2\) is the second-best. High confidence when the best cost is much lower than the second-best.
Confidence values are in \([0, 1]\), with 1 indicating high reliability.
Dense Matching Alternative¶
As an alternative to plane sweep, AquaMVS supports RoMa v2 for dense correspondence estimation.
- RoMa Overview
RoMa (Robust Matching) v2 is a learned dense matcher that produces per-pixel correspondence fields between image pairs. Unlike sparse matchers, it predicts a match for every pixel.
For cameras \(A\) and \(B\):
Input: Undistorted images \(I_A, I_B\)
Output: Dense correspondence map \(\mathbf{F}: (u_A, v_A) \to (u_B, v_B)\)
Confidence map indicating match reliability
- Dense Triangulation
For each pixel \((u_A, v_A)\) in camera \(A\) with match \((u_B, v_B)\) in camera \(B\):
Cast rays through both pixels.
Triangulate to get 3D point \(\mathbf{p}\).
Compute ray depth for camera \(A\).
This produces a dense point cloud directly, without plane sweep.
Comparison: Plane Sweep vs. Dense Matching
Aspect
Plane Sweep
Dense Matching (RoMa)
Coverage
Full dense depth map
Full dense matches
Speed
Slower (evaluates all depth hypotheses)
Faster (single forward pass)
Accuracy
High (multi-view consensus)
Moderate (pairwise only)
Robustness
Robust to textureless regions (if enough views)
Struggles with large viewpoint changes
Use Case
High-quality reconstruction
Fast prototyping, preview reconstruction
Connection to Code¶
The algorithms described here are implemented in:
aquamvs.dense.plane_sweep_stereo(): Main plane sweep function.aquamvs.dense.compute_ncc(): NCC cost computation.aquamvs.dense.extract_depth(): Winner-take-all depth selection and confidence estimation.aquamvs.features.matching: Sparse feature matching (SuperPoint + LightGlue).aquamvs.dense.roma_depth: Dense matching via RoMa v2.
For API details, see Reconstruction.
Next Steps¶
Once depth maps are computed for all cameras, they must be fused into a unified 3D representation. The next section covers Multi-View Fusion and Surface Reconstruction, which describes multi-view depth fusion, geometric consistency filtering, and surface reconstruction methods.