Dense Stereo Matching
=====================
This page explains the mathematical foundations of dense depth estimation
in AquaMVS. The goal is to compute a dense depth map for each reference
camera, assigning a depth value to every pixel. Two complementary approaches
are available: plane sweep stereo (with optional sparse feature guidance)
and dense feature matching.
Overview
--------
Dense depth estimation is the core reconstruction step, transforming 2D
images into metric 3D information. For each reference camera, we aim to
produce:
* **Depth map**: :math:`D(u, v)` giving ray depth at each pixel
* **Confidence map**: :math:`C(u, v)` indicating reliability of the depth estimate
AquaMVS offers two main pathways:
1. **Sparse → Dense**: Extract sparse features (SuperPoint + LightGlue),
triangulate to get 3D points, use them to guide plane sweep depth range.
2. **Dense Matching**: Use RoMa v2 for dense correspondence between camera
pairs, triangulate dense matches directly.
Both approaches account for refractive ray geometry (see :doc:`refractive_geometry`).
Sparse Feature Matching
------------------------
Sparse features provide initial 3D information and depth range priors for
plane sweep stereo.
**Feature Extraction**
SuperPoint detects keypoints and computes descriptors on undistorted images.
For a camera with :math:`H \times W` image:
* Keypoints: :math:`\{(u_i, v_i)\}_{i=1}^N` (sub-pixel locations)
* Descriptors: :math:`\{\mathbf{f}_i \in \mathbb{R}^{256}\}_{i=1}^N` (L2-normalized)
**Feature Matching**
LightGlue matches descriptors between reference and source cameras using
a learned correspondence network. For cameras :math:`A` and :math:`B`:
* Input: Descriptors :math:`\{\mathbf{f}_i^A\}, \{\mathbf{f}_j^B\}`
* Output: Match set :math:`\mathcal{M} = \{(i, j) : \mathbf{f}_i^A \leftrightarrow \mathbf{f}_j^B\}`
LightGlue uses attention mechanisms to refine matches and prune outliers.
**Cross-Pair Triangulation**
For each match :math:`(i, j)` with pixel coordinates :math:`(u_i^A, v_i^A)`
and :math:`(u_j^B, v_j^B)`:
1. Cast rays through both pixels using refractive ray model:
.. math::
\mathbf{r}_A: \mathbf{p}(t) = \mathbf{O}_A + t \, \mathbf{d}_A
.. math::
\mathbf{r}_B: \mathbf{p}(s) = \mathbf{O}_B + s \, \mathbf{d}_B
2. Find closest point of approach (3D point minimizing distance to both rays).
3. Compute ray depths :math:`t^*` and :math:`s^*` for the closest point.
The sparse point cloud :math:`\{\mathbf{p}_k\}` is used to:
* **Filter outliers**: Statistical outlier removal (points with few neighbors)
* **Estimate depth range**: Compute percentile-based depth bounds per camera
(e.g., 5th to 95th percentile avoids outlier contamination)
**Depth Range Computation**
For reference camera :math:`R`, sparse points are projected onto reference
rays to get ray depths :math:`\{d_1, \ldots, d_M\}`. The plane sweep depth
range is:
.. math::
d_{\min} = \text{percentile}(\{d_i\}, 5\%), \quad
d_{\max} = \text{percentile}(\{d_i\}, 95\%)
This provides adaptive depth bounds without requiring manual specification.
Plane Sweep Stereo
------------------
Plane sweep stereo evaluates photometric similarity at discrete depth
hypotheses to build a cost volume, then extracts the best-matching depth
per pixel.
**Algorithm Overview**
For each reference pixel :math:`(u, v)`:
1. Sample depth hypotheses :math:`\{d_1, \ldots, d_D\}` uniformly in
:math:`[d_{\min}, d_{\max}]`.
2. For each depth :math:`d_k`:
a. Back-project to 3D: :math:`\mathbf{p}_k = \mathbf{O} + d_k \, \mathbf{d}`
(using refractive ray model).
b. Project into each source camera :math:`S_j` to get pixel location
:math:`(u_j, v_j)`.
c. Sample source image :math:`I_j` at :math:`(u_j, v_j)` via bilinear
interpolation.
d. Compute photometric cost between reference and warped source patches.
3. Aggregate costs across all source cameras.
4. Select depth with minimum cost: :math:`\hat{d}(u, v) = \arg\min_k C(u, v, k)`.
.. mermaid::
graph LR
subgraph "Reference Camera"
P["Pixel (u,v)"]
end
subgraph "Depth Hypotheses"
D1["d₁"] --> X1["3D Point p₁"]
D2["d₂"] --> X2["3D Point p₂"]
DN["dₙ"] --> XN["3D Point pₙ"]
end
subgraph "Source Cameras"
S1["Source 1
Sample I₁(u₁,v₁)"]
S2["Source 2
Sample I₂(u₂,v₂)"]
end
P --> D1
P --> D2
P --> DN
X1 --> S1
X1 --> S2
X2 --> S1
X2 --> S2
XN --> S1
XN --> S2
style X1 fill:#81c784,stroke:#388e3c
style X2 fill:#81c784,stroke:#388e3c
style XN fill:#81c784,stroke:#388e3c
**Cost Volume**
The cost volume is a 3D tensor:
.. math::
\mathbf{C} \in \mathbb{R}^{H \times W \times D}
where :math:`C(u, v, k)` is the aggregated photometric cost at pixel
:math:`(u, v)` and depth hypothesis :math:`d_k`.
**Photometric Cost Function**
AquaMVS uses **Normalized Cross-Correlation (NCC)** to measure local
patch similarity. For reference pixel :math:`(u, v)` and source pixel
:math:`(u', v')`, NCC in an :math:`w \times w` window is:
.. math::
\text{NCC}(u, v; u', v') = \frac{\sum_{(i,j) \in W} (I_R(i,j) - \bar{I}_R) (I_S(i,j) - \bar{I}_S)}{\sqrt{\sum_{(i,j) \in W} (I_R(i,j) - \bar{I}_R)^2} \sqrt{\sum_{(i,j) \in W} (I_S(i,j) - \bar{I}_S)^2}}
where :math:`\bar{I}_R` and :math:`\bar{I}_S` are local means in the window
:math:`W` centered at :math:`(u,v)` and :math:`(u',v')`, respectively.
The cost is defined as:
.. math::
\text{Cost} = 1 - \text{NCC}
so that:
* **Cost = 0**: Perfect correlation (identical patches)
* **Cost = 1**: Uncorrelated
* **Cost = 2**: Perfect anti-correlation
**Cost Aggregation**
When multiple source cameras are available, costs are combined via
averaging:
.. math::
C(u, v, k) = \frac{1}{M} \sum_{j=1}^{M} \text{Cost}_j(u, v, k)
where :math:`M` is the number of source cameras.
**Winner-Take-All Depth Selection**
The depth estimate at each pixel is:
.. math::
\hat{d}(u, v) = d_{k^*}, \quad k^* = \arg\min_k C(u, v, k)
**Confidence Estimation**
Confidence is derived from the cost distribution. A sharp minimum indicates
high confidence. AquaMVS uses the **cost ratio**:
.. math::
\text{Confidence}(u, v) = 1 - \frac{C(u, v, k^*)}{C(u, v, k_2)}
where :math:`k^*` is the best depth and :math:`k_2` is the second-best.
High confidence when the best cost is much lower than the second-best.
Confidence values are in :math:`[0, 1]`, with 1 indicating high reliability.
Dense Matching Alternative
---------------------------
As an alternative to plane sweep, AquaMVS supports **RoMa v2** for dense
correspondence estimation.
**RoMa Overview**
RoMa (Robust Matching) v2 is a learned dense matcher that produces
per-pixel correspondence fields between image pairs. Unlike sparse
matchers, it predicts a match for every pixel.
For cameras :math:`A` and :math:`B`:
* Input: Undistorted images :math:`I_A, I_B`
* Output: Dense correspondence map :math:`\mathbf{F}: (u_A, v_A) \to (u_B, v_B)`
* Confidence map indicating match reliability
**Dense Triangulation**
For each pixel :math:`(u_A, v_A)` in camera :math:`A` with match
:math:`(u_B, v_B)` in camera :math:`B`:
1. Cast rays through both pixels.
2. Triangulate to get 3D point :math:`\mathbf{p}`.
3. Compute ray depth for camera :math:`A`.
This produces a dense point cloud directly, without plane sweep.
**Comparison: Plane Sweep vs. Dense Matching**
+-------------------+-------------------------+---------------------------+
| Aspect | Plane Sweep | Dense Matching (RoMa) |
+===================+=========================+===========================+
| Coverage | Full dense depth map | Full dense matches |
+-------------------+-------------------------+---------------------------+
| Speed | Slower (evaluates all | Faster (single forward |
| | depth hypotheses) | pass) |
+-------------------+-------------------------+---------------------------+
| Accuracy | High (multi-view | Moderate (pairwise only) |
| | consensus) | |
+-------------------+-------------------------+---------------------------+
| Robustness | Robust to textureless | Struggles with large |
| | regions (if enough | viewpoint changes |
| | views) | |
+-------------------+-------------------------+---------------------------+
| Use Case | High-quality | Fast prototyping, preview |
| | reconstruction | reconstruction |
+-------------------+-------------------------+---------------------------+
Connection to Code
------------------
The algorithms described here are implemented in:
* :py:func:`aquamvs.dense.plane_sweep_stereo`: Main plane sweep function.
* :py:func:`aquamvs.dense.compute_ncc`: NCC cost computation.
* :py:func:`aquamvs.dense.extract_depth`: Winner-take-all depth selection and
confidence estimation.
* :py:mod:`aquamvs.features.matching`: Sparse feature matching (SuperPoint + LightGlue).
* :py:mod:`aquamvs.dense.roma_depth`: Dense matching via RoMa v2.
For API details, see :doc:`/api/reconstruction`.
Next Steps
----------
Once depth maps are computed for all cameras, they must be fused into a
unified 3D representation. The next section covers :doc:`fusion`, which
describes multi-view depth fusion, geometric consistency filtering, and
surface reconstruction methods.