NaN-aware & weighted reduction¶
The single-matmul fast path assumes clean data: every cell is valid and every cell counts equally. Two situations break that assumption, and geohalo handles both by renormalising per slice instead of trusting a fixed denominator.
Why a fixed operator can't do it¶
The mean is \((\mathbf{W}\mathbf{x})_i\) divided by polygon \(i\)'s total overlap area, the row sum \(\sum_j W_{ij}\). That denominator is baked into the operator — correct only when every cell under the polygon contributes.
But if some cells are NaN (missing data — outside the model domain, masked ocean, etc.), those cells must drop out of both the numerator and the denominator, and which cells are NaN can differ from slice to slice. The same goes for per-cell weights: a population-weighted mean divides by the weighted valid area, not the geometric area. No single precomputed scalar captures a per-slice mask, so geohalo recomputes the denominator on the fly with a second matmul.
The masked path¶
reduce_with_stencil picks the path automatically:
flowchart TD
IN["reduce_with_stencil(grid, stencil)"] --> Q{"weight_key set<br/>OR any spatial NaN?"}
Q -->|no| CLEAN["build ReduceOperator<br/>→ one matmul (fast path)"]
Q -->|yes| MASK["masked path:<br/>numerator + denominator matmuls"]
classDef hl fill:#fef3c7,stroke:#b45309,color:#0b1220;
For the masked path, with the stencil's occupancy matrix \(\mathbf{W}\), a validity mask \(\mathbf{v}\) (1 where valid, 0 where NaN), and optional per-cell weights \(\mathbf{w}\):
Both sums are matmuls against \(\mathbf{W}\): the numerator projects the weight-and-mask-applied values, the denominator projects the weights-times-mask. The ratio is the renormalised mean, computed fresh for each slice's mask — without rebuilding \(\mathbf{W}\).
# from geohalo.api._project_masked (mean, unweighted)
valid = ~np.isnan(resampled)
numer = (occ @ np.where(valid, resampled, 0.0).T).T
denom = (occ @ valid.astype(np.float64).T).T
return np.where(denom > 0, numer / denom, np.nan)
A polygon all of whose cells are NaN gets denom == 0 and resolves to NaN rather than
a divide-by-zero — missing in, missing out.
An allocation geohalo skips
In the unweighted case the per-cell weight is just 1, so geohalo does not
materialise a full ones-array or run the two no-op multiplies. On a 50×4 batch
those skipped (batch, n_cells) allocations would run to gigabytes — visible in the
benchmarks as the masked-vs-clean memory gap.
how="sum" under masking¶
For a sum there is no denominator to renormalise — invalid cells simply contribute zero:
geohalo returns the numerator directly and skips the denominator matmul.
Per-cell weights¶
Name a weight variable with weight_key. It is resolved on the grid being reduced,
broadcast to match each data variable's batch shape, and (when refining) passed through
the same resampler as the data. Build a per-cell weight field, attach it to the data, and
reduce against it — a complete runnable example:
import numpy as np
import geopandas as gpd
import xarray as xr
from shapely.geometry import box
import geohalo as ghl
lats = np.arange(-25.0, -19.0, 0.25)
lons = np.arange(-50.0, -42.0, 0.25)
lon2d, lat2d = np.meshgrid(lons, lats)
da = xr.DataArray(
290.0 + 5.0 * np.cos(np.deg2rad(4 * lat2d)) + 0.1 * lon2d,
dims=("latitude", "longitude"),
coords={"latitude": lats, "longitude": lons},
name="t2m",
)
geoms = gpd.GeoSeries([box(-49, -24, -47, -22), box(-46, -22, -44, -20)], index=["SP", "MG"])
# a per-cell weight field (e.g. population) on the same grid, attached as a coordinate
weight = np.abs(np.random.default_rng(0).normal(1000.0, 400.0, size=da.shape))
da_w = da.assign_coords(population=(("latitude", "longitude"), weight))
out = ghl.reduce(da_w, geoms, weight_key="population") # population-weighted mean
print(out.values)
A NaN in the weight field drops that cell too — the validity mask is
~isnan(value) & ~isnan(weight).
Cost¶
The masked path is slower than the clean path — two matmuls instead of one, plus the
mask arithmetic — but still milliseconds. With a 1 % NaN mask over 5 571 polygons and a
batch of 50 slices it runs in ~14 ms, versus ~6 ms clean (see the
benchmarks). When a grid resample is also needed, the masked path
keeps the resampler factored (FactoredResampler) and applies it per slice, rather
than fusing it — because fusion would re-bake the denominator it is trying to keep live.