Performance¶

The headline claim — millisecond-scale aggregation in the hot path — is backed by a reproducible suite in benchmarks/. This page reads the numbers; the full generated report has every row.

uv run python -m benchmarks.run

The shape of the cost¶

geohalo deliberately moves all the expensive work into a one-time precompute, leaving a hot path that is a single sparse · dense matmul.

flowchart LR
    subgraph once ["once — seconds"]
        S["Stencil / ReduceOperator build<br/>exactextract + sparse assembly"]
    end
    subgraph each ["per grid slice — milliseconds"]
        M["one matmul<br/>flat @ M.T"]
    end
    S -.cache.-> M

Precompute — pay once¶

Building the stencil scales with polygon count and vertex complexity; the resulting matrix is small and cacheable.

n_polygons (GADM Brazil L2)	`Stencil.compute`	CSR size
50	~21 ms	6 KB
507	~170 ms	38 KB
5571	~2.1 s	430 KB

After the first build, a cache hit loads in milliseconds — a ~30–46× speedup for stencils, and thousands of times for a refined ReduceOperator.

Hot path — pay per slice¶

All rows below aggregate to GADM Brazil L2 (~5570 municipalities) on a 0.25° grid over Brazil (160×160 = 25 600 cells). The batch dims are arbitrary — any stacked non-spatial dims work; here they follow the ECMWF data the suite happens to use, where member=50 is a 50-member ensemble and step=N is N lead times.

n_polygons	batch	slices	factor	median
50	(member=50,)	50	1	3.6 ms
5571	(member=50,)	50	1	5.8 ms
5571	(member=50, step=10)	500	1	196 ms
5571	(member=50, step=40)	2 000	1	670 ms
5571	(member=50,)	50	4 (refined)	113 ms
5571	(member=50,)	50	1 (1 % NaN)	14 ms

A batch of 50 grid slices over 5 571 polygons reduces in single-digit milliseconds. The cost scales with the number of slices (it is one matmul over the flattened batch), and the NaN-aware path costs roughly 2–3× the clean path for its second matmul.

The fusion win¶

For a 0.25° → 0.05° refine (~3.2 M target cells) over 500 polygons, the materialised resampler is a 358 MB blob that cannot build at all at iterations=3. The fused ReduceOperator is a 0.40 MB blob, builds in ~0.5 s, and loads in ~0.5 ms — same answer, ~900× smaller.

Materialised resampler vs fused reduce operator

Rollups¶

Hierarchical rollups are another matmul. The full GADM Brazil muni → state hierarchy (5 571 leaves) rolls a batch of 50 slices up in ~5.6 ms; with a 500-slice batch, ~19 ms.

Caveats¶

Numbers are point-in-time on the author's machine and vary ±20 % by hardware. Cold-import overhead (~0.3 s for import geohalo) is excluded — the suite measures steady-state cost. The full report is regenerated after any perf-relevant change and records the exact environment, hardware, and methodology.