Quickstart¶
Every snippet on this page is self-contained and runnable — copy, paste, and it
works. We build a synthetic grid so there are no files to download; in your own code,
da comes from xr.open_dataset(...) (GRIB, NetCDF, Zarr, …) and geoms from your own
shapefile or GeoDataFrame.
Install¶
geohalo targets Python ≥ 3.12.
Two features live behind optional extras: redis (the
RedisCache backend) and matplotlib (the plotting helpers in
geohalo.plot).
Set up a grid and polygons¶
import numpy as np
import geopandas as gpd
import xarray as xr
from shapely.geometry import box
import geohalo as ghl
# a regular 0.25° lat/lon grid + a synthetic 2 m-temperature field (K)
lats = np.arange(-25.0, -19.0, 0.25)
lons = np.arange(-50.0, -42.0, 0.25)
lon2d, lat2d = np.meshgrid(lons, lats)
field = 290.0 + 5.0 * np.cos(np.deg2rad(4 * lat2d)) + 0.1 * lon2d
da = xr.DataArray(
field,
dims=("latitude", "longitude"),
coords={"latitude": lats, "longitude": lons},
name="t2m",
)
# three polygons (plain boxes here; any shapely geometry works)
geoms = gpd.GeoSeries(
[box(-49, -24, -47, -22), box(-47, -24, -45, -22), box(-46, -22, -44, -20)],
index=["SP", "RJ", "MG"], # the index holds the keys
)
The keys live on the index
geohalo never invents identifiers. Whatever you put on the GeoSeries index — a
string, a tuple, a pandas.MultiIndex — comes back as the geom coordinate, so the
result joins straight back to your metadata.
Reduce to per-polygon values¶
out = ghl.reduce(da, geoms)
print(out)
# <xarray.DataArray 't2m' (geom: 3)>
# array([286.02, 285.23, 285.03])
# Coordinates:
# * geom (geom) object 'MG' 'RJ' 'SP'
The output replaces (latitude, longitude) with a single geom dim and keeps every
other dim untouched. (Polygons are processed in sorted-key order, so geom comes back
sorted — sel by key whenever order matters.)
Batch dimensions come for free¶
Stack as many non-spatial dims as you like — time steps, ensemble members, bands, levels. They are flattened, run through one matmul, and reshaped back.
rng = np.random.default_rng(0)
ens = field[None] + rng.normal(0, 1.0, size=(50, *field.shape)) # 50 members
da_ens = xr.DataArray(
ens,
dims=("member", "latitude", "longitude"),
coords={"member": np.arange(50), "latitude": lats, "longitude": lons},
name="t2m",
)
out = ghl.reduce(da_ens, geoms)
print(out.dims, out.shape) # ('member', 'geom') (50, 3)
Datasets work too¶
Pass an xr.Dataset and every data variable carrying the spatial dims is reduced; the
rest pass through.
ds = xr.Dataset({"t2m": da, "tp": da * 0 + 3.0}) # two spatial vars
out = ghl.reduce(ds, geoms)
print(list(out.data_vars)) # ['t2m', 'tp']
Weighted means with weight_key¶
For a weighted mean — population-weighted temperature, area-weighted anything — attach a
per-cell weight field and name it with weight_key. Here we build a synthetic
population grid, add it to the data as a coordinate, and reduce against it:
# a per-cell weight field on the same grid (e.g. population density)
weight = np.abs(rng.normal(1000.0, 400.0, size=field.shape))
# attach it as a coordinate so geohalo can look it up by name
da_w = da.assign_coords(population=(("latitude", "longitude"), weight))
out = ghl.reduce(da_w, geoms, weight_key="population")
print(out.values) # population-weighted mean per polygon
Each polygon's value becomes \(\sum_j W_{ij}\,w_j\,x_j \big/ \sum_j W_{ij}\,w_j\) — the
cell weights multiply the geometric area weights. A NaN in the weight field drops that
cell. See NaN-aware & weighted reduction for the full story.
Weights on a Dataset
With a Dataset you can instead carry the weight as a data variable:
xr.Dataset({"t2m": da, "population": (("latitude", "longitude"), weight)}) and call
reduce(ds, geoms, weight_key="population"). The weight variable is reduced too and
appears in the output — select the variable you care about (out["t2m"]).
Missing data is handled automatically¶
When some cells are NaN (outside the domain, masked ocean, a sensor gap), geohalo drops
them from each polygon's average and renormalises by the valid area — no flag needed.
masked = field.copy()
masked[lat2d > -21.0] = np.nan # blank out the northern strip
da_nan = xr.DataArray(
masked,
dims=("latitude", "longitude"),
coords={"latitude": lats, "longitude": lons},
name="t2m",
)
out = ghl.reduce(da_nan, geoms) # NaN-aware; a polygon with no valid cell → NaN
print(out.values)
Sub-cell precision: refine first¶
Polygons smaller than a cell get a sharper answer if you refine the grid with mean-preserving downscaling before reducing — geohalo fuses the refine and the reduce into one operator, so the fine grid is never materialised.
out = ghl.reduce(da, geoms, target_resolution=0.05) # refine 0.25° → 0.05°
out = ghl.reduce(da, geoms, target_resolution=0.05, resample_iterations=3) # smoother
Separate the precompute from the hot path¶
reduce builds the operator and applies it in one call. For repeated calls (many grids
sharing the same grid + polygons), build it once, cache it, and
reuse:
cache = ghl.LocalCache("./.geohalo-cache")
stencil = cache.get_or_compute_stencil(da.latitude.values, da.longitude.values, geoms)
out = ghl.reduce_with_stencil(da_ens, stencil) # reuse for every slice
flowchart LR
subgraph precompute [precompute — once per grid+polygons]
G[grid coords] --> S[Stencil.compute]
P[polygons] --> S
end
subgraph hot [hot path — once per grid slice]
S --> R[reduce_with_stencil]
D[grid data] --> R
R --> O[per-polygon values]
end
Common options¶
reduce (and reduce_with_stencil) accept:
| Argument | Default | Effect |
|---|---|---|
how |
"mean" |
"mean" (area-weighted) or "sum" (area-weighted total) |
target_resolution |
None |
refine the grid to this step before reducing — sub-cell precision |
resample_iterations |
1 |
smoothing passes of the mean-preserving refine |
weight_key |
None |
name of a per-cell weight variable |
spherical_correction |
True |
divide each cell's weight by its true spherical area |
lat_dim / lon_dim |
"latitude" / "longitude" |
rename the spatial dims |
geom_dim |
"geom" |
rename the output polygon dim |
out = ghl.reduce(da, geoms, how="sum")
out = ghl.reduce(da, geoms, spherical_correction=False) # planar / equal-area weights
Roll up a hierarchy¶
If leaf polygons belong to a parent grouping (municipalities → states → country), one more
matmul rolls them up. The DataFrame index is the child; the parent column is its
parent.
import pandas as pd
out = ghl.reduce(da, geoms) # leaf values for SP, RJ, MG
edges = pd.DataFrame(
{"parent": ["BR", "BR", "BR"]},
index=pd.Index(["SP", "RJ", "MG"], name="geom"),
)
rolled = ghl.aggregate_bias(out, edges) # carries leaves AND the parent 'BR'
print(dict(zip(rolled.geom.values, rolled.values.round(2))))
# {'MG': 286.02, 'RJ': 285.23, 'SP': 285.03, 'BR': 285.43}
See Hierarchical rollups for weights, how="sum", and deeper
trees.