Dissolve with dask-geopandas#
Dask-geopandas offers a distributed version of the dissolve
method based on dask.dataframe.groupby
method. It shares all benefits and caveats of the vanilla groupby
with a few minor differences.
The geometry column is aggregated using the unary_union
operation, which tends to be relatively costly. Therefore it is beneficial to try to use all workers in every step of the operation. GroupBy by default returns a result as a single partition, which in reality means that the final aggregation step is done within a single thread. For geometries, that means that the operation needs to loop through groups of geometries coming from other partitions and call unary_union
on every group one by one.
The number of output partitions can be specified using split_out
keyword passed to aggregation. Therefore, if we set split_out=16
, it will return 16 partitions (if there are 16 or more unique groups) each of which is processed by a different worker (if there are 16 or more workers). The final aggregation (unary_union
) is then parallelised.
Note
Using split_out
> 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.
Dask-geopandas dissolve
uses the same default number of output partitions (1) as dask.dataframe but it is recommended to change it to match at least the number of workers to get all the benefits of parallelised computation.
import geopandas
import dask_geopandas
df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
npartitions=4 | ||||||
0 | int64 | object | object | object | float64 | geometry |
45 | ... | ... | ... | ... | ... | ... |
90 | ... | ... | ... | ... | ... | ... |
135 | ... | ... | ... | ... | ... | ... |
176 | ... | ... | ... | ... | ... | ... |
Using default settings, you get a single partition:
dissolved = ddf.dissolve("continent")
dissolved
pop_est | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|
npartitions=1 | |||||
int64 | object | object | float64 | geometry | |
... | ... | ... | ... | ... |
You can specify the number of partitions and get parallelised implementation in every step.
Note
Using split_out
> 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.
dissolved_4parts = ddf.dissolve("continent", split_out=4)
dissolved_4parts
pop_est | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|
npartitions=4 | |||||
int64 | object | object | float64 | geometry | |
... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | |
... | ... | ... | ... | ... |
dissolved_4parts.compute()
pop_est | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|
continent | |||||
Europe | 142257519 | Russia | RUS | 3745000.0 | MULTIPOLYGON (((-54.26971 2.73239, -54.18173 3... |
North America | 35623680 | Canada | CAN | 1674000.0 | MULTIPOLYGON (((-156.07347 19.70294, -156.0236... |
Antarctica | 4050 | Antarctica | ATA | 810.0 | MULTIPOLYGON (((-59.86585 -80.54966, -60.15966... |
Oceania | 920938 | Fiji | FJI | 8374.0 | MULTIPOLYGON (((146.66333 -43.58085, 146.04838... |
Asia | 18556698 | Kazakhstan | KAZ | 460700.0 | MULTIPOLYGON (((102.58426 -4.22026, 102.15617 ... |
Seven seas (open ocean) | 140 | Fr. S. Antarctic Lands | ATF | 16.0 | POLYGON ((68.93500 -48.62500, 69.58000 -48.940... |
South America | 44293293 | Argentina | ARG | 879400.0 | MULTIPOLYGON (((-69.95809 -55.19843, -71.00568... |
Africa | 53950935 | Tanzania | TZA | 150600.0 | MULTIPOLYGON (((32.58026 -27.47016, 32.46213 -... |
Alternative solution#
In some specific cases, groupby
may not be the most performant option. If your GeoDataFrame fits fully in memory on a single computer, it may be faster to sort data based on the by
column first and then map geopandas dissolve across the partitions. In that case, dask-geopandas calls unary_union
only once per each geometry (unary_union
is done twice in case of GroupBy). However, shuffling is an expensive operation. For larger data, this alternative solution is not feasible.
def dissolve_shuffle(ddf, by=None, **kwargs):
"""Shuffle and map partition"""
meta = ddf._meta.dissolve(by=by, as_index=False, **kwargs)
shuffled = ddf.shuffle(
by, npartitions=ddf.npartitions, shuffle="tasks", ignore_index=True
)
return shuffled.map_partitions(
geopandas.GeoDataFrame.dissolve, by=by, as_index=False, meta=meta, **kwargs
)
shuffled = dissolve_shuffle(ddf, "continent")
shuffled
continent | geometry | pop_est | name | iso_a3 | gdp_md_est | |
---|---|---|---|---|---|---|
npartitions=4 | ||||||
object | geometry | int64 | object | object | float64 | |
... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... | |
... | ... | ... | ... | ... | ... |
shuffled.compute()
continent | geometry | pop_est | name | iso_a3 | gdp_md_est | |
---|---|---|---|---|---|---|
0 | Africa | MULTIPOLYGON (((-11.43878 6.78592, -11.70819 6... | 53950935 | Tanzania | TZA | 150600.0 |
0 | Asia | MULTIPOLYGON (((48.67923 14.00320, 48.23895 13... | 18556698 | Kazakhstan | KAZ | 460700.0 |
1 | Seven seas (open ocean) | POLYGON ((68.93500 -48.62500, 69.58000 -48.940... | 140 | Fr. S. Antarctic Lands | ATF | 16.0 |
2 | South America | MULTIPOLYGON (((-68.63999 -55.58002, -69.23210... | 44293293 | Argentina | ARG | 879400.0 |
0 | Oceania | MULTIPOLYGON (((147.91405 -43.21152, 147.56456... | 920938 | Fiji | FJI | 8374.0 |
0 | Antarctica | MULTIPOLYGON (((-61.13898 -79.98137, -60.61012... | 4050 | Antarctica | ATA | 810.0 |
1 | Europe | MULTIPOLYGON (((-53.55484 2.33490, -53.77852 2... | 142257519 | Russia | RUS | 3745000.0 |
2 | North America | MULTIPOLYGON (((-155.22217 19.23972, -155.5421... | 35623680 | Canada | CAN | 1674000.0 |