Dissolve with dask-geopandas#

Dask-geopandas offers a distributed version of the dissolve method based on dask.dataframe.groupby method. It shares all benefits and caveats of the vanilla groupby with a few minor differences.

The geometry column is aggregated using the unary_union operation, which tends to be relatively costly. Therefore it is beneficial to try to use all workers in every step of the operation. GroupBy by default returns a result as a single partition, which in reality means that the final aggregation step is done within a single thread. For geometries, that means that the operation needs to loop through groups of geometries coming from other partitions and call unary_union on every group one by one.

The number of output partitions can be specified using split_out keyword passed to aggregation. Therefore, if we set split_out=16, it will return 16 partitions (if there are 16 or more unique groups) each of which is processed by a different worker (if there are 16 or more workers). The final aggregation (unary_union) is then parallelised.

Note

Using split_out > 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.

Dask-geopandas dissolve uses the same default number of output partitions (1) as dask.dataframe but it is recommended to change it to match at least the number of workers to get all the benefits of parallelised computation.

import geopandas
import dask_geopandas
df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf
Dask-GeoPandas GeoDataFrame Structure:
pop_est continent name iso_a3 gdp_md_est geometry
npartitions=4
0 int64 object object object float64 geometry
45 ... ... ... ... ... ...
90 ... ... ... ... ... ...
135 ... ... ... ... ... ...
176 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks

Using default settings, you get a single partition:

dissolved = ddf.dissolve("continent")
dissolved
Dask-GeoPandas GeoDataFrame Structure:
pop_est name iso_a3 gdp_md_est geometry
npartitions=1
int64 object object float64 geometry
... ... ... ... ...
Dask Name: set_crs, 10 tasks

You can specify the number of partitions and get parallelised implementation in every step.

Note

Using split_out > 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.

dissolved_4parts = ddf.dissolve("continent", split_out=4)
dissolved_4parts
Dask-GeoPandas GeoDataFrame Structure:
pop_est name iso_a3 gdp_md_est geometry
npartitions=4
int64 object object float64 geometry
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
Dask Name: set_crs, 36 tasks
dissolved_4parts.compute()
pop_est name iso_a3 gdp_md_est geometry
continent
Europe 142257519 Russia RUS 3745000.0 MULTIPOLYGON (((-54.26971 2.73239, -54.18173 3...
North America 35623680 Canada CAN 1674000.0 MULTIPOLYGON (((-156.07347 19.70294, -156.0236...
Antarctica 4050 Antarctica ATA 810.0 MULTIPOLYGON (((-59.86585 -80.54966, -60.15966...
Oceania 920938 Fiji FJI 8374.0 MULTIPOLYGON (((146.66333 -43.58085, 146.04838...
Asia 18556698 Kazakhstan KAZ 460700.0 MULTIPOLYGON (((102.58426 -4.22026, 102.15617 ...
Seven seas (open ocean) 140 Fr. S. Antarctic Lands ATF 16.0 POLYGON ((68.93500 -48.62500, 69.58000 -48.940...
South America 44293293 Argentina ARG 879400.0 MULTIPOLYGON (((-69.95809 -55.19843, -71.00568...
Africa 53950935 Tanzania TZA 150600.0 MULTIPOLYGON (((32.58026 -27.47016, 32.46213 -...

Alternative solution#

In some specific cases, groupby may not be the most performant option. If your GeoDataFrame fits fully in memory on a single computer, it may be faster to sort data based on the by column first and then map geopandas dissolve across the partitions. In that case, dask-geopandas calls unary_union only once per each geometry (unary_union is done twice in case of GroupBy). However, shuffling is an expensive operation. For larger data, this alternative solution is not feasible.

def dissolve_shuffle(ddf, by=None, **kwargs):
    """Shuffle and map partition"""

    meta = ddf._meta.dissolve(by=by, as_index=False, **kwargs)

    shuffled = ddf.shuffle(
        by, npartitions=ddf.npartitions, shuffle="tasks", ignore_index=True
    )

    return shuffled.map_partitions(
        geopandas.GeoDataFrame.dissolve, by=by, as_index=False, meta=meta, **kwargs
    )
shuffled = dissolve_shuffle(ddf, "continent")
shuffled
Dask-GeoPandas GeoDataFrame Structure:
continent geometry pop_est name iso_a3 gdp_md_est
npartitions=4
object geometry int64 object object float64
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Dask Name: dissolve, 32 tasks
shuffled.compute()
continent geometry pop_est name iso_a3 gdp_md_est
0 Africa MULTIPOLYGON (((-11.43878 6.78592, -11.70819 6... 53950935 Tanzania TZA 150600.0
0 Asia MULTIPOLYGON (((48.67923 14.00320, 48.23895 13... 18556698 Kazakhstan KAZ 460700.0
1 Seven seas (open ocean) POLYGON ((68.93500 -48.62500, 69.58000 -48.940... 140 Fr. S. Antarctic Lands ATF 16.0
2 South America MULTIPOLYGON (((-68.63999 -55.58002, -69.23210... 44293293 Argentina ARG 879400.0
0 Oceania MULTIPOLYGON (((147.91405 -43.21152, 147.56456... 920938 Fiji FJI 8374.0
0 Antarctica MULTIPOLYGON (((-61.13898 -79.98137, -60.61012... 4050 Antarctica ATA 810.0
1 Europe MULTIPOLYGON (((-53.55484 2.33490, -53.77852 2... 142257519 Russia RUS 3745000.0
2 North America MULTIPOLYGON (((-155.22217 19.23972, -155.5421... 35623680 Canada CAN 1674000.0