Dissolve with dask-geopandas#

Dask-geopandas offers a distributed version of the dissolve method based on dask.dataframe.groupby method. It shares all benefits and caveats of the vanilla groupby with a few minor differences.

The geometry column is aggregated using the unary_union operation, which tends to be relatively costly. Therefore it is beneficial to try to use all workers in every step of the operation. GroupBy by default returns a result as a single partition, which in reality means that the final aggregation step is done within a single thread. For geometries, that means that the operation needs to loop through groups of geometries coming from other partitions and call unary_union on every group one by one.

The number of output partitions can be specified using split_out keyword passed to aggregation. Therefore, if we set split_out=16, it will return 16 partitions (if there are 16 or more unique groups) each of which is processed by a different worker (if there are 16 or more workers). The final aggregation (unary_union) is then parallelised.

Note

Using split_out > 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.

Dask-geopandas dissolve uses the same default number of output partitions (1) as dask.dataframe but it is recommended to change it to match at least the number of workers to get all the benefits of parallelised computation.

import geopandas
import dask_geopandas

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf

Dask-GeoPandas GeoDataFrame Structure:

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
npartitions=4
0	int64	object	object	object	float64	geometry
45	...	...	...	...	...	...
90	...	...	...	...	...	...
135	...	...	...	...	...	...
176	...	...	...	...	...	...

Dask Name: from_pandas, 4 tasks

Using default settings, you get a single partition:

dissolved = ddf.dissolve("continent")
dissolved

Dask-GeoPandas GeoDataFrame Structure:

	pop_est	name	iso_a3	gdp_md_est	geometry
npartitions=1
	int64	object	object	float64	geometry
	...	...	...	...	...

Dask Name: set_crs, 10 tasks

You can specify the number of partitions and get parallelised implementation in every step.

Note

Using split_out > 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.

dissolved_4parts = ddf.dissolve("continent", split_out=4)
dissolved_4parts

Dask-GeoPandas GeoDataFrame Structure:

	pop_est	name	iso_a3	gdp_md_est	geometry
npartitions=4
	int64	object	object	float64	geometry
	...	...	...	...	...
	...	...	...	...	...
	...	...	...	...	...
	...	...	...	...	...

Dask Name: set_crs, 36 tasks

dissolved_4parts.compute()

	pop_est	name	iso_a3	gdp_md_est	geometry
continent
Europe	142257519	Russia	RUS	3745000.0	MULTIPOLYGON (((-54.26971 2.73239, -54.18173 3...
North America	35623680	Canada	CAN	1674000.0	MULTIPOLYGON (((-156.07347 19.70294, -156.0236...
Antarctica	4050	Antarctica	ATA	810.0	MULTIPOLYGON (((-59.86585 -80.54966, -60.15966...
Oceania	920938	Fiji	FJI	8374.0	MULTIPOLYGON (((146.66333 -43.58085, 146.04838...
Asia	18556698	Kazakhstan	KAZ	460700.0	MULTIPOLYGON (((102.58426 -4.22026, 102.15617 ...
Seven seas (open ocean)	140	Fr. S. Antarctic Lands	ATF	16.0	POLYGON ((68.93500 -48.62500, 69.58000 -48.940...
South America	44293293	Argentina	ARG	879400.0	MULTIPOLYGON (((-69.95809 -55.19843, -71.00568...
Africa	53950935	Tanzania	TZA	150600.0	MULTIPOLYGON (((32.58026 -27.47016, 32.46213 -...

Alternative solution#

In some specific cases, groupby may not be the most performant option. If your GeoDataFrame fits fully in memory on a single computer, it may be faster to sort data based on the by column first and then map geopandas dissolve across the partitions. In that case, dask-geopandas calls unary_union only once per each geometry (unary_union is done twice in case of GroupBy). However, shuffling is an expensive operation. For larger data, this alternative solution is not feasible.

def dissolve_shuffle(ddf, by=None, **kwargs):
    """Shuffle and map partition"""

    meta = ddf._meta.dissolve(by=by, as_index=False, **kwargs)

    shuffled = ddf.shuffle(
        by, npartitions=ddf.npartitions, shuffle="tasks", ignore_index=True
    )

    return shuffled.map_partitions(
        geopandas.GeoDataFrame.dissolve, by=by, as_index=False, meta=meta, **kwargs
    )

shuffled = dissolve_shuffle(ddf, "continent")
shuffled

Dask-GeoPandas GeoDataFrame Structure:

	continent	geometry	pop_est	name	iso_a3	gdp_md_est
npartitions=4
	object	geometry	int64	object	object	float64
	...	...	...	...	...	...
	...	...	...	...	...	...
	...	...	...	...	...	...
	...	...	...	...	...	...

Dask Name: dissolve, 32 tasks

shuffled.compute()

	continent	geometry	pop_est	name	iso_a3	gdp_md_est
0	Africa	MULTIPOLYGON (((-11.43878 6.78592, -11.70819 6...	53950935	Tanzania	TZA	150600.0
0	Asia	MULTIPOLYGON (((48.67923 14.00320, 48.23895 13...	18556698	Kazakhstan	KAZ	460700.0
1	Seven seas (open ocean)	POLYGON ((68.93500 -48.62500, 69.58000 -48.940...	140	Fr. S. Antarctic Lands	ATF	16.0
2	South America	MULTIPOLYGON (((-68.63999 -55.58002, -69.23210...	44293293	Argentina	ARG	879400.0
0	Oceania	MULTIPOLYGON (((147.91405 -43.21152, 147.56456...	920938	Fiji	FJI	8374.0
0	Antarctica	MULTIPOLYGON (((-61.13898 -79.98137, -60.61012...	4050	Antarctica	ATA	810.0
1	Europe	MULTIPOLYGON (((-53.55484 2.33490, -53.77852 2...	142257519	Russia	RUS	3745000.0
2	North America	MULTIPOLYGON (((-155.22217 19.23972, -155.5421...	35623680	Canada	CAN	1674000.0

Dissolve with dask-geopandas

Contents

Dissolve with dask-geopandas#

Alternative solution#