{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dissolve with dask-geopandas\n", "\n", "Dask-geopandas offers a distributed version of the `dissolve` method based on `dask.dataframe.groupby` method. It shares all benefits and caveats of the vanilla `groupby` with a few minor differences.\n", "\n", "The geometry column is aggregated using the `unary_union` operation, which tends to be relatively costly. Therefore it is beneficial to try to use all workers in every step of the operation. GroupBy by default returns a result as a single partition, which in reality means that the final aggregation step is done within a single thread. For geometries, that means that the operation needs to _loop_ through groups of geometries coming from other partitions and call `unary_union` on every group one by one.\n", "\n", "The number of output partitions can be specified using `split_out` keyword passed to aggregation. Therefore, if we set `split_out=16`, it will return 16 partitions (if there are 16 or more unique groups) each of which is processed by a different worker (if there are 16 or more workers). The final aggregation (`unary_union`) is then parallelised.\n", "\n", "```{note} \n", "Using `split_out` > 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.\n", "```\n", "\n", "Dask-geopandas `dissolve` uses the same default number of output partitions (1) as dask.dataframe but it is recommended to change it to match at least the number of workers to get all the benefits of parallelised computation." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import geopandas\n", "import dask_geopandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask-GeoPandas GeoDataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_estcontinentnameiso_a3gdp_md_estgeometry
npartitions=4
0int64objectobjectobjectfloat64geometry
45..................
90..................
135..................
176..................
\n", "
\n", "
Dask Name: from_pandas, 4 tasks
" ], "text/plain": [ "Dask GeoDataFrame Structure:\n", " pop_est continent name iso_a3 gdp_md_est geometry\n", "npartitions=4 \n", "0 int64 object object object float64 geometry\n", "45 ... ... ... ... ... ...\n", "90 ... ... ... ... ... ...\n", "135 ... ... ... ... ... ...\n", "176 ... ... ... ... ... ...\n", "Dask Name: from_pandas, 4 tasks" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))\n", "\n", "ddf = dask_geopandas.from_geopandas(df, npartitions=4)\n", "ddf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using default settings, you get a single partition:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask-GeoPandas GeoDataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_estnameiso_a3gdp_md_estgeometry
npartitions=1
int64objectobjectfloat64geometry
...............
\n", "
\n", "
Dask Name: set_crs, 10 tasks
" ], "text/plain": [ "Dask GeoDataFrame Structure:\n", " pop_est name iso_a3 gdp_md_est geometry\n", "npartitions=1 \n", " int64 object object float64 geometry\n", " ... ... ... ... ...\n", "Dask Name: set_crs, 10 tasks" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dissolved = ddf.dissolve(\"continent\")\n", "dissolved" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can specify the number of partitions and get parallelised implementation in every step.\n", "\n", "```{note} \n", "Using `split_out` > 1 may fail with the newer versions of Dask due to an underlying incompatibility between Dask and GeoPandas.\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask-GeoPandas GeoDataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_estnameiso_a3gdp_md_estgeometry
npartitions=4
int64objectobjectfloat64geometry
...............
...............
...............
...............
\n", "
\n", "
Dask Name: set_crs, 36 tasks
" ], "text/plain": [ "Dask GeoDataFrame Structure:\n", " pop_est name iso_a3 gdp_md_est geometry\n", "npartitions=4 \n", " int64 object object float64 geometry\n", " ... ... ... ... ...\n", " ... ... ... ... ...\n", " ... ... ... ... ...\n", " ... ... ... ... ...\n", "Dask Name: set_crs, 36 tasks" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dissolved_4parts = ddf.dissolve(\"continent\", split_out=4)\n", "dissolved_4parts" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_estnameiso_a3gdp_md_estgeometry
continent
Europe142257519RussiaRUS3745000.0MULTIPOLYGON (((-54.26971 2.73239, -54.18173 3...
North America35623680CanadaCAN1674000.0MULTIPOLYGON (((-156.07347 19.70294, -156.0236...
Antarctica4050AntarcticaATA810.0MULTIPOLYGON (((-59.86585 -80.54966, -60.15966...
Oceania920938FijiFJI8374.0MULTIPOLYGON (((146.66333 -43.58085, 146.04838...
Asia18556698KazakhstanKAZ460700.0MULTIPOLYGON (((102.58426 -4.22026, 102.15617 ...
Seven seas (open ocean)140Fr. S. Antarctic LandsATF16.0POLYGON ((68.93500 -48.62500, 69.58000 -48.940...
South America44293293ArgentinaARG879400.0MULTIPOLYGON (((-69.95809 -55.19843, -71.00568...
Africa53950935TanzaniaTZA150600.0MULTIPOLYGON (((32.58026 -27.47016, 32.46213 -...
\n", "
" ], "text/plain": [ " pop_est name iso_a3 gdp_md_est \\\n", "continent \n", "Europe 142257519 Russia RUS 3745000.0 \n", "North America 35623680 Canada CAN 1674000.0 \n", "Antarctica 4050 Antarctica ATA 810.0 \n", "Oceania 920938 Fiji FJI 8374.0 \n", "Asia 18556698 Kazakhstan KAZ 460700.0 \n", "Seven seas (open ocean) 140 Fr. S. Antarctic Lands ATF 16.0 \n", "South America 44293293 Argentina ARG 879400.0 \n", "Africa 53950935 Tanzania TZA 150600.0 \n", "\n", " geometry \n", "continent \n", "Europe MULTIPOLYGON (((-54.26971 2.73239, -54.18173 3... \n", "North America MULTIPOLYGON (((-156.07347 19.70294, -156.0236... \n", "Antarctica MULTIPOLYGON (((-59.86585 -80.54966, -60.15966... \n", "Oceania MULTIPOLYGON (((146.66333 -43.58085, 146.04838... \n", "Asia MULTIPOLYGON (((102.58426 -4.22026, 102.15617 ... \n", "Seven seas (open ocean) POLYGON ((68.93500 -48.62500, 69.58000 -48.940... \n", "South America MULTIPOLYGON (((-69.95809 -55.19843, -71.00568... \n", "Africa MULTIPOLYGON (((32.58026 -27.47016, 32.46213 -... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dissolved_4parts.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Alternative solution\n", "\n", "In some specific cases, `groupby` may not be the most performant option. If your GeoDataFrame fits fully in memory on a single computer, it may be faster to sort data based on the `by` column first and then map geopandas dissolve across the partitions. In that case, dask-geopandas calls `unary_union` only once per each geometry (`unary_union` is done twice in case of GroupBy). However, shuffling is an expensive operation. For larger data, this alternative solution is not feasible." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def dissolve_shuffle(ddf, by=None, **kwargs):\n", " \"\"\"Shuffle and map partition\"\"\"\n", "\n", " meta = ddf._meta.dissolve(by=by, as_index=False, **kwargs)\n", "\n", " shuffled = ddf.shuffle(\n", " by, npartitions=ddf.npartitions, shuffle=\"tasks\", ignore_index=True\n", " )\n", "\n", " return shuffled.map_partitions(\n", " geopandas.GeoDataFrame.dissolve, by=by, as_index=False, meta=meta, **kwargs\n", " )" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask-GeoPandas GeoDataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentgeometrypop_estnameiso_a3gdp_md_est
npartitions=4
objectgeometryint64objectobjectfloat64
..................
..................
..................
..................
\n", "
\n", "
Dask Name: dissolve, 32 tasks
" ], "text/plain": [ "Dask GeoDataFrame Structure:\n", " continent geometry pop_est name iso_a3 gdp_md_est\n", "npartitions=4 \n", " object geometry int64 object object float64\n", " ... ... ... ... ... ...\n", " ... ... ... ... ... ...\n", " ... ... ... ... ... ...\n", " ... ... ... ... ... ...\n", "Dask Name: dissolve, 32 tasks" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shuffled = dissolve_shuffle(ddf, \"continent\")\n", "shuffled" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentgeometrypop_estnameiso_a3gdp_md_est
0AfricaMULTIPOLYGON (((-11.43878 6.78592, -11.70819 6...53950935TanzaniaTZA150600.0
0AsiaMULTIPOLYGON (((48.67923 14.00320, 48.23895 13...18556698KazakhstanKAZ460700.0
1Seven seas (open ocean)POLYGON ((68.93500 -48.62500, 69.58000 -48.940...140Fr. S. Antarctic LandsATF16.0
2South AmericaMULTIPOLYGON (((-68.63999 -55.58002, -69.23210...44293293ArgentinaARG879400.0
0OceaniaMULTIPOLYGON (((147.91405 -43.21152, 147.56456...920938FijiFJI8374.0
0AntarcticaMULTIPOLYGON (((-61.13898 -79.98137, -60.61012...4050AntarcticaATA810.0
1EuropeMULTIPOLYGON (((-53.55484 2.33490, -53.77852 2...142257519RussiaRUS3745000.0
2North AmericaMULTIPOLYGON (((-155.22217 19.23972, -155.5421...35623680CanadaCAN1674000.0
\n", "
" ], "text/plain": [ " continent geometry \\\n", "0 Africa MULTIPOLYGON (((-11.43878 6.78592, -11.70819 6... \n", "0 Asia MULTIPOLYGON (((48.67923 14.00320, 48.23895 13... \n", "1 Seven seas (open ocean) POLYGON ((68.93500 -48.62500, 69.58000 -48.940... \n", "2 South America MULTIPOLYGON (((-68.63999 -55.58002, -69.23210... \n", "0 Oceania MULTIPOLYGON (((147.91405 -43.21152, 147.56456... \n", "0 Antarctica MULTIPOLYGON (((-61.13898 -79.98137, -60.61012... \n", "1 Europe MULTIPOLYGON (((-53.55484 2.33490, -53.77852 2... \n", "2 North America MULTIPOLYGON (((-155.22217 19.23972, -155.5421... \n", "\n", " pop_est name iso_a3 gdp_md_est \n", "0 53950935 Tanzania TZA 150600.0 \n", "0 18556698 Kazakhstan KAZ 460700.0 \n", "1 140 Fr. S. Antarctic Lands ATF 16.0 \n", "2 44293293 Argentina ARG 879400.0 \n", "0 920938 Fiji FJI 8374.0 \n", "0 4050 Antarctica ATA 810.0 \n", "1 142257519 Russia RUS 3745000.0 \n", "2 35623680 Canada CAN 1674000.0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shuffled.compute()" ] } ], "metadata": { "interpreter": { "hash": "9914e2881520d4f08a067c2c2c181121476026b863eca2e121cd0758701ab602" }, "kernelspec": { "display_name": "Python 3.8.10 64-bit ('geo_dev': conda)", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }