Parallel GeoPandas with Dask
Dask-GeoPandas is a project merging the geospatial capabilities of GeoPandas and scalability of Dask. GeoPandas is an open source project designed to make working with geospatial data in Python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Dask provides advanced parallelism and distributed out-of-core computation with a dask.dataframe module designed to scale pandas. Since GeoPandas is an extension to the pandas DataFrame, the same way Dask scales pandas can also be applied to GeoPandas.
This project is a bridge between Dask and GeoPandas and offers geospatial capabilities of GeoPandas backed by Dask.
Dask-GeoPandas depends on Dask and GeoPandas. In addition, it also requires either
Shapely >= 2.0, or the PyGEOS package. We recommend installing via
conda-forge channel but you can also install it from PyPI.
conda install dask-geopandas -c conda-forge
pip install dask-geopandas
For more details, see the installation instructions.
pandas, the API of
dask_geopandas mirrors the one of
df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
dask_df = dask_geopandas.from_geopandas(df, npartitions=4)
When should I use Dask-GeoPandas?#
Dask-GeoPandas is useful when dealing with large GeoDataFrames that either do not comfortably fit in memory or require expensive computation that can be easily parallelised. Note that using Dask-GeoPandas is not always faster than using GeoPandas as there is an unavoidable overhead in task scheduling and transfer of data between threads and processes, but in other cases, your performance gains can be almost linear with more threads.