Getting started#
The relationship between Dask-GeoPandas and GeoPandas is the same as the relationship
between dask.dataframe
and pandas
. We recommend checking the
Dask documentation to better understand how
DataFrames are scaled before diving into Dask-GeoPandas.
Dask-GeoPandas basics#
Given a GeoPandas dataframe
import geopandas
df = geopandas.read_file('...')
We can repartition it into a Dask-GeoPandas dataframe:
import dask_geopandas
ddf = dask_geopandas.from_geopandas(df, npartitions=4)
By default, this repartitions the data naively by rows. However, you can also provide spatial partitioning to take advantage of the spatial structure of the GeoDataFrame.
ddf = ddf.spatial_shuffle()
The familiar spatial attributes and methods of GeoPandas are also available and will be computed in parallel:
ddf.geometry.area.compute()
ddf.within(polygon)
Additionally, if you have a distributed dask.dataframe you can pass columns of
x-y points to the set_geometry
method.
import dask.dataframe as dd
import dask_geopandas
ddf = dd.read_csv('...')
ddf = ddf.set_geometry(
dask_geopandas.points_from_xy(ddf, 'longitude', 'latitude')
)
Writing files (and reading back) is currently supported for the Parquet and Feather file formats.
ddf.to_parquet("path/to/dir/")
ddf = dask_geopandas.read_parquet("path/to/dir/")
Traditional GIS file formats can be read into partitioned GeoDataFrame
(requires pyogrio
) but not written.
ddf = dask_geopandas.read_file("file.gpkg", npartitions=4)