The basic introduction to Dask-GeoPandas#

This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame and parallel dask_geopandas.GeoDataFrame.

import numpy as np
import geopandas

import dask_geopandas

Creating a parallelized dask_geopandas.GeoDataFrame#

There are many ways how to create a parallelized dask_geopandas.GeoDataFrame. If your initial data fits in memory, you can create if from a geopandas.GeoDataFrame using the from_geopandas function:

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df.head()
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 603253 Africa W. Sahara ESH 906.5 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 35623680 North America Canada CAN 1674000.0 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

When creating dask_geopandas.GeoDataFrame we have to specify how to partittion, e.g. using npartitons argument to split it into N equal chunks.

ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf
Dask DataFrame Structure:
pop_est continent name iso_a3 gdp_md_est geometry
npartitions=4
0 int64 object object object float64 geometry
45 ... ... ... ... ... ...
90 ... ... ... ... ... ...
135 ... ... ... ... ... ...
176 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks

Computation on a non-geometry column:

ddf.continent.value_counts().compute()
Africa                     51
Asia                       47
Europe                     39
North America              18
South America              13
Oceania                     7
Seven seas (open ocean)     1
Antarctica                  1
Name: continent, dtype: int64

And calling one of the geopandas-specific methods or attributes:

ddf.geometry.area
Dask Series Structure:
npartitions=4
0      float64
45         ...
90         ...
135        ...
176        ...
dtype: float64
Dask Name: getitem, 12 tasks

As you can see, without calling compute(), the resulting Series does not yet contain any values.

ddf.geometry.area.compute()
0         1.639511
1        76.301964
2         8.603984
3      1712.995228
4      1122.281921
          ...     
172       8.604719
173       1.479321
174       1.231641
175       0.639000
176      51.196106
Length: 177, dtype: float64

Timing comparison: Point-in-polygon with million points#

The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let’s create a bigger point GeoSeries:

N = 10_000_000
points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))

And creating the dask-geopandas version of this series:

dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)

A single polygon for which we will check if the points are located within this polygon:

import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)

The within operation will result in a boolean Series:

dpoints.within(box)
Dask Series Structure:
npartitions=16
0          bool
625000      ...
           ... 
9375000     ...
9999999     ...
dtype: bool
Dask Name: within, 32 tasks

The relative number of the points within the polygon:

(dpoints.within(box).sum() / len(dpoints)).compute()
0.1162862

Let’s compare the time it takes to compute this:

%timeit points.within(box)
460 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dpoints.within(box).compute()
169 ms ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is run on a laptop with 4 physical cores, and giving roughly a 3x speed-up using multithreading.