The basic introduction to Dask-GeoPandas#
This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame
and parallel dask_geopandas.GeoDataFrame
.
import numpy as np
import geopandas
import dask_geopandas
Creating a parallelized dask_geopandas.GeoDataFrame
#
There are many ways how to create a parallelized dask_geopandas.GeoDataFrame
. If your initial data fits in memory, you can create if from a geopandas.GeoDataFrame
using the from_geopandas
function:
df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df.head()
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
0 | 920938 | Oceania | Fiji | FJI | 8374.0 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
1 | 53950935 | Africa | Tanzania | TZA | 150600.0 | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
2 | 603253 | Africa | W. Sahara | ESH | 906.5 | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
3 | 35623680 | North America | Canada | CAN | 1674000.0 | MULTIPOLYGON (((-122.84000 49.00000, -122.9742... |
4 | 326625791 | North America | United States of America | USA | 18560000.0 | MULTIPOLYGON (((-122.84000 49.00000, -120.0000... |
When creating dask_geopandas.GeoDataFrame
we have to specify how to partittion, e.g. using npartitons
argument to split it into N equal chunks.
ddf = dask_geopandas.from_geopandas(df, npartitions=4)
ddf
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
npartitions=4 | ||||||
0 | int64 | object | object | object | float64 | geometry |
45 | ... | ... | ... | ... | ... | ... |
90 | ... | ... | ... | ... | ... | ... |
135 | ... | ... | ... | ... | ... | ... |
176 | ... | ... | ... | ... | ... | ... |
Computation on a non-geometry column:
ddf.continent.value_counts().compute()
Africa 51
Asia 47
Europe 39
North America 18
South America 13
Oceania 7
Seven seas (open ocean) 1
Antarctica 1
Name: continent, dtype: int64
And calling one of the geopandas-specific methods or attributes:
ddf.geometry.area
Dask Series Structure:
npartitions=4
0 float64
45 ...
90 ...
135 ...
176 ...
dtype: float64
Dask Name: getitem, 12 tasks
As you can see, without calling compute()
, the resulting Series does not yet contain any values.
ddf.geometry.area.compute()
0 1.639511
1 76.301964
2 8.603984
3 1712.995228
4 1122.281921
...
172 8.604719
173 1.479321
174 1.231641
175 0.639000
176 51.196106
Length: 177, dtype: float64
Timing comparison: Point-in-polygon with million points#
The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let’s create a bigger point GeoSeries:
N = 10_000_000
points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))
And creating the dask-geopandas version of this series:
dpoints = dask_geopandas.from_geopandas(points, npartitions=16)
A single polygon for which we will check if the points are located within this polygon:
import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)
The within
operation will result in a boolean Series:
dpoints.within(box)
Dask Series Structure:
npartitions=16
0 bool
625000 ...
...
9375000 ...
9999999 ...
dtype: bool
Dask Name: within, 32 tasks
The relative number of the points within the polygon:
(dpoints.within(box).sum() / len(dpoints)).compute()
0.1162862
Let’s compare the time it takes to compute this:
%timeit points.within(box)
460 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dpoints.within(box).compute()
169 ms ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is run on a laptop with 4 physical cores, and giving roughly a 3x speed-up using multithreading.