The basic introduction to Dask-GeoPandas#

This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on geopandas.GeoDataFrame and parallel dask_geopandas.GeoDataFrame.

import numpy as np
import geopandas

import dask_geopandas

Creating a parallelized `dask_geopandas.GeoDataFrame`#

There are many ways how to create a parallelized dask_geopandas.GeoDataFrame. If your initial data fits in memory, you can create if from a geopandas.GeoDataFrame using the from_geopandas function:

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

df.head()

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
0	920938	Oceania	Fiji	FJI	8374.0	MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1	53950935	Africa	Tanzania	TZA	150600.0	POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2	603253	Africa	W. Sahara	ESH	906.5	POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3	35623680	North America	Canada	CAN	1674000.0	MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4	326625791	North America	United States of America	USA	18560000.0	MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

When creating dask_geopandas.GeoDataFrame we have to specify how to partittion, e.g. using npartitons argument to split it into N equal chunks.

ddf = dask_geopandas.from_geopandas(df, npartitions=4)

ddf

Dask DataFrame Structure:

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
npartitions=4
0	int64	object	object	object	float64	geometry
45	...	...	...	...	...	...
90	...	...	...	...	...	...
135	...	...	...	...	...	...
176	...	...	...	...	...	...

Dask Name: from_pandas, 4 tasks

Computation on a non-geometry column:

ddf.continent.value_counts().compute()

Africa                     51
Asia                       47
Europe                     39
North America              18
South America              13
Oceania                     7
Seven seas (open ocean)     1
Antarctica                  1
Name: continent, dtype: int64

And calling one of the geopandas-specific methods or attributes:

ddf.geometry.area

Dask Series Structure:
npartitions=4
0      float64
45         ...
90         ...
135        ...
176        ...
dtype: float64
Dask Name: getitem, 12 tasks

As you can see, without calling compute(), the resulting Series does not yet contain any values.

ddf.geometry.area.compute()

       1.639511
      76.301964
       8.603984
    1712.995228
    1122.281921
          ...     
     8.604719
     1.479321
     1.231641
     0.639000
    51.196106
Length: 177, dtype: float64

Timing comparison: Point-in-polygon with million points#

The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let’s create a bigger point GeoSeries:

N = 10_000_000

points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))

And creating the dask-geopandas version of this series:

dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)

A single polygon for which we will check if the points are located within this polygon:

import shapely.geometry
box = shapely.geometry.box(0, 0, 1, 1)

The within operation will result in a boolean Series:

dpoints.within(box)

Dask Series Structure:
npartitions=16
0          bool
625000      ...
           ... 
9375000     ...
9999999     ...
dtype: bool
Dask Name: within, 32 tasks

The relative number of the points within the polygon:

(dpoints.within(box).sum() / len(dpoints)).compute()

0.1162862

Let’s compare the time it takes to compute this:

%timeit points.within(box)

460 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dpoints.within(box).compute()

169 ms ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is run on a laptop with 4 physical cores, and giving roughly a 3x speed-up using multithreading.

The basic introduction to Dask-GeoPandas

Contents

The basic introduction to Dask-GeoPandas#

Creating a parallelized dask_geopandas.GeoDataFrame#

Timing comparison: Point-in-polygon with million points#

Creating a parallelized `dask_geopandas.GeoDataFrame`#