{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The basic introduction to Dask-GeoPandas\n", "\n", "This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on `geopandas.GeoDataFrame` and parallel `dask_geopandas.GeoDataFrame`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import geopandas\n", "\n", "import dask_geopandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a parallelized `dask_geopandas.GeoDataFrame`\n", "\n", "There are many ways how to create a parallelized `dask_geopandas.GeoDataFrame`. If your initial data fits in memory, you can create if from a `geopandas.GeoDataFrame` using the `from_geopandas` function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_estcontinentnameiso_a3gdp_md_estgeometry
0920938OceaniaFijiFJI8374.0MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
153950935AfricaTanzaniaTZA150600.0POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2603253AfricaW. SaharaESH906.5POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
335623680North AmericaCanadaCAN1674000.0MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4326625791North AmericaUnited States of AmericaUSA18560000.0MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
\n", "
" ], "text/plain": [ " pop_est continent name iso_a3 gdp_md_est \\\n", "0 920938 Oceania Fiji FJI 8374.0 \n", "1 53950935 Africa Tanzania TZA 150600.0 \n", "2 603253 Africa W. Sahara ESH 906.5 \n", "3 35623680 North America Canada CAN 1674000.0 \n", "4 326625791 North America United States of America USA 18560000.0 \n", "\n", " geometry \n", "0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... \n", "1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... \n", "2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... \n", "3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... \n", "4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When creating `dask_geopandas.GeoDataFrame` we have to specify how to partittion, e.g. using `npartitons` argument to split it into N equal chunks." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ddf = dask_geopandas.from_geopandas(df, npartitions=4)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pop_estcontinentnameiso_a3gdp_md_estgeometry
npartitions=4
0int64objectobjectobjectfloat64geometry
45..................
90..................
135..................
176..................
\n", "
\n", "
Dask Name: from_pandas, 4 tasks
" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computation on a non-geometry column:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Africa 51\n", "Asia 47\n", "Europe 39\n", "North America 18\n", "South America 13\n", "Oceania 7\n", "Seven seas (open ocean) 1\n", "Antarctica 1\n", "Name: continent, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.continent.value_counts().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And calling one of the geopandas-specific methods or attributes:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dask Series Structure:\n", "npartitions=4\n", "0 float64\n", "45 ...\n", "90 ...\n", "135 ...\n", "176 ...\n", "dtype: float64\n", "Dask Name: getitem, 12 tasks" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.geometry.area" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, without calling `compute()`, the resulting Series does not yet contain any values." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.639511\n", "1 76.301964\n", "2 8.603984\n", "3 1712.995228\n", "4 1122.281921\n", " ... \n", "172 8.604719\n", "173 1.479321\n", "174 1.231641\n", "175 0.639000\n", "176 51.196106\n", "Length: 177, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.geometry.area.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Timing comparison: Point-in-polygon with million points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let's create a bigger point GeoSeries:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "N = 10_000_000" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And creating the dask-geopandas version of this series:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "dpoints = dask_geopandas.from_geopandas(points, npartitions=16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A single polygon for which we will check if the points are located within this polygon:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import shapely.geometry\n", "box = shapely.geometry.box(0, 0, 1, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `within` operation will result in a boolean Series:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dask Series Structure:\n", "npartitions=16\n", "0 bool\n", "625000 ...\n", " ... \n", "9375000 ...\n", "9999999 ...\n", "dtype: bool\n", "Dask Name: within, 32 tasks" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dpoints.within(box)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The relative number of the points within the polygon:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1162862" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(dpoints.within(box).sum() / len(dpoints)).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the time it takes to compute this:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "460 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%timeit points.within(box)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "169 ms ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit dpoints.within(box).compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is run on a laptop with 4 physical cores, and giving roughly a 3x speed-up using multithreading." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }