{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The basic introduction to Dask-GeoPandas\n",
    "\n",
    "This notebook illustrates the basic API of Dask-GeoPandas and provides a basic timing comparison between operations on `geopandas.GeoDataFrame` and parallel `dask_geopandas.GeoDataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import geopandas\n",
    "\n",
    "import dask_geopandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a parallelized `dask_geopandas.GeoDataFrame`\n",
    "\n",
    "There are many ways how to create a parallelized `dask_geopandas.GeoDataFrame`. If your initial data fits in memory, you can create if from a `geopandas.GeoDataFrame` using the `from_geopandas` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop_est</th>\n",
       "      <th>continent</th>\n",
       "      <th>name</th>\n",
       "      <th>iso_a3</th>\n",
       "      <th>gdp_md_est</th>\n",
       "      <th>geometry</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>920938</td>\n",
       "      <td>Oceania</td>\n",
       "      <td>Fiji</td>\n",
       "      <td>FJI</td>\n",
       "      <td>8374.0</td>\n",
       "      <td>MULTIPOLYGON (((180.00000 -16.06713, 180.00000...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>53950935</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Tanzania</td>\n",
       "      <td>TZA</td>\n",
       "      <td>150600.0</td>\n",
       "      <td>POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>603253</td>\n",
       "      <td>Africa</td>\n",
       "      <td>W. Sahara</td>\n",
       "      <td>ESH</td>\n",
       "      <td>906.5</td>\n",
       "      <td>POLYGON ((-8.66559 27.65643, -8.66512 27.58948...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>35623680</td>\n",
       "      <td>North America</td>\n",
       "      <td>Canada</td>\n",
       "      <td>CAN</td>\n",
       "      <td>1674000.0</td>\n",
       "      <td>MULTIPOLYGON (((-122.84000 49.00000, -122.9742...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>326625791</td>\n",
       "      <td>North America</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>USA</td>\n",
       "      <td>18560000.0</td>\n",
       "      <td>MULTIPOLYGON (((-122.84000 49.00000, -120.0000...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     pop_est      continent                      name iso_a3  gdp_md_est  \\\n",
       "0     920938        Oceania                      Fiji    FJI      8374.0   \n",
       "1   53950935         Africa                  Tanzania    TZA    150600.0   \n",
       "2     603253         Africa                 W. Sahara    ESH       906.5   \n",
       "3   35623680  North America                    Canada    CAN   1674000.0   \n",
       "4  326625791  North America  United States of America    USA  18560000.0   \n",
       "\n",
       "                                            geometry  \n",
       "0  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...  \n",
       "1  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...  \n",
       "2  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...  \n",
       "3  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...  \n",
       "4  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When creating `dask_geopandas.GeoDataFrame` we have to specify how to partittion, e.g. using `npartitons` argument to split it into N equal chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf = dask_geopandas.from_geopandas(df, npartitions=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop_est</th>\n",
       "      <th>continent</th>\n",
       "      <th>name</th>\n",
       "      <th>iso_a3</th>\n",
       "      <th>gdp_md_est</th>\n",
       "      <th>geometry</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=4</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>int64</td>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "      <td>float64</td>\n",
       "      <td>geometry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>135</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>176</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: from_pandas, 4 tasks</div>"
      ],
      "text/plain": [
       "<dask_geopandas.GeoDataFrame | 4 tasks | 4 npartitions>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computation on a non-geometry column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Africa                     51\n",
       "Asia                       47\n",
       "Europe                     39\n",
       "North America              18\n",
       "South America              13\n",
       "Oceania                     7\n",
       "Seven seas (open ocean)     1\n",
       "Antarctica                  1\n",
       "Name: continent, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.continent.value_counts().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And calling one of the geopandas-specific methods or attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dask Series Structure:\n",
       "npartitions=4\n",
       "0      float64\n",
       "45         ...\n",
       "90         ...\n",
       "135        ...\n",
       "176        ...\n",
       "dtype: float64\n",
       "Dask Name: getitem, 12 tasks"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.geometry.area"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, without calling `compute()`, the resulting Series does not yet contain any values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         1.639511\n",
       "1        76.301964\n",
       "2         8.603984\n",
       "3      1712.995228\n",
       "4      1122.281921\n",
       "          ...     \n",
       "172       8.604719\n",
       "173       1.479321\n",
       "174       1.231641\n",
       "175       0.639000\n",
       "176      51.196106\n",
       "Length: 177, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.geometry.area.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Timing comparison: Point-in-polygon with million points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The GeoDataFrame used above is a bit small to see any benefit from parallelization using dask (as the overhead of the task scheduler is larger than the actual operation on such a tiny dataframe), so let's create a bigger point GeoSeries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 10_000_000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "points = geopandas.GeoDataFrame(geometry=geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And creating the dask-geopandas version of this series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "dpoints =  dask_geopandas.from_geopandas(points, npartitions=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A single polygon for which we will check if the points are located within this polygon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shapely.geometry\n",
    "box = shapely.geometry.box(0, 0, 1, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `within` operation will result in a boolean Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dask Series Structure:\n",
       "npartitions=16\n",
       "0          bool\n",
       "625000      ...\n",
       "           ... \n",
       "9375000     ...\n",
       "9999999     ...\n",
       "dtype: bool\n",
       "Dask Name: within, 32 tasks"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dpoints.within(box)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The relative number of the points within the polygon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1162862"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(dpoints.within(box).sum() / len(dpoints)).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the time it takes to compute this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "460 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit points.within(box)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "169 ms ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit dpoints.within(box).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is run on a laptop with 4 physical cores, and giving roughly a 3x speed-up using multithreading."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}