dask_geopandas.read_parquet#
- dask_geopandas.read_parquet(*args, **kwargs)#
Read a Parquet file into a Dask DataFrame
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.
- Parameters:
- pathstr or list
Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like
s3://
to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.- columnsstr or list, default None
Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.
- filtersUnion[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]], default None
List of filters to apply, like
[[('col1', '==', 0), ...], ...]
. Using this argument will NOT result in row-wise filtering of the final partitions unlessengine="pyarrow-dataset"
is also specified. For other engines, filtering is only performed at the partition level, i.e., to prevent the loading of some row-groups and/or files.For the “pyarrow” engines, predicates can be expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are combined with an AND conjunction into a larger predicate. The outer-most list then combines all of the combined filters with an OR disjunction.
Predicates can also be expressed as a List[Tuple]. These are evaluated as an AND conjunction. To express OR in predictates, one must use the (preferred for “pyarrow”) List[List[Tuple]] notation.
Note that the “fastparquet” engine does not currently support DNF for the filtering of partitioned columns (List[Tuple] is required).
- indexstr, list or False, default None
Field name(s) to use as the output frame index. By default will be inferred from the pandas parquet file metadata (if present). Use False to read all fields as columns.
- categorieslist or dict, default None
For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask/fastparquet, not otherwise.
- storage_optionsdict, default None
Key/value pairs to be passed on to the file-system backend, if any.
- open_file_optionsdict, default None
Key/value arguments to be passed along to
AbstractFileSystem.open
when each parquet data file is open for reading. Experimental (optimized) “precaching” for remote file systems (e.g. S3, GCS) can be enabled by adding{"method": "parquet"}
under the"precache_options"
key. Also, a custom file-open function can be used (instead ofAbstractFileSystem.open
), by specifying the desired function under the"open_file_func"
key.- enginestr, default ‘auto’
Parquet reader library to use. Options include: ‘auto’, ‘fastparquet’, ‘pyarrow’, ‘pyarrow-dataset’, and ‘pyarrow-legacy’. Defaults to ‘auto’, which selects the FastParquetEngine if fastparquet is installed (and ArrowDatasetEngine otherwise). If ‘pyarrow’ or ‘pyarrow-dataset’ is specified, the ArrowDatasetEngine (which leverages the pyarrow.dataset API) will be used. If ‘pyarrow-legacy’ is specified, ArrowLegacyEngine will be used (which leverages the pyarrow.parquet.ParquetDataset API). NOTE: The ‘pyarrow-legacy’ option (ArrowLegacyEngine) is deprecated for pyarrow>=5.
- gather_statisticsbool, default None
Gather the statistics for each dataset partition. By default, this will only be done if the _metadata file is available. Otherwise, statistics will only be gathered if True, because the footer of every file will be parsed (which is very slow on some systems).
- ignore_metadata_filebool, default False
Whether to ignore the global
_metadata
file (when one is present). IfTrue
, or if the global_metadata
file is missing, the parquet metadata may be gathered and processed in parallel. Parallel metadata processing is currently supported forArrowDatasetEngine
only.- metadata_task_sizeint, default configurable
If parquet metadata is processed in parallel (see
ignore_metadata_file
description above), this argument can be used to specify the number of dataset files to be processed by each task in the Dask graph. If this argument is set to0
, parallel metadata processing will be disabled. The default values for local and remote filesystems can be specified with the “metadata-task-size-local” and “metadata-task-size-remote” config fields, respectively (see “dataframe.parquet”).- split_row_groupsbool or int, default None
Default is True if a _metadata file is available or if the dataset is composed of a single file (otherwise defult is False). If True, then each output dataframe partition will correspond to a single parquet-file row-group. If False, each partition will correspond to a complete file. If a positive integer value is given, each dataframe partition will correspond to that number of parquet row-groups (or fewer).
- chunksizeint or str, default None
The desired size of each output
DataFrame
partition in terms of total (uncompressed) parquet storage space. If specified, adjacent row-groups and/or files will be aggregated into the same output partition until the cumulativetotal_byte_size
parquet-metadata statistic reaches this value. Use aggregate_files to enable/disable inter-file aggregation.- aggregate_filesbool or str, default None
Whether distinct file paths may be aggregated into the same output partition. This parameter requires gather_statistics=True, and is only used when chunksize is specified or when split_row_groups is an integer >1. A setting of True means that any two file paths may be aggregated into the same output partition, while False means that inter-file aggregation is prohibited.
For “hive-partitioned” datasets, a “partition”-column name can also be specified. In this case, we allow the aggregation of any two files sharing a file path up to, and including, the corresponding directory name. For example, if
aggregate_files
is set to"section"
for the directory structure below,03.parquet
and04.parquet
may be aggregated together, but01.parquet
and02.parquet
cannot be. If, however,aggregate_files
is set to"region"
,01.parquet
may be aggregated with02.parquet
, and03.parquet
may be aggregated with04.parquet
:dataset-path/ ├── region=1/ │ ├── section=a/ │ │ └── 01.parquet │ ├── section=b/ │ └── └── 02.parquet └── region=2/ ├── section=a/ │ ├── 03.parquet └── └── 04.parquet
Note that the default behavior of
aggregate_files
is False.- **kwargs: dict (of dicts)
Passthrough key-word arguments for read backend. The top-level keys correspond to the appropriate operation type, and the second level corresponds to the kwargs that will be passed on to the underlying
pyarrow
orfastparquet
function. Supported top-level keys: ‘dataset’ (for opening apyarrow
dataset), ‘file’ or ‘dataset’ (for opening afastparquet.ParquetFile
), ‘read’ (for the backend read function), ‘arrow_to_pandas’ (for controlling the arguments passed to convert from apyarrow.Table.to_pandas()
). Any element of kwargs that is not defined under these top-level keys will be passed through to the engine.read_partitions classmethod as a stand-alone argument (and will be ignored by the engine implementations defined indask.dataframe
).
See also
to_parquet
pyarrow.parquet.ParquetDataset
Examples
>>> df = dd.read_parquet('s3://bucket/my-parquet-data')