Dataset¶

The Plums dataflow module shares the PyTorch data pipeline API main ideas and is compatible with both PyTorch and Tensorflow’s Keras.

Base datasets¶

The main class to stream through data in Plums is the Dataset base class, which guarantee a sequence-like interface to manipulate data in an ordered fashion.

class plums.dataflow.dataset.Dataset[source]¶

Bases: object

Abstract base class for all Dataset to inherit from.

Subclasses must override the __getitem__() method to allow Dataset to act as a ‘non-sized’ Sequence.

Although subclasses should override the len method, it is deliberately not implemented by default to assert valid behaviour with non-sized samplers. If one wants to explicitly signal that the __len__() is or should be present in subclasses, use the SizedDataset base class instead.

Hint

Insightful users might notice that the base dataset API closely mimics PyTorch’s own Dataset and Keras’s Sequence. This is deliberate so that the provided Dataset can be used as a stand-in (with some care needed to support on_epoch_end() callback on the Keras side though). However, to keep class inheritance clean and simple, Dataset will fail on isinstance() checks as they only share interfaces (duck-typing) with their framework counterparts.

class plums.dataflow.dataset.SizedDataset[source]¶

Bases: plums.dataflow.dataset.base.Dataset

Abstract base class for all Dataset with a known length to inherit from.

Subclasses must override the __getitem__() and __len__() methods to allow Dataset to act as a regular Sequence.

cat(datasets, *additional_datasets)[source]¶

Concatenate the dataset with any other dataset.

Parameters

datasets (Sequence[Dataset], Dataset) – Either a sequence of datasets or a single dataset to concatenate with self.
*additional_datasets (Dataset) – Datasets to concatenate with self.

Returns

The concatenation of self and other datasets.

Return type

ConcatDataset

Two utility Dataset classes are also provided to ease the creation of dataset partitions and compositions:

class plums.dataflow.dataset.Subset(dataset, indices)[source]¶

Bases: plums.dataflow.dataset.base.SizedDataset

Create a subset of a Dataset from a Dataset to wrap and a selector container.

Providing a list of indices is the preferred way to select a subset but any sequence-like or mapping-like object may work.

Parameters

dataset (Dataset) – A Dataset to wrap as a subset.
indices (Sequence, Mapping) – A selector container, mapping the subset items to the Dataset items.

class plums.dataflow.dataset.ConcatDataset(datasets, *additional_datasets)[source]¶

Bases: plums.dataflow.dataset.base.SizedDataset

Create a Dataset as the concatenation of multiple Dataset.

The concatenation involves no copy of any sort as the reindexing happens on-the-fly in the __getitem__() method.

Warning

An explicit Dataset type check is performed on datasets to sort out the provided argument correct signature. If using the class with non-plums datasets (as it may happen implicitly with dataset addition), ensure that the argument or either in the correct order to pass the type check or that one uses the non-expended form to avoid ambiguities.

Parameters

datasets (Sequence[Dataset], Dataset) – Either a sequence of datasets or a single dataset to concatenate with self.
*additional_datasets (Dataset) – Datasets to concatenate with self.

Raises

ValueError – If no dataset or an empty dataset is provided in the constructor’s arguments.

cumulative_size¶

A tuple of the cumulative length of all enclosed Dataset.

Type: tuple

Pattern dataset¶

class plums.dataflow.dataset.PatternDataset(tile_pattern, annotation_pattern, tile_driver, annotation_driver, path=None, sort_key=None, strict=True, cache=False)[source]¶

Bases: plums.dataflow.dataset.base.SizedDataset

A SizedDataset of which tile/annotation pairs are globed from a pair of matching dataset path patterns.

A path pattern is provided using a micro-language as described bellow:

A dataset pattern is a path-like string where path elements may be either “components” or “groups”.
A component designate an entity which value is fixed, e.g. in /some/pattern, some and pattern are components. Components define exact-matches where the name of the element to match is known in advance.
A group designate a named-entity whose value is unknown. They are delimited by curly braces { and }, e.g. in /some/pattern/{with}/some/{groups}, {with} and {groups} are groups, and may additionally define constraints to limit or expand the group match capability. The text consequential to the opening bracket is the group’s name and must be unique to the pattern. A forward-slash / following the group name indicates a recursive group, which is a group which might span over multiple folders. If one wishes to constraint the group match, a colon : after the group name (or recursive slash) is used to add a regex on which all candidate entities will be matched (note that for recursive groups, the regex will apply on each of the path entity, not on the whole group).
The last group or component of the pattern must be a file, indicated by an extension added at the end. Multiple extension alternatives may be provided, using brackets [ and ] delimiters and separating each alternative with a pipe | alternator.

In a more formal manner, the path pattern language EBNF grammar might look something like:

pattern = [ absolute ], { folder }, file ;
absolute = SEPARATOR ;
folder = entry, SEPARATOR ;
file = entry, ".", extension ;
entry = FSNAME | "{" IDENTIFIER, [ SEPARATOR ], [ ":", REGEX ], "}" ;
extension = EXTENSION | "[", EXTENSION, { "|", EXTENSION }, "]" ;
IDENTIFIER = ( "_" | LETTER ), { "_" | LETTER | NUMBER } ;
FSNAME = { LETTER | NUMBER | "_" | "-" | " " } ;

Hint

The annotation path pattern may be degenerate (i.e. point to a single, non variable file) in which case the path matching every tile path will be the degenerate annotation path. A degenerate flag set to True is passed to the enclosed annotation driver called to allow for caching mechanism and reduce the file load overhead in this case.

The PatternDataset also expects a pair of callable, called the drivers which will be fed a tuple of path and the path pattern named-group match name: value pairs. It should returns objects compatible with the Plums data-model, i.e. a Tile-like object for tiles and an Annotation-like object for annotations.

Domain datasets¶

The following datasets are domain-specific datasets based on the PatternDataset.

Playground¶

class plums.dataflow.dataset.PlaygroundDataset(path, select_datasets=(), select_zones=(), select_images=(), select_tiles=(), exclude_datasets=(), exclude_zones=(), exclude_images=(), exclude_tiles=(), tile_driver=None, annotation_driver=None, use_taxonomy=True, strict=True, cache=False)[source]¶

Bases: plums.dataflow.dataset.pattern.PatternDataset

A Dataset as exported by the Intelligence Playground which loads data in the Plums data model.

A PlaygroundDataset has the following file structure:

├── <dataset_id_1>
│   ├── samples
│   │   ├── <zone_id_1>
│   │   │   ├── <image_id_1>
│   │   │   │   ├── <tile_id>.jpg
│   │   │   │   └── ...
│   │   │   ├── <image_id_2>
│   │   │   │   ├── <tile_id>.jpg
│   │   │   │   └── ...
│   │   │   └── ...
│   │   ├── <zone_id_2>
│   │   │   ├── samples
│   │   │   └── ...
│   │   └── ...
│   └── labels
│       ├── <zone_id_1>
│       │   ├── <tile_id>.json
│       │   └── ...
│       ├── <zone_id_2>
│       │   ├── <tile_id>.json
│       │   └── ...
│       └── ...
├── <dataset_id_2>
│   └── ...
└── ...

Where samples are projected jpg tiles of imagery and annotation are a geojson FeatureCollection.

Hint

The constructor arguments allows for explicit selection of datasets, zones, images or tiles and explicit exclusion of datasets, zones, images or tiles by providing list of identifiers to select or exclude. If no such sequence or provided, valid data point will be automatically discovered from the filesystem.

Parameters

path (PathLike) – The path path to the dataset root, it may be either absolute or relative to the current working directory.
select_datasets (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to select the datasets in which data points will be fetched.
exclude_datasets (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to excludes datasets from the data point search.
select_zones (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to select the zones in which data points will be fetched.
exclude_zones (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to excludes zones from the data point search.
select_images (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to select the images in which data points will be fetched.
exclude_images (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to excludes images from the data point search.
select_tiles (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to select the tiles which will be fetched.
exclude_tiles (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to excludes tiles from the data point search.
tile_driver (callable) – Optional. Default to a TileDriver. A function(path_tuple, **matched_groups) callable which return a TileCollection-like object called for each data point (see Drivers).
annotation_driver (callable) – Optional. Default to a AnnotationDriver. A function(path_tuple, **matched_groups) callable which return an Annotation-like object called for each data point (see Drivers).
use_taxonomy (bool) – Optional. Default to True. If False, the global taxonomy will not be passed to the annotation driver and implicit taxonomies for each annotation files, with no interplay guarantee.
strict (bool) – If False, solitary tiles or annotations will be silently dropped instead of raising.
cache (bool) – If True, the dataset will be looked-up in the user’s cache directory and if found loaded from there instead of walking the file-system. Note that although this could speedup dataset loading multiple fold for big datasets, one may load stale data when using the cache.

Warning

If providing a custom annotation driver, the use_taxonomy flag is not guaranteed to work and it is up to the provided driver to handle dataset taxonomies if needed (See also the TaxonomyReader helper class).

Raises

ValueError – If the requested playground datasets have mismatching taxonomies and global Taxonomy usage was requested.
ValueError – If tile could not be matched to an annotation and strict is True.
ValueError – If no tile/annotation pair could be found.

Warns

UserWarning – If the requested playground datasets have mismatching taxonomies and global Taxonomy usage was not requested.

class plums.dataflow.dataset.playground.TileDriver(*names, ptype=ptype('RGB'), dtype=dtype('uint8'), fetch_ordering=True)[source]¶

Bases: object

A basic driver to open Intelligence Playground tiles as Tile instance.

It provides a basic level of customisation but heavy modification will require either subclassing and overriding or writing a new driver altogether.

Parameters

*names (str) – Optional. If provided, it will be used a keys in the TileCollection returned by the driver.
ptype (ptype) – Optional. Default to RGB. The image pixel-type (e.g. RGB, BGR or Grey).
dtype (dtype) – Optional. Default to uint8. The internal ndarray storage data type.
fetch_ordering (bool) –
If True, tiles will be ordered using the information stored in the dataset summary provided as a JSON file alongside each exports.

Warning

If False the TileCollection ordering will be entirely filesystem dependent which is no better than random.

__call__(path_tuple, **matched_groups)[source]¶

Open a set of tiles in a TileCollection.

Parameters

path_tuple (Tuple[PathLike]) – A tuple of paths pointing to the tiles to open.
**matched_groups (str) – A group_name: value mapping of the path pattern group match in the paths.

Returns

A TileCollection with the opened tiles. If names where provided in the constructor, they are used as key in the collection, otherwise, the default applies.

Return type

TileCollection

Raises

ValueError – If the number of names provided in the constructor and the number of retrieved tiles mismatch.

class plums.dataflow.dataset.playground.AnnotationDriver(record_id_key='record_id', confidence_key='confidence', taxonomy=None, cache=False)[source]¶

Bases: object

A basic driver to open Intelligence Playground annotation GeoJSON FeatureCollection as Annotation.

It provides a basic level of customisation but heavy modification will require either subclassing and overriding or writing a new driver altogether.

Parameters

record_id_key (str) – The key used to find a record’s unique identifier in its properties mapping.
confidence_key (str) – The key used to find a record’s confidence score in its properties mapping.
taxonomy (Taxonomy) – If provided, a Taxonomy against which all records’ labels will be validated.
cache (bool) – Optional. Default to False. If True, all constructed Annotation will be cached in memory to speed up future retrieval.

__call__(path_tuple, **matched_groups)[source]¶

Open a Playground annotation GeoJSON file as an Annotation.

Parameters

path_tuple (Tuple[PathLike]) – A tuple containing a single path pointing to a valid GeoJSON file.
**matched_groups (str) – A group_name: value mapping of the path pattern group match in the paths.

Returns

An Annotation with Record in the tile and a VectorMask corresponding to the zone footprint in the tile.

Return type

Annotation

Raises

ValueError – If no valid Annotation could be constructed from the opened JSON file.
ValueError – If more than one path was provided.

class plums.dataflow.dataset.playground.TaxonomyReader[source]¶

Bases: object

A callable class which loads and constructs a Taxonomy when provided with a valid Playground dataset path.

__call__(path)[source]¶

Construct a Taxonomy from the exported dataset taxonomy.json file.

Parameters: path (PathLike) – A path to a single Playground dataset.
Returns: The dataset taxonomy.
Return type: Taxonomy