Dataset¶
The Plums dataflow module shares the PyTorch data pipeline API main ideas and is compatible with both PyTorch and Tensorflow’s Keras.
Base datasets¶
The main class to stream through data in Plums is the Dataset base class, which guarantee a sequence-like
interface to manipulate data in an ordered fashion.
-
class
plums.dataflow.dataset.Dataset[source]¶ Bases:
objectAbstract base class for all
Datasetto inherit from.Subclasses must override the
__getitem__()method to allowDatasetto act as a ‘non-sized’Sequence.Although subclasses should override the len method, it is deliberately not implemented by default to assert valid behaviour with non-sized samplers. If one wants to explicitly signal that the
__len__()is or should be present in subclasses, use theSizedDatasetbase class instead.Hint
Insightful users might notice that the base dataset API closely mimics PyTorch’s own
Datasetand Keras’sSequence. This is deliberate so that the providedDatasetcan be used as a stand-in (with some care needed to supporton_epoch_end()callback on the Keras side though). However, to keep class inheritance clean and simple,Datasetwill fail onisinstance()checks as they only share interfaces (duck-typing) with their framework counterparts.
-
class
plums.dataflow.dataset.SizedDataset[source]¶ Bases:
plums.dataflow.dataset.base.DatasetAbstract base class for all
Datasetwith a known length to inherit from.Subclasses must override the
__getitem__()and__len__()methods to allowDatasetto act as a regularSequence.
Two utility Dataset classes are also provided to ease the creation of dataset partitions and compositions:
-
class
plums.dataflow.dataset.Subset(dataset, indices)[source]¶ Bases:
plums.dataflow.dataset.base.SizedDatasetCreate a subset of a
Datasetfrom aDatasetto wrap and a selector container.Providing a list of indices is the preferred way to select a subset but any sequence-like or mapping-like object may work.
-
class
plums.dataflow.dataset.ConcatDataset(datasets, *additional_datasets)[source]¶ Bases:
plums.dataflow.dataset.base.SizedDatasetCreate a
Datasetas the concatenation of multipleDataset.The concatenation involves no copy of any sort as the reindexing happens on-the-fly in the
__getitem__()method.Warning
An explicit
Datasettype check is performed ondatasetsto sort out the provided argument correct signature. If using the class with non-plums datasets (as it may happen implicitly with dataset addition), ensure that the argument or either in the correct order to pass the type check or that one uses the non-expended form to avoid ambiguities.- Parameters
- Raises
ValueError – If no dataset or an empty dataset is provided in the constructor’s arguments.
Pattern dataset¶
-
class
plums.dataflow.dataset.PatternDataset(tile_pattern, annotation_pattern, tile_driver, annotation_driver, path=None, sort_key=None, strict=True, cache=False)[source]¶ Bases:
plums.dataflow.dataset.base.SizedDatasetA
SizedDatasetof which tile/annotation pairs are globed from a pair of matching dataset path patterns.A path pattern is provided using a micro-language as described bellow:
A dataset pattern is a path-like string where path elements may be either “components” or “groups”.
A component designate an entity which value is fixed, e.g. in
/some/pattern,someandpatternare components. Components define exact-matches where the name of the element to match is known in advance.A group designate a named-entity whose value is unknown. They are delimited by curly braces { and }, e.g. in
/some/pattern/{with}/some/{groups},{with}and{groups}are groups, and may additionally define constraints to limit or expand the group match capability. The text consequential to the opening bracket is the group’s name and must be unique to the pattern. A forward-slash / following the group name indicates a recursive group, which is a group which might span over multiple folders. If one wishes to constraint the group match, a colon : after the group name (or recursive slash) is used to add a regex on which all candidate entities will be matched (note that for recursive groups, the regex will apply on each of the path entity, not on the whole group).The last group or component of the pattern must be a file, indicated by an extension added at the end. Multiple extension alternatives may be provided, using brackets [ and ] delimiters and separating each alternative with a pipe | alternator.
In a more formal manner, the path pattern language EBNF grammar might look something like:
pattern = [ absolute ], { folder }, file ; absolute = SEPARATOR ; folder = entry, SEPARATOR ; file = entry, ".", extension ; entry = FSNAME | "{" IDENTIFIER, [ SEPARATOR ], [ ":", REGEX ], "}" ; extension = EXTENSION | "[", EXTENSION, { "|", EXTENSION }, "]" ; IDENTIFIER = ( "_" | LETTER ), { "_" | LETTER | NUMBER } ; FSNAME = { LETTER | NUMBER | "_" | "-" | " " } ;
Hint
The annotation path pattern may be degenerate (i.e. point to a single, non variable file) in which case the path matching every tile path will be the degenerate annotation path. A degenerate flag set to
Trueis passed to the enclosed annotation driver called to allow for caching mechanism and reduce the file load overhead in this case.The
PatternDatasetalso expects a pair of callable, called the drivers which will be fed a tuple of path and the path pattern named-group matchname: valuepairs. It should returns objects compatible with the Plums data-model, i.e. aTile-like object for tiles and anAnnotation-like object for annotations.- Parameters
tile_pattern (str) – The path pattern corresponding to the dataset tiles.
annotation_pattern (str) – The path pattern corresponding to the dataset annotations.
tile_driver (callable) – A
function(path_tuple, **matched_groups)callable which return aTileCollection-like object.annotation_driver (callable) – A
function(path_tuple, **matched_groups)callable which return anAnnotation-like object.path (PathLike) – If the tile and annotation path pattern a relative, a folder from which to start discovering tile/annotation file pairs.
sort_key (callable) –
Optional. If provided, it must be function of one match group which return a sorting key used to sort tile/annotation pairs.
Warning
Although the data points will be sorted, the matched file paths ordering will be entirely filesystem dependent which is no better than random.
strict (bool) – If
False, solitary tiles or annotations will be silently dropped instead of raising.cache (bool) – If
True, the dataset will be looked-up in the user’s cache directory and if found loaded from there instead of walking the file-system. Note that although this could speedup dataset loading multiple fold for big datasets, one may load stale data when using the cache.
- Raises
ValueError – If the provided tile path pattern is degenerate.
ValueError – If the provided tile path pattern have no named group in common with the provided annotation path pattern.
ValueError – If tile could not be matched to an annotation and
strictisTrue.ValueError – If no tile/annotation pair could be found.
Domain datasets¶
The following datasets are domain-specific datasets based on the PatternDataset.
Playground¶
-
class
plums.dataflow.dataset.PlaygroundDataset(path, select_datasets=(), select_zones=(), select_images=(), select_tiles=(), exclude_datasets=(), exclude_zones=(), exclude_images=(), exclude_tiles=(), tile_driver=None, annotation_driver=None, use_taxonomy=True, strict=True, cache=False)[source]¶ Bases:
plums.dataflow.dataset.pattern.PatternDatasetA
Datasetas exported by the Intelligence Playground which loads data in the Plums data model.A
PlaygroundDatasethas the following file structure:├── <dataset_id_1> │ ├── samples │ │ ├── <zone_id_1> │ │ │ ├── <image_id_1> │ │ │ │ ├── <tile_id>.jpg │ │ │ │ └── ... │ │ │ ├── <image_id_2> │ │ │ │ ├── <tile_id>.jpg │ │ │ │ └── ... │ │ │ └── ... │ │ ├── <zone_id_2> │ │ │ ├── samples │ │ │ └── ... │ │ └── ... │ └── labels │ ├── <zone_id_1> │ │ ├── <tile_id>.json │ │ └── ... │ ├── <zone_id_2> │ │ ├── <tile_id>.json │ │ └── ... │ └── ... ├── <dataset_id_2> │ └── ... └── ...
Where samples are projected jpg tiles of imagery and annotation are a geojson FeatureCollection.
Hint
The constructor arguments allows for explicit selection of datasets, zones, images or tiles and explicit exclusion of datasets, zones, images or tiles by providing list of identifiers to select or exclude. If no such sequence or provided, valid data point will be automatically discovered from the filesystem.
- Parameters
path (PathLike) – The path path to the dataset root, it may be either absolute or relative to the current working directory.
select_datasets (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to select the datasets in which data points will be fetched.
exclude_datasets (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to excludes datasets from the data point search.
select_zones (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to select the zones in which data points will be fetched.
exclude_zones (Sequence[str]) – Optional. If provided, it must be a sequence of uuid used to excludes zones from the data point search.
select_images (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to select the images in which data points will be fetched.
exclude_images (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to excludes images from the data point search.
select_tiles (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to select the tiles which will be fetched.
exclude_tiles (Sequence[str]) – Optional. If provided, it must be a sequence of identifiers used to excludes tiles from the data point search.
tile_driver (callable) – Optional. Default to a
TileDriver. Afunction(path_tuple, **matched_groups)callable which return aTileCollection-like object called for each data point (see Drivers).annotation_driver (callable) – Optional. Default to a
AnnotationDriver. Afunction(path_tuple, **matched_groups)callable which return anAnnotation-like object called for each data point (see Drivers).use_taxonomy (bool) – Optional. Default to
True. IfFalse, the global taxonomy will not be passed to the annotation driver and implicit taxonomies for each annotation files, with no interplay guarantee.strict (bool) – If
False, solitary tiles or annotations will be silently dropped instead of raising.cache (bool) – If
True, the dataset will be looked-up in the user’s cache directory and if found loaded from there instead of walking the file-system. Note that although this could speedup dataset loading multiple fold for big datasets, one may load stale data when using the cache.
Warning
If providing a custom annotation driver, the
use_taxonomyflag is not guaranteed to work and it is up to the provided driver to handle dataset taxonomies if needed (See also theTaxonomyReaderhelper class).- Raises
ValueError – If the requested playground datasets have mismatching taxonomies and global
Taxonomyusage was requested.ValueError – If tile could not be matched to an annotation and
strictisTrue.ValueError – If no tile/annotation pair could be found.
- Warns
UserWarning – If the requested playground datasets have mismatching taxonomies and global
Taxonomyusage was not requested.
-
class
plums.dataflow.dataset.playground.TileDriver(*names, ptype=ptype('RGB'), dtype=dtype('uint8'), fetch_ordering=True)[source]¶ Bases:
objectA basic driver to open Intelligence Playground tiles as
Tileinstance.It provides a basic level of customisation but heavy modification will require either subclassing and overriding or writing a new driver altogether.
- Parameters
*names (str) – Optional. If provided, it will be used a keys in the
TileCollectionreturned by the driver.ptype (
ptype) – Optional. Default toRGB. The image pixel-type (e.g. RGB, BGR or Grey).dtype (
dtype) – Optional. Default touint8. The internalndarraystorage data type.fetch_ordering (bool) –
If
True, tiles will be ordered using the information stored in the dataset summary provided as a JSON file alongside each exports.Warning
If
FalsetheTileCollectionordering will be entirely filesystem dependent which is no better than random.
-
__call__(path_tuple, **matched_groups)[source]¶ Open a set of tiles in a
TileCollection.- Parameters
path_tuple (Tuple[PathLike]) – A tuple of paths pointing to the tiles to open.
**matched_groups (str) – A
group_name: valuemapping of the path pattern group match in the paths.
- Returns
A
TileCollectionwith the opened tiles. If names where provided in the constructor, they are used as key in the collection, otherwise, the default applies.- Return type
TileCollection- Raises
ValueError – If the number of names provided in the constructor and the number of retrieved tiles mismatch.
-
class
plums.dataflow.dataset.playground.AnnotationDriver(record_id_key='record_id', confidence_key='confidence', taxonomy=None, cache=False)[source]¶ Bases:
objectA basic driver to open Intelligence Playground annotation GeoJSON FeatureCollection as
Annotation.It provides a basic level of customisation but heavy modification will require either subclassing and overriding or writing a new driver altogether.
- Parameters
record_id_key (str) – The key used to find a record’s unique identifier in its
propertiesmapping.confidence_key (str) – The key used to find a record’s confidence score in its
propertiesmapping.taxonomy (Taxonomy) – If provided, a
Taxonomyagainst which all records’ labels will be validated.cache (bool) – Optional. Default to
False. IfTrue, all constructedAnnotationwill be cached in memory to speed up future retrieval.
-
__call__(path_tuple, **matched_groups)[source]¶ Open a Playground annotation GeoJSON file as an
Annotation.- Parameters
path_tuple (Tuple[PathLike]) – A tuple containing a single path pointing to a valid GeoJSON file.
**matched_groups (str) – A
group_name: valuemapping of the path pattern group match in the paths.
- Returns
An
AnnotationwithRecordin the tile and aVectorMaskcorresponding to the zone footprint in the tile.- Return type
Annotation- Raises
ValueError – If no valid
Annotationcould be constructed from the opened JSON file.ValueError – If more than one path was provided.