Choice of API#
Grain offers two different ways of defining data processing pipelines:
DataLoader and Dataset.
TL;DR: If you need to do one of the following:
mix multiple data sources
pack variable length elements
split dataset elements and globally shuffle the splits
then you should use
Dataset, otherwise use simplerDataLoader.
DataLoader#
DataLoader is a high-level API that uses the following abstractions to define
data processing:
RandomAccessDataSourcethat reads raw input data.A
Samplerthat defines the order in which the raw data should be read.A flat sequence of
Transformations to apply to the raw data.
You can specify other execution parameters for asynchronous data processing,
sharding, shuffling, and DataLoader will automatically take care of inserting
them in the right places between the data processing steps.
These are simple and usually general enough to cover most data processing use
cases. Prefer using DataLoader if your workflow can be described using the
abstractions above. See tutorial
for more details.
Dataset#
Dataset is a lower-level API that uses chaining syntax to define data
transformation steps. It allows more general types of processing (e.g. dataset
mixing) and more control over the execution (e.g. different order of data
sharding and shuffling). Dataset transformations are composed in a way that
allows to preserve random access property past the source and some of the
transformations. This, among other things, can be used for debugging by
evaluating dataset elements at specific positions without processing the entire
dataset.
There are 3 main classes comprising the Dataset API:
MapDatasetdefines a dataset that supports efficient random access. Think of it as an (infinite)Sequencethat computes values lazily.IterDatasetdefines a dataset that does not support efficient random access and only supports iterating over it. It’s anIterable. AnyMapDatasetcan be turned into aIterDatasetby callingto_iter_dataset().DatasetIteratordefines a stateful iterator of anIterDataset. The state of the iterator can be saved and restored.
Most data pipelines will start with one or more MapDataset (often derived from
a RandomAccessDataSource) and switch to IterDataset late or not at all. See
tutorial
for more details.