grain.experimental.FirstFitPackIterDataset

grain.experimental.FirstFitPackIterDataset#

class grain.experimental.FirstFitPackIterDataset(parent, *, length_struct, num_packing_bins, seed=0, shuffle_bins=True, shuffle_bins_group_by_feature=None, meta_features=(), pack_alignment_struct=None, padding_struct=None, max_sequences_per_bin=None)#

Implements first-fit packing of sequences.

Packing, compared to concat-and-split, avoids splitting sequences by padding instead. Larger number of packing bins reduce the amount of padding. If the number of bins is large, this can cause epoch leakage (data points from multiple epochs getting packed together).

This uses a simple first-fit packing algorithm that: 1. Creates N bins. 2. Adds elements (in the order coming from the parent) to the first bin that has enough space. 3. Once an element doesn’t fit, emits all N bins as elements. 4. (optional) Shuffles bins. 5. Loops back to 1 and starts with the element that didn’t fit.

Parameters:
  • parent (IterDataset)

  • length_struct (Any)

  • num_packing_bins (int)

  • seed (int)

  • shuffle_bins (bool)

  • shuffle_bins_group_by_feature (str | None)

  • meta_features (Sequence[str])

  • pack_alignment_struct (Any)

  • padding_struct (Any)

  • max_sequences_per_bin (int | None)

__init__(parent, *, length_struct, num_packing_bins, seed=0, shuffle_bins=True, shuffle_bins_group_by_feature=None, meta_features=(), pack_alignment_struct=None, padding_struct=None, max_sequences_per_bin=None)#

Creates a dataset that packs sequences using the first-fit strategy.

Parameters:
  • parent (IterDataset) – Parent dataset with variable length sequences.

  • length_struct (Any) – Target sequence length for each feature.

  • num_packing_bins (int) – Number of bins to pack sequences into.

  • seed (int) – Random seed for shuffling bins.

  • shuffle_bins (bool) – Whether to shuffle bins after packing.

  • shuffle_bins_group_by_feature (str | None) – Feature to group by for shuffling.

  • meta_features (Sequence[str]) – Meta features that do not need packing logic.

  • pack_alignment_struct (Any) – Optional per-feature alignment values.

  • padding_struct (Any) – Optional per-feature padding values.

  • max_sequences_per_bin (int | None) – Optional maximum number of input sequences that can be packed into a bin

Methods

__init__(parent, *, length_struct, ...[, ...])

Creates a dataset that packs sequences using the first-fit strategy.

apply(transformations)

Returns a dataset with the given transformation(s) applied.

batch(batch_size, *[, drop_remainder, batch_fn])

Returns a dataset of elements batched along a new first dimension.

filter(transform)

Returns a dataset containing only the elements that match the filter.

map(transform)

Returns a dataset containing the elements transformed by transform.

map_with_index(transform)

Returns a dataset of the elements transformed by the transform.

mp_prefetch([options, worker_init_fn, ...])

Returns a dataset prefetching elements in multiple processes.

pipe(func, /, *args, **kwargs)

Syntactic sugar for applying a callable to this dataset.

prefetch(multiprocessing_options)

Deprecated, use mp_prefetch instead.

random_map(transform, *[, seed])

Returns a dataset containing the elements transformed by transform.

seed(seed)

Returns a dataset that uses the seed for default seed generation.

Attributes

parents