nalp.datasets¶

Because we need data, right? Datasets are composed by classes and methods that allow to prepare data for further neural networks.

A dataset package to transform encoded data into real datasets.

class nalp.datasets.ImageDataset(images: numpy.array, batch_size: Optional[int] = 256, shape: Optional[Tuple[int, int]] = None, normalize: Optional[bool] = True, shuffle: Optional[bool] = True)¶

Bases: nalp.core.Dataset

An ImageDataset class is responsible for creating a dataset that encodes images for adversarial generation.

__init__(self, images: numpy.array, batch_size: Optional[int] = 256, shape: Optional[Tuple[int, int]] = None, normalize: Optional[bool] = True, shuffle: Optional[bool] = True)¶

Initialization method.

Parameters

images – An array of images.
batch_size – Size of batches.
shape – A tuple containing the shape if the array should be forced to reshape.
normalize – Whether images should be normalized between -1 and 1.
shuffle – Whether batches should be shuffled or not.

_preprocess(self, images: numpy.array, shape: Tuple[int, int], normalize: bool)¶

Pre-process an array of images by reshaping and normalizing, if necessary.

Parameters

images – An array of images.
shape – A tuple containing the shape if the array should be forced to reshape.
normalize – Whether images should be normalized between -1 and 1.

Returns

Slices of pre-processed tensor-based images.

Return type

(tf.data.Dataset)

class nalp.datasets.LanguageModelingDataset(encoded_tokens: numpy.array, max_contiguous_pad_length: Optional[int] = 1, batch_size: Optional[int] = 64, shuffle: Optional[bool] = True)¶

Bases: nalp.core.Dataset

A LanguageModelingDataset class is responsible for creating a dataset that predicts the next timestep (t+1) given a timestep (t).

__init__(self, encoded_tokens: numpy.array, max_contiguous_pad_length: Optional[int] = 1, batch_size: Optional[int] = 64, shuffle: Optional[bool] = True)¶

Initialization method.

Parameters

encoded_tokens – An array of encoded tokens.
max_contiguous_pad_length – Maximum length to pad contiguous text.
batch_size – Size of batches.
shuffle – Whether batches should be shuffled or not.

_create_input_target(self, sequence: tensorflow.Tensor)¶

Creates input (t) and targets (t+1) using the next timestep approach.

Parameters: sequence – A tensor holding the sequence to be mapped.
Returns: Input and target tensors.
Return type: (Tuple[tf.Tensor, tf.Tensor])

_create_sequences(self, encoded_tokens: numpy.array, rank: int, max_contiguous_pad_length: int)¶

Creates sequences of the desired length.

Parameters

encoded_tokens – An array of encoded tokens.
rank – Number of array dimensions (rank).
max_contiguous_pad_length – Maximum sequences’ length.

Returns

Slices of tensor-based sequences.

Return type

(tf.data.Dataset)