nalp.datasets

Because we need data, right? Datasets are composed by classes and methods that allow to prepare data for further neural networks.

A dataset package to transform encoded data into real datasets.

class nalp.datasets.ImageDataset(images: numpy.array, batch_size: Optional[int] = 256, shape: Optional[Tuple[int, int]] = None, normalize: Optional[bool] = True, shuffle: Optional[bool] = True)

Bases: nalp.core.Dataset

An ImageDataset class is responsible for creating a dataset that encodes images for adversarial generation.

__init__(self, images: numpy.array, batch_size: Optional[int] = 256, shape: Optional[Tuple[int, int]] = None, normalize: Optional[bool] = True, shuffle: Optional[bool] = True)

Initialization method.

Parameters
  • images – An array of images.

  • batch_size – Size of batches.

  • shape – A tuple containing the shape if the array should be forced to reshape.

  • normalize – Whether images should be normalized between -1 and 1.

  • shuffle – Whether batches should be shuffled or not.

_preprocess(self, images: numpy.array, shape: Tuple[int, int], normalize: bool)

Pre-process an array of images by reshaping and normalizing, if necessary.

Parameters
  • images – An array of images.

  • shape – A tuple containing the shape if the array should be forced to reshape.

  • normalize – Whether images should be normalized between -1 and 1.

Returns

Slices of pre-processed tensor-based images.

Return type

(tf.data.Dataset)

class nalp.datasets.LanguageModelingDataset(encoded_tokens: numpy.array, max_contiguous_pad_length: Optional[int] = 1, batch_size: Optional[int] = 64, shuffle: Optional[bool] = True)

Bases: nalp.core.Dataset

A LanguageModelingDataset class is responsible for creating a dataset that predicts the next timestep (t+1) given a timestep (t).

__init__(self, encoded_tokens: numpy.array, max_contiguous_pad_length: Optional[int] = 1, batch_size: Optional[int] = 64, shuffle: Optional[bool] = True)

Initialization method.

Parameters
  • encoded_tokens – An array of encoded tokens.

  • max_contiguous_pad_length – Maximum length to pad contiguous text.

  • batch_size – Size of batches.

  • shuffle – Whether batches should be shuffled or not.

_create_input_target(self, sequence: tensorflow.Tensor)

Creates input (t) and targets (t+1) using the next timestep approach.

Parameters

sequence – A tensor holding the sequence to be mapped.

Returns

Input and target tensors.

Return type

(Tuple[tf.Tensor, tf.Tensor])

_create_sequences(self, encoded_tokens: numpy.array, rank: int, max_contiguous_pad_length: int)

Creates sequences of the desired length.

Parameters
  • encoded_tokens – An array of encoded tokens.

  • rank – Number of array dimensions (rank).

  • max_contiguous_pad_length – Maximum sequences’ length.

Returns

Slices of tensor-based sequences.

Return type

(tf.data.Dataset)