nalp.encoders

Text or Numbers? Encodings are used to make embeddings. Embeddings are used to feed into neural networks. Remember that networks cannot read raw data, therefore you might want to pre-encode your data using well-known encoders.

An encoding package, containing encoders, decoders and all text-to-vector necessities.

class nalp.encoders.IntegerEncoder

Bases: nalp.core.encoder.Encoder

An IntegerEncoder class is responsible for encoding text into integers.

__init__(self)

Initizaliation method.

decode(self, encoded_tokens: numpy.array)

Decodes the encoding back to tokens.

Parameters

encoded_tokens – A numpy array containing the encoded tokens.

Returns

Decoded tokens.

Return type

(List[str])

property decoder(self)

Decoder dictionary.

encode(self, tokens: List[str])

Encodes new tokens based on previous learning.

Parameters

tokens – A list of tokens to be encoded.

Returns

Encoded tokens.

Return type

(np.array)

learn(self, dictionary: Dict[str, Any], reverse_dictionary: Dict[str, Any])

Learns an integer vectorization encoding.

Parameters
  • dictionary – The vocabulary to index mapping.

  • reverse_dictionary – The index to vocabulary mapping.

class nalp.encoders.Word2vecEncoder

Bases: nalp.core.encoder.Encoder

A Word2vecEncoder class is responsible for learning a Word2Vec encode and further encoding new data.

__init__(self)

Initizaliation method.

decode(self, encoded_tokens: numpy.array)

Decodes the encoding back to tokens.

Parameters

encoded_tokens – A numpy array containing the encoded tokens.

Returns

Decoded tokens.

Return type

(List[str])

encode(self, tokens: List[str])

Encodes the data into a Word2Vec representation.

Parameters

tokens – Tokens to be encoded.

learn(self, tokens: List[str], max_features: Optional[int] = 128, window_size: Optional[int] = 5, min_count: Optional[int] = 1, algorithm: Optional[bool] = 0, learning_rate: Optional[float] = 0.01, iterations: Optional[int] = 1000)

Learns a Word2Vec representation based on the its methodology.

One can use CBOW or Skip-gram algorithm for the learning procedure.

Parameters
  • tokens – A list of tokens.

  • max_features – Maximum number of features to be fitted.

  • window_size – Maximum distance between current and predicted word.

  • min_count – Minimum count of words for its use.

  • algorithm – 1 for skip-gram, while 0 for CBOW.

  • learning_rate – Value of the learning rate.

  • iterations – Number of iterations.