nalp.encoders¶

Text or Numbers? Encodings are used to make embeddings. Embeddings are used to feed into neural networks. Remember that networks cannot read raw data, therefore you might want to pre-encode your data using well-known encoders.

An encoding package, containing encoders, decoders and all text-to-vector necessities.

class nalp.encoders.IntegerEncoder¶

Bases: nalp.core.encoder.Encoder

An IntegerEncoder class is responsible for encoding text into integers.

__init__(self)¶: Initizaliation method.

decode(self, encoded_tokens: numpy.array)¶

Decodes the encoding back to tokens.

Parameters: encoded_tokens – A numpy array containing the encoded tokens.
Returns: Decoded tokens.
Return type: (List[str])

property decoder(self)¶: Decoder dictionary.

encode(self, tokens: List[str])¶

Encodes new tokens based on previous learning.

Parameters: tokens – A list of tokens to be encoded.
Returns: Encoded tokens.
Return type: (np.array)

learn(self, dictionary: Dict[str, Any], reverse_dictionary: Dict[str, Any])¶

Learns an integer vectorization encoding.

Parameters

dictionary – The vocabulary to index mapping.
reverse_dictionary – The index to vocabulary mapping.

class nalp.encoders.Word2vecEncoder¶

Bases: nalp.core.encoder.Encoder

A Word2vecEncoder class is responsible for learning a Word2Vec encode and further encoding new data.

__init__(self)¶: Initizaliation method.

decode(self, encoded_tokens: numpy.array)¶

Decodes the encoding back to tokens.

Parameters: encoded_tokens – A numpy array containing the encoded tokens.
Returns: Decoded tokens.
Return type: (List[str])

encode(self, tokens: List[str])¶

Encodes the data into a Word2Vec representation.

Parameters: tokens – Tokens to be encoded.

learn(self, tokens: List[str], max_features: Optional[int] = 128, window_size: Optional[int] = 5, min_count: Optional[int] = 1, algorithm: Optional[bool] = 0, learning_rate: Optional[float] = 0.01, iterations: Optional[int] = 1000)¶

Learns a Word2Vec representation based on the its methodology.

One can use CBOW or Skip-gram algorithm for the learning procedure.

Parameters

tokens – A list of tokens.
max_features – Maximum number of features to be fitted.
window_size – Maximum distance between current and predicted word.
min_count – Minimum count of words for its use.
algorithm – 1 for skip-gram, while 0 for CBOW.
learning_rate – Value of the learning rate.
iterations – Number of iterations.