nalp.encoders¶
Text or Numbers? Encodings are used to make embeddings. Embeddings are used to feed into neural networks. Remember that networks cannot read raw data, therefore you might want to pre-encode your data using well-known encoders.
An encoding package, containing encoders, decoders and all text-to-vector necessities.
- class nalp.encoders.IntegerEncoder¶
Bases:
nalp.core.encoder.Encoder
An IntegerEncoder class is responsible for encoding text into integers.
- __init__(self)¶
Initizaliation method.
- decode(self, encoded_tokens: numpy.array)¶
Decodes the encoding back to tokens.
- Parameters
encoded_tokens – A numpy array containing the encoded tokens.
- Returns
Decoded tokens.
- Return type
(List[str])
- property decoder(self)¶
Decoder dictionary.
- encode(self, tokens: List[str])¶
Encodes new tokens based on previous learning.
- Parameters
tokens – A list of tokens to be encoded.
- Returns
Encoded tokens.
- Return type
(np.array)
- learn(self, dictionary: Dict[str, Any], reverse_dictionary: Dict[str, Any])¶
Learns an integer vectorization encoding.
- Parameters
dictionary – The vocabulary to index mapping.
reverse_dictionary – The index to vocabulary mapping.
- class nalp.encoders.Word2vecEncoder¶
Bases:
nalp.core.encoder.Encoder
A Word2vecEncoder class is responsible for learning a Word2Vec encode and further encoding new data.
- __init__(self)¶
Initizaliation method.
- decode(self, encoded_tokens: numpy.array)¶
Decodes the encoding back to tokens.
- Parameters
encoded_tokens – A numpy array containing the encoded tokens.
- Returns
Decoded tokens.
- Return type
(List[str])
- encode(self, tokens: List[str])¶
Encodes the data into a Word2Vec representation.
- Parameters
tokens – Tokens to be encoded.
- learn(self, tokens: List[str], max_features: Optional[int] = 128, window_size: Optional[int] = 5, min_count: Optional[int] = 1, algorithm: Optional[bool] = 0, learning_rate: Optional[float] = 0.01, iterations: Optional[int] = 1000)¶
Learns a Word2Vec representation based on the its methodology.
One can use CBOW or Skip-gram algorithm for the learning procedure.
- Parameters
tokens – A list of tokens.
max_features – Maximum number of features to be fitted.
window_size – Maximum distance between current and predicted word.
min_count – Minimum count of words for its use.
algorithm – 1 for skip-gram, while 0 for CBOW.
learning_rate – Value of the learning rate.
iterations – Number of iterations.