nalp.corpus¶
Every pipeline has its first step, right? The corpus package serves as a basic class to load raw text, audio and sentences.
A corpus package, containing all the basic class and functions to load text, audio and sentences.
- class nalp.corpus.AudioCorpus(from_file: str, min_frequency: Optional[int] = 1)¶
Bases:
nalp.core.Corpus
An AudioCorpus class is used to defined the first step of the workflow.
It serves to load the raw audio, pre-process it and create their tokens and vocabulary.
- __init__(self, from_file: str, min_frequency: Optional[int] = 1)¶
Initialization method.
- Parameters
from_file – An input file to load the audio.
min_frequency – Minimum frequency of individual tokens.
- class nalp.corpus.SentenceCorpus(tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1, max_pad_length: Optional[int] = None, sos_eos_tokens: Optional[bool] = True)¶
Bases:
nalp.core.Corpus
A SentenceCorpus class is used to defined the first step of the workflow.
It serves to load the raw sentences, pre-process them and create their tokens and vocabulary.
- __init__(self, tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1, max_pad_length: Optional[int] = None, sos_eos_tokens: Optional[bool] = True)¶
Initialization method.
- Parameters
tokens – A list of tokens.
from_file – An input file to load the sentences.
corpus_type – The desired type to tokenize the sentences. Should be char or word.
min_frequency – Minimum frequency of individual tokens.
max_pad_length – Maximum length to pad the tokens.
sos_eos_tokens – Whether start-of-sentence and end-of-sentence tokens should be used.
- _build(self)¶
Builds the vocabulary based on the tokens.
- _check_token_frequency(self)¶
Cuts tokens that do not meet a minimum frequency value.
- _pad_token(self, max_pad_length: int, sos_eos_tokens: bool)¶
Pads the tokens into a fixed length.
- Parameters
max_pad_length – Maximum length to pad the tokens.
sos_eos_tokens – Whether start-of-sentence and end-of-sentence tokens should be used.
- class nalp.corpus.TextCorpus(tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1)¶
Bases:
nalp.core.Corpus
A TextCorpus class is used to defined the first step of the workflow.
It serves to load the raw text, pre-process it and create their tokens and vocabulary.
- __init__(self, tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1)¶
Initialization method.
- Parameters
tokens – A list of tokens.
from_file – An input file to load the text.
corpus_type – The desired type to tokenize the text. Should be char or word.
min_frequency – Minimum frequency of individual tokens.