nalp.corpus¶

Every pipeline has its first step, right? The corpus package serves as a basic class to load raw text, audio and sentences.

A corpus package, containing all the basic class and functions to load text, audio and sentences.

class nalp.corpus.AudioCorpus(from_file: str, min_frequency: Optional[int] = 1)¶

Bases: nalp.core.Corpus

An AudioCorpus class is used to defined the first step of the workflow.

It serves to load the raw audio, pre-process it and create their tokens and vocabulary.

__init__(self, from_file: str, min_frequency: Optional[int] = 1)¶

Initialization method.

Parameters

from_file – An input file to load the audio.
min_frequency – Minimum frequency of individual tokens.

class nalp.corpus.SentenceCorpus(tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1, max_pad_length: Optional[int] = None, sos_eos_tokens: Optional[bool] = True)¶

Bases: nalp.core.Corpus

A SentenceCorpus class is used to defined the first step of the workflow.

It serves to load the raw sentences, pre-process them and create their tokens and vocabulary.

__init__(self, tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1, max_pad_length: Optional[int] = None, sos_eos_tokens: Optional[bool] = True)¶

Initialization method.

Parameters

tokens – A list of tokens.
from_file – An input file to load the sentences.
corpus_type – The desired type to tokenize the sentences. Should be char or word.
min_frequency – Minimum frequency of individual tokens.
max_pad_length – Maximum length to pad the tokens.
sos_eos_tokens – Whether start-of-sentence and end-of-sentence tokens should be used.

_build(self)¶: Builds the vocabulary based on the tokens.

_check_token_frequency(self)¶: Cuts tokens that do not meet a minimum frequency value.

_pad_token(self, max_pad_length: int, sos_eos_tokens: bool)¶

Pads the tokens into a fixed length.

Parameters

max_pad_length – Maximum length to pad the tokens.
sos_eos_tokens – Whether start-of-sentence and end-of-sentence tokens should be used.

class nalp.corpus.TextCorpus(tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1)¶

Bases: nalp.core.Corpus

A TextCorpus class is used to defined the first step of the workflow.

It serves to load the raw text, pre-process it and create their tokens and vocabulary.

__init__(self, tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1)¶

Initialization method.

Parameters

tokens – A list of tokens.
from_file – An input file to load the text.
corpus_type – The desired type to tokenize the text. Should be char or word.
min_frequency – Minimum frequency of individual tokens.