nalp.corpus

Every pipeline has its first step, right? The corpus package serves as a basic class to load raw text, audio and sentences.

A corpus package, containing all the basic class and functions to load text, audio and sentences.

class nalp.corpus.AudioCorpus(from_file: str, min_frequency: Optional[int] = 1)

Bases: nalp.core.Corpus

An AudioCorpus class is used to defined the first step of the workflow.

It serves to load the raw audio, pre-process it and create their tokens and vocabulary.

__init__(self, from_file: str, min_frequency: Optional[int] = 1)

Initialization method.

Parameters
  • from_file – An input file to load the audio.

  • min_frequency – Minimum frequency of individual tokens.

class nalp.corpus.SentenceCorpus(tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1, max_pad_length: Optional[int] = None, sos_eos_tokens: Optional[bool] = True)

Bases: nalp.core.Corpus

A SentenceCorpus class is used to defined the first step of the workflow.

It serves to load the raw sentences, pre-process them and create their tokens and vocabulary.

__init__(self, tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1, max_pad_length: Optional[int] = None, sos_eos_tokens: Optional[bool] = True)

Initialization method.

Parameters
  • tokens – A list of tokens.

  • from_file – An input file to load the sentences.

  • corpus_type – The desired type to tokenize the sentences. Should be char or word.

  • min_frequency – Minimum frequency of individual tokens.

  • max_pad_length – Maximum length to pad the tokens.

  • sos_eos_tokens – Whether start-of-sentence and end-of-sentence tokens should be used.

_build(self)

Builds the vocabulary based on the tokens.

_check_token_frequency(self)

Cuts tokens that do not meet a minimum frequency value.

_pad_token(self, max_pad_length: int, sos_eos_tokens: bool)

Pads the tokens into a fixed length.

Parameters
  • max_pad_length – Maximum length to pad the tokens.

  • sos_eos_tokens – Whether start-of-sentence and end-of-sentence tokens should be used.

class nalp.corpus.TextCorpus(tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1)

Bases: nalp.core.Corpus

A TextCorpus class is used to defined the first step of the workflow.

It serves to load the raw text, pre-process it and create their tokens and vocabulary.

__init__(self, tokens: Optional[List[str]] = None, from_file: Optional[str] = None, corpus_type: Optional[str] = 'char', min_frequency: Optional[int] = 1)

Initialization method.

Parameters
  • tokens – A list of tokens.

  • from_file – An input file to load the text.

  • corpus_type – The desired type to tokenize the text. Should be char or word.

  • min_frequency – Minimum frequency of individual tokens.