mindmeld.models.nn_utils.input_encoders module¶
This module consists of encoders that serve as input to pytorch modules
-
class
mindmeld.models.nn_utils.input_encoders.
AbstractEncoder
(**kwargs)[source]¶ Bases:
abc.ABC
Defines a stateful tokenizer. Unlike the tokenizer in the text_preperation_pipeline, tokenizers derived from this abstract class have a state such a vocabulary or a trained/pretrained model that is used for encoding an input textual string into sequence of ids or a sequence of embeddings. These outputs are used by the initial layers of neural nets.
-
batch_encode
(examples: List[str], padding_length: int = None, add_terminals: bool = False, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]¶ Method that encodes a list of texts into a list of sequence of ids
Parameters: - examples – List of text strings that will be encoded as a batch
- padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
- add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns: - A dictionary-like object for the supplied batch of data, consisting of
various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.
Return type: BatchData
- Special note on add_terminals when using for sequence classification:
- This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
-
dump
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the state has to be dumped
-
get_pad_token_idx
() → Union[None, int][source]¶ If there exists a padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.
-
load
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
-
prepare
(examples: List[str])[source]¶ Method that fits the tokenizer and creates a state that can be dumped or used for encoding
Parameters: examples – List of text strings that will be used for creating the state of the tokenizer
-
number_of_terminal_tokens
¶ Returns the (maximum) number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.
-
-
class
mindmeld.models.nn_utils.input_encoders.
AbstractHuggingfaceTrainableEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractEncoder
Abstract class wrapped around AbstractEncoder that is based on Huggingface’s tokenizers library for creating state model.
reference: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html
-
batch_encode
(examples: List[str], padding_length: int = None, add_terminals: bool = True, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]¶ output = tokenizer.encode_batch([“Hello, y’all!”, “How are you 😁 ?”]) print(output[1].tokens) # [“[CLS]”, “How”, “are”, “you”, “[UNK]”, “?”, “[SEP]”, “[PAD]”]
Passing the argument padding_length to set the max length for batch encoding is not available yet for Huggingface tokenizers
-
dump
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the state has to be dumped
-
get_pad_token_idx
() → int[source]¶ If there exists a padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.
-
load
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
-
prepare
(examples: List[str])[source]¶ references: - Huggingface: tutorials/python/training_from_memory.html @ https://tinyurl.com/6hxrtspa - https://huggingface.co/docs/tokenizers/python/latest/index.html
-
SPECIAL_TOKENS
= ['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]']¶
-
number_of_terminal_tokens
¶ Returns the (maximum) number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.
-
-
class
mindmeld.models.nn_utils.input_encoders.
AbstractVocabLookupEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractEncoder
Abstract class wrapped around AbstractEncoder that has a vocabulary lookup as the state.
-
batch_encode
(examples: List[str], padding_length: int = None, add_terminals: bool = False, _return_tokenized_examples: bool = False, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]¶ Method that encodes a list of texts into a list of sequence of ids
Parameters: - examples – List of text strings that will be encoded as a batch
- padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
- add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns: - A dictionary-like object for the supplied batch of data, consisting of
various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.
Return type: BatchData
- Special note on add_terminals when using for sequence classification:
- This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
-
dump
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the state has to be dumped
-
load
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
-
prepare
(examples: List[str])[source]¶ Method that fits the tokenizer and creates a state that can be dumped or used for encoding
Parameters: examples – List of text strings that will be used for creating the state of the tokenizer
-
SPECIAL_TOKENS_DICT
= {'end_token': '<END>', 'pad_token': '<PAD>', 'start_token': '<START>', 'unk_token': '<UNK>'}¶
-
id2token
¶
-
number_of_terminal_tokens
¶ Returns the (maximum) number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.
-
-
class
mindmeld.models.nn_utils.input_encoders.
BytePairEncodingEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractHuggingfaceTrainableEncoder
Encoder that fits a BPE model based on the input examples
-
class
mindmeld.models.nn_utils.input_encoders.
CharEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder
A simple tokenizer that tokenizes at character level
-
class
mindmeld.models.nn_utils.input_encoders.
HuggingfacePretrainedEncoder
(pretrained_model_name_or_path=None, **kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractEncoder
-
batch_encode
(examples: List[str], padding_length: int = None, add_terminals: bool = True, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]¶ Method that encodes a list of texts into a list of sequence of ids
Parameters: - examples – List of text strings that will be encoded as a batch
- padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
- add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns: - A dictionary-like object for the supplied batch of data, consisting of
various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.
Return type: BatchData
- Special note on add_terminals when using for sequence classification:
- This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
-
dump
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the state has to be dumped
-
get_pad_token_idx
() → int[source]¶ If there exists a padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.
-
load
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
-
prepare
(examples: List[str])[source]¶ Method that fits the tokenizer and creates a state that can be dumped or used for encoding
Parameters: examples – List of text strings that will be used for creating the state of the tokenizer
-
number_of_terminal_tokens
¶ Overwrite parent class’ definition of number of terminal tokens
-
-
class
mindmeld.models.nn_utils.input_encoders.
InputEncoderFactory
[source]¶ Bases:
object
-
TOKENIZER_NAME_TO_CLASS
= {<TokenizerType.WHITESPACE_TOKENIZER: 'whitespace-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.WhitespaceEncoder'>, <TokenizerType.CHAR_TOKENIZER: 'char-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.CharEncoder'>, <TokenizerType.WHITESPACE_AND_CHAR_DUAL_TOKENIZER: 'whitespace_and_char-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.WhitespaceAndCharDualEncoder'>, <TokenizerType.BPE_TOKENIZER: 'bpe-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.BytePairEncodingEncoder'>, <TokenizerType.WORDPIECE_TOKENIZER: 'wordpiece-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.WordPieceEncoder'>, <TokenizerType.HUGGINGFACE_PRETRAINED_TOKENIZER: 'huggingface_pretrained-tokenizer'>: <class 'mindmeld.models.nn_utils.input_encoders.HuggingfacePretrainedEncoder'>}¶
-
-
class
mindmeld.models.nn_utils.input_encoders.
WhitespaceAndCharDualEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder
-
batch_encode
(examples: List[str], char_padding_length: int = None, char_add_terminals: bool = True, add_terminals: bool = False, _return_tokenized_examples: bool = False, **kwargs) → mindmeld.models.nn_utils.helpers.BatchData[source]¶ Method that encodes a list of texts into a list of sequence of ids
Parameters: - examples – List of text strings that will be encoded as a batch
- padding_length – The maximum length of each encoded input. Sequences less than this length are padded to padding_length, longer sequences are trimmed. If not specified, the max length of examples upon tokenization is used as padding_length.
- add_terminals – A boolean flag that determines if terminal special tokens are to be added to the tokenized examples or not.
Returns: - A dictionary-like object for the supplied batch of data, consisting of
various tensor inputs to the neural computation graph as well as any other inputs required during the forward computation.
Return type: BatchData
- Special note on add_terminals when using for sequence classification:
- This flag can be True or False in general. Setting it to False will lead to errors in case of Huggingface tokenizers as they are generally built to include terminals along with pad tokens. Hence, the default value for add_terminals is False in case of encoders built on top of AbstractVocabLookupEncoder and True for Hugginface ones. This value can be True or False for encoders based on AbstractVocabLookupEncoder for sequence classification.
-
dump
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the state has to be dumped
-
get_char_pad_token_idx
() → Union[None, int][source]¶ If there exists a char padding token’s index in the vocab, it is returned; useful while initializing an embedding layer. Else returns a None.
-
load
(path: str)[source]¶ Method that dumps the state (if any) of the tokenizer
Parameters: path – The folder where the dumped state can be found. Not all tokenizers dump with same file names, hence we use a folder name rather than filename.
-
prepare
(examples: List[str])[source]¶ Method that fits the tokenizer and creates a state that can be dumped or used for encoding
Parameters: examples – List of text strings that will be used for creating the state of the tokenizer
-
SPECIAL_CHAR_TOKENS_DICT
= {'char_end_token': '<CHAR_END>', 'char_pad_token': '<CHAR_PAD>', 'char_start_token': '<CHAR_START>', 'char_unk_token': '<CHAR_UNK>'}¶
-
char_id2token
¶
-
number_of_char_terminal_tokens
¶ Returns the number of char terminal tokens used by the encoder during batch encoding when add_terminals is set to True
-
number_of_terminal_tokens
¶ Returns the number of terminal tokens used by the encoder during batch encoding when add_terminals is set to True.
-
-
class
mindmeld.models.nn_utils.input_encoders.
WhitespaceEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractVocabLookupEncoder
Encoder that tokenizes at whitespace. Not useful for languages such as Chinese.
-
class
mindmeld.models.nn_utils.input_encoders.
WordPieceEncoder
(**kwargs)[source]¶ Bases:
mindmeld.models.nn_utils.input_encoders.AbstractHuggingfaceTrainableEncoder
Encoder that fits a WordPiece model based on the input examples