mindmeld.models.nn_utils.helpers module¶
Default params used by various sequence and token classification classes
-
class
mindmeld.models.nn_utils.helpers.
BatchData
(**kwargs)[source]¶ Bases:
mindmeld.core.Bunch
A dictionary-like object that exposes its keys as attributes and holds various inputs as well as outputs of neural models related to a batch of data, such as tensor encodings, lengths of inputs etc.
- Following is the description of the different keys that serve as inputs to neural models:
- seq_lengths: Number of tokens in each example before adding padding tokens. The number
- includes terminal tokens too if they are added before padding. If using an encoder that splits words in sub-words, seq_lengths still implies number of words (instead of number of sub-words) along with any added terminal tokens; this number is useful in case of token classifiers which require token-level (aka. word-level) outputs as well as in sequence classifiers models such as LSTM.
- split_lengths: The length of each subgroup (i.e. group of sub-words) in each
- example. Due to its definition, it obviously does not include any terminal tokens in its counts. This can be seen as fine-grained information to seq_lengths values for the encoders with sub-word tokenization. This is again useful in cases of token classifiers to flexibly choose between representations of first sub-word or mean/max pool of sub-words' representations in order to obtain the word-level representations. For lookup table based encoders where words are not broken into sub-words, split_lengths is simply a sequence of ones whose sum indicates the number of words w/o terminal & padding tokens.
- seq_ids (in case non-pretrained models that require training an embedding layer):
- The encoded ids useful for embedding lookup, including terminal special tokens if asked for, and with padding.
- attention_masks (only in case of huggingface trainable encoders): Boolean flags
- corresponding to each id in seq_ids, set to 0 if padding token else 1.
- hgf_encodings (only in huggingface pretrained encoders): A dict of outputs from a
- Pretrained Language Model encoder from Huggingface (shortly dubbed as hgf).
- char_seq_ids (only in dual tokenizers): Similar to seq_ids but from a char
- tokenizer in case of dual tokenization
- char_seq_lengths (only in dual tokenizers): Similar to seq_lengths but from a char
- tokenizer in case of dual tokenization. Like seq_lengths, this also includes terminal special tokens from char vocab in the length count whenever added.
Following is the description of the different keys that serve as inputs to neural models:
- Following is the description of the different keys that are outputted by neural models:
- seq_embs: The embeddings produced before final classification (dense) layers by
- sequence-classification classes (generally of shape [batch_size, emd_dim]).
- token_embs: The embeddings produced before final classification (dense) layers by
- token-classification classes (generally of shape [batch_size, seq_length, emd_dim]).
- logits: Classification scores (before SoftMax).
- loss: Classification loss object.
-
class
mindmeld.models.nn_utils.helpers.
ClassificationType
[source]¶ Bases:
enum.Enum
An enumeration.
-
TAGGER
= 'tagger'¶
-
TEXT
= 'text'¶
-
-
class
mindmeld.models.nn_utils.helpers.
EmbedderType
[source]¶ Bases:
enum.Enum
An enumeration.
-
BERT
= 'bert'¶
-
GLOVE
= 'glove'¶
-
NONE
= None¶
-
-
class
mindmeld.models.nn_utils.helpers.
SequenceClassificationType
[source]¶ Bases:
enum.Enum
An enumeration.
-
CNN
= 'cnn'¶
-
EMBEDDER
= 'embedder'¶
-
LSTM
= 'lstm'¶
-
-
class
mindmeld.models.nn_utils.helpers.
TokenClassificationType
[source]¶ Bases:
enum.Enum
An enumeration.
-
CNN_LSTM
= 'cnn-lstm'¶
-
EMBEDDER
= 'embedder'¶
-
LSTM
= 'lstm-pytorch'¶
-
LSTM_LSTM
= 'lstm-lstm'¶
-
-
class
mindmeld.models.nn_utils.helpers.
TokenizerType
[source]¶ Bases:
enum.Enum
An enumeration.
-
BPE_TOKENIZER
= 'bpe-tokenizer'¶
-
CHAR_TOKENIZER
= 'char-tokenizer'¶
-
HUGGINGFACE_PRETRAINED_TOKENIZER
= 'huggingface_pretrained-tokenizer'¶
-
WHITESPACE_AND_CHAR_DUAL_TOKENIZER
= 'whitespace_and_char-tokenizer'¶
-
WHITESPACE_TOKENIZER
= 'whitespace-tokenizer'¶
-
WORDPIECE_TOKENIZER
= 'wordpiece-tokenizer'¶
-
-
class
mindmeld.models.nn_utils.helpers.
ValidationMetricType
[source]¶ Bases:
enum.Enum
An enumeration.
-
ACCURACY
= 'accuracy'¶
-
F1
= 'f1'¶
-
-
mindmeld.models.nn_utils.helpers.
get_default_params
(class_name: str)[source]¶ Returns all the default params based on the inputted class name
Parameters: class_name (str) -- A (child) class name from sequence_classification.py or token_classification.py
-
mindmeld.models.nn_utils.helpers.
get_disk_space_of_model
(pytorch_module)[source]¶ Returns the disk space of a pytorch module in MB units. This includes all weights (trainable and non-trainable) of the module.
Parameters: pytorch_module -- a pytorch neural network module derived from torch.nn.Module Returns: The size of model when dumped Return type: size (float)
-
mindmeld.models.nn_utils.helpers.
get_num_weights_of_model
(pytorch_module)[source]¶ Returns the number of trainable and the total parameters in a pytorch module. Returning both helps to do a sanity check if any layers which are meant to be frozen are being trained or not.
Parameters: pytorch_module -- a pytorch neural network module derived from torch.nn.Module Returns: - A tuple of number of params that are trainable and total number
- of params of the pytorch module
Return type: number_of_params (tuple)