Deep Neural Networks in MindMeld

Conversational AI and Natural Language Processing more generally have seen a boost in performance at a variety of tasks through the use of Deep learning. In particular, deep neural models based on Convolutional neural networks (CNNs), Long short-term memory networks (LSTM) and Transformers architectures have been widely adopted over more traditional approaches to NLP to great success. MindMeld now extends its suite of traditional machine learning models (e.g. Logistic regression, Decision tree, etc.) with a variety of deep neural models and an array of configurable parameters.

Users can now train and use deep neural models for domain classification and intent classification (aka. sequence classification) as well as for entity recognition (or token classification) tasks.

Note

These models are implemented using Pytorch framework and thus need extra installations before starting to use them in your chatbot application. Please make sure to install the Pytorch requirement by running in the shell:

pip install mindmeld[torch]

MindMeld supports the use of pretrained transformer models such as BERT through the popular Huggingface Transformers library. Several pretrained models from their Models Hub that can be used for sequence classification or token classification can be employed in your chatbot application.

Note

To use pretrained transformer models, install the extra transformers requirement by running in the shell:

pip install mindmeld[transformers]

Before proceeding to use the deep neural models, consider the following possible advantages and disadvantages of using them in place of traditional machine learning models.

  • Better overall performance for larger training sets. Deep models generally outperform traditional machine learning models on training sets with several hundreds or thousands of queries when training from scratch, and with at least few hundred if fine-tuning from a pretrained checkpoint.
  • Slower training and inference times on CPU devices but faster on GPU devices. Training and inference times for deep models on CPU-only machines can take longer than traditional machine learning models. However, on GPU-enabled devices, the run times of the deep networks can be comparable to some of the traditional models in MindMeld.
  • Minimal feature engineering work but manual hyperparameter tuning. Unlike traditional machine learning models, deep models require little or no feature engineering work because they infer input features (such as word embeddings). Traditional models must take into account several hundred engineered features (n-grams, system entities, and so on), which requires fine-grained tuning. On the flip side, Mindmeld's deep models don't have automated hyperparameter tuning methods like sklearn.model_selection.GridSearchCV, which are available for their traditional counterparts. While the default hyperparameters for MindMeld's deep neural models work well across datasets, you can further tune them and a good starting point to understand this subject better is Andrej Karpathy's course notes from the Convolutional Neural Networks for Visual Recognition course at Stanford University.
  • Larger disk storage required. While deep neural models can have a similar disk storage footprint to their traditional counterparts, depending on your data, it is not uncommon for them to require more disk storage space.

Note

  • To use deep neural networks instead of traditional machine learning models in your MindMeld application, simply make few modifications to the classifier configuration dictionaries for all or selected classifiers in your app's config.py.
  • To make modifications to selected domains or intents, recollect that you can implement the get_intent_classifier_config() and get_entity_recognizer_config() functions respectively in your app's config.py for a finer-grained control.

In the following sections, different model architectures and their configurable parameters are outlined.

Domain and Intent classification

Using MindMeld’s deep neural models requires configuring only two keys in your classifier configuration dictionaries: 'model_settings' and 'params'. When working with the deep models, the 'features' and 'param_selection' keys in the classifier configuration are redundant, as we neither have to handcraft any feature sets for modeling, nor is there automated hyperparameter tuning.

This is a departure from other documentation on Working with the Domain Classifier and Working with the Intent Classifier, which outlines that text classifier configuration requires an additional two keys ('features' and 'param_selection').

The 'model_settings' is a dict with the single key 'classifier_type', whose value specifies the machine learning model to use. The allowed values of 'classifier_type' that are backed by deep neural nets and are meant for sequence classification are:

Value Classifier Reference for configurable parameters
'embedder' Pooled Token Embeddings or Deep Contextualized Embeddings Embedder parameters
'cnn' Convolutional neural networks (CNN) CNN parameters
'lstm' Long short-term memory networks (LSTM) LSTM parameters

The 'params' is also a dict with several configurable keys, some of which are specific to the choice of classifier type and others common across all the above classifier types. In the following section, the list of allowed parameters related to each choice of classifier type are outlined. See Common Configurable Params section for list of configurable params that are not just specific to any classifier type but are common across all the classifier types.

1. 'embedder' classifier type

Mindmeld's 'embedder' classifier type uses a pooling operation on top of model embeddings, which are based on either a lookup table or a deep neural model:

  • Lookup table embeddings can be derived based on a user-defined tokenization strategy-- word-level, sub-word-level, or character-level tokenization (see Tokenization Choices below for more details). By default, the lookup table is randomly initialized, but it can instead be initialized to a pretrained checkpoint when using a word-level tokenization strategy (such as GloVe) .
  • Deep contextualized embedders are pretrained embedders in the style of BERT, which consists of its own tokenization strategy and neural embedding process.

In either case, all the underlying weights can be tuned to the training data provided, or can be kept frozen during the training process. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

Note

Specify the embedding choice using the param embedder_type. Set it to None, 'glove' or 'bert' to use with desired embedding styles-- based on a randomly initialized embedding lookup table, based on lookup table initialized with GloVe (or GloVe-like formatted) pretrained embeddings or a BERT-like pretrained transformer based deep contextualized embedder, respectively.

The following are the different optional params that are configurable along with the chosen choice of embedder_type param. See Common Configurable Params section for list of additional configurable params that are common across classifiers.

1.1 Embedding Lookup Table (embedder_type: None)

Configuration Key Description
emb_dim

Number of dimensions for each token's embedding.

Type: int

Default: 256

Choices: Any positive integer

tokenizer_type

The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details.

Type: str

Default: 'whitespace-tokenizer'

Choices: See Tokenization Choices

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: True

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

embedder_output_pooling_type

Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer.

Type: str

Default: 'mean'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 1.0

Choices: A float between 0 and 1

Below is a minimal working example of a sequence classifier configuration for a classifier based on an embedding lookup table:

{
 'model_type': 'text',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'embedder'},
 'params': {
     'embedder_type': None,
     'emb_dim': 256,
 },
}

1.2 Pretrained Embedding Lookup Table (embedder_type: glove)

Configuration Key Description
token_dimension

Specifies the dimension of the GloVe-6B pretrained word vectors. This key is only valid when using embedder_type as 'glove'.

Type: int

Default: 300

Choices: 50, 100, 200, 300

token_pretrained_embedding_filepath

Specifies a local file path for pretrained embedding file. This key is only valid when using embedder_type as 'glove'.

Type: Union[str, None]

Default: None

Choices: File path to a valid GloVe-style embeddings file

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: True

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

embedder_output_pooling_type

Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer.

Type: str

Default: 'mean'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 1.0

Choices: A float between 0 and 1

Below is a minimal working example of a sequence classifier configuration for a classifier based on a pretrained-initialized embedding lookup table:

{
 'model_type': 'text',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'embedder'},
 'params': {
     'embedder_type': 'glove',
     'update_embeddings': True,
 },
}

1.3 Deep Contextualized Embeddings (embedder_type: bert)

Configuration Key Description
pretrained_model_name_or_path

Specifies a pretrained checkpoint's name or a valid file path to load a bert-like embedder. This key is only valid when using embedder_type as 'bert'.

Type: str

Default: 'bert-base-uncased'

Choices: Any valid name from Huggingface Models Hub or a valid folder path where the model's weights as well as its tokenizer's resources are present.

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

embedder_output_pooling_type

Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer.

Type: str

Default: 'mean'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 1.0

Choices: A float between 0 and 1

save_frozen_embedder

If set to False, the weights of the underlying bert-like embedder that are not being tuned are not dumped to disk upon calling a classifier's .dump() method. This boolean key is only valid when update_embeddings is set to False.

Type: bool

Default: False

Choices: True, False

Below is a minimal working example of a sequence classifier configuration for a classifier based on a BERT-like embedder:

{
 'model_type': 'text',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'embedder'},
 'params': {
     'embedder_type': 'bert',
     'pretrained_model_name_or_path': 'distilbert-base-uncased',
     'update_embeddings': True,
 },
}

2. 'cnn' classifier type

Convolutional neural networks (CNN) based text classifiers are light-weight neural classifiers that have achieved remarkably strong performance on the practically important task of sentence classification.

Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. The subsequent layer performs convolutions over the sequence of embedded word vectors using kernels (also called filters); kernels of different lengths capture different n-gram patterns from the input text. For each chosen length, several kernels are used to capture different patterns at the same receptive range. Finally, each kernel leads to one feature map.

Each feature map is reduced to the maximum value observed in that map, and maximum values from all maps are combined to form a long feature vector. This vector is analogous to an 'embedder' classifier's pooled output, which is then passed through a classification layer. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

The following are the different optional params that are configurable with the 'cnn' classifier type. See Common Configurable Params section for list of additional configurable params that are common across classifiers.

Configuration Key Description
embedder_type

The choice of embeddings to be used. Specifying None randomly initializes an embeddings lookup table whereas specifying 'glove' initializes the table with pretrained GloVe embeddings.

Type: Union[str, None]

Default: None

Choices: None, 'glove'

emb_dim

Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder.

Type: int

Default: 256

Choices: Any positive integer

tokenizer_type

The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details.

Type: str

Default: 'whitespace-tokenizer'

Choices: See Tokenization Choices

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: True

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

window_sizes

The lengths of 1D CNN kernels to be used for convolution on top of embeddings.

Type: List[int]

Default: [3,4,5]

Choices: A list of positive integers

number_of_windows

The number of kernels per each specified length of 1D CNN kernels.

Type: List[int]

Default: `[100,100,100]

Choices: A list of positive integers; same length as window_sizes

Below is a minimal working example of a sequence classifier configuration for a classifier based on CNNs:

{
 'model_type': 'text',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'cnn'},
 'params': {
     'embedder_type': 'glove',
     'window_sizes': [3,4,5],
     'number_of_windows': [100,100,100],
 },
}

3. 'lstm' classifier type

Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data.

Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. The subsequent layer applies LSTM over the sequence of embedded word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.

To obtain a single vector per input text, the vectors for each token can be pooled or the last vector in the sequence can simply be used as the representative vector. This vector is analogous to an 'embedder' classifier's pooled output, which is then passed through a classification layer. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

The following are the different optional params that are configurable with the 'lstm' classifier type. See Common Configurable Params section for list of additional configurable params that are common across classifiers.

Configuration Key Description
embedder_type

The choice of embeddings to be used. Specifying None randomly initializes an embeddings lookup table whereas specifying 'glove' initializes the table with pretrained GloVe embeddings.

Type: Union[str, None]

Default: None

Choices: None, 'glove'

emb_dim

Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder.

Type: int

Default: 256

Choices: Any positive integer

tokenizer_type

The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details.

Type: str

Default: 'whitespace-tokenizer'

Choices: See Tokenization Choices

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: True

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_hidden_dim

Number of states per each LSTM layer.

Type: int

Default: 128

Choices: Any positive integer

lstm_num_layers

The number of LSTM layers that are to be stacked sequentially.

Type: int

Default: 2

Choices: Any positive integer

lstm_keep_prob

Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_bidirectional

If True, the LSTM layers will be bidirectional.

Type: bool

Default: True

Choices: True, False

lstm_output_pooling_type

Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer.

Type: str

Default: 'last'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

Below is a minimal working example of a sequence classifier configuration for a classifier based on LSTMs:

{
 'model_type': 'text',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'cnn'},
 'params': {
     'embedder_type': 'glove',
     'lstm_hidden_dim': 128,
     'lstm_bidirectional': True,
 },
}

Entity recognition

Using MindMeld’s deep neural models requires configuring only two keys in your classifier configuration dictionaries: 'model_settings' and 'params'. When working with the deep models, the 'features' and 'param_selection' keys in the classifier configuration are redundant, as we neither have to handcraft any feature sets for modeling, nor is there automated hyperparameter tuning.

This is a departure from other documentation on Working with the Entity Recognizer, which outlines that text classifier configuration requires an additional two keys ('features' and 'param_selection').

The 'model_settings' is a dict with the single key 'classifier_type', whose value specifies the machine learning model to use. The allowed values of 'classifier_type' that are backed by deep neural nets and are meant for token classification are:

Value Classifier Reference for configurable parameters
'embedder' Pooled Token Embeddings or Deep Contextualized Embeddings Embedder parameters
'lstm-pytorch' Long short-term memory networks (LSTM) LSTM-PYTORCH parameters
'cnn-lstm' Character-level Convolutional neural networks (CNN) followed by word-level Long short-term memory networks (LSTM) CNN-LSTM parameters
'lstm-lstm' Character-level Long short-term memory networks (LSTM) followed by word-level Long short-term memory networks (LSTM) LSTM-LSTM parameters
'lstm' Long short-term memory networks (LSTM) coupled with gazetteer encodings and backed by Tensorflow LSTM parameters

The 'params' is also a dict with several configurable keys, some of which are specific to the choice of classifier type and others common across all the above classifier types. In the following section, the list of allowed parameters related to each choice of classifier type are outlined. See Common Configurable Params section for a list of configurable params that are common across all classifier types.

1. 'embedder' classifier type

This classifier type includes neural models that are based on either an embedding lookup table or a deep contextualized embedder, the outputs of which are then passed through a Conditional Random Field (CRF) or a Softmax layer which labels target word as a particular entity.

  • Lookup table embeddings can be derived based on a user-defined tokenization strategy-- word-level, sub-word-level, or character-level tokenization (see Tokenization Choices below for more details). By default, the lookup table is randomly initialized, but it can instead be initialized to a pretrained checkpoint when using a word-level tokenization strategy (such as GloVe) .
  • Deep contextualized embedders are pretrained embedders in the style of BERT, which consists of its own tokenization strategy and neural embedding process.

In either case, all the underlying weights can be tuned to the training data provided, or can be kept frozen during the training process. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

The 'embedder' classifier type pools the vectors of all tokens corresponding to a word that has been assigned an entity tag, so as to obtain a single vector per word in an input text. This is unlike sequence classification models, where all tokens of all words are pooled together, and then passed through a classification layer.

Note

Specify the embedding choice using the param embedder_type. Set it to None, 'glove' or 'bert' to use with desired embedding styles-- based on a randomly initialized embedding lookup table, based on lookup table initialized with GloVe (or GloVe-like formatted) pretrained embeddings or a BERT-like pretrained transformer based deep contextualized embedder, respectively.

The following are the different optional params that are configurable along with the chosen choice of embedder_type param. See Common Configurable Params for list of additional configurable params that are common across classifiers.

1.1 Embedding Lookup Table (embedder_type: None)

Configuration Key Description
emb_dim

Number of dimensions for each token's embedding.

Type: int

Default: 256

Choices: Any positive integer

tokenizer_type

The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details.

Type: str

Default: 'whitespace-tokenizer'

Choices: See Tokenization Choices

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: None

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 1.0

Choices: A float between 0 and 1

token_spans_pooling_type

Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer.

Type: str

Default: 'first'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

use_crf_layer

If set to True, a CRF layer is used for entity classification instead of a softmax layer.

Type: bool

Default: True

Choices: True, False

Below is a minimal working example of a token classifier configuration for a classifier based on an embedding lookup table:

{
 'model_type': 'tagger',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'embedder'},
 'params': {
     'embedder_type': None,
     'emb_dim': 256,
 },
}

1.2 Pretrained Embedding Lookup Table (embedder_type: glove)

Configuration Key Description
token_dimension

Specifies the dimension of the GloVe-6B pretrained word vectors. This key is only valid when using embedder_type as 'glove'.

Type: int

Default: 300

Choices: 50, 100, 200, 300

token_pretrained_embedding_filepath

Specifies a local file path for pretrained embedding file. This key is only valid when using embedder_type as 'glove'.

Type: Union[str, None]

Default: None

Choices: File path to a valid GloVe-style embeddings file

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: None

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 1.0

Choices: A float between 0 and 1

token_spans_pooling_type

Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer.

Type: str

Default: 'first'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

use_crf_layer

If set to True, a CRF layer is used for entity classification instead of a softmax layer.

Type: bool

Default: True

Choices: True, False

Below is a minimal working example of a token classifier configuration for a classifier based on a pretrained-initialized embedding lookup table:

{
 'model_type': 'tagger',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'embedder'},
 'params': {
     'embedder_type': 'glove',
     'update_embeddings': True,
 },
}

1.3 Deep Contextualized Embeddings (embedder_type: bert)

Configuration Key Description
pretrained_model_name_or_path

Specifies a pretrained checkpoint's name or a valid file path to load a bert-like embedder. This key is only valid when using embedder_type as 'bert'.

Type: str

Default: 'bert-base-uncased'

Choices: Any valid name from Huggingface Models Hub or a valid folder path where the model's weights as well as its tokenizer's resources are present.

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 1.0

Choices: A float between 0 and 1

save_frozen_embedder

If set to False, the weights of the underlying bert-like embedder that are not being tuned are not dumped to disk upon calling a classifier's .dump() method. This boolean key is only valid when update_embeddings is set to False.

Type: bool

Default: False

Choices: True, False

token_spans_pooling_type

Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer.

Type: str

Default: 'first'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

use_crf_layer

If set to True, a CRF layer is used for entity classification instead of a softmax layer.

Type: bool

Default: False

Choices: True, False

Below is a minimal working example of a token classifier configuration for a classifier based on a BERT embedder:

{
 'model_type': 'tagger',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'embedder'},
 'params': {
     'embedder_type': 'bert',
     'pretrained_model_name_or_path': 'distilbert-base-uncased',
     'update_embeddings': True,
 },
}

2. 'lstm-pytorch' classifier type

Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data.

Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. The subsequent layer applies LSTM over the sequence of embedded word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.

To obtain a single vector per word per input text, the vectors of all tokens corresponding to each word (for which an entity tag is to be ascertained) are pooled. This vector is analogous to an 'embedder' classifier's output, which is then passed through a classification layer. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

The following are the different optional params that are configurable with the 'lstm' classifier type. See Common Configurable Params section for list of additional configurable params that are common across classifiers.

Configuration Key Description
embedder_type

The choice of embeddings to be used. Specifying None randomly initializes an embeddings lookup table whereas specifying 'glove' initializes the table with pretrained GloVe embeddings.

Type: Union[str, None]

Default: None

Choices: None, 'glove'

emb_dim

Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder.

Type: int

Default: 256

Choices: Any positive integer

tokenizer_type

The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details.

Type: str

Default: 'whitespace-tokenizer'

Choices: See Tokenization Choices

add_terminals

If set to True, terminal tokens (a start and an end token) are added at the beginning and ending for each input before applying any padding. If left unset or set to None, the value will be set to True if the input text encoders (based on the choice of tokenization) require it to be so.

Type: Union[bool, None]

Default: None

Choices: None, True, False

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_hidden_dim

Number of states per each LSTM layer.

Type: int

Default: 128

Choices: Any positive integer

lstm_num_layers

The number of LSTM layers that are to be stacked sequentially.

Type: int

Default: 2

Choices: Any positive integer

lstm_keep_prob

Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_bidirectional

If True, the LSTM layers will be bidirectional.

Type: bool

Default: True

Choices: True, False

token_spans_pooling_type

Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer.

Type: str

Default: 'first'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

use_crf_layer

If set to True, a CRF layer is used for entity classification instead of a softmax layer.

Type: bool

Default: False

Choices: True, False

Below is a minimal working example of a token classifier configuration for a classifier based on LSTMs:

{
 'model_type': 'tagger',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'cnn'},
 'params': {
     'embedder_type': 'glove',
     'lstm_hidden_dim': 128,
     'lstm_bidirectional': True,
 },
}

3. 'cnn-lstm' classifier type

Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data. When coupled with Convolutional neural networks (CNN) for extracting character-level features from input text, the overall architecture can better model the textual data as well as it is more robust to variations in the spellings.

Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. This is then concatenated with the outputs of each word's convolutions at the character-level using kernels of different lengths to capture different patterns. These convolutions are similar to those of CNN classifier type except they are applied for each word in the input text separately to obtain one representation for each word.

The subsequent layer applies LSTM over the sequence of concatenated word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.

The 'cnn-lstm' classifier type pools the vectors of all tokens corresponding to words that have been assigned an entity tag so as to obtain a single vector per word in an input text. This vector is analogous to an 'embedder' classifier's output, which is then passed through a classification layer. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

The following are the different optional params that are configurable with the 'lstm' classifier type. See Common Configurable Params section for list of additional configurable params that are common across classifiers.

Configuration Key Description
embedder_type

The choice of embeddings to be used. Specifying None randomly initializes an embeddings lookup table whereas specifying 'glove' initializes the table with pretrained GloVe embeddings.

Type: Union[str, None]

Default: None

Choices: None, 'glove'

emb_dim

Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder.

Type: int

Default: 256

Choices: Any positive integer

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_hidden_dim

Number of states per each LSTM layer.

Type: int

Default: 128

Choices: Any positive integer

lstm_num_layers

The number of LSTM layers that are to be stacked sequentially.

Type: int

Default: 2

Choices: Any positive integer

lstm_keep_prob

Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_bidirectional

If True, the LSTM layers will be bidirectional.

Type: bool

Default: True

Choices: True, False

char_emb_dim

Number of dimensions for each character's embedding.

Type: int

Default: 50

Choices: Any positive integer

char_window_sizes

The lengths of 1D CNN kernels to be used for character-level convolution on top of character embeddings.

Type: List[int]

Default: [3,4,5]

Choices: A list of positive integers

char_number_of_windows

The number of kernels per each specified length of 1D CNN kernels in char_window_sizes.

Type: List[int]

Default: `[100,100,100]

Choices: A list of positive integers; same length as char_window_sizes

char_cnn_output_keep_prob

Keep probability for the dropout layer placed on top of character CNN's output. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

char_proj_dim

The final dimension of each character after it is transformed by the character-level network. Usually greater than the char_emb_dim since it encodes more information about orthography and semantics. If unspecified or None, the dimension is same as the char_emb_dim.

Type: Union[int, None]

Default: None

Choices: Any positive integer, None

char_padding_length

The maximum number of characters allowed per word. If a word has more characters than char_padding_length, the surplus characters are discarded. If specified as None, the char_padding_length in a mini-batch of queries is simply the maximum length of all words in that mini-batch.

Type: Union[int, None]

Default: None

Choices: Any positive integer, None

char_add_terminals

If set to True, terminal character tokens (a start and an end character token) are added at the beginning and ending for each word before applying any padding, while preparing inputs to the underlying character-level network.

Type: bool

Default: False

Choices: True, False

use_crf_layer

If set to True, a CRF layer is used for entity classification instead of a softmax layer.

Type: bool

Default: False

Choices: True, False

Below is a minimal working example of a token classifier configuration for a classifier based on CNN-LSTM:

{
 'model_type': 'tagger',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'cnn-lstm'},
 'params': {
     'embedder_type': 'glove',
     'lstm_hidden_dim': 128,
     'lstm_bidirectional': True,
     'char_emb_dim': 32
 },
}

4. 'lstm-lstm' classifier type

Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data. When coupled with Long short-term memory networks (LSTM) for extracting character-level features from input text, the overall architecture can better model the textual data as well as it is more robust to variations in the spellings.

Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table, and concatenates them with the outputs of a character-level bi-LSTM (for each word individually) to capture different character-level patterns. This is then concatenated with the outputs of each word's convolutions at the character-level using kernels of different lengths to capture different patterns. These convolutions are similar to those of CNN classifier type except they are applied for each word in the input text separately to obtain one representation for each word.

The subsequent layer applies LSTM over the sequence of concatenated word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.

The 'lstm-lstm' classifier type pools the vectors of all tokens corresponding to words that have been assigned an entity tag so as to obtain a single vector per word in an input text. This vector is analogous to an 'embedder' classifier's output, which is then passed through a classification layer. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.

The following are the different optional params that are configurable with the 'lstm' classifier type. See Common Configurable Params section for list of additional configurable params that are common across classifiers.

Configuration Key Description
embedder_type

The choice of embeddings to be used. Specifying None randomly initializes an embeddings lookup table whereas specifying 'glove' initializes the table with pretrained GloVe embeddings.

Type: Union[str, None]

Default: None

Choices: None, 'glove'

emb_dim

Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder.

Type: int

Default: 256

Choices: Any positive integer

update_embeddings

If set to False, the weights of embedding table or the deep contextualized embedder will not be updated during back-propogation of gradients. This boolean key is only valid when using a pretrained embedder type.

Type: bool

Default: True

Choices: True, False

embedder_output_keep_prob

Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

output_keep_prob

Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_hidden_dim

Number of states per each LSTM layer.

Type: int

Default: 128

Choices: Any positive integer

lstm_num_layers

The number of LSTM layers that are to be stacked sequentially.

Type: int

Default: 2

Choices: Any positive integer

lstm_keep_prob

Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer.

Type: float

Default: 0.7

Choices: A float between 0 and 1

lstm_bidirectional

If True, the LSTM layers will be bidirectional.

Type: bool

Default: True

Choices: True, False

char_emb_dim

Number of dimensions for each character's embedding.

Type: int

Default: 50

Choices: Any positive integer

char_lstm_hidden_dim

Number of states per each character-level LSTM layer.

Type: int

Default: 128

Choices: Any positive integer

char_lstm_num_layers

The number of character-level LSTM layers that are to be stacked sequentially.

Type: int

Default: 2

Choices: Any positive integer

char_lstm_keep_prob

Keep probability for the nodes that constitute the outputs of each character-level LSTM layer except the last layer in the stack.

Type: float

Default: 0.7

Choices: A float between 0 and 1

char_lstm_bidirectional

If True, the character-level LSTM layers will be bidirectional.

Type: bool

Default: True

Choices: True, False

char_lstm_output_pooling_type

Specifies the manner in which a word's character-level embeddings are to be collated into a single embedding before passing to subsequent layers.

Type: str

Default: 'last'

Choices: 'first', 'last', 'max', 'mean', 'mean_sqrt'

char_proj_dim

The final dimension of each character after it is transformed by the character-level network. Usually greater than the char_emb_dim since it encodes more information about orthography and semantics. If unspecified or None, the dimension is same as the char_emb_dim.

Type: Union[int, None]

Default: None

Choices: Any positive integer, None

char_padding_length

The maximum number of characters allowed per word. If a word has more characters than char_padding_length, the surplus characters are discarded. If specified as None, the char_padding_length in a mini-batch of queries is simply the maximum length of all words in that mini-batch.

Type: Union[int, None]

Default: None

Choices: Any positive integer, None

char_add_terminals

If set to True, terminal character tokens (a start and an end character token) are added at the beginning and ending for each word before applying any padding, while preparing inputs to the underlying character-level network.

Type: bool

Default: False

Choices: True, False

use_crf_layer

If set to True, a CRF layer is used for entity classification instead of a softmax layer.

Type: bool

Default: False

Choices: True, False

Below is a minimal working example of a token classifier configuration for a classifier based on LSTM-LSTM:

{
 'model_type': 'tagger',
 'train_label_set': 'train.*\.txt',
 'test_label_set': 'test.*\.txt',
 'model_settings': {'classifier_type': 'lstm-lstm'},
 'params': {
     'embedder_type': 'glove',
     'lstm_hidden_dim': 128,
     'lstm_bidirectional': True,
     'char_emb_dim': 32,
 },
}

5. 'lstm' classifier type

A Tensorflow backed implementation of Bi-Directional Long Short-Term Memory (LSTM) Network.

Note

To use this classifier type, make sure to install the Tensorflow requirement by running in the shell:

pip install mindmeld[tensorflow]

The MindMeld Bi-Directional LSTM network

  • encodes words as pre-trained word embeddings using Stanford's GloVe representation
  • encodes characters using a convolutional network trained on the training data
  • concatenates the word and character embeddings together and feeds them into the bi-directional LSTM
  • couples the forget and input gates of the LSTM using a peephole connection, to improve overall accuracies on downstream NLP tasks
  • feeds the output of the LSTM into a linear chain Conditional Random Field (CRF) or Softmax layer which labels the target word as a particular entity

The following are the different optional params that are configurable with the 'lstm' classifier type. Unlike other classifier types, this classifier type does not share any common additional configurable params.

Parameter name Description
padding_length

The sequence model treats this as the maximum number of words in a query. If a query has more words than padding_length, the surplus words are discarded.

Typically set to the maximum word length of query expected both at train and predict time.

Default: 20

Example:

{'padding_length': 20}
  • a query can have a maximum of twenty words
batch_size

Size of each batch of training data to feed into the network (which uses mini-batch learning).

Default: 20

Example:

{'batch_size': 20}
  • feed twenty training queries to the network for each learning step
display_epoch

The network displays training accuracy statistics at this interval, measured in epochs.

Default: 5

Example:

{'display_epoch': 5}
  • display accuracy statistics every five epochs
number_of_epochs

Total number of complete iterations of the training data to feed into the network. In each iteration, the data is shuffled to break any prior sequence patterns.

Default: 20

Example:

{'number_of_epochs': 20}
  • iterate through the training data twenty times
optimizer

Optimizer to use to minimize the network's stochastic objective function.

Default: 'adam'

Example:

{'optimizer': 'adam'}
  • use the Adam optimizer to minimize the objective function
learning_rate

Parameter to control the size of weight and bias changes of the training algorithm as it learns.

This article explains Learning Rate in technical terms.

Default: 0.005

Example:

{'learning_rate': 0.005}
  • set learning rate to 0.005
dense_keep_prob

In the context of the ''dropout'' technique (a regularization method to prevent overfitting), keep probability specifies the proportion of nodes to "keep"—that is, to exempt from dropout during the network's learning phase.

The dense_keep_prob parameter sets the keep probability of the nodes in the dense network layer that connects the output of the LSTM layer to the nodes that predict the named entities.

Default: 0.5

Example:

{'dense_keep_prob': 0.5}
  • 50% of the nodes in the dense layer will not be turned off by dropout
lstm_input_keep_prob

Keep probability for the nodes that constitute the inputs to the LSTM cell.

Default: 0.5

Example:

{'lstm_input_keep_prob': 0.5}
  • 50% of the nodes that are inputs to the LSTM cell will not be turned off by dropout
lstm_output_keep_prob

Keep probability for the nodes that constitute the outputs of the LSTM cell.

Default: 0.5

Example:

{'lstm_output_keep_prob': 0.5}
  • 50% of the nodes that are outputs of the LSTM cell will not be turned off by dropout
token_lstm_hidden_state_dimension

Number of states per LSTM cell.

Default: 300

Example:

{'token_lstm_hidden_state_dimension': 300}
  • an LSTM cell will have 300 states
token_embedding_dimension

Number of dimensions for word embeddings.

Allowed values: [50, 100, 200, 300].

Default: 300

Example:

{'token_embedding_dimension': 300}
  • each word embedding will have 300 dimensions
gaz_encoding_dimension

Number of nodes to connect to the gazetteer encodings in a fully-connected network.

Default: 100

Example:

{'gaz_encoding_dimension': 100}
  • 100 nodes will be connected to the gazetteer encodings in a fully-connected network
max_char_per_word

The sequence model treats this as the maximum number of characters in a word. If a word has more characters than max_char_per_word, the surplus characters are discarded.

Usually set to the size of the longest word in the training and test sets.

Default: 20

Example:

{'max_char_per_word': 20}
  • a word can have a maximum of twenty characters
use_crf_layer

If set to True, use a linear chain Conditional Random Field layer for the final layer, which predicts sequence tags.

If set to False, use a softmax layer to predict sequence tags.

Default: False

Example:

{'use_crf_layer': True}
  • use the CRF layer
use_character_embeddings

If set to True, use the character embedding trained on the training data using a convolutional network.

If set to False, do not use character embeddings.

Note: Using character embedding significantly increases training time compared to vanilla word embeddings only.

Default: False

Example:

{'use_character_embeddings': True}
  • use character embeddings
char_window_sizes

List of window sizes for convolutions that the network should use to build the character embeddings. Usually in decreasing numerical order.

Note: This parameter is needed only if use_character_embeddings is set to True.

Default: [5]

Example:

{'char_window_sizes': [5, 3]}
  • first, use a convolution of size 5
  • next, feed the output of that convolution through a convolution of size 3
character_embedding_dimension

Initial dimension of each character before it is fed into the convolutional network.

Note: This parameter is needed only if use_character_embeddings is set to True.

Default: 10

Example:

{'character_embedding_dimension': 10}
  • initialize the dimension of each character to ten
word_level_character_embedding_size

The final dimension of each character after it is transformed by the convolutional network.

Usually greater than character_embedding_dimension since it encodes more information about orthography and semantics.

Note: This parameter is needed only if use_character_embeddings is set to True.

Default: 40

Example:

{'word_level_character_embedding_size': 40}
  • each character should have dimension of forty, after convolutional network training

Addendum

Common Configurable Params

The following are some params that are commonly configurable across all the classifier types described above, both for domain/intent classification as well as for entity recognition.

Parameter name Description
device

Name of the device on which torch tensors will be allocated. The 'cuda' choice is to be specified only when building models in a GPU environment.

Type: str

Default: 'cuda' if torch.cuda.is_available() else 'cpu'

Choices: 'cuda', 'cpu'

number_of_epochs

The total number of complete iterations of the training data to feed into the network. In each iteration, the data is shuffled to break any prior sequence patterns.

Type: int

Default: 100 unless specified otherwise for the selected classifier type.

Choices: Any positive integer

patience

The number of epochs to wait without any improvement on the validation metric before terminating training.

Type: int

Default: 10 if token classification else 7, unless specified otherwise for the selected classifier type.

Choices: Any positive integer

batch_size

Size of each batch of training data to feed into the network (which uses mini-batch learning).

Type: int

Default: 32 unless specified otherwise for the selected classifier type.

Choices: Any positive integer

gradient_accumulation_steps

Number of consecutive mini-batches for which gradients will be averaged and accumulated before updating the weights of the network.

Type: int

Default: 1 unless specified otherwise for the selected classifier type.

Choices: Any positive integer

max_grad_norm

Maximum norm to which the accumulated gradients' norm is to be clipped.

Type: Union[float, None]

Default: None

Choices: Any positive float, None

optimizer

Optimizer to use to minimize the network's stochastic objective function.

Type: str

Default: 'Adam'

Choices: A valid name from Pytorch optimizers

learning_rate

Parameter to control the size of weight and bias changes of the training algorithm as it learns. This article explains Learning Rate in technical terms.

Type: float

Default: 0.001

Choices: Any positive float

validation_metric

The metric used to track model improvements on the validation data split.

Type: str

Default: 'accuracy' for sequence classification and 'f1' for token classification

Choices: 'accuracy', 'f1'

dev_split_ratio

The fraction of samples in the training data that are to be used for validation; sampled randomly.

Type: float

Default: 0.2

Choices: A float between 0 and 1

padding_length

The maximum number of tokens (words, sub-words, or characters) allowed in a query. If a query has more tokens than padding_length, the surplus words are discarded. If specified as None, the padding_length of a mini-batch is simply the maximum length of inputs to that mini-batch.

Type: Union[int, None]

Default: None

Choices: Any positive integer, None

query_text_type

Determines the choice of text that is fed into the neural model. This param is coupled with the Text Preparation Pipeline when using a choice other than 'text'. The following are the three available choices:

'text': Specifies that the raw text of the queries be used without any processing. This is an appropriate choice if using a Huggingface pretrained model as the pretrained model's tokenizer takes care of any processing, tokenization, and normalizations on top of the raw text. This choice is oblivious to the app's TEXT_PREPARATION_CONFIG.

'processed_text': Specifies that processed text of the Text Preparation Pipeline be used as input to the neural model.

'normalized_text': Specifies that the text upon processing, tokenization, and normalization steps of the Text Preparation Pipeline be used as input to the neural model.

Type: str

Default: 'processed_text' for sequence classification and 'normalized_text' for token classification

Choices: 'text', 'processed_text', 'normalized_text'

Tokenization Choices

A noteworthy distinction between the traditional suite of models versus the deep neural models suite is the way the inputs are prepared for the underlying model. While the inputs for the former are prepared based on the specifications provided in the 'features' key of the classifier's config, inputs of deep neural models are naive in the sense that they are simply a sequence of tokens in the input query; the deep models do the heavy-lifting of discovering patterns to classify the text.

Broadly, tokens can be extracted from an input text as a sequence of individual characters or group of characters (aka. sub-words) or words itself by simply splitting the input text at whitespaces. Based on the choice of tokenization, a sequence of tokens are obtained from the input queries which are then converted into a sequence of ids for the neural model.

Note

  • To use a specific tokenization strategy, simply set the tokenizer_type param to one of the following choices (e.g. {tokenizer_type: 'whitespace-tokenizer'}).
  • Note that some of strategies are specific to the choice of embedder being used in the classifier.

Warning

The choices of tokenization presented here shouldn't be confused with the Tokenizers in text preparation pipeline. The tokenizers in text preparation pipeline are used to develop text that is inputted to the neural models while the following are used to prepare sequence of tokens for the underlying embedders.

The neural suite has the following choices of tokenizations to prepare inputs for neural models.

1. 'whitespace-tokenizer'

A Whitespace tokenizer tokenizes a query into a sequence of tokens by splitting it at whitespaces. The result are tokens that are simply the words present in the query. This tokenization strategy is state-less and the sequence of tokens produced for an input text will be same irrespective of the queries present in the training data.

2. 'char-tokenizer'

A Character tokenizer tokenizes a query into a sequence of characters present in it. This tokenization strategy is state-less and the sequence of tokens produced for an input text will be same irrespective of the queries present in the training data.

3. 'bpe-tokenizer'

A Byte-Pair Encoding (BPE) tokenizer tokenizes a query into a sequence of sub-words based on a vocabulary created from all of the queries in the training data. This tokenization strategy is state-ful and the sequence of tokens produced for an input text might not be same if the queries present in the training data change. This tokenizer is implemented using the Huggingface's Tokenizer library.

4. 'wordpiece-tokenizer'

A Word-Piece tokenizer tokenizes a query into a sequence of sub-words based on a vocabulary created from all of the queries in the training data. This tokenization strategy is state-ful and the sequence of tokens produced for an input text might not be same if the queries present in the training data change. This tokenizer is implemented using the Huggingface's Tokenizer library.

5. 'huggingface_pretrained-tokenizer'

A tokenizer pretrained and available as part of Huggingface transformers library. Although this tokenization strategy is state-ful (due to its pretraining), the sequence of tokens produced for an input text will be same irrespective of the queries present in the training data. To use this tokenizer, set the tokenizer_type and pretrained_model_name_or_path keys appropriately as follows: {tokenizer_type: 'huggingface_pretrained-tokenizer', pretrained_model_name_or_path: 'distilbert-base-uncased'}.