Deep Neural Networks in MindMeld¶
Conversational AI and Natural Language Processing more generally have seen a boost in performance at a variety of tasks through the use of Deep learning. In particular, deep neural models based on Convolutional neural networks (CNNs), Long short-term memory networks (LSTM) and Transformers architectures have been widely adopted over more traditional approaches to NLP to great success. MindMeld now extends its suite of traditional machine learning models (e.g. Logistic regression, Decision tree, etc.) with a variety of deep neural models and an array of configurable parameters.
Users can now train and use deep neural models for domain classification and intent classification (aka. sequence classification) as well as for entity recognition (or token classification) tasks.
Note
These models are implemented using Pytorch framework and thus need extra installations before starting to use them in your chatbot application. Please make sure to install the Pytorch requirement by running in the shell:
pip install mindmeld[torch]
MindMeld supports the use of pretrained transformer models such as BERT through the popular Huggingface Transformers library. Several pretrained models from their Models Hub that can be used for sequence classification or token classification can be employed in your chatbot application.
Note
To use pretrained transformer models, install the extra transformers requirement by running in the shell:
pip install mindmeld[transformers]
Before proceeding to use the deep neural models, consider the following possible advantages and disadvantages of using them in place of traditional machine learning models.
- Better overall performance for larger training sets. Deep models generally outperform traditional machine learning models on training sets with several hundreds or thousands of queries when training from scratch, and with at least few hundred if fine-tuning from a pretrained checkpoint.
- Slower training and inference times on CPU devices but faster on GPU devices. Training and inference times for deep models on CPU-only machines can take longer than traditional machine learning models. However, on GPU-enabled devices, the run times of the deep networks can be comparable to some of the traditional models in MindMeld.
- Minimal feature engineering work but manual hyperparameter tuning. Unlike traditional machine learning models, deep models require little or no feature engineering work because they infer input features (such as word embeddings). Traditional models must take into account several hundred engineered features (n-grams, system entities, and so on), which requires fine-grained tuning. On the flip side, Mindmeld's deep models don't have automated hyperparameter tuning methods like sklearn.model_selection.GridSearchCV, which are available for their traditional counterparts. While the default hyperparameters for MindMeld's deep neural models work well across datasets, you can further tune them and a good starting point to understand this subject better is Andrej Karpathy's course notes from the Convolutional Neural Networks for Visual Recognition course at Stanford University.
- Larger disk storage required. While deep neural models can have a similar disk storage footprint to their traditional counterparts, depending on your data, it is not uncommon for them to require more disk storage space.
Note
- To use deep neural networks instead of traditional machine learning models in your MindMeld application, simply make few modifications to the classifier configuration dictionaries for all or selected classifiers in your app's
config.py
. - To make modifications to selected domains or intents, recollect that you can implement the get_intent_classifier_config() and get_entity_recognizer_config() functions respectively in your app's
config.py
for a finer-grained control.
In the following sections, different model architectures and their configurable parameters are outlined.
Domain and Intent classification¶
Using MindMeld’s deep neural models requires configuring only two keys in your classifier configuration dictionaries: 'model_settings'
and 'params'
.
When working with the deep models, the 'features'
and 'param_selection'
keys in the classifier configuration are redundant, as we neither have to handcraft any feature sets for modeling, nor is there automated hyperparameter tuning.
This is a departure from other documentation on Working with the Domain Classifier and Working with the Intent Classifier, which outlines that text classifier configuration requires an additional two keys ('features'
and 'param_selection'
).
The 'model_settings'
is a dict
with the single key 'classifier_type'
, whose value specifies the machine learning model to use.
The allowed values of 'classifier_type'
that are backed by deep neural nets and are meant for sequence classification are:
Value | Classifier | Reference for configurable parameters |
---|---|---|
'embedder' |
Pooled Token Embeddings or Deep Contextualized Embeddings | Embedder parameters |
'cnn' |
Convolutional neural networks (CNN) | CNN parameters |
'lstm' |
Long short-term memory networks (LSTM) | LSTM parameters |
The 'params'
is also a dict
with several configurable keys, some of which are specific to the choice of classifier type and others common across all the above classifier types.
In the following section, the list of allowed parameters related to each choice of classifier type are outlined.
See Common Configurable Params section for list of configurable params that are not just specific to any classifier type but are common across all the classifier types.
1. 'embedder'
classifier type¶
Mindmeld's 'embedder'
classifier type uses a pooling operation on top of model embeddings, which are based on either a lookup table or a deep neural model:
- Lookup table embeddings can be derived based on a user-defined tokenization strategy-- word-level, sub-word-level, or character-level tokenization (see Tokenization Choices below for more details). By default, the lookup table is randomly initialized, but it can instead be initialized to a pretrained checkpoint when using a word-level tokenization strategy (such as GloVe) .
- Deep contextualized embedders are pretrained embedders in the style of BERT, which consists of its own tokenization strategy and neural embedding process.
In either case, all the underlying weights can be tuned to the training data provided, or can be kept frozen during the training process. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
Note
Specify the embedding choice using the param embedder_type
. Set it to None
, 'glove'
or 'bert'
to use with desired embedding styles-- based on a randomly initialized embedding lookup table, based on lookup table initialized with GloVe (or GloVe-like formatted) pretrained embeddings or a BERT-like pretrained transformer based deep contextualized embedder, respectively.
The following are the different optional params that are configurable along with the chosen choice of embedder_type
param.
See Common Configurable Params section for list of additional configurable params that are common across classifiers.
1.1 Embedding Lookup Table (embedder_type
: None
)¶
Configuration Key | Description |
---|---|
emb_dim |
Number of dimensions for each token's embedding. Type: int Default: Choices: Any positive integer |
tokenizer_type |
The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details. Type: str Default: Choices: See Tokenization Choices |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
embedder_output_pooling_type |
Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer. Type: str Default: Choices: |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
Below is a minimal working example of a sequence classifier configuration for a classifier based on an embedding lookup table:
{
'model_type': 'text',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'embedder'},
'params': {
'embedder_type': None,
'emb_dim': 256,
},
}
1.2 Pretrained Embedding Lookup Table (embedder_type
: glove
)¶
Configuration Key | Description |
---|---|
token_dimension |
Specifies the dimension of the GloVe-6B pretrained word vectors. This key is only valid when using Type: int Default: Choices: |
token_pretrained_embedding_filepath |
Specifies a local file path for pretrained embedding file. This key is only valid when using Type: Union[str, None] Default: Choices: File path to a valid GloVe-style embeddings file |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
embedder_output_pooling_type |
Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer. Type: str Default: Choices: |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
Below is a minimal working example of a sequence classifier configuration for a classifier based on a pretrained-initialized embedding lookup table:
{
'model_type': 'text',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'embedder'},
'params': {
'embedder_type': 'glove',
'update_embeddings': True,
},
}
1.3 Deep Contextualized Embeddings (embedder_type
: bert
)¶
Configuration Key | Description |
---|---|
pretrained_model_name_or_path |
Specifies a pretrained checkpoint's name or a valid file path to load a bert-like embedder. This key is only valid when using Type: str Default: Choices: Any valid name from Huggingface Models Hub or a valid folder path where the model's weights as well as its tokenizer's resources are present. |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
embedder_output_pooling_type |
Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer. Type: str Default: Choices: |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
save_frozen_embedder |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a sequence classifier configuration for a classifier based on a BERT-like embedder:
{
'model_type': 'text',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'embedder'},
'params': {
'embedder_type': 'bert',
'pretrained_model_name_or_path': 'distilbert-base-uncased',
'update_embeddings': True,
},
}
2. 'cnn'
classifier type¶
Convolutional neural networks (CNN) based text classifiers are light-weight neural classifiers that have achieved remarkably strong performance on the practically important task of sentence classification.
Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. The subsequent layer performs convolutions over the sequence of embedded word vectors using kernels (also called filters); kernels of different lengths capture different n-gram patterns from the input text. For each chosen length, several kernels are used to capture different patterns at the same receptive range. Finally, each kernel leads to one feature map.
Each feature map is reduced to the maximum value observed in that map, and maximum values from all maps are combined to form a long feature vector.
This vector is analogous to an 'embedder'
classifier's pooled output, which is then passed through a classification layer.
Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
The following are the different optional params that are configurable with the 'cnn'
classifier type.
See Common Configurable Params section for list of additional configurable params that are common across classifiers.
Configuration Key | Description |
---|---|
embedder_type |
The choice of embeddings to be used. Specifying Type: Union[str, None] Default: None Choices: |
emb_dim |
Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder. Type: int Default: Choices: Any positive integer |
tokenizer_type |
The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details. Type: str Default: Choices: See Tokenization Choices |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
window_sizes |
The lengths of 1D CNN kernels to be used for convolution on top of embeddings. Type: List[int] Default: Choices: A list of positive integers |
number_of_windows |
The number of kernels per each specified length of 1D CNN kernels. Type: List[int] Default: Choices: A list of positive integers; same length as |
Below is a minimal working example of a sequence classifier configuration for a classifier based on CNNs:
{
'model_type': 'text',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'cnn'},
'params': {
'embedder_type': 'glove',
'window_sizes': [3,4,5],
'number_of_windows': [100,100,100],
},
}
3. 'lstm'
classifier type¶
Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data.
Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. The subsequent layer applies LSTM over the sequence of embedded word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.
To obtain a single vector per input text, the vectors for each token can be pooled or the last vector in the sequence can simply be used as the representative vector.
This vector is analogous to an 'embedder'
classifier's pooled output, which is then passed through a classification layer.
Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
The following are the different optional params that are configurable with the 'lstm'
classifier type.
See Common Configurable Params section for list of additional configurable params that are common across classifiers.
Configuration Key | Description |
---|---|
embedder_type |
The choice of embeddings to be used. Specifying Type: Union[str, None] Default: None Choices: |
emb_dim |
Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder. Type: int Default: Choices: Any positive integer |
tokenizer_type |
The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details. Type: str Default: Choices: See Tokenization Choices |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
lstm_hidden_dim |
Number of states per each LSTM layer. Type: int Default: Choices: Any positive integer |
lstm_num_layers |
The number of LSTM layers that are to be stacked sequentially. Type: int Default: Choices: Any positive integer |
lstm_keep_prob |
Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer. Type: float Default: Choices: A float between 0 and 1 |
lstm_bidirectional |
If Type: bool Default: Choices: |
lstm_output_pooling_type |
Specifies the manner in which a query's token-wise embeddings are to be collated into a single embedding before passing through classification layer. Type: str Default: Choices: |
Below is a minimal working example of a sequence classifier configuration for a classifier based on LSTMs:
{
'model_type': 'text',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'cnn'},
'params': {
'embedder_type': 'glove',
'lstm_hidden_dim': 128,
'lstm_bidirectional': True,
},
}
Entity recognition¶
Using MindMeld’s deep neural models requires configuring only two keys in your classifier configuration dictionaries: 'model_settings'
and 'params'
.
When working with the deep models, the 'features'
and 'param_selection'
keys in the classifier configuration are redundant, as we neither have to handcraft any feature sets for modeling, nor is there automated hyperparameter tuning.
This is a departure from other documentation on Working with the Entity Recognizer, which outlines that text classifier configuration requires an additional two keys ('features'
and 'param_selection'
).
The 'model_settings'
is a dict
with the single key 'classifier_type'
, whose value specifies the machine learning model to use.
The allowed values of 'classifier_type'
that are backed by deep neural nets and are meant for token classification are:
Value | Classifier | Reference for configurable parameters |
---|---|---|
'embedder' |
Pooled Token Embeddings or Deep Contextualized Embeddings | Embedder parameters |
'lstm-pytorch' |
Long short-term memory networks (LSTM) | LSTM-PYTORCH parameters |
'cnn-lstm' |
Character-level Convolutional neural networks (CNN) followed by word-level Long short-term memory networks (LSTM) | CNN-LSTM parameters |
'lstm-lstm' |
Character-level Long short-term memory networks (LSTM) followed by word-level Long short-term memory networks (LSTM) | LSTM-LSTM parameters |
'lstm' |
Long short-term memory networks (LSTM) coupled with gazetteer encodings and backed by Tensorflow | LSTM parameters |
The 'params'
is also a dict
with several configurable keys, some of which are specific to the choice of classifier type and others common across all the above classifier types.
In the following section, the list of allowed parameters related to each choice of classifier type are outlined.
See Common Configurable Params section for a list of configurable params that are common across all classifier types.
1. 'embedder'
classifier type¶
This classifier type includes neural models that are based on either an embedding lookup table or a deep contextualized embedder, the outputs of which are then passed through a Conditional Random Field (CRF) or a Softmax layer which labels target word as a particular entity.
- Lookup table embeddings can be derived based on a user-defined tokenization strategy-- word-level, sub-word-level, or character-level tokenization (see Tokenization Choices below for more details). By default, the lookup table is randomly initialized, but it can instead be initialized to a pretrained checkpoint when using a word-level tokenization strategy (such as GloVe) .
- Deep contextualized embedders are pretrained embedders in the style of BERT, which consists of its own tokenization strategy and neural embedding process.
In either case, all the underlying weights can be tuned to the training data provided, or can be kept frozen during the training process. Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
The 'embedder'
classifier type pools the vectors of all tokens corresponding to a word that has been assigned an entity tag, so as to obtain a single vector per word in an input text.
This is unlike sequence classification models, where all tokens of all words are pooled together, and then passed through a classification layer.
Note
Specify the embedding choice using the param embedder_type
. Set it to None
, 'glove'
or 'bert'
to use with desired embedding styles-- based on a randomly initialized embedding lookup table, based on lookup table initialized with GloVe (or GloVe-like formatted) pretrained embeddings or a BERT-like pretrained transformer based deep contextualized embedder, respectively.
The following are the different optional params that are configurable along with the chosen choice of embedder_type
param.
See Common Configurable Params for list of additional configurable params that are common across classifiers.
1.1 Embedding Lookup Table (embedder_type
: None
)¶
Configuration Key | Description |
---|---|
emb_dim |
Number of dimensions for each token's embedding. Type: int Default: Choices: Any positive integer |
tokenizer_type |
The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details. Type: str Default: Choices: See Tokenization Choices |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
token_spans_pooling_type |
Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer. Type: str Default: Choices: |
use_crf_layer |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a token classifier configuration for a classifier based on an embedding lookup table:
{
'model_type': 'tagger',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'embedder'},
'params': {
'embedder_type': None,
'emb_dim': 256,
},
}
1.2 Pretrained Embedding Lookup Table (embedder_type
: glove
)¶
Configuration Key | Description |
---|---|
token_dimension |
Specifies the dimension of the GloVe-6B pretrained word vectors. This key is only valid when using Type: int Default: Choices: |
token_pretrained_embedding_filepath |
Specifies a local file path for pretrained embedding file. This key is only valid when using Type: Union[str, None] Default: Choices: File path to a valid GloVe-style embeddings file |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
token_spans_pooling_type |
Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer. Type: str Default: Choices: |
use_crf_layer |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a token classifier configuration for a classifier based on a pretrained-initialized embedding lookup table:
{
'model_type': 'tagger',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'embedder'},
'params': {
'embedder_type': 'glove',
'update_embeddings': True,
},
}
1.3 Deep Contextualized Embeddings (embedder_type
: bert
)¶
Configuration Key | Description |
---|---|
pretrained_model_name_or_path |
Specifies a pretrained checkpoint's name or a valid file path to load a bert-like embedder. This key is only valid when using Type: str Default: Choices: Any valid name from Huggingface Models Hub or a valid folder path where the model's weights as well as its tokenizer's resources are present. |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
save_frozen_embedder |
If set to Type: bool Default: Choices: |
token_spans_pooling_type |
Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer. Type: str Default: Choices: |
use_crf_layer |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a token classifier configuration for a classifier based on a BERT embedder:
{
'model_type': 'tagger',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'embedder'},
'params': {
'embedder_type': 'bert',
'pretrained_model_name_or_path': 'distilbert-base-uncased',
'update_embeddings': True,
},
}
2. 'lstm-pytorch'
classifier type¶
Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data.
Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. The subsequent layer applies LSTM over the sequence of embedded word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.
To obtain a single vector per word per input text, the vectors of all tokens corresponding to each word (for which an entity tag is to be ascertained) are pooled.
This vector is analogous to an 'embedder'
classifier's output, which is then passed through a classification layer.
Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
The following are the different optional params that are configurable with the 'lstm'
classifier type.
See Common Configurable Params section for list of additional configurable params that are common across classifiers.
Configuration Key | Description |
---|---|
embedder_type |
The choice of embeddings to be used. Specifying Type: Union[str, None] Default: None Choices: |
emb_dim |
Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder. Type: int Default: Choices: Any positive integer |
tokenizer_type |
The choice of tokenization strategy to extract tokens from the training data. See Tokenization Choices section below for more details. Type: str Default: Choices: See Tokenization Choices |
add_terminals |
If set to Type: Union[bool, None] Default: Choices: |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
lstm_hidden_dim |
Number of states per each LSTM layer. Type: int Default: Choices: Any positive integer |
lstm_num_layers |
The number of LSTM layers that are to be stacked sequentially. Type: int Default: Choices: Any positive integer |
lstm_keep_prob |
Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer. Type: float Default: Choices: A float between 0 and 1 |
lstm_bidirectional |
If Type: bool Default: Choices: |
token_spans_pooling_type |
Specifies the manner in which a word's token-wise embeddings are to be collated into a single embedding before passing through entity classification layer. Type: str Default: Choices: |
use_crf_layer |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a token classifier configuration for a classifier based on LSTMs:
{
'model_type': 'tagger',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'cnn'},
'params': {
'embedder_type': 'glove',
'lstm_hidden_dim': 128,
'lstm_bidirectional': True,
},
}
3. 'cnn-lstm'
classifier type¶
Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data. When coupled with Convolutional neural networks (CNN) for extracting character-level features from input text, the overall architecture can better model the textual data as well as it is more robust to variations in the spellings.
Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table. This is then concatenated with the outputs of each word's convolutions at the character-level using kernels of different lengths to capture different patterns. These convolutions are similar to those of CNN classifier type except they are applied for each word in the input text separately to obtain one representation for each word.
The subsequent layer applies LSTM over the sequence of concatenated word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.
The 'cnn-lstm'
classifier type pools the vectors of all tokens corresponding to words that have been assigned an entity tag so as to obtain a single vector per word in an input text.
This vector is analogous to an 'embedder'
classifier's output, which is then passed through a classification layer.
Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
The following are the different optional params that are configurable with the 'lstm'
classifier type.
See Common Configurable Params section for list of additional configurable params that are common across classifiers.
Configuration Key | Description |
---|---|
embedder_type |
The choice of embeddings to be used. Specifying Type: Union[str, None] Default: None Choices: |
emb_dim |
Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder. Type: int Default: Choices: Any positive integer |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
lstm_hidden_dim |
Number of states per each LSTM layer. Type: int Default: Choices: Any positive integer |
lstm_num_layers |
The number of LSTM layers that are to be stacked sequentially. Type: int Default: Choices: Any positive integer |
lstm_keep_prob |
Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer. Type: float Default: Choices: A float between 0 and 1 |
lstm_bidirectional |
If Type: bool Default: Choices: |
char_emb_dim |
Number of dimensions for each character's embedding. Type: int Default: Choices: Any positive integer |
char_window_sizes |
The lengths of 1D CNN kernels to be used for character-level convolution on top of character embeddings. Type: List[int] Default: Choices: A list of positive integers |
char_number_of_windows |
The number of kernels per each specified length of 1D CNN kernels in Type: List[int] Default: Choices: A list of positive integers; same length as |
char_cnn_output_keep_prob |
Keep probability for the dropout layer placed on top of character CNN's output. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
char_proj_dim |
The final dimension of each character after it is transformed by the character-level network.
Usually greater than the Type: Union[int, None] Default: Choices: Any positive integer, |
char_padding_length |
The maximum number of characters allowed per word.
If a word has more characters than Type: Union[int, None] Default: Choices: Any positive integer, |
char_add_terminals |
If set to Type: bool Default: Choices: |
use_crf_layer |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a token classifier configuration for a classifier based on CNN-LSTM:
{
'model_type': 'tagger',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'cnn-lstm'},
'params': {
'embedder_type': 'glove',
'lstm_hidden_dim': 128,
'lstm_bidirectional': True,
'char_emb_dim': 32
},
}
4. 'lstm-lstm'
classifier type¶
Long short-term memory networks (LSTM) based text classifiers utilize recurrent feedback connections to be able to learn temporal dependencies in sequential data. When coupled with Long short-term memory networks (LSTM) for extracting character-level features from input text, the overall architecture can better model the textual data as well as it is more robust to variations in the spellings.
Using a sequence of textual tokens extracted from the input text, the first layer of this classifier type embeds those sequences into low-dimensional vectors using an embedding lookup table, and concatenates them with the outputs of a character-level bi-LSTM (for each word individually) to capture different character-level patterns. This is then concatenated with the outputs of each word's convolutions at the character-level using kernels of different lengths to capture different patterns. These convolutions are similar to those of CNN classifier type except they are applied for each word in the input text separately to obtain one representation for each word.
The subsequent layer applies LSTM over the sequence of concatenated word vectors. An LSTM's ability to maintain temporal information is generally dependent on its hidden dimension. The LSTM processes the text from left-to-right or in the case of a bi-directional LSTM (bi-LSTM), it can process the text both ways, from left-to-right and right-to-left. This yields an output sequence of one vector per token of the input text. Optionally, several LSTMs can then be stacked, with the output of one serving as the input to another.
The 'lstm-lstm'
classifier type pools the vectors of all tokens corresponding to words that have been assigned an entity tag so as to obtain a single vector per word in an input text.
This vector is analogous to an 'embedder'
classifier's output, which is then passed through a classification layer.
Dropout layers are used as regularizers to avoid over-fitting, which is a more common phenomenon when working with small sized datasets.
The following are the different optional params that are configurable with the 'lstm'
classifier type.
See Common Configurable Params section for list of additional configurable params that are common across classifiers.
Configuration Key | Description |
---|---|
embedder_type |
The choice of embeddings to be used. Specifying Type: Union[str, None] Default: None Choices: |
emb_dim |
Number of dimensions for each token's embedding. This key is only valid when not using a pretrained embedder. Type: int Default: Choices: Any positive integer |
update_embeddings |
If set to Type: bool Default: Choices: |
embedder_output_keep_prob |
Keep probability for the dropout layer placed on top of embeddings. Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
output_keep_prob |
Keep probability for the dropout layer placed on top of classifier's penultimate layer (i.e the layer before logits are computed). Dropout helps in regularization and reduces over-fitting. Type: float Default: Choices: A float between 0 and 1 |
lstm_hidden_dim |
Number of states per each LSTM layer. Type: int Default: Choices: Any positive integer |
lstm_num_layers |
The number of LSTM layers that are to be stacked sequentially. Type: int Default: Choices: Any positive integer |
lstm_keep_prob |
Keep probability for the nodes that constitute the outputs of each LSTM layer except the last LSTM layer. Type: float Default: Choices: A float between 0 and 1 |
lstm_bidirectional |
If Type: bool Default: Choices: |
char_emb_dim |
Number of dimensions for each character's embedding. Type: int Default: Choices: Any positive integer |
char_lstm_hidden_dim |
Number of states per each character-level LSTM layer. Type: int Default: Choices: Any positive integer |
char_lstm_num_layers |
The number of character-level LSTM layers that are to be stacked sequentially. Type: int Default: Choices: Any positive integer |
char_lstm_keep_prob |
Keep probability for the nodes that constitute the outputs of each character-level LSTM layer except the last layer in the stack. Type: float Default: Choices: A float between 0 and 1 |
char_lstm_bidirectional |
If Type: bool Default: Choices: |
char_lstm_output_pooling_type |
Specifies the manner in which a word's character-level embeddings are to be collated into a single embedding before passing to subsequent layers. Type: str Default: Choices: |
char_proj_dim |
The final dimension of each character after it is transformed by the character-level network.
Usually greater than the Type: Union[int, None] Default: Choices: Any positive integer, |
char_padding_length |
The maximum number of characters allowed per word.
If a word has more characters than Type: Union[int, None] Default: Choices: Any positive integer, |
char_add_terminals |
If set to Type: bool Default: Choices: |
use_crf_layer |
If set to Type: bool Default: Choices: |
Below is a minimal working example of a token classifier configuration for a classifier based on LSTM-LSTM:
{
'model_type': 'tagger',
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt',
'model_settings': {'classifier_type': 'lstm-lstm'},
'params': {
'embedder_type': 'glove',
'lstm_hidden_dim': 128,
'lstm_bidirectional': True,
'char_emb_dim': 32,
},
}
5. 'lstm'
classifier type¶
A Tensorflow backed implementation of Bi-Directional Long Short-Term Memory (LSTM) Network.
Note
To use this classifier type, make sure to install the Tensorflow requirement by running in the shell:
pip install mindmeld[tensorflow]
The MindMeld Bi-Directional LSTM network
- encodes words as pre-trained word embeddings using Stanford's GloVe representation
- encodes characters using a convolutional network trained on the training data
- concatenates the word and character embeddings together and feeds them into the bi-directional LSTM
- couples the forget and input gates of the LSTM using a peephole connection, to improve overall accuracies on downstream NLP tasks
- feeds the output of the LSTM into a linear chain Conditional Random Field (CRF) or Softmax layer which labels the target word as a particular entity
The following are the different optional params that are configurable with the 'lstm'
classifier type.
Unlike other classifier types, this classifier type does not share any common additional configurable params.
Parameter name | Description |
---|---|
padding_length |
The sequence model treats this as the maximum number of words in a query.
If a query has more words than Typically set to the maximum word length of query expected both at train and predict time. Default: Example:
|
batch_size |
Size of each batch of training data to feed into the network (which uses mini-batch learning). Default: Example:
|
display_epoch |
The network displays training accuracy statistics at this interval, measured in epochs. Default: Example:
|
number_of_epochs |
Total number of complete iterations of the training data to feed into the network. In each iteration, the data is shuffled to break any prior sequence patterns. Default: Example:
|
optimizer |
Optimizer to use to minimize the network's stochastic objective function. Default: Example:
|
learning_rate |
Parameter to control the size of weight and bias changes of the training algorithm as it learns. This article explains Learning Rate in technical terms. Default: Example:
|
dense_keep_prob |
In the context of the ''dropout'' technique (a regularization method to prevent overfitting), keep probability specifies the proportion of nodes to "keep"—that is, to exempt from dropout during the network's learning phase. The Default: Example:
|
lstm_input_keep_prob |
Keep probability for the nodes that constitute the inputs to the LSTM cell. Default: Example:
|
lstm_output_keep_prob |
Keep probability for the nodes that constitute the outputs of the LSTM cell. Default: Example:
|
token_lstm_hidden_state_dimension |
Number of states per LSTM cell. Default: Example:
|
token_embedding_dimension |
Number of dimensions for word embeddings. Allowed values: [50, 100, 200, 300]. Default: Example:
|
gaz_encoding_dimension |
Number of nodes to connect to the gazetteer encodings in a fully-connected network. Default: Example:
|
max_char_per_word |
The sequence model treats this as the maximum number of characters in a word.
If a word has more characters than Usually set to the size of the longest word in the training and test sets. Default: Example:
|
use_crf_layer |
If set to If set to Default: Example:
|
use_character_embeddings |
If set to If set to Note: Using character embedding significantly increases training time compared to vanilla word embeddings only. Default: Example:
|
char_window_sizes |
List of window sizes for convolutions that the network should use to build the character embeddings. Usually in decreasing numerical order. Note: This parameter is needed only if Default: Example:
|
character_embedding_dimension |
Initial dimension of each character before it is fed into the convolutional network. Note: This parameter is needed only if Default: Example:
|
word_level_character_embedding_size |
The final dimension of each character after it is transformed by the convolutional network. Usually greater than Note: This parameter is needed only if Default: Example:
|
Addendum¶
Common Configurable Params¶
The following are some params that are commonly configurable across all the classifier types described above, both for domain/intent classification as well as for entity recognition.
Parameter name | Description |
---|---|
device |
Name of the device on which torch tensors will be allocated. The Type: str Default: Choices: |
number_of_epochs |
The total number of complete iterations of the training data to feed into the network. In each iteration, the data is shuffled to break any prior sequence patterns. Type: int Default: Choices: Any positive integer |
patience |
The number of epochs to wait without any improvement on the validation metric before terminating training. Type: int Default: Choices: Any positive integer |
batch_size |
Size of each batch of training data to feed into the network (which uses mini-batch learning). Type: int Default: Choices: Any positive integer |
gradient_accumulation_steps |
Number of consecutive mini-batches for which gradients will be averaged and accumulated before updating the weights of the network. Type: int Default: Choices: Any positive integer |
max_grad_norm |
Maximum norm to which the accumulated gradients' norm is to be clipped. Type: Union[float, None] Default: Choices: Any positive float, |
optimizer |
Optimizer to use to minimize the network's stochastic objective function. Type: str Default: Choices: A valid name from Pytorch optimizers |
learning_rate |
Parameter to control the size of weight and bias changes of the training algorithm as it learns. This article explains Learning Rate in technical terms. Type: float Default: Choices: Any positive float |
validation_metric |
The metric used to track model improvements on the validation data split. Type: str Default: Choices: |
dev_split_ratio |
The fraction of samples in the training data that are to be used for validation; sampled randomly. Type: float Default: Choices: A float between 0 and 1 |
padding_length |
The maximum number of tokens (words, sub-words, or characters) allowed in a query. If a query has
more tokens than Type: Union[int, None] Default: Choices: Any positive integer, |
query_text_type |
Determines the choice of text that is fed into the neural model. This param is coupled with the
Text Preparation Pipeline when using a choice other than
Type: str Default: Choices: |
Tokenization Choices¶
A noteworthy distinction between the traditional suite of models versus the deep neural models suite is the way the inputs are prepared for the underlying model.
While the inputs for the former are prepared based on the specifications provided in the 'features'
key of the classifier's config, inputs of deep neural models are naive in the sense that they are simply a sequence of tokens in the input query;
the deep models do the heavy-lifting of discovering patterns to classify the text.
Broadly, tokens can be extracted from an input text as a sequence of individual characters or group of characters (aka. sub-words) or words itself by simply splitting the input text at whitespaces. Based on the choice of tokenization, a sequence of tokens are obtained from the input queries which are then converted into a sequence of ids for the neural model.
Note
- To use a specific tokenization strategy, simply set the
tokenizer_type
param to one of the following choices (e.g. {tokenizer_type
:'whitespace-tokenizer'
}). - Note that some of strategies are specific to the choice of embedder being used in the classifier.
Warning
The choices of tokenization presented here shouldn't be confused with the Tokenizers in text preparation pipeline. The tokenizers in text preparation pipeline are used to develop text that is inputted to the neural models while the following are used to prepare sequence of tokens for the underlying embedders.
The neural suite has the following choices of tokenizations to prepare inputs for neural models.
1. 'whitespace-tokenizer'
¶
A Whitespace tokenizer tokenizes a query into a sequence of tokens by splitting it at whitespaces. The result are tokens that are simply the words present in the query. This tokenization strategy is state-less and the sequence of tokens produced for an input text will be same irrespective of the queries present in the training data.
2. 'char-tokenizer'
¶
A Character tokenizer tokenizes a query into a sequence of characters present in it. This tokenization strategy is state-less and the sequence of tokens produced for an input text will be same irrespective of the queries present in the training data.
3. 'bpe-tokenizer'
¶
A Byte-Pair Encoding (BPE) tokenizer tokenizes a query into a sequence of sub-words based on a vocabulary created from all of the queries in the training data. This tokenization strategy is state-ful and the sequence of tokens produced for an input text might not be same if the queries present in the training data change. This tokenizer is implemented using the Huggingface's Tokenizer library.
4. 'wordpiece-tokenizer'
¶
A Word-Piece tokenizer tokenizes a query into a sequence of sub-words based on a vocabulary created from all of the queries in the training data. This tokenization strategy is state-ful and the sequence of tokens produced for an input text might not be same if the queries present in the training data change. This tokenizer is implemented using the Huggingface's Tokenizer library.
5. 'huggingface_pretrained-tokenizer'
¶
A tokenizer pretrained and available as part of Huggingface transformers library.
Although this tokenization strategy is state-ful (due to its pretraining), the sequence of tokens produced for an input text will be same irrespective of the queries present in the training data.
To use this tokenizer, set the tokenizer_type
and pretrained_model_name_or_path
keys appropriately as follows: {tokenizer_type
: 'huggingface_pretrained-tokenizer'
, pretrained_model_name_or_path
: 'distilbert-base-uncased'
}.