mindmeld.models.taggers.pytorch_crf module¶
-
class
mindmeld.models.taggers.pytorch_crf.
CRFModel
[source]¶ Bases:
torch.nn.modules.module.Module
PyTorch Model Class for Conditional Random Fields
-
build_params
(num_features, num_classes)[source]¶ Sets the parameters for the layers in the PyTorch CRF model. Naming convention is kept consistent with the CRFSuite implementation.
Parameters:
-
compute_marginal_probabilities
(inputs, mask)[source]¶ Function used to calculate the marginal probabilities of each token per tag. Implementation is borrowed from https://github.com/kmkurn/pytorch-crf/pull/37.
Parameters: - inputs (torch.Tensor) – Batch of padded input tensors.
- mask (torch.Tensor) – Batch of mask tensors to account for padded inputs.
Returns: marginal probabilities for every tag for each token for every sequence.
-
fit
(X, y)[source]¶ Trains the entire PyTorch CRF model.
Parameters: - X (list of list of dicts) – Generally a list of feature vectors, one for each training example
- y (list of lists) – A list of classification labels (encoded by the label_encoder, NOT MindMeld entity objects)
-
forward
(inputs, targets, mask, drop_input=0.0)[source]¶ The forward pass of the PyTorch CRF model. Returns the predictions or loss depending on whether labels are passed or not.
Parameters: Returns: Loss from training or predictions for input sequence.
Return type: loss (torch.Tensor or list)
-
get_dataloader
(X, y, is_train)[source]¶ Creates and returns the PyTorch dataloader instance for the training/test data.
Parameters: Returns: returns PyTorch dataloader object that can be used to iterate across the data.
Return type: torch_dataloader (torch.utils.data.dataloader.DataLoader)
-
load_best_weights_path
(path)[source]¶ Saves the best weights of the model to a path in the .generated folder.
Parameters: path (str) – Path to save the best model weights.
-
predict
(X)[source]¶ Gets predicted labels for the data.
Parameters: X (list of list of dicts) – Feature vectors for data to predict labels on. Returns: Predictions for each token in each sequence. Return type: preds (list of lists)
-
predict_marginals
(X)[source]¶ Get marginal probabilites for each tag per token for each sequence.
Parameters: X (list of list of dicts) – Feature vectors for data to predict marginal probabilities on. Returns: Returns the probability of every tag for each token in a sequence. Return type: marginals_dict (list of list of dicts)
-
run_predictions
(dataloader, calc_f1=False)[source]¶ Get predictions for the data by running a inference pass of the model.
Parameters: dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for test/validation data calc_f1 (bool): Flag to return dev f1 score or return predictions for each token. Returns: Dev F1 score or predictions for each token in a sequence.
-
save_best_weights_path
(path)[source]¶ Saves the best weights of the model to a path in the .generated folder.
Parameters: path (str) – Path to save the best model weights.
-
set_params
(feat_type='hash', feat_num=50000, stratify_train_val_split=True, drop_input=0.2, batch_size=8, number_of_epochs=100, patience=3, dev_split_ratio=0.2, optimizer='sgd', l1_weight=0, l2_weight=0, random_state=None, **kwargs)[source]¶ Set the parameters for the PyTorch CRF model and also validates the parameters.
Parameters: - feat_type (str) – The type of feature extractor. Supported options are ‘dict’ and ‘hash’.
- feat_num (int) – The number of features to be used by the FeatureHasher. Is not supported with the DictVectorizer
- stratify_train_val_split (bool) – Flag to check whether inputs should be stratified during train-dev split.
- drop_input (float) – The percentage at which to apply a dropout to the input features.
- batch_size (int) – Training batch size for the model.
- number_of_epochs (int) – The number of epochs (passes over the training data) to train the model for.
- patience (int) – Number of epochs to wait for before stopping training if dev score does not improve.
- dev_split_ratio (float) – Percentage of training data to be used for validation.
- optimizer (str) – Type of optimizer used for the model. Supported options are ‘sgd’ and ‘adam’.
- random_state (int) – Integer value to set random seeds for deterministic output.
- l1_weight (float) – Regularization weight for L1-penalty
- l2_weight (float) – Regularization weight for L2-penalty
-
set_random_states
()[source]¶ Sets the random seeds across all libraries used for deterministic output.
-
train_one_epoch
(train_dataloader)[source]¶ Contains the training code for one epoch.
Parameters: train_dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for training data
-
training_loop
(train_dataloader, dev_dataloader, tmp_save_path)[source]¶ Contains the training loop process where we train the model for specified number of epochs.
Parameters: - train_dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for training data
- dev_dataloader (torch.utils.data.dataloader.DataLoader) – Dataloader for validation data
-
-
class
mindmeld.models.taggers.pytorch_crf.
Encoder
(feature_extractor='hash', num_feats=50000)[source]¶ Bases:
object
Encoder class that is responsible for the feature extraction and label encoding for the PyTorch model.
-
encode_padded_input
(current_seq_len, max_seq_len, x)[source]¶ Pads the input sequence feature vectors to the max sequence length and returns the sparse torch tensor representation.
Parameters: Returns: Sparse COO tensor representation of padded input sequence
Return type: sparse_feat_tensor (torch.Tensor)
-
encode_padded_label
(current_seq_len, max_seq_len, y)[source]¶ Pads the label sequences to the max sequence length and returns the torch tensor representation.
Parameters: Returns: PyTorch tensor representation of padded label sequence
Return type: label_tensor (torch.Tensor)
-
get_padded_transformed_tensors
(inputs_or_labels, seq_lens, is_label)[source]¶ Returns the encoded and padded sparse tensor representations of the inputs/labels.
Parameters: Returns: PyTorch tensor representation of padded input sequence/labels.
Return type: encoded_tensors (list of torch.Tensor)
-
get_tensor_data
(feat_dicts, labels=None, fit=False)[source]¶ Gets the feature dicts and labels transformed into padded PyTorch sparse tensor data.
Parameters: - feat_dicts (list of list of dicts) – Generally a list of feature vectors, one for each training example
- y (list of lists) – A list of classification labels
- fit (bool) – Flag to whether fit the Feature Extractor or Label Encoder.
Returns: list of Sparse COO tensor representation of encoded padded input sequence. seq_lens (list of ints): List of actual length of each sequence. encoded_tensor_labels (list of torch.Tensor): list of tensors representations of encoded padded label sequence.
Return type: encoded_tensor_inputs (list of torch.Tensor)
-
-
class
mindmeld.models.taggers.pytorch_crf.
TaggerDataset
(inputs, seq_lens, labels=None)[source]¶ Bases:
torch.utils.data.dataset.Dataset
PyTorch Dataset class used to handle tagger inputs, labels and mask
-
mindmeld.models.taggers.pytorch_crf.
collate_tensors_and_masks
(sequence)[source]¶ Custom collate function that ensures proper batching of sparse tensors, labels and masks.
Parameters: sequence (list of tuples) – Each tuple contains one input tensor, one mask tensor and one label tensor. Returns: Batched representation of input, label and mask sequences.
-
mindmeld.models.taggers.pytorch_crf.
diag_concat_coo_tensors
(tensors)[source]¶ Concatenates sparse PyTorch COO tensors diagonally so that they can processed in batches.
Parameters: tensors (tuple of torch.Tensor) – Tuple of sparse COO tensors to diagonally concatenate. Returns: A single sparse COO tensor that acts as a single batch. Return type: stacked_tensor (torch.Tensor)
-
mindmeld.models.taggers.pytorch_crf.
stratify_input
(X, y)[source]¶ Gets the input and labels ready for stratification into train and dev data. Stratification is done based on the presence of unique labels for each sequence. It also duplicates the unique samples across input and labels to ensure that it doesn’t fail with scikit-learn’s train_test_split.
Parameters: Returns: List of feature vectors, ready for stratification. str_y (list): List of labels, ready for stratification. stratify_tuples (list): Unique label for each example which will be the value used for stratification..
Return type: str_X (list)