mindmeld.active_learning.data_loading module¶
This module contains classes used to load queries for the Active Learning Pipeline.
-
class
mindmeld.active_learning.data_loading.
DataBucket
(label_map, resource_loader, test_queries: mindmeld.resource_loader.ProcessedQueryList, unsampled_queries: mindmeld.resource_loader.ProcessedQueryList, sampled_queries: mindmeld.resource_loader.ProcessedQueryList)[source]¶ Bases:
object
Class to hold data throughout the Active Learning training pipeline. Responsible for data conversion, filtration, and storage.
-
static
filter_queries_by_nlp_component
(query_list: mindmeld.resource_loader.ProcessedQueryList, component_type: str, component_name: str)[source]¶ Filter queries for training preperation.
Parameters: Returns: List of indices of filtered queries. filtered_queries (list): List of filtered queries.
Return type: filtered_queries_indices (list)
-
get_queries
(query_ids)[source]¶ Method to get multiple queries from the QueryCache given a list of query ids.
Parameters: query_ids (List[int]) -- List of ids corresponding to queries in the QueryCache. Returns: List of processed queries from the cache. Return type: queries (List[ProcessedQuery])
-
sample_and_update
(sampling_size: int, confidences_2d: List[List[float]], confidences_3d: List[List[List[float]]], heuristic: mindmeld.active_learning.heuristics.Heuristic, confidence_segments: Dict = None, tuning_type: mindmeld.constants.TuningType = <TuningType.CLASSIFIER: 'classifier'>)[source]¶ Method to sample a DataBucket's unsampled_queries and update its sampled_queries and newly_sampled_queries. :param sampling_size: Number of elements to sample in the next iteration. :type sampling_size: int :param confidences_2d: Confidence probabilities per element.
(3d for tagger tuning)Parameters: - confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
- heuristic (Heuristic) -- Selection strategy.
- confidence_segments (Dict[(str, Tuple(int,int))]) -- A dictionary mapping segments to run KL Divergence.
- tuning_type (TuningType) -- Component to be tuned ("classifier" or "tagger")
Returns: - List of ids corresponding the newly sampled
queries in the QueryCache.
Return type: newly_sampled_queries_ids (List[int])
-
update_sampled_queries
(newly_sampled_queries_ids)[source]¶ Update the current set of sampled queries by adding the set of newly sampled queries. A new PrcoessedQueryList object is created with the updated set of query ids.
Parameters: newly_sampled_queries_ids (List[int]) -- List of ids corresponding the newly sampled queries in the QueryCache.
-
update_unsampled_queries
(remaining_indices)[source]¶ Update the current set of unsampled queries by removing the set of newly sampled queries. A new PrcoessedQueryList object is created with the updated set of query ids.
Parameters: remaining_indices (List[int]) -- List of ids corresponding the reamining queries queries in self.unsampled_queries.
-
static
-
class
mindmeld.active_learning.data_loading.
DataBucketFactory
[source]¶ Bases:
object
Class to generate the initial data for experimentation. (Seed Queries, Remaining Queries, and Test Queries). Handles initial sampling and data split based on configuation details.
-
static
get_data_bucket_for_query_selection
(app_path: str, tuning_level: list, train_pattern: str, test_pattern: str, unlabeled_logs_path: str, labeled_logs_pattern: str = None, log_usage_pct: float = 1.0)[source]¶ Creates a DataBucket to be used for log query selection.
Parameters: - app_path (str) -- Path to MindMeld application
- tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity")
- train_pattern (str) -- Regex pattern to match train files. For example, ".*train.*.txt"
- test_pattern (str) -- Regex pattern to match test files. For example, ".*test.*.txt"
- unlabeled_logs_path (str) -- Path a logs text file with unlabeled queries
- labeled_logs_pattern (str) -- Pattern to obtain logs already labeled within a MindMeld app
- log_usage_pct (float) -- Percentage of the log data to use for selection
Returns: DataBucket for log query selection
Return type: query_selection_data_bucket (DataBucket)
-
static
get_data_bucket_for_strategy_tuning
(app_path: str, tuning_level: list, train_pattern: str, test_pattern: str, train_seed_pct: float)[source]¶ Creates a DataBucket to be used for strategy tuning.
Parameters: - app_path (str) -- Path to MindMeld application
- tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity")
- train_pattern (str) -- Regex pattern to match train files. (".*train.*.txt")
- test_pattern (str) -- Regex pattern to match test files. (".*test.*.txt")
- train_seed_pct (float) -- Percentage of training data to use as the initial seed
Returns: DataBucket for tuning
Return type: strategy_tuning_data_bucket (DataBucket)
-
static
-
class
mindmeld.active_learning.data_loading.
LabelMap
(query_tree: Dict)[source]¶ Bases:
object
Class that handles label encoding and mapping.
-
static
create_label_map
(app_path, file_pattern)[source]¶ Creates a label map.
Parameters: Returns: A label map.
Return type: label_map (LabelMap)
-
static
get_class_labels
(tuning_level: list, query_list: mindmeld.resource_loader.ProcessedQueryList) → List[str][source]¶ - Creates a class label for a set of queries. These labels are used to split
- queries by type. Labels follow the format of "domain" or "domain|intent". For example, "date|get_date".
Parameters: - tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity")
- query_list (ProcessedQueryList) -- Data structure containing a list of processed queries.
Returns: list of labels for classification task.
Return type: class_labels (List[str])
-
static
-
class
mindmeld.active_learning.data_loading.
LogQueriesLoader
(app_path: str, tuning_level: list, log_file_path: str)[source]¶ Bases:
object
-
convert_text_queries_to_processed
(text_queries: List[str]) → List[mindmeld.core.ProcessedQuery][source]¶ Converts text queries to processed queries using an annotator.
Parameters: text_queries (List[str]) -- a List of text queries. Returns: List of processed queries. Return type: queries (List[ProcessedQuery])
-
static
deduplicate_raw_text_queries
(log_queries_iter) → List[str][source]¶ Removes duplicates in the text queries.
Parameters: log_queries_iter (generator) -- Log queries generator. Returns: a List of filtered text queries. Return type: filtered_text_queries (List[str])
-
queries
¶
-