mindmeld.active_learning.data_loading module

This module contains classes used to load queries for the Active Learning Pipeline.

class mindmeld.active_learning.data_loading.DataBucket(label_map, resource_loader, test_queries: mindmeld.resource_loader.ProcessedQueryList, unsampled_queries: mindmeld.resource_loader.ProcessedQueryList, sampled_queries: mindmeld.resource_loader.ProcessedQueryList)[source]

Bases: object

Class to hold data throughout the Active Learning training pipeline. Responsible for data conversion, filtration, and storage.

static filter_queries_by_nlp_component(query_list: mindmeld.resource_loader.ProcessedQueryList, component_type: str, component_name: str)[source]

Filter queries for training preperation.

Parameters:
  • query_list (list) -- List of queries to filter
  • component_type (str) -- Component type of desired queries (e.g. "domain")
  • component_name (str) -- Component name of desired queries (e.g. "smart_home")
Returns:

List of indices of filtered queries. filtered_queries (list): List of filtered queries.

Return type:

filtered_queries_indices (list)

get_queries(query_ids)[source]

Method to get multiple queries from the QueryCache given a list of query ids.

Parameters:query_ids (List[int]) -- List of ids corresponding to queries in the QueryCache.
Returns:List of processed queries from the cache.
Return type:queries (List[ProcessedQuery])
sample_and_update(sampling_size: int, confidences_2d: List[List[float]], confidences_3d: List[List[List[float]]], heuristic: mindmeld.active_learning.heuristics.Heuristic, confidence_segments: Dict = None, tuning_type: mindmeld.constants.TuningType = <TuningType.CLASSIFIER: 'classifier'>)[source]

Method to sample a DataBucket's unsampled_queries and update its sampled_queries and newly_sampled_queries. :param sampling_size: Number of elements to sample in the next iteration. :type sampling_size: int :param confidences_2d: Confidence probabilities per element.

(3d for tagger tuning)
Parameters:
  • confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
  • heuristic (Heuristic) -- Selection strategy.
  • confidence_segments (Dict[(str, Tuple(int,int))]) -- A dictionary mapping segments to run KL Divergence.
  • tuning_type (TuningType) -- Component to be tuned ("classifier" or "tagger")
Returns:

List of ids corresponding the newly sampled

queries in the QueryCache.

Return type:

newly_sampled_queries_ids (List[int])

update_sampled_queries(newly_sampled_queries_ids)[source]

Update the current set of sampled queries by adding the set of newly sampled queries. A new PrcoessedQueryList object is created with the updated set of query ids.

Parameters:newly_sampled_queries_ids (List[int]) -- List of ids corresponding the newly sampled queries in the QueryCache.
update_unsampled_queries(remaining_indices)[source]

Update the current set of unsampled queries by removing the set of newly sampled queries. A new PrcoessedQueryList object is created with the updated set of query ids.

Parameters:remaining_indices (List[int]) -- List of ids corresponding the reamining queries queries in self.unsampled_queries.
class mindmeld.active_learning.data_loading.DataBucketFactory[source]

Bases: object

Class to generate the initial data for experimentation. (Seed Queries, Remaining Queries, and Test Queries). Handles initial sampling and data split based on configuation details.

static get_data_bucket_for_query_selection(app_path: str, tuning_level: list, train_pattern: str, test_pattern: str, unlabeled_logs_path: str, labeled_logs_pattern: str = None, log_usage_pct: float = 1.0)[source]

Creates a DataBucket to be used for log query selection.

Parameters:
  • app_path (str) -- Path to MindMeld application
  • tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity")
  • train_pattern (str) -- Regex pattern to match train files. For example, ".*train.*.txt"
  • test_pattern (str) -- Regex pattern to match test files. For example, ".*test.*.txt"
  • unlabeled_logs_path (str) -- Path a logs text file with unlabeled queries
  • labeled_logs_pattern (str) -- Pattern to obtain logs already labeled within a MindMeld app
  • log_usage_pct (float) -- Percentage of the log data to use for selection
Returns:

DataBucket for log query selection

Return type:

query_selection_data_bucket (DataBucket)

static get_data_bucket_for_strategy_tuning(app_path: str, tuning_level: list, train_pattern: str, test_pattern: str, train_seed_pct: float)[source]

Creates a DataBucket to be used for strategy tuning.

Parameters:
  • app_path (str) -- Path to MindMeld application
  • tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity")
  • train_pattern (str) -- Regex pattern to match train files. (".*train.*.txt")
  • test_pattern (str) -- Regex pattern to match test files. (".*test.*.txt")
  • train_seed_pct (float) -- Percentage of training data to use as the initial seed
Returns:

DataBucket for tuning

Return type:

strategy_tuning_data_bucket (DataBucket)

class mindmeld.active_learning.data_loading.LabelMap(query_tree: Dict)[source]

Bases: object

Class that handles label encoding and mapping.

static create_label_map(app_path, file_pattern)[source]

Creates a label map.

Parameters:
  • app_path (str) -- Path to MindMeld application
  • file_pattern (str) -- Regex pattern to match text files. (".*train.*.txt")
Returns:

A label map.

Return type:

label_map (LabelMap)

static get_class_labels(tuning_level: list, query_list: mindmeld.resource_loader.ProcessedQueryList) → List[str][source]
Creates a class label for a set of queries. These labels are used to split
queries by type. Labels follow the format of "domain" or "domain|intent". For example, "date|get_date".
Parameters:
  • tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity")
  • query_list (ProcessedQueryList) -- Data structure containing a list of processed queries.
Returns:

list of labels for classification task.

Return type:

class_labels (List[str])

static get_domain_to_intents(query_tree: Dict) → Dict[source]
Parameters:query_tree (dict) -- Nested Dictionary containing queries. Has the format: {"domain":{"intent":[Query List]}}
Returns:Dict mapping domains to a list of intents.
Return type:domain_to_intents (dict)
class mindmeld.active_learning.data_loading.LogQueriesLoader(app_path: str, tuning_level: list, log_file_path: str)[source]

Bases: object

convert_text_queries_to_processed(text_queries: List[str]) → List[mindmeld.core.ProcessedQuery][source]

Converts text queries to processed queries using an annotator.

Parameters:text_queries (List[str]) -- a List of text queries.
Returns:List of processed queries.
Return type:queries (List[ProcessedQuery])
static deduplicate_raw_text_queries(log_queries_iter) → List[str][source]

Removes duplicates in the text queries.

Parameters:log_queries_iter (generator) -- Log queries generator.
Returns:a List of filtered text queries.
Return type:filtered_text_queries (List[str])
queries