mindmeld.active_learning.data_loading module¶

This module contains classes used to load queries for the Active Learning Pipeline.

class mindmeld.active_learning.data_loading.DataBucket(label_map, resource_loader, test_queries: mindmeld.resource_loader.ProcessedQueryList, unsampled_queries: mindmeld.resource_loader.ProcessedQueryList, sampled_queries: mindmeld.resource_loader.ProcessedQueryList)[source]¶

Bases: object

Class to hold data throughout the Active Learning training pipeline. Responsible for data conversion, filtration, and storage.

static filter_queries_by_nlp_component(query_list: mindmeld.resource_loader.ProcessedQueryList, component_type: str, component_name: str)[source]¶

Filter queries for training preperation.

Parameters:	query_list (list) -- List of queries to filter component_type (str) -- Component type of desired queries (e.g. "domain") component_name (str) -- Component name of desired queries (e.g. "smart_home")
Returns:	List of indices of filtered queries. filtered_queries (list): List of filtered queries.
Return type:	filtered_queries_indices (list)

get_queries(query_ids)[source]¶

Method to get multiple queries from the QueryCache given a list of query ids.

Parameters:	query_ids (List[int]) -- List of ids corresponding to queries in the QueryCache.
Returns:	List of processed queries from the cache.
Return type:	queries (List[ProcessedQuery])

sample_and_update(sampling_size: int, confidences_2d: List[List[float]], confidences_3d: List[List[List[float]]], heuristic: mindmeld.active_learning.heuristics.Heuristic, confidence_segments: Dict = None, tuning_type: mindmeld.constants.TuningType = <TuningType.CLASSIFIER: 'classifier'>)[source]¶

Method to sample a DataBucket's unsampled_queries and update its sampled_queries and newly_sampled_queries. :param sampling_size: Number of elements to sample in the next iteration. :type sampling_size: int :param confidences_2d: Confidence probabilities per element.

(3d for tagger tuning)

Parameters:

confidences_3d (List[List[List[float]]]) -- Confidence probabilities per element.
heuristic (Heuristic) -- Selection strategy.
confidence_segments (Dict[(str, Tuple(int,int))]) -- A dictionary mapping segments to run KL Divergence.
tuning_type (TuningType) -- Component to be tuned ("classifier" or "tagger")

Returns:

List of ids corresponding the newly sampled: queries in the QueryCache.

Return type:

newly_sampled_queries_ids (List[int])

update_sampled_queries(newly_sampled_queries_ids)[source]¶

Update the current set of sampled queries by adding the set of newly sampled queries. A new PrcoessedQueryList object is created with the updated set of query ids.

Parameters:	newly_sampled_queries_ids (List[int]) -- List of ids corresponding the newly sampled queries in the QueryCache.

update_unsampled_queries(remaining_indices)[source]¶

Update the current set of unsampled queries by removing the set of newly sampled queries. A new PrcoessedQueryList object is created with the updated set of query ids.

Parameters:	remaining_indices (List[int]) -- List of ids corresponding the reamining queries queries in self.unsampled_queries.

class mindmeld.active_learning.data_loading.DataBucketFactory[source]¶

Bases: object

Class to generate the initial data for experimentation. (Seed Queries, Remaining Queries, and Test Queries). Handles initial sampling and data split based on configuation details.

static get_data_bucket_for_query_selection(app_path: str, tuning_level: list, train_pattern: str, test_pattern: str, unlabeled_logs_path: str, labeled_logs_pattern: str = None, log_usage_pct: float = 1.0)[source]¶

Creates a DataBucket to be used for log query selection.

Parameters:	app_path (str) -- Path to MindMeld application tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity") train_pattern (str) -- Regex pattern to match train files. For example, ".train..txt" test_pattern (str) -- Regex pattern to match test files. For example, ".test..txt" unlabeled_logs_path (str) -- Path a logs text file with unlabeled queries labeled_logs_pattern (str) -- Pattern to obtain logs already labeled within a MindMeld app log_usage_pct (float) -- Percentage of the log data to use for selection
Returns:	DataBucket for log query selection
Return type:	query_selection_data_bucket (DataBucket)

static get_data_bucket_for_strategy_tuning(app_path: str, tuning_level: list, train_pattern: str, test_pattern: str, train_seed_pct: float)[source]¶

Creates a DataBucket to be used for strategy tuning.

Parameters:	app_path (str) -- Path to MindMeld application tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity") train_pattern (str) -- Regex pattern to match train files. (".train..txt") test_pattern (str) -- Regex pattern to match test files. (".test..txt") train_seed_pct (float) -- Percentage of training data to use as the initial seed
Returns:	DataBucket for tuning
Return type:	strategy_tuning_data_bucket (DataBucket)

class mindmeld.active_learning.data_loading.LabelMap(query_tree: Dict)[source]¶

Bases: object

Class that handles label encoding and mapping.

static create_label_map(app_path, file_pattern)[source]¶

Creates a label map.

Parameters:	app_path (str) -- Path to MindMeld application file_pattern (str) -- Regex pattern to match text files. (".train..txt")
Returns:	A label map.
Return type:	label_map (LabelMap)

static get_class_labels(tuning_level: list, query_list: mindmeld.resource_loader.ProcessedQueryList) → List[str][source]¶

Creates a class label for a set of queries. These labels are used to split: queries by type. Labels follow the format of "domain" or "domain|intent". For example, "date|get_date".

Parameters:	tuning_level (list) -- The hierarchy levels to tune ("domain", "intent" or "entity") query_list (ProcessedQueryList) -- Data structure containing a list of processed queries.
Returns:	list of labels for classification task.
Return type:	class_labels (List[str])

static get_domain_to_intents(query_tree: Dict) → Dict[source]¶

Parameters:	query_tree (dict) -- Nested Dictionary containing queries. Has the format: {"domain":{"intent":[Query List]}}
Returns:	Dict mapping domains to a list of intents.
Return type:	domain_to_intents (dict)

class mindmeld.active_learning.data_loading.LogQueriesLoader(app_path: str, tuning_level: list, log_file_path: str)[source]¶

Bases: object

convert_text_queries_to_processed(text_queries: List[str]) → List[mindmeld.core.ProcessedQuery][source]¶

Converts text queries to processed queries using an annotator.

Parameters:	text_queries (List[str]) -- a List of text queries.
Returns:	List of processed queries.
Return type:	queries (List[ProcessedQuery])

static deduplicate_raw_text_queries(log_queries_iter) → List[str][source]¶

Removes duplicates in the text queries.

Parameters:	log_queries_iter (generator) -- Log queries generator.
Returns:	a List of filtered text queries.
Return type:	filtered_text_queries (List[str])

queries¶