mindmeld.components.question_answerer module¶

This module contains the question answerer component of MindMeld.

class mindmeld.components.question_answerer.BaseQuestionAnswerer(app_path=None, config=None, app_namespace=None, **_kwargs)[source]¶

Bases: abc.ABC

build_search(index_name=None, ranking_config=None, app_namespace=None, **kwargs)[source]¶

Build a search object for advanced filtered search.

Parameters:	index_name (str) -- index name of knowledge base object. ranking_config (dict, optional) -- overriding ranking configuration parameters. app_namespace (str, optional) -- The namespace of the app. Used to prevent collisions between the indices of this app and those of other apps.
Returns:	a Search object for filtered search.
Return type:	Search

get(index_name=None, size=10, query_type=None, app_namespace=None, **kwargs)[source]¶

Parameters:	index_name (str) -- The name of an index. size (int) -- The maximum number of records, default to 10. query_type (str) -- Whether the search is over structured, unstructured and whether to use text signals for ranking, embedder signals, or both. id (str) -- The id of a particular document to retrieve. _sort (str) -- Specify the knowledge base field for custom sort. _sort_type (str) -- Specify custom sort type. Valid values are 'asc', 'desc' and 'distance'. _sort_location (dict) -- The origin location to be used when sorting by distance.
Returns:	A list of matching documents.
Return type:	list

load_kb(index_name, data_file, **kwargs)[source]¶

Loads documents from disk into the specified index in the knowledge base. If an index with the specified name doesn't exist, a new index with that name will be created in the knowledge base.

Parameters:	index_name (str) -- The name of the new index to be created; can be any valid string data_file (str) -- The path to the data file containing the documents to be imported into the knowledge base index. It could be either json or jsonl file.

Optional Args (used by all QA classes):

app_namespace (str): A custom namespace of the app. Used to prevent collisions between: the indices of two apps with same app name.
clean (bool): Set to true if you want to delete an existing index and reindex it. If: False (default), ElasticsearchQA just updates its index with new objects not deleting the old objects whereas NativeQA replaces old index with new index consisting of the inputted data file's KB objects.
embedding_fields (list): List of embedding fields for the given index that can be: directly passed-in instead of adding them to QA config or overriding QA config. Embedder information is generated and indexed only for the user specified fields and not all KB field names. If this list is empty, no fields have the embedder component even though 'embedder' keyword is specified in 'model_type'.

Optional Args (Elasticsearch specific):

es_host (str): The Elasticsearch host server. es_client (Elasticsearch): The Elasticsearch client. connect_timeout (int): The amount of time for a connection to the Elasticsearch host.

model_settings¶

model_type¶

query_type¶

class mindmeld.components.question_answerer.ElasticsearchQuestionAnswerer(**kwargs)[source]¶

Bases: mindmeld.components.question_answerer.BaseQuestionAnswerer

The question answerer is primarily an information retrieval system that provides all the necessary functionality for interacting with the application's knowledge base.

This class uses Elasticsearch in the backend to implement various underlying functionalities of question answerer.

class FieldInfo(name, field_type)[source]¶

Bases: object

This class models an information source of a knowledge base field metadata

get_name()[source]¶: Returns knowledge base field name

get_type()[source]¶: Returns knowledge base field type

is_date_field()[source]¶: Returns True if the knowledge base field is a date field, otherwise returns False

is_location_field()[source]¶: Returns True if the knowledge base field is a location field, otherwise returns False

is_number_field()[source]¶: Returns True if the knowledge base field is a number field, otherwise returns False

is_text_field()[source]¶: Returns True if the knowledge base field is a text field, otherwise returns False

is_vector_field()[source]¶: Returns True if the knowledge base field is a vector field, otherwise returns False

DATE_TYPES = {'date'}¶

GEO_TYPES = {'geo_point'}¶

NUMBER_TYPES = {'long', 'double', 'short', 'integer', 'byte', 'scaled_float', 'half_float', 'float'}¶

TEXT_TYPES = {'text', 'keyword'}¶

VECTOR_TYPES = {'dense_vector'}¶

class Search(client, index, ranking_config=None, field_info=None)[source]¶

Bases: object

This class models a generic filtered search in knowledge base. It allows developers to construct more complex knowledge base search criteria based on the application requirements.

class Clause[source]¶

Bases: abc.ABC

This class models an abstract knowledge base clause.

build_query()[source]¶: Build knowledge base query.

get_type()[source]¶: Returns clause type

validate()[source]¶: Validate the clause.

class FilterClause(field, field_info=None, value=None, query_type='keyword', range_gt=None, range_gte=None, range_lt=None, range_lte=None)[source]¶

Bases: mindmeld.components.question_answerer.Clause

This class models a knowledge base filter clause.

build_query()[source]¶: build knowledge base query for filter clause

validate()[source]¶: Validate the clause.

class QueryClause(field, field_info, value, query_type='keyword', synonym_field=None)[source]¶

Bases: mindmeld.components.question_answerer.Clause

This class models a knowledge base query clause.

build_query()[source]¶: build knowledge base query for query clause

validate()[source]¶: Validate the clause.

DEFAULT_EXACT_MATCH_BOOSTING_WEIGHT = 100¶

class SortClause(field, field_info=None, sort_type=None, field_stats=None, location=None)[source]¶

Bases: mindmeld.components.question_answerer.Clause

This class models a knowledge base sort clause.

build_query()[source]¶: build knowledge base query for sort clause

validate()[source]¶: Validate the clause.

DEFAULT_SORT_WEIGHT = 30¶

SORT_DISTANCE = 'distance'¶

SORT_ORDER_ASC = 'asc'¶

SORT_ORDER_DESC = 'desc'¶

SORT_TYPES = {'distance', 'asc', 'desc'}¶

execute(size=10)[source]¶

Executes the knowledge base search with provided criteria and returns matching documents.

Parameters:	size (int) -- The maximum number of records to fetch, default to 10.
Returns:	a list of matching documents.

filter(query_type='keyword', **kwargs)[source]¶

Specify filter condition to be applied to specified knowledge base field. In MindMeld two types of filters are supported: text filter and range filters.

Text filters are used to apply hard filters on specified knowledge base text fields. The filter text value is normalized and matched using entire text span against the knowledge base field.

It's common to have filter conditions based on other resolved canonical entities. For example, in food ordering domain the resolved restaurant entity can be used as a filter to resolve dish entities. The exact knowledge base field to apply these filters depends on the knowledge base data model of the application. If the entity is not in the canonical form, a fuzzy filter can be applied by setting the query_type to 'text'.

Range filters are used to filter with a value range on specified knowledge base number or date fields. Example use cases include price range filters and date range filters.

Examples:

add text filter:

>>> s = question_answerer.build_search(index='menu_items')
>>> s.filter(restaurant_id='B01CGKGQ40')

add range filter:

>>> s = question_answerer.build_search(index='menu_items')
>>> s.filter(field='price', gte=1, lt=10)

Parameters:	query_type (str) -- Whether the filter is over structured or unstructured text. kwargs -- A keyword argument to specify the filter text and the knowledge base text field. field (str) -- knowledge base field name for range filter. gt (number or str) -- range filter operator for greater than. gte (number or str) -- range filter operator for greater than or equal to. lt (number or str) -- range filter operator for less than. lte (number or str) -- range filter operator for less or equal to.
Returns:	A new Search object with added search criteria.
Return type:	Search

query(query_type='keyword', **kwargs)[source]¶

Specify the query text to match on a knowledge base text field. The query text is normalized and processed (based on query_type) to find matches in knowledge base using several text relevance scoring factors including exact matches, phrase matches and partial matches.

Examples

>>> s = question_answerer.build_search(index='dish')
>>> s.query(name='pad thai')

In the example above the query text "pad thai" will be used to match against document field "name" in knowledge base index "dish".

Parameters:	keyword argument to specify the query text and the knowledge base document field (a) -- with the query type (along) --
Returns:	a new Search object with added search criteria.
Return type:	Search

sort(field, sort_type=None, location=None)[source]¶

Specify custom sort criteria.

Parameters:

field (str) -- knowledge base field for sort.
sort_type (str) -- sorting type. valid values are 'asc', 'desc' and 'distance'. 'asc' and 'desc' can be used to sort numeric or date fields and 'distance' can be used to sort by distance on geo_point fields. Default sort type is 'desc' if not specified.
location (str) -- location (lat, lon) in geo_point format to be used as origin when sorting by 'distance'

SYN_FIELD_SUFFIX = '$whitelist'¶

class mindmeld.components.question_answerer.NativeQuestionAnswerer(*args, **kwargs)[source]¶

Bases: mindmeld.components.question_answerer.BaseQuestionAnswerer

The question answerer is primarily an information retrieval system that provides all the necessary functionality for interacting with the application's knowledge base.

This class uses Entity Resolvers in the backend to implement various underlying functionalities of question answerer. It consists of three important sub-classes: (1) Indices which maintains the different indices including fit entity resolvers used for inference, (2) FieldResource which forms the core of each index, encapsulating the fit resolvers and metadata related to each KB field, (3) Search class that is used to build custom search similar to what ElasticsearchQuestionAnswerer offers. In addition, NativeQuestionAnswerer also offers same apis as the Elasticsearch one- .get(), .load_kb(), .build_search().

The created resolvers are dumped at DEFAULT_APP_PATH, whose directory serves as a common site to host all indices, similar to how all the indices of Elasticsearch are stored in a common directory on the disk.

class FieldResource(index_name, field_name)[source]¶

Bases: mindmeld.components.question_answerer.FieldResourceHelper

An object encapsulating all resources necessary for search/filter/sort-ing on any field in the Knowledge Base. This class should only be used as part of Indices class and not in isolation.

This class currently supports: - location strings, - date strings, - boolean, - number, - strings, and - list of strings.

Any other data type (eg. dictionary type) is currently not supported and is marked as an 'unknown' data type. Such unknown data types fields do not have any associated resolvers.

static curate_docs_to_return(index_resources, _ids, _scores=None)[source]¶

Collates all field names into docs

Parameters:	index_resources -- a dict of field names and corresponding FieldResource instances _ids (List[str]) -- if provided as a list of strings, only docs with those ids are obtained in the same order of the ids, else all ids are used _scores (List[number], optional) -- if provided as a list of numbers and of same size as the _ids, they will be attached to the curated results for corresponding _ids
Returns:	compiled docs
Return type:	list[dict]

do_filter(allowed_ids, filter_text=None, gt=None, gte=None, lt=None, lte=None, et=None, boolean=None)[source]¶: Filters a list of docs to a subset based on some criteria such as a boolean value or {>,<,=} operations or a text snippet.

do_search(query_type, value, allowed_ids=None)[source]¶

Retrieves doc ids with corresponding similarity scores for the given query

Parameters:	query_type (str) -- one of ALL_QUERY_TYPES value (str) -- A string to do similarity search allowed_ids (iterable, optional) -- if not None, only docs containing these ids are populated in the results
Returns:	a mapping between _ids recorded in this field and their corresponding scores
Return type:	dict

do_sort(curated_docs, sort_type, location=None)[source]¶

classmethod from_metadata(cache_object: mindmeld.core.Bunch)[source]¶

Creates a fit resource from metadata

Parameters:	cache_object (Bunch) -- a Bunch dictionary object with attribute names and values
Returns:	FieldResource

load_resolvers(app_path='/Users/lucienc/.cache/mindmeld', resource_loader=None)[source]¶

Loads a field resource by fitting with latest data (if id2value is passed) or by loading already fit resolvers if no data changes take place.

Parameters:	app_path (str, optional) -- a path to create cache for embedder resolver resource_loader (ResourceLoader, optional) -- a resource loader object

to_metadata()[source]¶

Returns a Bunch object consisting of various details about the resource- (scoped) index name, the name of the KB field for which the resource is built for, the field's data type, all the KB data associated with this field, a hash that uniquely identifies the data, the data processor type and the presence of text as well as embedder resolvers. The returned metadata object does not contain any fit entity resolvers.

Returns:	various meta information of this field resource
Return type:	Bunch

update_resource(id2value, has_text_resolver, has_embedding_resolver, resolver_model_settings=None, clean=False, app_path='/Users/lucienc/.cache/mindmeld', processor_type='keyword', resource_loader=None)[source]¶

Updates a field resource by fitting with latest data (if id2value is passed) or by loading already fit resolvers if no data changes take place.

While loading from metadata, the self.id2value data is also loaded, which is in turn used to create the new_hash. Even if the new_hash is same as self.hash (implying no data changes), if the self._text_resolver is not fit but is required, the resolver is loaded instead of fitting.

Parameters:

id2value (dict) -- a mapping between documnet ids & values of the chosen KB field
has_text_resolver (bool) -- If a tfidf resolver is to be created
has_embedding_resolver (bool) -- If a embedder resolver is to be created
resolver_model_settings (dict) -- A dictionary cocnsisting of model settings for the resolver models. Currently, the same setting is passed to all kinds of resolver models.
clean (bool, optional) -- if True, resolvers are fit with clean=True
app_path (str, optional) -- a path to create cache for embedder resolver
processor_type (str, optional, "text" or "keyword") -- processor for tfidf resolver
resource_loader (ResourceLoader, optional) -- a resource loader object

doc_ids¶

class FieldResourceDataHelper[source]¶

Bases: object

A class that holds methods to aid validating, formatting and scoring different data types in a FieldResource object.

static date_scorer(value)[source]¶: ascertains a suitable date format for input and returns number of days from origin date as score

static is_bool(value)[source]¶

static is_date(value)[source]¶

static is_list_of_strings(value)[source]¶

static is_location(value)[source]¶

static is_number(value)[source]¶

static is_string(value)[source]¶

static location_scorer(some_location, source_location)[source]¶

Uses Haversine formula to find distance between two coordinates references: https://en.wikipedia.org/wiki/Haversine_formula and

http://www.movable-type.co.uk/scripts/latlong.html

Parameters:	some_location (str) -- latitude and longitude supplied as comma separated strings, eg. "37.77,122.41" source_location (str) -- latitude and longitude supplied as comma separated strings, eg. "37.77,122.41"
Returns:	distance between the coordinates in kilometers
Return type:	float

Example 1:

>>> point_1 = "37.78953146306901,-122.41160227491551" # SF in CA
>>> point_2 = "47.65182346406002, -122.36765696283909" # Seattle in WA
>>> location_scorer(point_1, point_2)
>>> # 1096.98 kms (approx. points on Google maps says 680.23 miles/1094.72 kms)

Example 2:

>>> point_3, point_4 = "52.2296756,21.0122287", "52.406374,16.9251681"
>>> location_scorer(point_3, point_4)
>>> # 278.54 kms

static min_max_normalizer(list_of_numbers)[source]¶

static number_scorer(some_number)[source]¶

DATA_TYPES = ['bool', 'number', 'string', 'date', 'location', 'unknown']¶

DATE_FORMATS = ('%Y', '%d %b', '%d %B', '%b %d', '%B %d', '%b %Y', '%B %Y', '%b, %Y', '%B, %Y', '%b %d, %Y', '%B %d, %Y', '%b %d %Y', '%B %d %Y', '%b %d,%Y', '%B %d,%Y', '%d %b, %Y', '%d %B, %Y', '%d %b %Y', '%d %B %Y', '%d %b,%Y', '%d %B,%Y', '%m/%d/%Y', '%m/%d/%y', '%d/%m/%Y', '%d/%m/%y')¶

data_types¶

date_formats¶

class FieldResourceHelper[source]¶

Bases: mindmeld.components.question_answerer.FieldResourceDataHelper

static get_resolvers_cname(value)[source]¶

class Indices(app_path)[source]¶

Bases: object

An object that hold all the indices for an app_path

'self._indices' has the following dictionary format, with keys as the index name and the value as the metadata of that index

'''

{: index_name1: {key11: FieldResource11, key12: FieldResource12, ...}, index_name2: {key21: FieldResource21, key22: FieldResource22, ...}, index_name3: {...}, ...

}

'''

Index metadata includes metadata of each field found in the KB. The metadata for each field is encapsulated in a FieldResource object which in-turn constitutes of metadata related to that specific field in the KB (across all ids in the KB) along with information such as what data-type that field belongs to (number, date, etc.), field name, & hash of the stored data. See FieldResource class docstrings for more details.

delete_index(index_name)[source]¶: Deletes the index both from memory as well as disk

get_all_ids(index_name)[source]¶

Returns all ids observed in the KB for the specified index name in chronological order. The specified index must already be loaded into the memory to obtain ids.

Parameters:	index_name (str) -- A scoped index name
Returns:	a list of ids observed in the KB during load-kb
Return type:	List[str]
Raises:	`KeyError` -- When the specified (scoped) index name is not found in memory

get_index(index_name)[source]¶

Returns the index corresponding to the specified index name. If the index is not found in memory, this method looks up the cache path to obtain the index.

Parameters:	index_name (str) -- A scoped index name
Returns:	a dictionary of FieldResource, one for each field name in the index
Return type:	index_resources (Dict[str, FieldResource])
Raises:	`KnowledgeBaseError` -- if the specified index name is unavailable both in memory as well as disk

get_metadata(index_name)[source]¶

Returns index's FieldResources' metadata objects if the index is available in memory.

Parameters:	index_name (str) -- A scoped index name
Returns:	metadata associated with each field name of the index
Return type:	metadata (Dict[str, Bunch])
Raises:	`KeyError` -- When the specified (scoped) index name is not found in memory

is_available(index_name)[source]¶

Checks for availability of a specified index name both in memory (i.e. if already loaded into memory at the time of checking) as well as in the cache path where are all the indices' metadata are stored.

Parameters:	index_name (str) -- A scoped index name for checking its availability
Returns:	True if index name is available in memory or in cache directory, else False
Return type:	bool

update_index_and_persist(index_name, index_resources, index_all_ids)[source]¶

Updates the specified index's resources in the memory as well as dumps the metadata into disk for fast loading time later on. Note that this method is best used with the a load_kb() method wherein fit resources are frirst created before updating indices.

During reloading of an index, use self.get_index_metadata() as well as fieldResource.update_resource() to obtain back a fit index.

Parameters:	index_name (str) -- A scoped index name for loading metadata index_resources (Dict[str, FieldResource]) -- Dict curated with all the KB fields and their FieldResources index_all_ids (List[str]): List of all ids observed for this index in the order they are present in the KB.

class Search(index)[source]¶

Bases: object

Search class enabling functionality to query, filter and sort. Utilizes various methods from Indices and FiledResource to compute results.

Currently, the following are supported data types for each clause type: Query -> "string", "date" (assumes that "date" field exists as strings, both are supported

through kwargs)

Filter -> "number" and "date" (through range parameters), "bool" (though boolean parameter),: "string" (through kwargs)
Sort -> "number" and "date" (by specifying sort_type=asc or sort_type=desc),: "location" (by specifying sort_type=distance and passing origin 'location' parameter)

Note: This Search class supports more items than Elasticsearch based QA

execute(size=10)[source]¶

filter(query_type='keyword', **kwargs)[source]¶

query(query_type='keyword', **kwargs)[source]¶

sort(field, sort_type=None, location=None)[source]¶

ALL_INDICES = <mindmeld.components.question_answerer.NativeQuestionAnswerer.Indices object>¶

RESOURCE_LOADER = None¶

class mindmeld.components.question_answerer.QuestionAnswerer[source]¶

Bases: object

Backwards compatible QuestionAnswerer class

old usages (allowed but will soon be deprecated)

# loading KB directly through class method >>> QuestionAnswerer.load_kb(...) # instantiating a QA object from QuestionAnswerer instead of QuestionAnswererFactory >>> question_answerer = QuestionAnswerer(app_path, resource_loader, es_host, config)

new usages

>>> question_answerer = QuestionAnswererFactory.create_question_answerer(**kwargs)
# Use the QA object's methods to load KB and get search results, instead of class methods
>>> question_answerer.load_kb(...)
>>> question_answerer.get(...) # .get(...) and .build_search(...)

classmethod load_kb(app_namespace, index_name, data_file, es_host=None, es_client=None, connect_timeout=2, clean=False, app_path=None, config=None, **kwargs)[source]¶

Implemented to maintain backward compatibility. Should be removed in future versions.

Parameters:

app_namespace (str) -- The namespace of the app. Used to prevent collisions between the indices of this app and those of other apps.
index_name (str) -- The name of the new index to be created.
data_file (str) -- The path to the data file containing the documents to be imported into the knowledge base index. It could be either json or jsonl file.
es_host (str) -- The Elasticsearch host server.
es_client (Elasticsearch) -- The Elasticsearch client.
connect_timeout (int, optional) -- The amount of time for a connection to the Elasticsearch host.
clean (bool) -- Set to true if you want to delete an existing index and reindex it
app_path (str) -- The path to the directory containing the app's data
config (dict) -- The QA config if passed directly rather than loaded from the app config

DEPRECATION_MESSAGE = "Calling QuestionAnswerer class directly will be deprecated in future versions. To instantiate a QA instance, use the QuestionAnswererFactory by calling 'qa = QuestionAnswererFactory.create_question_answerer(**kwargs)'. An instantiated QA can then be used as 'qa.load_kb(...)', 'qa.get(...)', etc. See https://www.mindmeld.com/docs/userguide/kb.html for details about the various functionalities available with different question-answerers."¶

class mindmeld.components.question_answerer.QuestionAnswererFactory[source]¶

Bases: object

Factory class for creating QuestionAnswerers

usage

>>> question_answerer = QuestionAnswererFactory.create_question_answerer(**kwargs)
>>> question_answerer.load_kb(...)
>>> question_answerer.get(...) # .get(...) or .build_search(...)

classmethod create_question_answerer(app_path=None, config=None, app_namespace=None, **kwargs)[source]¶

Parameters:

app_path (str, optional) -- The path to the directory containing the app's data. If provided, used to obtain default 'app_namespace' and QA configurations
app_namespace (str, optional) -- The namespace of the app. Used to prevent collisions between the indices of this app and those of other apps.
config (dict, optional) -- The QA config if passed directly rather than loaded from the app config