mindmeld.components.question_answerer module¶
This module contains the question answerer component of MindMeld.
-
class
mindmeld.components.question_answerer.
BaseQuestionAnswerer
(app_path=None, config=None, app_namespace=None, **_kwargs)[source]¶ Bases:
abc.ABC
-
build_search
(index_name=None, ranking_config=None, app_namespace=None, **kwargs)[source]¶ Build a search object for advanced filtered search.
Parameters: Returns: a Search object for filtered search.
Return type: Search
-
get
(index_name=None, size=10, query_type=None, app_namespace=None, **kwargs)[source]¶ Parameters: - index_name (str) -- The name of an index.
- size (int) -- The maximum number of records, default to 10.
- query_type (str) -- Whether the search is over structured, unstructured and whether to use text signals for ranking, embedder signals, or both.
- id (str) -- The id of a particular document to retrieve.
- _sort (str) -- Specify the knowledge base field for custom sort.
- _sort_type (str) -- Specify custom sort type. Valid values are 'asc', 'desc' and 'distance'.
- _sort_location (dict) -- The origin location to be used when sorting by distance.
Returns: A list of matching documents.
Return type:
-
load_kb
(index_name, data_file, **kwargs)[source]¶ Loads documents from disk into the specified index in the knowledge base. If an index with the specified name doesn't exist, a new index with that name will be created in the knowledge base.
Parameters: - Optional Args (used by all QA classes):
- app_namespace (str): A custom namespace of the app. Used to prevent collisions between
- the indices of two apps with same app name.
- clean (bool): Set to true if you want to delete an existing index and reindex it. If
- False (default), ElasticsearchQA just updates its index with new objects not deleting the old objects whereas NativeQA replaces old index with new index consisting of the inputted data file's KB objects.
- embedding_fields (list): List of embedding fields for the given index that can be
- directly passed-in instead of adding them to QA config or overriding QA config. Embedder information is generated and indexed only for the user specified fields and not all KB field names. If this list is empty, no fields have the embedder component even though 'embedder' keyword is specified in 'model_type'.
- Optional Args (Elasticsearch specific):
- es_host (str): The Elasticsearch host server. es_client (Elasticsearch): The Elasticsearch client. connect_timeout (int): The amount of time for a connection to the Elasticsearch host.
-
model_settings
¶
-
model_type
¶
-
query_type
¶
-
-
class
mindmeld.components.question_answerer.
ElasticsearchQuestionAnswerer
(**kwargs)[source]¶ Bases:
mindmeld.components.question_answerer.BaseQuestionAnswerer
The question answerer is primarily an information retrieval system that provides all the necessary functionality for interacting with the application's knowledge base.
This class uses Elasticsearch in the backend to implement various underlying functionalities of question answerer.
-
class
FieldInfo
(name, field_type)[source]¶ Bases:
object
This class models an information source of a knowledge base field metadata
-
is_date_field
()[source]¶ Returns True if the knowledge base field is a date field, otherwise returns False
-
is_location_field
()[source]¶ Returns True if the knowledge base field is a location field, otherwise returns False
-
is_number_field
()[source]¶ Returns True if the knowledge base field is a number field, otherwise returns False
-
is_text_field
()[source]¶ Returns True if the knowledge base field is a text field, otherwise returns False
-
is_vector_field
()[source]¶ Returns True if the knowledge base field is a vector field, otherwise returns False
-
DATE_TYPES
= {'date'}¶
-
GEO_TYPES
= {'geo_point'}¶
-
NUMBER_TYPES
= {'long', 'double', 'short', 'integer', 'byte', 'scaled_float', 'half_float', 'float'}¶
-
TEXT_TYPES
= {'text', 'keyword'}¶
-
VECTOR_TYPES
= {'dense_vector'}¶
-
-
class
Search
(client, index, ranking_config=None, field_info=None)[source]¶ Bases:
object
This class models a generic filtered search in knowledge base. It allows developers to construct more complex knowledge base search criteria based on the application requirements.
-
class
FilterClause
(field, field_info=None, value=None, query_type='keyword', range_gt=None, range_gte=None, range_lt=None, range_lte=None)[source]¶ Bases:
mindmeld.components.question_answerer.Clause
This class models a knowledge base filter clause.
-
class
QueryClause
(field, field_info, value, query_type='keyword', synonym_field=None)[source]¶ Bases:
mindmeld.components.question_answerer.Clause
This class models a knowledge base query clause.
-
DEFAULT_EXACT_MATCH_BOOSTING_WEIGHT
= 100¶
-
-
class
SortClause
(field, field_info=None, sort_type=None, field_stats=None, location=None)[source]¶ Bases:
mindmeld.components.question_answerer.Clause
This class models a knowledge base sort clause.
-
DEFAULT_SORT_WEIGHT
= 30¶
-
SORT_DISTANCE
= 'distance'¶
-
SORT_ORDER_ASC
= 'asc'¶
-
SORT_ORDER_DESC
= 'desc'¶
-
SORT_TYPES
= {'distance', 'asc', 'desc'}¶
-
-
execute
(size=10)[source]¶ Executes the knowledge base search with provided criteria and returns matching documents.
Parameters: size (int) -- The maximum number of records to fetch, default to 10. Returns: a list of matching documents.
-
filter
(query_type='keyword', **kwargs)[source]¶ Specify filter condition to be applied to specified knowledge base field. In MindMeld two types of filters are supported: text filter and range filters.
Text filters are used to apply hard filters on specified knowledge base text fields. The filter text value is normalized and matched using entire text span against the knowledge base field.
It's common to have filter conditions based on other resolved canonical entities. For example, in food ordering domain the resolved restaurant entity can be used as a filter to resolve dish entities. The exact knowledge base field to apply these filters depends on the knowledge base data model of the application. If the entity is not in the canonical form, a fuzzy filter can be applied by setting the query_type to 'text'.
Range filters are used to filter with a value range on specified knowledge base number or date fields. Example use cases include price range filters and date range filters.
Examples:
- add text filter:
>>> s = question_answerer.build_search(index='menu_items') >>> s.filter(restaurant_id='B01CGKGQ40')
- add range filter:
>>> s = question_answerer.build_search(index='menu_items') >>> s.filter(field='price', gte=1, lt=10)
Parameters: - query_type (str) -- Whether the filter is over structured or unstructured text.
- kwargs -- A keyword argument to specify the filter text and the knowledge base text field.
- field (str) -- knowledge base field name for range filter.
- gt (number or str) -- range filter operator for greater than.
- gte (number or str) -- range filter operator for greater than or equal to.
- lt (number or str) -- range filter operator for less than.
- lte (number or str) -- range filter operator for less or equal to.
Returns: A new Search object with added search criteria.
Return type: Search
-
query
(query_type='keyword', **kwargs)[source]¶ Specify the query text to match on a knowledge base text field. The query text is normalized and processed (based on query_type) to find matches in knowledge base using several text relevance scoring factors including exact matches, phrase matches and partial matches.
Examples
>>> s = question_answerer.build_search(index='dish') >>> s.query(name='pad thai')
In the example above the query text "pad thai" will be used to match against document field "name" in knowledge base index "dish".
Parameters: - keyword argument to specify the query text and the knowledge base document field (a) --
- with the query type (along) --
Returns: a new Search object with added search criteria.
Return type: Search
-
sort
(field, sort_type=None, location=None)[source]¶ Specify custom sort criteria.
Parameters: - field (str) -- knowledge base field for sort.
- sort_type (str) -- sorting type. valid values are 'asc', 'desc' and 'distance'. 'asc' and 'desc' can be used to sort numeric or date fields and 'distance' can be used to sort by distance on geo_point fields. Default sort type is 'desc' if not specified.
- location (str) -- location (lat, lon) in geo_point format to be used as origin when sorting by 'distance'
-
SYN_FIELD_SUFFIX
= '$whitelist'¶
-
class
-
class
-
class
mindmeld.components.question_answerer.
NativeQuestionAnswerer
(*args, **kwargs)[source]¶ Bases:
mindmeld.components.question_answerer.BaseQuestionAnswerer
The question answerer is primarily an information retrieval system that provides all the necessary functionality for interacting with the application's knowledge base.
This class uses Entity Resolvers in the backend to implement various underlying functionalities of question answerer. It consists of three important sub-classes: (1) Indices which maintains the different indices including fit entity resolvers used for inference, (2) FieldResource which forms the core of each index, encapsulating the fit resolvers and metadata related to each KB field, (3) Search class that is used to build custom search similar to what ElasticsearchQuestionAnswerer offers. In addition, NativeQuestionAnswerer also offers same apis as the Elasticsearch one- .get(), .load_kb(), .build_search().
The created resolvers are dumped at DEFAULT_APP_PATH, whose directory serves as a common site to host all indices, similar to how all the indices of Elasticsearch are stored in a common directory on the disk.
-
class
FieldResource
(index_name, field_name)[source]¶ Bases:
mindmeld.components.question_answerer.FieldResourceHelper
An object encapsulating all resources necessary for search/filter/sort-ing on any field in the Knowledge Base. This class should only be used as part of Indices class and not in isolation.
This class currently supports: - location strings, - date strings, - boolean, - number, - strings, and - list of strings.
Any other data type (eg. dictionary type) is currently not supported and is marked as an 'unknown' data type. Such unknown data types fields do not have any associated resolvers.
-
static
curate_docs_to_return
(index_resources, _ids, _scores=None)[source]¶ Collates all field names into docs
Parameters: - index_resources -- a dict of field names and corresponding FieldResource instances
- _ids (List[str]) -- if provided as a list of strings, only docs with those ids are obtained in the same order of the ids, else all ids are used
- _scores (List[number], optional) -- if provided as a list of numbers and of same size as the _ids, they will be attached to the curated results for corresponding _ids
Returns: compiled docs
Return type:
-
do_filter
(allowed_ids, filter_text=None, gt=None, gte=None, lt=None, lte=None, et=None, boolean=None)[source]¶ Filters a list of docs to a subset based on some criteria such as a boolean value or {>,<,=} operations or a text snippet.
-
do_search
(query_type, value, allowed_ids=None)[source]¶ Retrieves doc ids with corresponding similarity scores for the given query
Parameters: Returns: a mapping between _ids recorded in this field and their corresponding scores
Return type:
-
classmethod
from_metadata
(cache_object: mindmeld.core.Bunch)[source]¶ Creates a fit resource from metadata
Parameters: cache_object (Bunch) -- a Bunch dictionary object with attribute names and values Returns: FieldResource
-
load_resolvers
(app_path='/Users/lucienc/.cache/mindmeld', resource_loader=None)[source]¶ Loads a field resource by fitting with latest data (if id2value is passed) or by loading already fit resolvers if no data changes take place.
Parameters: - app_path (str, optional) -- a path to create cache for embedder resolver
- resource_loader (ResourceLoader, optional) -- a resource loader object
-
to_metadata
()[source]¶ Returns a Bunch object consisting of various details about the resource- (scoped) index name, the name of the KB field for which the resource is built for, the field's data type, all the KB data associated with this field, a hash that uniquely identifies the data, the data processor type and the presence of text as well as embedder resolvers. The returned metadata object does not contain any fit entity resolvers.
Returns: various meta information of this field resource Return type: Bunch
-
update_resource
(id2value, has_text_resolver, has_embedding_resolver, resolver_model_settings=None, clean=False, app_path='/Users/lucienc/.cache/mindmeld', processor_type='keyword', resource_loader=None)[source]¶ Updates a field resource by fitting with latest data (if id2value is passed) or by loading already fit resolvers if no data changes take place.
While loading from metadata, the self.id2value data is also loaded, which is in turn used to create the new_hash. Even if the new_hash is same as self.hash (implying no data changes), if the self._text_resolver is not fit but is required, the resolver is loaded instead of fitting.
Parameters: - id2value (dict) -- a mapping between documnet ids & values of the chosen KB field
- has_text_resolver (bool) -- If a tfidf resolver is to be created
- has_embedding_resolver (bool) -- If a embedder resolver is to be created
- resolver_model_settings (dict) -- A dictionary cocnsisting of model settings for the resolver models. Currently, the same setting is passed to all kinds of resolver models.
- clean (bool, optional) -- if True, resolvers are fit with clean=True
- app_path (str, optional) -- a path to create cache for embedder resolver
- processor_type (str, optional, "text" or "keyword") -- processor for tfidf resolver
- resource_loader (ResourceLoader, optional) -- a resource loader object
-
doc_ids
¶
-
static
-
class
FieldResourceDataHelper
[source]¶ Bases:
object
A class that holds methods to aid validating, formatting and scoring different data types in a FieldResource object.
-
static
date_scorer
(value)[source]¶ ascertains a suitable date format for input and returns number of days from origin date as score
-
static
location_scorer
(some_location, source_location)[source]¶ Uses Haversine formula to find distance between two coordinates references: https://en.wikipedia.org/wiki/Haversine_formula and
Parameters: Returns: distance between the coordinates in kilometers
Return type: - Example 1:
>>> point_1 = "37.78953146306901,-122.41160227491551" # SF in CA >>> point_2 = "47.65182346406002, -122.36765696283909" # Seattle in WA >>> location_scorer(point_1, point_2) >>> # 1096.98 kms (approx. points on Google maps says 680.23 miles/1094.72 kms)
- Example 2:
>>> point_3, point_4 = "52.2296756,21.0122287", "52.406374,16.9251681" >>> location_scorer(point_3, point_4) >>> # 278.54 kms
-
DATA_TYPES
= ['bool', 'number', 'string', 'date', 'location', 'unknown']¶
-
DATE_FORMATS
= ('%Y', '%d %b', '%d %B', '%b %d', '%B %d', '%b %Y', '%B %Y', '%b, %Y', '%B, %Y', '%b %d, %Y', '%B %d, %Y', '%b %d %Y', '%B %d %Y', '%b %d,%Y', '%B %d,%Y', '%d %b, %Y', '%d %B, %Y', '%d %b %Y', '%d %B %Y', '%d %b,%Y', '%d %B,%Y', '%m/%d/%Y', '%m/%d/%y', '%d/%m/%Y', '%d/%m/%y')¶
-
data_types
¶
-
date_formats
¶
-
static
-
class
FieldResourceHelper
[source]¶ Bases:
mindmeld.components.question_answerer.FieldResourceDataHelper
-
class
Indices
(app_path)[source]¶ Bases:
object
An object that hold all the indices for an app_path
'self._indices' has the following dictionary format, with keys as the index name and the value as the metadata of that index
- '''
- {
- index_name1: {key11: FieldResource11, key12: FieldResource12, ...}, index_name2: {key21: FieldResource21, key22: FieldResource22, ...}, index_name3: {...}, ...
}
'''
Index metadata includes metadata of each field found in the KB. The metadata for each field is encapsulated in a FieldResource object which in-turn constitutes of metadata related to that specific field in the KB (across all ids in the KB) along with information such as what data-type that field belongs to (number, date, etc.), field name, & hash of the stored data. See FieldResource class docstrings for more details.
-
get_all_ids
(index_name)[source]¶ Returns all ids observed in the KB for the specified index name in chronological order. The specified index must already be loaded into the memory to obtain ids.
Parameters: index_name (str) -- A scoped index name Returns: a list of ids observed in the KB during load-kb Return type: List[str] Raises: KeyError
-- When the specified (scoped) index name is not found in memory
-
get_index
(index_name)[source]¶ Returns the index corresponding to the specified index name. If the index is not found in memory, this method looks up the cache path to obtain the index.
Parameters: index_name (str) -- A scoped index name Returns: - a dictionary of FieldResource, one for
- each field name in the index
Return type: index_resources (Dict[str, FieldResource]) Raises: KnowledgeBaseError
-- if the specified index name is unavailable both in memory as well as disk
-
get_metadata
(index_name)[source]¶ Returns index's FieldResources' metadata objects if the index is available in memory.
Parameters: index_name (str) -- A scoped index name Returns: metadata associated with each field name of the index Return type: metadata (Dict[str, Bunch]) Raises: KeyError
-- When the specified (scoped) index name is not found in memory
-
is_available
(index_name)[source]¶ Checks for availability of a specified index name both in memory (i.e. if already loaded into memory at the time of checking) as well as in the cache path where are all the indices' metadata are stored.
Parameters: index_name (str) -- A scoped index name for checking its availability Returns: True if index name is available in memory or in cache directory, else False Return type: bool
-
update_index_and_persist
(index_name, index_resources, index_all_ids)[source]¶ Updates the specified index's resources in the memory as well as dumps the metadata into disk for fast loading time later on. Note that this method is best used with the a load_kb() method wherein fit resources are frirst created before updating indices.
During reloading of an index, use self.get_index_metadata() as well as fieldResource.update_resource() to obtain back a fit index.
Parameters:
-
class
Search
(index)[source]¶ Bases:
object
Search class enabling functionality to query, filter and sort. Utilizes various methods from Indices and FiledResource to compute results.
Currently, the following are supported data types for each clause type: Query -> "string", "date" (assumes that "date" field exists as strings, both are supported
through kwargs)- Filter -> "number" and "date" (through range parameters), "bool" (though boolean parameter),
- "string" (through kwargs)
- Sort -> "number" and "date" (by specifying sort_type=asc or sort_type=desc),
- "location" (by specifying sort_type=distance and passing origin 'location' parameter)
Note: This Search class supports more items than Elasticsearch based QA
-
ALL_INDICES
= <mindmeld.components.question_answerer.NativeQuestionAnswerer.Indices object>¶
-
RESOURCE_LOADER
= None¶
-
class
-
class
mindmeld.components.question_answerer.
QuestionAnswerer
[source]¶ Bases:
object
Backwards compatible QuestionAnswerer class
- old usages (allowed but will soon be deprecated)
- # loading KB directly through class method >>> QuestionAnswerer.load_kb(...) # instantiating a QA object from QuestionAnswerer instead of QuestionAnswererFactory >>> question_answerer = QuestionAnswerer(app_path, resource_loader, es_host, config)
- new usages
>>> question_answerer = QuestionAnswererFactory.create_question_answerer(**kwargs) # Use the QA object's methods to load KB and get search results, instead of class methods >>> question_answerer.load_kb(...) >>> question_answerer.get(...) # .get(...) and .build_search(...)
-
classmethod
load_kb
(app_namespace, index_name, data_file, es_host=None, es_client=None, connect_timeout=2, clean=False, app_path=None, config=None, **kwargs)[source]¶ Implemented to maintain backward compatibility. Should be removed in future versions.
Parameters: - app_namespace (str) -- The namespace of the app. Used to prevent collisions between the indices of this app and those of other apps.
- index_name (str) -- The name of the new index to be created.
- data_file (str) -- The path to the data file containing the documents to be imported into the knowledge base index. It could be either json or jsonl file.
- es_host (str) -- The Elasticsearch host server.
- es_client (Elasticsearch) -- The Elasticsearch client.
- connect_timeout (int, optional) -- The amount of time for a connection to the Elasticsearch host.
- clean (bool) -- Set to true if you want to delete an existing index and reindex it
- app_path (str) -- The path to the directory containing the app's data
- config (dict) -- The QA config if passed directly rather than loaded from the app config
-
DEPRECATION_MESSAGE
= "Calling QuestionAnswerer class directly will be deprecated in future versions. To instantiate a QA instance, use the QuestionAnswererFactory by calling 'qa = QuestionAnswererFactory.create_question_answerer(**kwargs)'. An instantiated QA can then be used as 'qa.load_kb(...)', 'qa.get(...)', etc. See https://www.mindmeld.com/docs/userguide/kb.html for details about the various functionalities available with different question-answerers."¶
-
class
mindmeld.components.question_answerer.
QuestionAnswererFactory
[source]¶ Bases:
object
Factory class for creating QuestionAnswerers
- usage
>>> question_answerer = QuestionAnswererFactory.create_question_answerer(**kwargs) >>> question_answerer.load_kb(...) >>> question_answerer.get(...) # .get(...) or .build_search(...)
-
classmethod
create_question_answerer
(app_path=None, config=None, app_namespace=None, **kwargs)[source]¶ Parameters: - app_path (str, optional) -- The path to the directory containing the app's data. If provided, used to obtain default 'app_namespace' and QA configurations
- app_namespace (str, optional) -- The namespace of the app. Used to prevent collisions between the indices of this app and those of other apps.
- config (dict, optional) -- The QA config if passed directly rather than loaded from the app config