mindmeld.components.entity_recognizer module¶
This module contains the entity recognizer component of the MindMeld natural language processor.
-
class
mindmeld.components.entity_recognizer.
EntityRecognizer
(resource_loader, domain, intent)[source]¶ Bases:
mindmeld.components.classifier.Classifier
An entity recognizer which is used to identify the entities for a given query. It is trained using all the labeled queries for a particular intent. The labels are the entity annotations for each query.
-
domain
¶ str -- The domain that this entity recognizer belongs to
-
intent
¶ str -- The intent that this entity recognizer belongs to
-
entity_types
¶ set -- A set containing the entity types which can be recognized
-
fit
(queries=None, label_set=None, incremental_timestamp=None, load_cached=True, **kwargs)[source]¶ Trains a statistical model for classification using the provided training examples and model configuration.
Parameters: - queries (list(ProcessedQuery) or ProcessedQueryList, optional) -- A list of queries to train on. If not specified the queries will be loaded from the label_set.
- label_set (str) -- A label set to load. If not specified, the default training set will be loaded.
- incremental_timestamp (str, optional) -- The timestamp folder to cache models in
- model_type (str, optional) -- The type of machine learning model to use. If omitted, the default model type will be used.
- model_settings (dict) -- Settings specific to the model type specified
- features (dict) -- Features to extract from each example instance to form the feature vector used for model training. If omitted, the default feature set for the model type will be used.
- params (dict) -- Params to pass to the underlying classifier
- params_selection (dict) -- The grid of hyper-parameters to search, for finding the optimal hyper-parameter settings for the model. If omitted, the default hyper-parameter search grid will be used.
- param_selection (dict) -- Configuration for param selection (using cross-validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': { 'C': [100, 10000, 1000000]}}
- features -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.
- load_cached (bool) -- If the model is cached on disk, load it into memory.
Returns: True if model was loaded and fit, False if a valid cached model exists but was not loaded (controlled by the load_cached arg).
Examples
Fit using default the configuration.
>>> clf.fit()
Fit using a 'special' label set.
>>> clf.fit(label_set='special')
Fit using given params, bypassing cross-validation. This is useful for speeding up train times if you are confident the params are optimized.
>>> clf.fit(params={'C': 10000000})
Fit using given parameter selection settings (also known as cross-validation settings).
>>> clf.fit(param_selection={})
Fit using a custom set of features, including a custom feature extractor. This is only for advanced users.
>>> clf.fit(features={ 'in-gaz': {}, // gazetteer features 'contrived': lambda exa, res: {'contrived': len(exa.text) == 26} })
-
load
(model_path)[source]¶ Loads the trained entity recognition model from disk.
Parameters: model_path (str) -- The location on disk where the model is stored.
-
predict
(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶ Predicts entities for the given query using the trained recognition model.
Parameters: - query (Query, str) -- The input query.
- time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
- timestamp (long, optional) -- A unix time stamp for the request (in seconds).
- dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference.
Returns: The predicted class label.
Return type: (str)
-
predict_proba
(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶ Runs prediction on a given query and generates multiple entity tagging hypotheses with their associated probabilities using the trained entity recognition model
Parameters: - query (Query, str) -- The input query.
- time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
- timestamp (long, optional) -- A unix time stamp for the request (in seconds).
- dynamic_resource (optional) -- Dynamic resource, unused.
Returns: A list of tuples of the form (Entity list, float) grouping potential entity tagging hypotheses and their probabilities.
Return type: (list)
-
unload
()[source]¶ Unloads the model from memory. This helps reduce memory requirements while training other models.
-
CLF_TYPE
= 'entity'¶ The classifier type.
-