mindmeld.components.classifier module¶

This module contains the base class for all the machine-learned classifiers in MindMeld.

class mindmeld.components.classifier.Classifier(resource_loader)[source]¶

Bases: abc.ABC

The base class for all the machine-learned classifiers in MindMeld. A classifier is a machine-learned model that categorizes input examples into one of the pre-determined class labels. Among other functionality, each classifier provides means by which to fit a statistical model on a given training dataset and then use the trained model to make predictions on new unseen data.

ready¶: bool -- Whether the classifier is ready.

dirty¶: bool -- Whether the classifier has unsaved changes to its model.

config¶: ClassifierConfig -- The classifier configuration.

hash¶: str -- A hash representing the inputs into the model.

dump(model_path, incremental_model_path=None)[source]¶

Persists the trained classification model to disk.

Parameters:	model_path (str) -- The location on disk where the model should be stored. incremental_model_path (str, optional) -- The timestamp folder where the cached models are stored.

evaluate(queries=None, label_set=None, fetch_distribution=False)[source]¶

Evaluates the trained classification model on the given test data

Parameters:	queries (Optional(list(ProcessedQuery))) -- optional list of queries to evaluate label_set (str) -- The label set to use for evaluation.
Returns:	A ModelEvaluation object that contains evaluation results
Return type:	ModelEvaluation

fit(queries=None, label_set=None, incremental_timestamp=None, load_cached=True, **kwargs)[source]¶

Trains a statistical model for classification using the provided training examples and model configuration.

Parameters:

queries (list(ProcessedQuery) or ProcessedQueryList, optional) -- A list of queries to train on. If not specified the queries will be loaded from the label_set.
label_set (str) -- A label set to load. If not specified, the default training set will be loaded.
incremental_timestamp (str, optional) -- The timestamp folder to cache models in
model_type (str, optional) -- The type of machine learning model to use. If omitted, the default model type will be used.
model_settings (dict) -- Settings specific to the model type specified
features (dict) -- Features to extract from each example instance to form the feature vector used for model training. If omitted, the default feature set for the model type will be used.
params (dict) -- Params to pass to the underlying classifier
params_selection (dict) -- The grid of hyper-parameters to search, for finding the optimal hyper-parameter settings for the model. If omitted, the default hyper-parameter search grid will be used.
param_selection (dict) -- Configuration for param selection (using cross-validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': { 'C': [100, 10000, 1000000]}}
features -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.
load_cached (bool) -- If the model is cached on disk, load it into memory.

Returns:

True if model was loaded and fit, False if a valid cached model exists but was not loaded (controlled by the load_cached arg).

Examples

Fit using default the configuration.

>>> clf.fit()

Fit using a 'special' label set.

>>> clf.fit(label_set='special')

Fit using given params, bypassing cross-validation. This is useful for speeding up train times if you are confident the params are optimized.

>>> clf.fit(params={'C': 10000000})

Fit using given parameter selection settings (also known as cross-validation settings).

>>> clf.fit(param_selection={})

Fit using a custom set of features, including a custom feature extractor. This is only for advanced users.

>>> clf.fit(features={
        'in-gaz': {}, // gazetteer features
        'contrived': lambda exa, res: {'contrived': len(exa.text) == 26}
    })

inspect(query, gold_label=None, dynamic_resource=None)[source]¶

load(model_path)[source]¶

Loads the trained classification model from disk

Parameters:	model_path (str) -- The location on disk where the model is stored

predict(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶

Predicts a class label for the given query using the trained classification model

Parameters:	query (Query or str) -- The input query time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information. timestamp (long, optional) -- A unix time stamp for the request (in seconds). dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference
Returns:	The predicted class label
Return type:	str

predict_proba(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶

Runs prediction on a given query and generates multiple hypotheses with their associated probabilities using the trained classification model

Parameters:	query (Query) -- The input query time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information. timestamp (long, optional) -- A unix time stamp for the request (in seconds). dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference
Returns:	a list of tuples of the form (str, float) grouping predicted class labels and their probabilities
Return type:	list

unload()[source]¶: Unloads the model from memory. This helps reduce memory requirements while training other models.

view_extracted_features(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]¶

Extracts features for the given input based on the model config.

Parameters:	query (Query or str) -- The input query time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information. timestamp (long, optional) -- A unix time stamp for the request (in seconds). dynamic_resource (dict) -- Dynamic gazetteer to be included for feature extraction.
Returns:	The extracted features from the given input
Return type:	dict

CLF_TYPE = None¶: Classifier type (str).

class mindmeld.components.classifier.ClassifierConfig(model_type=None, features=None, model_settings=None, params=None, param_selection=None)[source]¶

Bases: object

A value object representing a classifier configuration

model_type¶: str -- The name of the model type. Will be used to find the model class to instantiate.

model_settings¶: dict -- Settings specific to the model type specified.

params¶: dict -- Params to pass to the underlying classifier.

param_selection¶: dict -- Configuration for param selection (using cross validation). For example: {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': {} }

features¶: dict -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.

classmethod from_model_config(model_config)[source]¶

to_dict()[source]¶

Converts the model config object into a dict.

Returns:	A dict version of the config.
Return type:	(dict)

to_json()[source]¶

Converts the model config object to JSON.

Returns:	JSON representation of the classifier.
Return type:	(str)

features

model_settings

model_type

param_selection

params