mindmeld.components.classifier module

This module contains the base class for all the machine-learned classifiers in MindMeld.

class mindmeld.components.classifier.Classifier(resource_loader)[source]

Bases: abc.ABC

The base class for all the machine-learned classifiers in MindMeld. A classifier is a machine-learned model that categorizes input examples into one of the pre-determined class labels. Among other functionality, each classifier provides means by which to fit a statistical model on a given training dataset and then use the trained model to make predictions on new unseen data.

ready

bool -- Whether the classifier is ready.

dirty

bool -- Whether the classifier has unsaved changes to its model.

config

ClassifierConfig -- The classifier configuration.

hash

str -- A hash representing the inputs into the model.

dump(model_path, incremental_model_path=None)[source]

Persists the trained classification model to disk.

Parameters:
  • model_path (str) -- The location on disk where the model should be stored.
  • incremental_model_path (str, optional) -- The timestamp folder where the cached models are stored.
evaluate(queries=None, label_set=None, fetch_distribution=False)[source]

Evaluates the trained classification model on the given test data

Parameters:
  • queries (Optional(list(ProcessedQuery))) -- optional list of queries to evaluate
  • label_set (str) -- The label set to use for evaluation.
Returns:

A ModelEvaluation object that contains evaluation results

Return type:

ModelEvaluation

fit(queries=None, label_set=None, incremental_timestamp=None, load_cached=True, **kwargs)[source]

Trains a statistical model for classification using the provided training examples and model configuration.

Parameters:
  • queries (list(ProcessedQuery) or ProcessedQueryList, optional) -- A list of queries to train on. If not specified the queries will be loaded from the label_set.
  • label_set (str) -- A label set to load. If not specified, the default training set will be loaded.
  • incremental_timestamp (str, optional) -- The timestamp folder to cache models in
  • model_type (str, optional) -- The type of machine learning model to use. If omitted, the default model type will be used.
  • model_settings (dict) -- Settings specific to the model type specified
  • features (dict) -- Features to extract from each example instance to form the feature vector used for model training. If omitted, the default feature set for the model type will be used.
  • params (dict) -- Params to pass to the underlying classifier
  • params_selection (dict) -- The grid of hyper-parameters to search, for finding the optimal hyper-parameter settings for the model. If omitted, the default hyper-parameter search grid will be used.
  • param_selection (dict) -- Configuration for param selection (using cross-validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': { 'C': [100, 10000, 1000000]}}
  • features -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.
  • load_cached (bool) -- If the model is cached on disk, load it into memory.
Returns:

True if model was loaded and fit, False if a valid cached model exists but was not loaded (controlled by the load_cached arg).

Examples

Fit using default the configuration.

>>> clf.fit()

Fit using a 'special' label set.

>>> clf.fit(label_set='special')

Fit using given params, bypassing cross-validation. This is useful for speeding up train times if you are confident the params are optimized.

>>> clf.fit(params={'C': 10000000})

Fit using given parameter selection settings (also known as cross-validation settings).

>>> clf.fit(param_selection={})

Fit using a custom set of features, including a custom feature extractor. This is only for advanced users.

>>> clf.fit(features={
        'in-gaz': {}, // gazetteer features
        'contrived': lambda exa, res: {'contrived': len(exa.text) == 26}
    })
inspect(query, gold_label=None, dynamic_resource=None)[source]
load(model_path)[source]

Loads the trained classification model from disk

Parameters:model_path (str) -- The location on disk where the model is stored
predict(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]

Predicts a class label for the given query using the trained classification model

Parameters:
  • query (Query or str) -- The input query
  • time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
  • timestamp (long, optional) -- A unix time stamp for the request (in seconds).
  • dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference
Returns:

The predicted class label

Return type:

str

predict_proba(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]

Runs prediction on a given query and generates multiple hypotheses with their associated probabilities using the trained classification model

Parameters:
  • query (Query) -- The input query
  • time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
  • timestamp (long, optional) -- A unix time stamp for the request (in seconds).
  • dynamic_resource (dict, optional) -- A dynamic resource to aid NLP inference
Returns:

a list of tuples of the form (str, float) grouping predicted class labels and their probabilities

Return type:

list

unload()[source]

Unloads the model from memory. This helps reduce memory requirements while training other models.

view_extracted_features(query, time_zone=None, timestamp=None, dynamic_resource=None)[source]

Extracts features for the given input based on the model config.

Parameters:
  • query (Query or str) -- The input query
  • time_zone (str, optional) -- The name of an IANA time zone, such as 'America/Los_Angeles', or 'Asia/Kolkata' See the [tz database](https://www.iana.org/time-zones) for more information.
  • timestamp (long, optional) -- A unix time stamp for the request (in seconds).
  • dynamic_resource (dict) -- Dynamic gazetteer to be included for feature extraction.
Returns:

The extracted features from the given input

Return type:

dict

CLF_TYPE = None

Classifier type (str).

class mindmeld.components.classifier.ClassifierConfig(model_type=None, features=None, model_settings=None, params=None, param_selection=None)[source]

Bases: object

A value object representing a classifier configuration

model_type

str -- The name of the model type. Will be used to find the model class to instantiate.

model_settings

dict -- Settings specific to the model type specified.

params

dict -- Params to pass to the underlying classifier.

param_selection

dict -- Configuration for param selection (using cross validation). For example: {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': {} }

features

dict -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features.

classmethod from_model_config(model_config)[source]
to_dict()[source]

Converts the model config object into a dict.

Returns:A dict version of the config.
Return type:(dict)
to_json()[source]

Converts the model config object to JSON.

Returns:JSON representation of the classifier.
Return type:(str)
features
model_settings
model_type
param_selection
params