mindmeld.models package¶
-
class
mindmeld.models.
ModelFactory
[source]¶ Bases:
object
Auto class that identifies appropriate text/tagger model from text_models.py/tagger_models.py to load one based on the inputted configs or from the loaded configs file.
The .create_model_from_config() methods allows to load the appropriate model when a ModelConfig is passed. The .create_model_from_path() method uses AbstractModel's load method to load a dumped config, which is then used to load appropriate model and return it through a metadata dictionary object.
-
classmethod
create_model_from_config
(model_config: Union[dict, mindmeld.models.model.ModelConfig]) → Type[mindmeld.models.model.AbstractModel][source]¶ Instantiates and returns a valid model from the specified model configs
Parameters: model_config (Union[dict, ModelConfig]) -- Model configs inputted either as dict or an instance of ModelConfig Returns: A text/tagger model instance Return type: model (Type[AbstractModel]) Raises: ValueError
-- When the configs are invalid
-
classmethod
create_model_from_path
(path: str) → Union[None, Type[mindmeld.models.model.AbstractModel]][source]¶ Loads and returns a model from the specified path
Parameters: path (str) -- A pickle file path from where a model can be loaded Returns: - Returns None when the specified path is not
- found or if the model loaded from the specified path is a NoneType. If found a valid config and a valid model, the model is load by calling .load() method and returned
Return type: model (Union[None, Type[AbstractModel]]) Raises: ValueError
-- When the path is invalid
-
classmethod
-
class
mindmeld.models.
Embedder
(app_path=None, cache_path=None, **kwargs)[source]¶ Bases:
abc.ABC
Base class for embedder model
-
add_to_cache
(mean_or_max_pooled_whitelist_embs)[source]¶ Method to add custom embeddings to cache without triggering .encode(). Example, one can manually add some max-pooled or mean-pooled embeddings to cache. This method is created to entertain storing superficial text-encoding pairs (superficial because the encodings are not the encodings of the text itself but a combination of encodings of some list of texts from the same embedder model). For example, to add superficial entity embeddings as average of whitelist embeddings in Entity Resolution.
Parameters: mean_or_max_pooled_whitelist_embs (dict) -- texts and their corresponding superficial embeddings as a 1D numpy array, having same length as emb_dim of the embedder
-
encode
(text_list)[source]¶ Parameters: text_list (list) -- A list of text strings for which to generate the embeddings. Returns: A list of numpy arrays of the embeddings. Return type: (list)
-
find_similarity
(src_texts: List[str], tgt_texts: List[str] = None, top_n: int = 20, scores_normalizer: str = None, similarity_function: Callable[[List[Any], List[Any]], numpy.ndarray] = None, _return_as_dict=False, _no_sort=False)[source]¶ Computes the cosine similarity
Parameters: - src_texts (Union[str, list]) -- string or list of strings to obtain matching scores for.
- tgt_texts (list, optional) -- list of strings that will be matched to. if None, existing cache is used as target strings
- top_n (int, optional) -- maximum number of results to populate. if None, equals length of tgt_texts
- scores_normalizer (str, optional) -- normalizer type to normalize scores. Allowed values are: "min_max_scaler", "standard_scaler"
- similarity_function (function, optional) -- if None, defaults to pytorch_cos_sim. If specified, must take two numpy-array/pytorch-tensor arguments for similarity computation with an optional argument to return results as numpy or tensor
- _return_as_dict (bool, optional) -- if the results should be returned as a dictionary of target_text name as keys and scores as corresponding values
- _no_sort (bool, optional) -- If True, results are returned without sorting. This is helpful at times when you wish to do additional wrapper operations on top of raw results and would like to save computational time without sorting.
Returns: - if _return_as_dict, returns a dictionary of tgt_texts and
their scores, else a list of tuple each consisting of a src_text paired with its similarity scores with all tgt_texts as a np array (sorted list in descending order)
Return type:
-
get_encodings
(text_list, add_to_cache=True) → List[Any][source]¶ Fetches the encoded values from the cache, or generates them and adds to cache unless add_to_cache is set to False. This method is wrapped around .encode() by maintaining an embedding cache.
Parameters: Returns: A list of numpy arrays with the embeddings.
Return type: (list)
-
static
pytorch_cos_sim
(src_vecs, tgt_vecs, return_tensor=False)[source]¶ Computes the cosine similarity for 2d matrices
Parameters: - src_vecs -- a 2d numpy array or pytorch tensor
- tgt_vecs -- a 2d numpy array or pytorch tensor
- return_tensor -- If False, this method returns the cosine similarity as a numpy 2d array instead of tensor, else returns 2d tensor output
-
model_id
¶ Returns a unique hash representation of the embedder model based on its name and configs
-
-
class
mindmeld.models.
ModelConfig
(model_type: str = None, example_type: str = None, label_type: str = None, features: Dict = None, model_settings: Dict = None, params: Dict = None, param_selection: Dict = None, train_label_set: Pattern[str] = None, test_label_set: Pattern[str] = None)[source]¶ Bases:
object
A value object representing a model configuration.
-
model_type
¶ str -- The name of the model type. Will be used to find the model class to instantiate
-
example_type
¶ str -- The type of the examples which will be passed into fit() and predict(). Used to select feature extractors
-
label_type
¶ str -- The type of the labels which will be passed into fit() and returned by predict(). Used to select the label encoder
-
model_settings
¶ dict -- Settings specific to the model type specified
-
params
¶ dict -- Params to pass to the underlying classifier
-
param_selection
¶ dict -- Configuration for param selection (using cross validation) {'type': 'shuffle', 'n': 3, 'k': 10, 'n_jobs': 2, 'scoring': '', 'grid': {} }
-
features
¶ dict -- The keys are the names of feature extractors and the values are either a kwargs dict which will be passed into the feature extractor function, or a callable which will be used as to extract features
-
train_label_set
¶ regex pattern -- The regex pattern for finding training file names.
-
test_label_set
¶ regex pattern -- The regex pattern for finding testing file names.
-
get_ngram_lengths_and_thresholds
(rname: str) → Tuple[source]¶ Returns the n-gram lengths and thresholds to extract to optimize resource collection
Parameters: rname (string) -- Name of the resource Returns: tuple containing: - lengths (list of int): list of n-gram lengths to be extracted
- thresholds (list of int): thresholds to be applied to corresponding n-gram lengths
Return type: (tuple)
-
required_resources
() → Set[source]¶ Returns the resources this model requires
Returns: set of required resources for this model Return type: set
-
resolve_config
(new_config: mindmeld.models.model.ModelConfig)[source]¶ This method resolves any config incompatibility issues by loading the latest settings from the app config to the current config
Parameters: new_config (ModelConfig) -- The ModelConfig representing the app's latest config
-
to_dict
() → Dict[source]¶ Converts the model config object into a dict
Returns: A dict version of the config Return type: dict
-
to_json
() → str[source]¶ Converts the model config object to JSON
Returns: JSON representation of the classifier Return type: str
-
example_type
-
features
-
label_type
-
model_settings
-
model_type
-
param_selection
-
params
-
test_label_set
-
train_label_set
-
-
mindmeld.models.
create_model
(config)[source]¶ Creates a model instance using the provided configuration
Parameters: config (ModelConfig) -- A model configuration Returns: a configured model Return type: Model Raises: ValueError
-- When model configuration is invalid
-
mindmeld.models.
load_model
(path)[source]¶ Loads a model from a specified path
Parameters: path (str) -- A path where the model configuration is pickled along with other metadata Returns: - metadata loaded from the path, which contains the configured model in 'model' key
- and the model configs in 'model_config' key along with other keys
Return type: dict Raises: ValueError
-- When model configuration is invalid
-
mindmeld.models.
create_embedder_model
(app_path, config)[source]¶ Creates and loads an embedder model
Parameters: config (dict) -- Model settings passed in as a dictionary with 'embedder_type' being a required key Returns: An instance of appropriate embedder class Return type: Embedder Raises: ValueError
-- When model configuration is invalid or required key is missing
Subpackages¶
Submodules¶
- mindmeld.models.containers module
- mindmeld.models.embedder_models module
- mindmeld.models.evaluation module
- mindmeld.models.helpers module
- mindmeld.models.labels module
- mindmeld.models.model module
- mindmeld.models.model_factory module
- mindmeld.models.tagger_models module
- mindmeld.models.text_models module