mindmeld.models.embedder_models module¶
This module contains the embedder model class.
-
class
mindmeld.models.embedder_models.
BertEmbedder
(app_path=None, cache_path=None, pretrained_name_or_abspath=None, **kwargs)[source]¶ Bases:
mindmeld.models.embedder_models.Embedder
Encoder class for bert models based on https://github.com/UKPLab/sentence-transformers
-
encode
(phrases)[source]¶ Encodes input text(s) into embeddings, one vector for each phrase
Parameters: phrases (str, list[str]) -- textual inputs that are to be encoded using sentence transformers' model Returns: - By default, a numpy array is returned.
- If convert_to_tensor, a stacked tensor is returned. If convert_to_numpy, a numpy matrix is returned.
Return type: (Union[List[Tensor], ndarray, Tensor])
-
CACHE_MODELS
= {}¶
-
model_id
¶ Returns a unique hash representation of the embedder model based on its name and configs
-
-
class
mindmeld.models.embedder_models.
Embedder
(app_path=None, cache_path=None, **kwargs)[source]¶ Bases:
abc.ABC
Base class for embedder model
-
add_to_cache
(mean_or_max_pooled_whitelist_embs)[source]¶ Method to add custom embeddings to cache without triggering .encode(). Example, one can manually add some max-pooled or mean-pooled embeddings to cache. This method is created to entertain storing superficial text-encoding pairs (superficial because the encodings are not the encodings of the text itself but a combination of encodings of some list of texts from the same embedder model). For example, to add superficial entity embeddings as average of whitelist embeddings in Entity Resolution.
Parameters: mean_or_max_pooled_whitelist_embs (dict) -- texts and their corresponding superficial embeddings as a 1D numpy array, having same length as emb_dim of the embedder
-
encode
(text_list)[source]¶ Parameters: text_list (list) -- A list of text strings for which to generate the embeddings. Returns: A list of numpy arrays of the embeddings. Return type: (list)
-
find_similarity
(src_texts: List[str], tgt_texts: List[str] = None, top_n: int = 20, scores_normalizer: str = None, similarity_function: Callable[[List[Any], List[Any]], numpy.ndarray] = None, _return_as_dict=False, _no_sort=False)[source]¶ Computes the cosine similarity
Parameters: - src_texts (Union[str, list]) -- string or list of strings to obtain matching scores for.
- tgt_texts (list, optional) -- list of strings that will be matched to. if None, existing cache is used as target strings
- top_n (int, optional) -- maximum number of results to populate. if None, equals length of tgt_texts
- scores_normalizer (str, optional) -- normalizer type to normalize scores. Allowed values are: "min_max_scaler", "standard_scaler"
- similarity_function (function, optional) -- if None, defaults to pytorch_cos_sim. If specified, must take two numpy-array/pytorch-tensor arguments for similarity computation with an optional argument to return results as numpy or tensor
- _return_as_dict (bool, optional) -- if the results should be returned as a dictionary of target_text name as keys and scores as corresponding values
- _no_sort (bool, optional) -- If True, results are returned without sorting. This is helpful at times when you wish to do additional wrapper operations on top of raw results and would like to save computational time without sorting.
Returns: - if _return_as_dict, returns a dictionary of tgt_texts and
their scores, else a list of tuple each consisting of a src_text paired with its similarity scores with all tgt_texts as a np array (sorted list in descending order)
Return type:
-
get_encodings
(text_list, add_to_cache=True) → List[Any][source]¶ Fetches the encoded values from the cache, or generates them and adds to cache unless add_to_cache is set to False. This method is wrapped around .encode() by maintaining an embedding cache.
Parameters: Returns: A list of numpy arrays with the embeddings.
Return type: (list)
-
static
pytorch_cos_sim
(src_vecs, tgt_vecs, return_tensor=False)[source]¶ Computes the cosine similarity for 2d matrices
Parameters: - src_vecs -- a 2d numpy array or pytorch tensor
- tgt_vecs -- a 2d numpy array or pytorch tensor
- return_tensor -- If False, this method returns the cosine similarity as a numpy 2d array instead of tensor, else returns 2d tensor output
-
model_id
¶ Returns a unique hash representation of the embedder model based on its name and configs
-
-
class
mindmeld.models.embedder_models.
GloveEmbedder
(app_path=None, cache_path=None, **kwargs)[source]¶ Bases:
mindmeld.models.embedder_models.Embedder
Encoder class for GloVe embeddings as described here: https://nlp.stanford.edu/projects/glove/
-
encode
(text_list)[source]¶ Parameters: text_list (list) -- A list of text strings for which to generate the embeddings. Returns: A list of numpy arrays of the embeddings. Return type: (list)
-
DEFAULT_EMBEDDING_DIM
= 300¶
-
model_id
¶ Returns a unique hash representation of the embedder model based on its name and configs
-