mindmeld.gazetteer module¶
-
class
mindmeld.gazetteer.
Gazetteer
(name, text_preparation_pipeline, exclude_ngrams=False)[source]¶ Bases:
object
This class holds the following fields, which are extracted and exported to file.
-
entity_count
¶ int -- Total entities in the file
-
pop_dict
¶ dict -- A dictionary containing the entity name as a key and the popularity score as the value. If there are more than one entity with the same name, the popularity is the maximum value across all duplicate entities.
-
index
¶ dict -- A dictionary containing the inverted index, which maps terms and n-grams to the set of documents which contain them
-
entities
¶ list -- A list of all entities
-
sys_types
¶ set -- The set of nested numeric types for this entity
-
dump
(gaz_path)[source]¶ Persists the gazetteer to disk.
Parameters: gaz_path (str) -- The location on disk where the gazetteer should be stored
-
from_dict
(serialized_gaz)[source]¶ De-serializes gaz object from a dictionary using deep copy ops
Parameters: serialized_gaz (dict) -- The serialized gaz object
-
load
(gaz_path)[source]¶ Loads the gazetteer from disk
Parameters: gaz_path (str) -- The location on disk where the gazetteer is stored
-
update_with_entity_data_file
(filename, popularity_cutoff, normalizer)[source]¶ Updates this gazetteer with data from an entity data file.
Parameters:
-
update_with_entity_map
(mapping, normalizer, update_if_missing_canonical=True)[source]¶ Update gazetteer with a list of normalized key,value pairs from the input mapping list
Parameters: - mapping (list) -- A list of dicts containing canonnical names and whitelists of a particular entity
- normalizer (func) -- A QueryFactory normalization function that is used to normalize the input mapping data before they are added to the gazetteer.
-
-
class
mindmeld.gazetteer.
NestedGazetteer
(start_token_index, end_token_index_plus_one, gaz_name, token_ngram, raw_ngram)[source]¶ Bases:
object
This class represents a gazetteer entry corresponding to a Query object
-
end_token_index_plus_one
¶
-
gaz_name
¶
-
raw_ngram
¶
-
start_token_index
¶
-
token_ngram
¶
-