Working with the Auto Annotator¶
The Auto Annotator
- is a tool to automatically annotate or unannotate select entities across all labelled data in an application.
- supports the development of custom Annotators.
Note
The examples in this section require the HR Assistant blueprint application. To get the app, open a terminal and run mindmeld blueprint hr_assistant
.
Examples related to the MultiLingualAnnotator requires the Health Screening blueprint application. To get the app, open a terminal and run mindmeld blueprint hr_assistant
Warning
Changes by an Auto Annotator cannot be undone and MindMeld does not backup query data. We recommend using version control software such as Github.
Quick Start¶
This section briefly explains the use of the annotate
and unannotate
commands. For more details, please read the next section.
Before annotating the data, we will first remove all existing annotations using unannotate
. Be sure to include the --unannotate_all
flag when running the following command in the command-line.
Command-line:
mindmeld unannotate --app-path "hr_assistant" --unannotate_all
We can now proceed to annotate
our data using the command below.
Command-line:
mindmeld annotate --app-path "hr_assistant"
The following section explains this same process in more detail.
Using the Auto Annotator¶
The Auto Annotator can be used by importing a class that implements the Annotator
abstract class in the auto_annotator
module or through the command-line.
We will demonstrate both approaches for annotation and unannotation using the MultiLingualAnnotator
class.
Annotate¶
By default, all entity types supported by an Annotator will by annotated if they do not overlap with existing annotations.
You can annotate
using the command-line.
To overwrite existing annotations that overlap with new annotations, pass in the optional param --overwrite
.
mindmeld annotate --app-path "hr_assistant" --overwrite
Alternatively, you can annotate by creating an instance of the Annotator
class and running the Python code below.
An optional param overwrite
can be passed in here as well.
from mindmeld.auto_annotator import MultiLingualAnnotator
annotation_rules = [
{
"domains": ".*",
"intents": ".*",
"files": ".*",
"entities": ".*",
}
]
mla = MultiLingualAnnotator(
app_path="hr_assistant",
annotation_rules=annotation_rules,
overwrite=True
)
mla.annotate()
If you do not want to annotate all supported entities, you can specify annotation rules instead.
For example, let's annotate sys_person
entities from the get_hierarchy_up
intent in the hierarchy
domain.
To do this, we can add the following AUTO_ANNOTATOR_CONFIG
dictionary to config.py
.
Notice that we are setting overwrite
to True since we want to replace the existing custom entity label, name
.
AUTO_ANNOTATOR_CONFIG = {
"annotator_class": "MultiLingualAnnotator",
"overwrite": True,
"annotation_rules": [
{
"domains": "hierarchy",
"intents": "get_hierarchy_up",
"files": "train.txt",
"entities": "sys_person",
}
],
"unannotate_supported_entities_only": True,
"unannotation_rules": None
}
Before running the annotation, let's take a look at the first four queries in the train.txt file for the get_hierarchy_up
intent:
I wanna get a list of all of the employees that are currently {manage|manager} {caroline|name}
I wanna know {Tayana Jeannite|name}'s person in {leadership|manager} of her?
is it correct to say that {Angela|name} is a {boss|manager}?
who all is {management|manager} of {tayana|name}
After running annotate
we find that instances of sys_person
have been labelled and have overwritten previous instances of the custom entity, name
.
I wanna get a list of all of the employees that are currently {manage|manager} {caroline|sys_person}
I wanna know {Tayana Jeannite|sys_person}'s person in {leadership|manager} of her?
is it correct to say that {Angela|sys_person} is a {boss|manager}?
who all is {management|manager} of {tayana|sys_person}
You can annotate with multiple annotation rules. For more details on annotation rules please read the "Auto Annotator Configuration" section below.
Unannotate¶
By default, only the entities that are supported by an Annotator will be unannotated.
You can unannotate
using the command-line. To unannotate all entities, pass in the optional param --unannotate_all
.
mindmeld unannotate --app-path "hr_assistant" --unannotate_all
To unannotate by creating an instance of the Annotator
class, run the Python code below.
To unannotate all annotations, use the the unannotation_rules
shown below and set unannotate_supported_entities_only
to False.
from mindmeld.auto_annotator import MultiLingualAnnotator
unannotation_rules = [
{
"domains": ".*",
"intents": ".*",
"files": ".*",
"entities": ".*",
}
]
mla = MultiLingualAnnotator(
app_path="hr_assistant",
unannotation_rules=unannotation_rules,
unannotate_supported_entities_only=False
)
mla.unannotate()
If you see the following message, you need to update the unannotate parameter in your custom AUTO_ANNOTATOR_CONFIG
dictionary in config.py
.
You can refer to the config specifications in the "Auto Annotator Configuration" section below.
'unannotate' field is not configured or misconfigured in the `config.py`. We can't find any file to unannotate.
If you do not want to unannotate all entities, you can can specify annotation rules to be used for unannotation in the unannotate
param of your config.
For example, let's unannotate sys_time
entities from the get_date_range_aggregate
intent in the date
domain.
To do this, we can add the following AUTO_ANNOTATOR_CONFIG
dictionary to config.py
.
AUTO_ANNOTATOR_CONFIG = {
"annotator_class": "MultiLingualAnnotator",
"overwrite": False,
"annotate": [{"domains": ".*", "intents": ".*", "files": ".*", "entities": ".*"}],
"unannotate_supported_entities_only": True,
"unannotate": [
{
"domains": "date",
"intents": "get_date_range_aggregate",
"files": "train.txt",
"entities": "sys_time",
}
],
}
Note
The content of annotate
in the config has no effect on unannotation. Similarly, unannotate
in the config has no affect on annotation. These processes are independent and are only affected by the corresponding parameter in the config.
Before running the unannotation, let's take a look at the first four queries in the train.txt file for the get_date_range_aggregate
intent:
{sum|function} of {non-citizen|citizendesc} people {hired|employment_action} {after|date_compare} {2005|sys_time}
What {percentage|function} of employees were {born|dob} {before|date_compare} {1992|sys_time}?
{us citizen|citizendesc} people with {birthday|dob} {before|date_compare} {1996|sys_time} {count|function}
{count|function} of {eligible non citizen|citizendesc} workers {born|dob} {before|date_compare} {1994|sys_time}
After running unannotate
we find that instances of sys_time
have been unannotated as expected.
{sum|function} of {non-citizen|citizendesc} people {hired|employment_action} {after|date_compare} 2005
What {percentage|function} of employees were {born|dob} {before|date_compare} 1992?
{us citizen|citizendesc} people with {birthday|dob} {before|date_compare} 1996 {count|function}
{count|function} of {eligible non citizen|citizendesc} workers {born|dob} {before|date_compare} 1994
Default Auto Annotator: MultiLingual Annotator¶
The mindmeld.auto_annotator
module contains an abstract Annotator
class.
This class serves as a base class for any MindMeld Annotator including the MultiLingualAnnotator
class.
The MultiLingualAnnotator
leverages Spacy's Named Entity Recognition system and duckling to detect entities.
Supported Entities and Languages¶
Up to 21 entities are supported across 15 languages. The table below defines these entities and whether they are resolvable by duckling.
Supported Entities | Resolvable by Duckling | Examples or Definition |
---|---|---|
"sys_time" | Yes | "today", "Tuesday, Feb 18" , "last week" |
"sys_interval" | Yes | "from 9:30 to 11:00am", "Monday to Friday", "Tuesday 3pm to Wednesday 7pm" |
"sys_duration" | Yes | "2 hours", "15 minutes", "3 days" |
"sys_number" | Yes | "58", "two hundred", "1,394,345.45" |
"sys_amount-of-money" | Yes | "ten dollars", "seventy-eight euros", "$58.67" |
"sys_distance" | Yes | "500 meters", "498 miles", "47.5 inches" |
"sys_weight" | Yes | "400 pound", "3 grams", "47.5 mg" |
"sys_ordinal" | Yes | "3rd place" ("3rd"), "fourth street" ("fourth"), "5th" |
"sys_percent" | Yes | "four percent", "12%", "5 percent" |
"sys_org" | No | "Cisco", "IBM", "Google" |
"sys_loc" | No | "Europe", "Asia", "the Alps", "Pacific ocean" |
"sys_person" | No | "Blake Smith", "Julia", "Andy Neff" |
"sys_gpe" | No | "California", "FL", "New York City", "USA" |
"sys_norp" | No | Nationalities or religious or political groups. |
"sys_fac" | No | Buildings, airports, highways, bridges, etc. |
"sys_product" | No | Objects, vehicles, foods, etc. (Not services.) |
"sys_event" | No | Named hurricanes, battles, wars, sports events, etc. |
"sys_law" | No | Named documents made into laws. |
"sys_language" | No | Any named language. |
"sys_work-of-art" | No | Titles of books, songs, etc. |
"sys_other-quantity" | No | "10 joules", "30 liters", "15 tons" |
Supported languages include English (en), Spanish (es), French (fr), German (de), Danish (da), Greek (el), Portuguese (pt), Lithuanian (lt), Norwegian Bokmal (nb), Romanian (ro), Polish (pl), Italian (it), Japanese (ja), Chinese (zh), Dutch (nl). The table below identifies the supported entities for each language.
EN | ES | FR | DE | DA | EL | PT | LT | NB | RO | PL | IT | JA | ZH | NL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sys_amount-of-money | y | y | y | n | n | n | y | n | y | y | n | n | y | y | y |
sys_distance | y | y | y | y | n | n | y | n | n | y | n | y | n | y | y |
sys_duration | y | y | y | y | y | y | y | y | y | y | y | y | y | y | y |
sys_event | y | n | n | n | n | y | n | n | n | y | n | n | y | y | y |
sys_fac | y | n | n | n | n | n | n | n | n | y | n | n | y | y | y |
sys_gpe | y | n | n | n | n | y | n | y | n | y | y | n | y | y | y |
sys_interval | y | y | y | y | y | y | y | n | y | y | y | y | n | y | y |
sys_language | y | n | n | n | n | n | n | n | n | y | n | n | y | y | y |
sys_law | y | n | n | n | n | n | n | n | n | n | n | n | y | y | y |
sys_loc | y | y | y | y | y | y | y | y | y | y | n | y | y | y | y |
sys_norp | y | n | n | n | n | n | n | n | n | y | n | n | y | y | y |
sys_number | y | y | y | y | y | y | y | n | y | y | y | y | y | y | y |
sys_ordinal | y | y | y | y | y | y | y | n | y | y | y | y | y | y | y |
sys_org | y | y | y | y | y | y | y | y | y | y | y | y | y | y | y |
sys_other-quantity | y | n | n | n | n | n | n | n | n | y | n | n | y | y | y |
sys_percent | y | n | n | n | n | n | n | n | n | n | n | n | y | y | y |
sys_person | y | y | y | y | y | y | y | y | y | y | y | y | y | y | y |
sys_product | y | n | n | n | n | y | n | n | n | y | n | n | y | y | y |
sys_time | y | y | y | y | y | y | y | y | y | y | y | y | n | y | y |
sys_weight | y | n | n | n | n | n | n | n | n | y | n | n | y | y | y |
sys_work_of_art | y | n | n | n | n | n | n | n | n | y | n | n | y | y | y |
Working with English Sentences¶
To detect entities in a single sentence first create an instance of the MultiLingualAnnotator
class.
If a language is not specified in LANGUAGE_CONFIG
(config.py
) then by default English will be used.
from mindmeld.auto_annotator import MultiLingualAnnotator
mla = MultiLingualAnnotator(app_path="hr_assistant")
Then use the parse()
function.
mla.parse("Apple stock went up $10 last monday.")
Three entities are automatically recognized and a list of QueryEntity objects is returned. Each QueryEntity represents a detected entity.:
[
<QueryEntity 'Apple' ('sys_org') char: [0-4], tok: [0-0]>,
<QueryEntity '$10' ('sys_amount-of-money') char: [20-22], tok: [4-4]>,
<QueryEntity 'last monday' ('sys_time') char: [24-34], tok: [5-6]>
]
The Auto Annotator detected "Apple" as sys_org
. Moreover, it recognized "$10" as sys_amount-of-money
and resolved its value
as 10 and unit
as "$".
Lastly, it recognized "last monday" as sys_time
and resolved its value
to be a timestamp representing the last monday from the current date.
To restrict the types of entities returned from the parse()
method use the entity_types
parameter and pass in a list of entities to restrict parsing to. By default, all entities are allowed.
For example, we can restrict the output of the previous example by doing the following:
allowed_entites = ["sys_org", "sys_amount-of-money", "sys_time"]
sentence = "Apple stock went up $10 last monday."
mla.parse(sentence=sentence, entity_types=allowed_entities)
Working with Non-English Sentences¶
The MultiLingualAnnotator
will use the language and locale specified in the LANGUAGE_CONFIG
(config.py
) if it used through the command-line.
LANGUAGE_CONFIG = {'language': 'es'}
Many Spacy non-English NER models have limited entity support. To overcome this, in addition to the entities detected by non-English NER models, the MultiLingualAnnotator
translates the sentence to English and detects entities
using the English NER model. The English detected entities are compared against duckling candidates for the non-English sentence. Duckling candidates with a match between the type and value of the entity or the translated body text
are selected. If a translation service is not available, the MultiLingualAnnotator
selects the duckling candidates with the largest non-overlapping spans. The sections below describe the steps to setup the annotator depending on whether a translation service is being used.
Annotating with a Translation Service (Google)¶
The MultiLingualAnnotator
can leverage the Google Translation API to better detect entities in non-English sentences. To use this feature, export your Google application credentials.
export GOOGLE_APPLICATION_CREDENTIALS="/<YOUR_PATH>/google_application_credentials.json"
Install the extras requirements for annotators.
pip install mindmeld[language_annotator]
Finally, specify the translator in AUTO_ANNOTATOR_CONFIG
. Set translator
to GoogleTranslator
.
Annotating without a Translation Service¶
We can still use the MultiLingualAnnotator
without a translation service. To do so, set translator
to NoOpTranslator
in AUTO_ANNOTATOR_CONFIG
.
Spanish Sentence Example¶
Let's take a look at an example of the MultiLingualAnnotator
detecting entities in Spanish sentences.
To use a Spanish MindMeld application we can download the Screening App
blueprint with the following command:
mindmeld blueprint screening_app
We can now create our MultiLingualAnnotator
object and pass in the app_path. If a spanish Spacy model is not found in the environment, it will automatically be downloaded.
from mindmeld.auto_annotator import MultiLingualAnnotator
mla = MultiLingualAnnotator(
app_path="screening_app",
language="es",
locale=None,
)
Then use the parse()
function.
mla.parse("Las acciones de Apple subieron $10 el lunes pasado.")
Three entities are automatically recognized.
[
<QueryEntity 'Apple' ('sys_org') char: [16-20], tok: [3-3]>,
<QueryEntity 'el lunes pasado' ('sys_time') char: [35-49], tok: [6-8]>,
<QueryEntity '$10' ('sys_amount-of-money') char: [31-33], tok: [5-5]>
]
Auto Annotator Configuration¶
The DEFAULT_AUTO_ANNOTATOR_CONFIG
shown below is the default config for an Annotator.
A custom config can be included in config.py
by duplicating the default config and renaming it to AUTO_ANNOTATOR_CONFIG
.
Alternatively, a custom config dictionary can be passed in directly to MultiLingualAnnotator
or any Annotator class upon instantiation.
DEFAULT_AUTO_ANNOTATOR_CONFIG = {
"annotator_class": "MultiLingualAnnotator",
"overwrite": False,
"annotation_rules": [
{
"domains": ".*",
"intents": ".*",
"files": ".*",
"entities": ".*",
}
],
"unannotate_supported_entities_only": True,
"unannotation_rules": None,
}
Let's take a look at the allowed values for each setting in an Auto Annotator configuration.
'annotator_class'
(str
): The class in auto_annotator.py to use for annotation when invoked from the command line. By default, MultiLingualAnnotator
is used.
'overwrite'
(bool
): Whether new annotations should overwrite existing annotations in the case of a span conflict. False by default.
'annotation_rules'
(list
): A list of annotation rules where each rule is represented as a dictionary. Each rule must have four keys: domains
, intents
, files
, and entities
.
Annotation rules are combined internally to create Regex patterns to match selected files. The character '.*'
can be used if all possibilities in a section are to be selected, while possibilities within
a section are expressed with the usual Regex special characters, such as '.'
for any single character and '|'
to represent "or".
{
"domains": "(faq|salary)",
"intents": ".*",
"files": "(train.txt|test.txt)",
"entities": "(sys_amount-of-money|sys_time)",
}
The rule above would annotate all text files named "train" or "test" in the "faq" and "salary" domains. Only sys_amount-of-money and sys_time entities would be annotated. Internally, the above rule is combined to a single pattern: "(faq|salary)/.*/(train.txt|test.txt)" and this pattern is matched against all file paths in the domain folder of your MindMeld application.
Warning
The order of the annotation rules matters. Each rule overwrites the list of entities to annotate for a file if the two rules include the same file. It is good practice to start with more generic rules first and then have more specific rules.
Be sure to use the regex "or" (|
) if applying rules at the same level of specificity. Otherwise, if written as separate rules, the latter will overwrite the former.
Warning
By default, all files in all intents across all domains will be annotated with all supported entities. Before annotating consider including custom annotation rules in config.py
.
'language'
(str
): Language as specified using a 639-1/2 code.
'locale'
(str
): The locale representing the ISO 639-1 language code and ISO3166 alpha 2 country code separated by an underscore character.
'unannotate_supported_entities_only'
(boolean
): By default, when the unannotate command is used only entities that the Annotator can annotate will be eligible for removal.
'unannotation_rules'
(list
): List of annotation rules in the same format as those used for annotation. These rules specify which entities should have their annotations removed. By default, files
is None.
'spacy_model_size'
(str
): lg
is used by default for the best performance. Alternative options are sm
and md
. This parameter is optional and is specific to the use of the SpacyAnnotator
and MultiLingualAnnotator
.
If the selected model is not in the current environment it will automatically be downloaded. Refer to Spacy's documentation to learn more about their NER models.
'translator'
(str
): This parameter is used by the MultiLingualAnnotator
. If Google application credentials are available and have been exported, set this parameter to GoogleTranslator
. Otherwise, set this paramter to NoOpTranslator
.
Using the Bootstrap Annotator¶
The BootstrapAnnotator
speeds up the data annotation process of new queries. When a BootstrapAnnotator
is instantiated a NaturalLanguageProcessor
is built for your app. For each intent, an entity recognizer is trained on the existing labeled data.
The BootstrapAnnotator
uses these entity recognizers to predict and label the entities for your app if you have existing labeled queries. The BootstrapAnnotator
labels the entities for new queries using the trained entity recognizer for each given intent.
First, ensure that files that you would like to label have the same name or pattern. For example, you may label your files train_bootstrap.txt
across all intents.
Update the annotator_class
field in your AUTO_ANNOTATOR_CONFIG
to be BootstrapAnnotator
and set your annotation rules to include your desired patterns.
You can optionally set the confidence_threshold
for labeling in the config as shown below. For this example, we will set it to 0.95. This means that entities will only be labeled if the entity recognizer assigns a confidence score over 95% to the entity.
AUTO_ANNOTATOR_CONFIG = {
"annotator_class": "BootstrapAnnotator",
"confidence_threshold": 0.95,
...
"annotation_rules": [
{
"domains": ".*",
"intents": ".*",
"files": ".*bootstrap.*\.txt",
"entities": ".*",
}
],
}
Check your ENTITY_RECOGNIZER_CONFIG
in config.py
. Make sure that you explicitly specify the regex pattern for training and testing and that this pattern does not overlap with the pattern for your unlabeled data (E.g. train_bootstrap.txt
).
ENTITY_RECOGNIZER_CONFIG = {
...
'train_label_set': 'train.*\.txt',
'test_label_set': 'test.*\.txt'
}
To run from the command line:
mindmeld annotate --app-path "hr_assistant"
Alternatively, you can annotate by creating an instance of the BootstrapAnnotator
class and running the Python code below.
An optional param overwrite
can be passed in here as well.
from mindmeld.auto_annotator import BootstrapAnnotator
annotation_rules: [
{
"domains": ".*",
"intents": ".*",
"files": ".*bootstrap.*\.txt",
"entities": ".*",
}
]
ba = BootstrapAnnotator(
app_path="hr_assistant",
annotation_rules=annotation_rules,
confidence_threshold=0.95,
)
ba.annotate()
Note
The Bootstrap Annotator is different from the predict
command-line function. Running python -m hr_assistant predict train_bootstrap.txt -o labeled.tsv
will output a tsv with annotated queries.
Unlike the Bootstrap Annotator, the predict
only annotates a single file and does not use the entity recognizer of a specific intent. Instead, it uses the intent classified by nlp.process(query_text)
.
Creating a Custom Annotator¶
The MultiLingualAnnotator
is a subclass of the abstract base class Annotator
.
The functionality for annotating and unannotating files is contained in Annotator
itself.
A developer simply needs to implement two methods to create a custom annotator.
Custom Annotator Boilerplate Code¶
This section includes boilerplate code to build a CustomAnnotator
class to which you can add to your own python file, let's call it custom_annotator.py
There are two "TODO"s. To implement a CustomAnnotator
class a developer has to implement the parse()
and supported_entity_types()
methods.
class CustomAnnotator(Annotator):
""" Custom Annotator class used to generate annotations.
"""
def __init__(
self,
app_path,
annotation_rules=None,
language=None,
locale=None,
overwrite=False,
unannotate_supported_entities_only=True,
unannotation_rules=None,
custom_param=None,
):
super().__init__(
app_path,
annotation_rules=annotation_rules,
language=language,
locale=locale,
overwrite=overwrite,
unannotate_supported_entities_only=unannotate_supported_entities_only,
unannotation_rules=unannotation_rules,
)
self.custom_param = custom_param
# Add additional params to init if needed
def parse(self, sentence, entity_types=None, **kwargs):
"""
Args:
sentence (str): Sentence to detect entities.
entity_types (list): List of entity types to parse. If None, all
possible entity types will be parsed.
Returns:
query_entities (list[QueryEntity]): List of QueryEntity objects.
"""
# TODO: Add custom parse logic
return query_entities
@property
def supported_entity_types(self):
"""
Returns:
supported_entity_types (list): List of supported entity types.
"""
# TODO: Add the entities supported by CustomAnnotator to supported_entities (list)
supported_entities = []
return supported_entities
if __name__ == "__main__":
annotation_rules: [
{
"domains": ".*",
"intents": ".*",
"files": ".*",
"entities": ".*",
}
]
custom_annotator = CustomAnnotator(
app_path="hr_assistant",
annotation_rules=annotation_rules,
)
custom_annotator.annotate()
To run your custom Annotator, simply run in the command line: python custom_annotator.py
.
To run unannotation with your custom Annotator, change the last line in your script to custom_annotator.unannotate()
.
Getting Custom Parameters from the Config¶
spacy_model_size
is an example of an optional parameter in the config that is relevant only for a specific Annotator
class.
AUTO_ANNOTATOR_CONFIG = {
...
"spacy_model": "en_core_web_md",
...
}
If a SpacyAnnotator
is created using the command-line, it will use the value for spacy_model_size
that exists in the config during instantiation.
A similar approach can be taken for custom Annotators.