mindmeld.text_preparation.tokenizers module¶
This module contains Tokenizers.
-
class
mindmeld.text_preparation.tokenizers.
CharacterTokenizer
[source]¶ Bases:
mindmeld.text_preparation.tokenizers.Tokenizer
A Tokenizer that splits text at the character level.
-
tokenize
(text)[source]¶ Split characters into separate tokens while skipping spaces. :param text: the text to tokenize :type text: str
Returns: - List of tokenized tokens which a represented as dictionaries.
- Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type: tokens (List[Dict])
-
-
class
mindmeld.text_preparation.tokenizers.
LetterTokenizer
[source]¶ Bases:
mindmeld.text_preparation.tokenizers.Tokenizer
A Tokenizer that splits text into a separate token if the character proceeds a space, is a non-latin character, or is a different unicode category than the previous character.
-
static
create_tokens
(text, token_num_by_char)[source]¶ Generate token dictionaries from the original text and the token numbers by character. :param text: the text to tokenize :type text: str :param token_num_by_char: Token number that each character belongs to.
Spaces are represented as None. For example: [1,2,2,3,None,4,None,5,5,5]Returns: - List of tokenized tokens which a represented as dictionaries.
- Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type: tokens (List[Dict])
-
static
get_token_num_by_char
(text)[source]¶ Determine the token number for each character.
More details about unicode categories can be found here: http://www.unicode.org/reports/tr44/#General_Category_Values. :param text: The text to process and get actions per character. :type text: str
Returns: - Token number that each character belongs to.
- Spaces are represented as None. For example: [1,2,2,3,None,4,None,5,5,5]
Return type: token_num_by_char (List[str])
-
tokenize
(text)[source]¶ Identify tokens in text and create normalized tokens that contain the text and start index. :param text: the text to tokenize :type text: str
Returns: - List of tokenized tokens which a represented as dictionaries.
- Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type: tokens (List[Dict])
-
static
-
class
mindmeld.text_preparation.tokenizers.
NoOpTokenizer
[source]¶ Bases:
mindmeld.text_preparation.tokenizers.Tokenizer
A No-Ops tokenizer.
-
class
mindmeld.text_preparation.tokenizers.
SpacyTokenizer
(language, spacy_model_size='sm')[source]¶ Bases:
mindmeld.text_preparation.tokenizers.Tokenizer
A Tokenizer that uses Spacy to split text into tokens.
-
class
mindmeld.text_preparation.tokenizers.
Tokenizer
[source]¶ Bases:
abc.ABC
Abstract Tokenizer Base Class.
-
class
mindmeld.text_preparation.tokenizers.
TokenizerFactory
[source]¶ Bases:
object
Tokenizer Factory Class
-
static
get_default_tokenizer
()[source]¶ Creates the default tokenizer (WhiteSpaceTokenizer) irrespective of the language of the current application.
Parameters: language (str, optional) -- Language as specified using a 639-1/2 code. Returns: Tokenizer Class Return type: (Tokenizer)
-
static
-
class
mindmeld.text_preparation.tokenizers.
WhiteSpaceTokenizer
[source]¶ Bases:
mindmeld.text_preparation.tokenizers.Tokenizer
A Tokenizer that splits text at spaces.
-
tokenize
(text)[source]¶ Identify tokens in text and token dictionaries that contain the text and start index. :param text: the text to tokenize :type text: str
Returns: - List of tokenized tokens which a represented as dictionaries.
- Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type: tokens (List[Dict])
-