mindmeld.text_preparation.tokenizers module¶

This module contains Tokenizers.

class mindmeld.text_preparation.tokenizers.CharacterTokenizer[source]¶

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that splits text at the character level.

tokenize(text)[source]¶

Split characters into separate tokens while skipping spaces. :param text: the text to tokenize :type text: str

Returns:	List of tokenized tokens which a represented as dictionaries. Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:	tokens (List[Dict])

class mindmeld.text_preparation.tokenizers.LetterTokenizer[source]¶

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that splits text into a separate token if the character proceeds a space, is a non-latin character, or is a different unicode category than the previous character.

static create_tokens(text, token_num_by_char)[source]¶

Generate token dictionaries from the original text and the token numbers by character. :param text: the text to tokenize :type text: str :param token_num_by_char: Token number that each character belongs to.

Spaces are represented as None. For example: [1,2,2,3,None,4,None,5,5,5]

Returns:	List of tokenized tokens which a represented as dictionaries. Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:	tokens (List[Dict])

static get_token_num_by_char(text)[source]¶

Determine the token number for each character.

More details about unicode categories can be found here: http://www.unicode.org/reports/tr44/#General_Category_Values. :param text: The text to process and get actions per character. :type text: str

Returns:	Token number that each character belongs to. Spaces are represented as None. For example: [1,2,2,3,None,4,None,5,5,5]
Return type:	token_num_by_char (List[str])

tokenize(text)[source]¶

Identify tokens in text and create normalized tokens that contain the text and start index. :param text: the text to tokenize :type text: str

Returns:	List of tokenized tokens which a represented as dictionaries. Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:	tokens (List[Dict])

class mindmeld.text_preparation.tokenizers.NoOpTokenizer[source]¶

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A No-Ops tokenizer.

tokenize(text)[source]¶

Returns the original text as a list. :param text: Input text. :type text: str

Returns:	List of tokens.
Return type:	tokens (List[str])

class mindmeld.text_preparation.tokenizers.SpacyTokenizer(language, spacy_model_size='sm')[source]¶

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that uses Spacy to split text into tokens.

tokenize(text)[source]¶

Parameters:	text (str) -- Input text.
Returns:	List of tokenized tokens which a represented as dictionaries. Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:	tokens (List[Dict])

class mindmeld.text_preparation.tokenizers.Tokenizer[source]¶

Bases: abc.ABC

Abstract Tokenizer Base Class.

tojson()[source]¶

Method defined to obtain recursive JSON representation of a TextPreparationPipeline.

Parameters:	None. --
Returns:	JSON representation of TextPreparationPipeline (dict) .

tokenize(text)[source]¶

Parameters:	text (str) -- Input text.
Returns:	List of tokens.
Return type:	tokens (List[str])

class mindmeld.text_preparation.tokenizers.TokenizerFactory[source]¶

Bases: object

Tokenizer Factory Class

static get_default_tokenizer()[source]¶

Creates the default tokenizer (WhiteSpaceTokenizer) irrespective of the language of the current application.

Parameters:	language (str, optional) -- Language as specified using a 639-1/2 code.
Returns:	Tokenizer Class
Return type:	(Tokenizer)

static get_tokenizer(tokenizer: str, language='en', spacy_model_size='sm')[source]¶

A static method to get a tokenizer

Parameters:	tokenizer (str) -- Name of the desired tokenizer class language (str, optional) -- Language as specified using a 639-1/2 code. spacy_model_size (str, optional) -- Size of the Spacy model to use. ("sm", "md", or "lg")
Returns:	Tokenizer Class
Return type:	(Tokenizer)

class mindmeld.text_preparation.tokenizers.WhiteSpaceTokenizer[source]¶

Bases: mindmeld.text_preparation.tokenizers.Tokenizer

A Tokenizer that splits text at spaces.

tokenize(text)[source]¶

Identify tokens in text and token dictionaries that contain the text and start index. :param text: the text to tokenize :type text: str

Returns:	List of tokenized tokens which a represented as dictionaries. Keys include "start" (token starting index), and "text" (token text). For example: [{"start": 0, "text":"hello"}]
Return type:	tokens (List[Dict])