TokenMonster Documentation¶
Table of Contents¶
- Understanding the Purpose
- Overview and Introduction
- Class Definition
- Functionality and Usage
- Initializing TokenMonster
- Setting Local Directory
- Loading Vocabulary
- Creating a New Vocabulary
- Saving Vocabulary
- Exporting Vocabulary to YAML
- Tokenization
- Decoding Tokens
- Creating a Decoder Instance
- Getting Vocabulary Dictionary
- Getting Character Set
- Getting Normalization
- Getting Capcode Level
- Getting Optimization Mode
- Mapping Token ID to Token String
- Mapping Token ID to Token String (Decoded)
- Mapping Token String to Token ID
- Modifying Vocabulary
- Adding Regular Tokens
- Deleting Tokens
- Deleting Tokens by ID
- Adding Special Tokens
- Resizing Vocabulary
- Resetting Token IDs
- Enabling UNK Token
- Disabling UNK Token
- Disconnecting TokenMonster
- Serializing Tokens
- Deserializing Tokens
- Additional Information
- Examples
- Conclusion
1. Understanding the Purpose ¶
TokenMonster is a Python library designed to provide tokenization and vocabulary management functionalities. It allows you to tokenize text, manage vocabularies, modify vocabularies, and perform various operations related to tokenization and vocabulary handling.
Purpose and Functionality¶
TokenMonster serves the following purposes and functionalities:
- Tokenization: Tokenize text into tokens based on a specified vocabulary.
- Vocabulary Management: Load, create, save, and modify vocabularies.
- Token ID Mapping: Map tokens to token IDs and vice versa.
- Serialization: Serialize and deserialize tokens for storage or transmission.
- Configuration: Access and modify vocabulary settings like character set, normalization, capcode level, and optimization mode.
- Special Token Handling: Add, delete, or modify special tokens.
- Disconnecting: Gracefully disconnect from TokenMonster server.
TokenMonster is useful in various natural language processing tasks, especially when working with custom vocabularies and tokenization requirements.
2. Overview and Introduction ¶
Overview¶
TokenMonster is a versatile library for tokenization and vocabulary management. It allows you to create, load, and modify vocabularies, tokenize text, and perform various operations related to tokenization.
Importance and Relevance¶
In the field of natural language processing, tokenization is a fundamental step in text preprocessing. TokenMonster provides a flexible and efficient way to tokenize text while also enabling users to manage custom vocabularies. The ability to add, delete, or modify tokens is crucial when working with specialized language models and text data.
Key Concepts and Terminology¶
Before diving into the details, let's clarify some key concepts and terminology used throughout the documentation:
- Tokenization: The process of breaking text into individual tokens (words, subwords, or characters).
- Vocabulary: A collection of tokens and their corresponding token IDs.
- Token ID: A unique identifier for each token in the vocabulary.
- Special Tokens: Tokens that have a specific role, such as padding, start of sentence, end of sentence, and unknown tokens.
- Normalization: Text processing operations like lowercasing, accent removal, and character set transformation.
- Capcode Level: The level of capcoding applied to tokens (0-2).
- Optimization Mode: The mode used for optimizing TokenMonster (0-5).
Now that we have an overview, let's proceed with a detailed class definition.
3. Class Definition ¶
Class: TokenMonster¶
The TokenMonster
class encapsulates the functionality of the TokenMonster library.
Constructor¶
def __init__(self, path):
"""
Initializes the TokenMonster class and loads a vocabulary.
Args:
path (str): A filepath, URL, or pre-built vocabulary name.
"""
Methods¶
The TokenMonster
class defines various methods to perform tokenization, vocabulary management, and configuration. Here are the key methods:
1. Setting Local Directory¶
def set_local_directory(self, dir=None):
"""
Sets the local directory for TokenMonster.
Args:
dir (str, optional): The local directory to use. Defaults to None.
"""
2. Loading Vocabulary¶
def load(self, path):
"""
Loads a TokenMonster vocabulary from file, URL, or by name.
Args:
path (str): A filepath, URL, or pre-built vocabulary name.
"""
3. Loading Vocabulary (Multiprocess Safe)¶
def load_multiprocess_safe(self, path):
"""
Loads a TokenMonster vocabulary from file, URL, or by name. It's safe for multiprocessing,
but vocabulary modification is disabled, and tokenization is slightly slower.
Args:
path (str): A filepath, URL, or pre-built vocabulary name.
"""
4. Creating a New Vocabulary¶
def new(self, yaml):
"""
Creates a new vocabulary from a YAML string.
Args:
yaml (str): The YAML file.
"""
5. Saving Vocabulary¶
def save(self, fname):
"""
Saves the current vocabulary to a file.
Args:
fname (str): The filename to save the vocabulary to.
"""
6. Exporting Vocabulary to YAML¶
def export_yaml(self, order_by_score=False):
"""
Exports the vocabulary as a YAML file, which is returned as a bytes string.
Args:
order_by_score (bool, optional): If true, the tokens are ordered by score instead of alphabetically. Defaults to False.
Returns:
bytes: The vocabulary in YAML format.
"""
7. Tokenization¶
def tokenize(self, text):
"""
Tokenizes a
string into tokens according to the vocabulary.
Args:
text (str): A string or bytes string or a list of strings or bytes strings.
Returns:
numpy array: The token IDs.
"""
8. Tokenization (Count)¶
def tokenize_count(self, text):
"""
Same as tokenize, but it returns only the number of tokens.
Args:
text (str): A string or bytes string or a list of strings or bytes strings.
Returns:
int: The number of tokens for each input string.
"""
9. Decoding Tokens¶
def decode(self, tokens):
"""
Decodes tokens into a string.
Args:
tokens (int, list of int, or numpy array): The tokens to decode into a string.
Returns:
str: The composed string from the input tokens.
"""
10. Creating a Decoder Instance¶
def decoder(self):
"""
Returns a new decoder instance used for decoding tokens into text.
Returns:
tokenmonster.DecoderInstance: A new decoder instance.
"""
11. Getting Vocabulary Dictionary¶
def get_dictionary(self):
"""
Returns a dictionary of all tokens in the vocabulary.
Returns:
list: A list of dictionaries where the index is the token ID, and each is a dictionary.
"""
12. Getting Character Set¶
def charset(self):
"""
Returns the character set used by the vocabulary.
Returns:
str: The character set used by the vocabulary. Possible values are "UTF-8" or "None".
"""
13. Getting Normalization¶
def normalization(self):
"""
Returns the normalization of the vocabulary.
Returns:
str: The normalization of the vocabulary. Possible values are "None", "NFD", "Lowercase", "Accents", "Quotemarks", "Collapse", "Trim", "LeadingSpace", or "UnixLines".
"""
14. Getting Capcode Level¶
def capcode(self):
"""
Returns the capcode level of the vocabulary.
Returns:
int: The capcode level (0-2).
"""
15. Getting Optimization Mode¶
def mode(self):
"""
Returns the optimization mode of the vocabulary.
Returns:
int: The optimization mode (0-5).
"""
16. Mapping Token ID to Token String¶
def id_to_token(self, id):
"""
Get the token string from a single token ID, in its capcode-encoded form.
Args:
id (int): The token ID.
Returns:
str or None: The token string corresponding to the input ID. None if the ID is not in the vocabulary.
"""
17. Mapping Token ID to Token String (Decoded)¶
def id_to_token_decoded(self, id):
"""
Get the token string from a single token ID, in its capcode-decoded form.
Args:
id (int): The token ID.
Returns:
str or None: The token string corresponding to the input ID. None if the ID is not in the vocabulary.
"""
18. Mapping Token String to Token ID¶
def token_to_id(self, token):
"""
Returns the ID of a single token.
Args:
token (str): The token to get the ID for.
Returns:
int or None: The ID of the token. None if the token is not in the vocabulary.
"""
19. Modifying Vocabulary¶
def modify(
self,
add_special_tokens=None,
add_regular_tokens=None,
delete_tokens=None,
resize=None,
change_unk=None,
):
"""
Modifies the vocabulary.
Args:
add_special_tokens (str or list of str, optional): Special tokens to add to the vocabulary.
add_regular_tokens (str or list of str, optional): Regular tokens to add to the vocabulary.
delete_tokens (str or list of str, optional): Regular or special tokens to delete.
resize (int, optional): Resizes the vocabulary to this size.
change_unk (bool, optional): If set, it enables or disables the UNK token.
Returns:
int: The new size of the vocabulary.
"""
20. Adding Regular Tokens¶
def add_token(self, token):
"""
Add one or more regular tokens.
Args:
token (str or list of str): The regular tokens to add.
Returns:
int: The new size of the vocabulary.
"""
21. Deleting Tokens¶
def delete_token(self, token):
"""
Delete one or more regular or special tokens.
Args:
token (str or list of str): The tokens to delete.
Returns:
int: The new size of the vocabulary.
"""
22. Deleting Tokens by ID¶
def delete_token_by_id(self, id):
"""
Delete one or more regular or special tokens by specifying the token ID.
Args:
id (int or list of int): The IDs of the tokens to delete.
Returns:
int: The new size of the vocabulary.
"""
23. Adding Special Tokens¶
def add_special_token(self, token):
"""
Add one or more special tokens.
Args:
token (str or list of str): The special tokens to add.
Returns:
int: The new size of the vocabulary.
"""
24. Resizing Vocabulary¶
def resize(self, size):
"""
Changes the size of the vocabulary.
Args:
size (int): The new size of the vocabulary.
Returns:
int: The new size of the vocabulary.
"""
25. Resetting Token IDs¶
26. Enabling UNK Token¶
def enable_unk_token(self):
"""
Enables the UNK token.
Returns:
int: The new size of the vocabulary.
"""
27. Disabling UNK Token¶
def disable_unk_token(self):
"""
Disables the UNK token.
Returns:
int: The new size of the vocabulary.
"""
28. Disconnecting TokenMonster¶
29. Serializing Tokens¶
def serialize_tokens(self, integer_list):
"""
Serializes tokens from a list of ints or numpy array into a binary string.
Args:
integer_list (list of int or numpy array): The tokens to serialize.
Returns:
bytes: The serialized binary string.
"""
30. Deserializing Tokens¶
def deserialize_tokens(self, binary_string):
"""
Deserializes a binary string into a numpy array of token IDs.
Args:
binary_string (bytes): The binary string to deserialize.
Returns:
np.array: The deserialized tokens.
"""
This concludes the class definition. In the following sections, we will explore each method in detail and provide examples of their usage.
4. Functionality and Usage ¶
4.1. Initializing TokenMonster ¶
To get started with TokenMonster, you need to initialize an instance of the TokenMonster
class. The constructor takes a single argument, path
, which specifies the location of the vocabulary.
Example:
from zeta.tokenizers import TokenMonster
# Initialize TokenMonster with a vocabulary file
tokenizer = TokenMonster("path/to/vocabulary")
4.2. Setting Local Directory ¶
You can set the local directory for TokenMonster using the set_local_directory
method. This directory is used for local caching of vocabulary files.
Example:
4.3. Loading Vocabulary ¶
TokenMonster allows you to load vocabularies from various sources, including file paths, URLs, or pre-built vocabulary names. Use the load
method to load a vocabulary.
Example:
4.4. Creating a New Vocabulary ¶
You can create a new vocabulary from a YAML string using the new
method. This is useful when you want to define a custom vocabulary.
Example:
# Create a new vocabulary from a YAML string
yaml_string = """
- token: [PAD]
id: 0
"""
tokenizer.new(yaml_string)
4.5. Saving Vocabulary ¶
TokenMonster allows you to save the current vocabulary to a file using the save
method. This is useful for preserving custom vocabularies you've created.
Example:
4.6. Exporting Vocabulary to YAML ¶
You can export the vocabulary as a YAML file using the export_yaml
method. This method returns the vocabulary in YAML format as a bytes string.
Example:
4.7. Tokenization ¶
Tokenization is a core functionality of TokenMonster. You can tokenize text into tokens according to the loaded vocabulary using the tokenize
method.
Example:
4.8. Tokenization (Count) ¶
If you want to know the number of tokens without getting the token IDs, you can use the tokenize_count
method.
Example:
# Count the number of tokens in a text string
text = "Hello, world!"
token_count = tokenizer.tokenize_count(text)
4.9. Decoding Tokens ¶
To decode token IDs back into a human-readable string, you can use the decode
method.
Example:
4.10. Creating a Decoder Instance ¶
TokenMonster allows you to create a decoder instance for decoding tokens into text. Use the decoder
method to obtain a decoder instance.
Example:
4.11. Getting Vocabulary Dictionary ¶
The get_dictionary
method returns a dictionary of all tokens in the vocabulary. Each dictionary entry contains information about the token.
Example:
4.12. Getting Character Set ¶
You can retrieve the character set used by the vocabulary using the charset
method.
Example:
4.13. Getting Normalization ¶
TokenMonster allows you to access the normalization settings applied to the vocabulary using the normalization
method.
Example:
4.14. Getting Capcode Level ¶
The capcode
method returns the capcode level of the vocabulary.
Example:
4.15. Getting Optimization Mode ¶
You can retrieve the optimization mode used for TokenMonster using the mode
method.
Example:
4.16. Mapping Token ID to Token String ¶
Given a token ID, you can use the id_to_token
method to get the token string in its capcode-encoded form.
Example:
# Get the token string from a token ID (capcode-encoded)
token_id = 42
token_string = tokenizer.id_to_token(token_id)
4.17. Mapping Token ID to Token String (Decoded) ¶
The id_to_token_decoded
method is used to get the token string from a token ID in its capcode-decoded form.
Example:
# Get the token string from a token ID (capcode-decoded)
token_id = 42
decoded_token_string = tokenizer.id_to_token_decoded(token_id)
4.18. Mapping Token String to Token ID ¶
You can obtain the token ID of a given token string using the token_to_id
method.
Example:
# Get the token ID from a token string
token_string = "apple"
token_id = tokenizer.token_to_id(token_string)
4.19. Modifying Vocabulary ¶
TokenMonster provides methods to modify the vocabulary. You can add special tokens, add regular tokens, delete tokens, resize the vocabulary, and enable or disable the UNK token.
Example:
# Example of modifying the vocabulary
# Add a special token
tokenizer.modify(add
_special_tokens="[_START_]", add_regular_tokens=None, delete_tokens=None, resize=None, change_unk=None)
# Delete a regular token
tokenizer.modify(add_special_tokens=None, add_regular_tokens=None, delete_tokens=["apple"], resize=None, change_unk=None)
4.20. Adding Regular Tokens ¶
You can add one or more regular tokens to the vocabulary using the add_token
method.
Example:
4.21. Deleting Tokens ¶
To delete one or more regular or special tokens from the vocabulary, use the delete_token
method.
Example:
4.22. Deleting Tokens by ID ¶
You can delete one or more regular or special tokens by specifying their token IDs using the delete_token_by_id
method.
Example:
4.23. Adding Special Tokens ¶
Special tokens play specific roles in tokenization. You can add one or more special tokens to the vocabulary using the add_special_token
method.
Example:
4.24. Resizing Vocabulary ¶
To change the size of the vocabulary, you can use the resize
method. This allows you to specify the desired size of the vocabulary.
Example:
4.25. Resetting Token IDs ¶
The reset_token_ids
method resets the token IDs to be sequential beginning from zero.
Example:
4.26. Enabling UNK Token ¶
You can enable the UNK (unknown) token in the vocabulary using the enable_unk_token
method.
Example:
4.27. Disabling UNK Token ¶
The disable_unk_token
method allows you to disable the UNK (unknown) token in the vocabulary.
Example:
4.28. Disconnecting TokenMonster ¶
To gracefully disconnect from the TokenMonster server, use the disconnect
method.
Example:
4.29. Serializing Tokens ¶
TokenMonster provides the serialize_tokens
method to serialize tokens from a list of integers or a numpy array into a binary string.
Example:
4.30. Deserializing Tokens ¶
You can use the deserialize_tokens
method to deserialize a binary string into a numpy array of token IDs.
Example:
# Deserialize tokens
binary_string = b"\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00"
deserialized_tokens = tokenizer.deserialize_tokens(binary_string)
5. Additional Information ¶
5.1. Multiprocessing Safety¶
TokenMonster provides a load_multiprocess_safe
method that is safe for multiprocessing. When using this method, vocabulary modification is disabled, and tokenization may be slightly slower compared to the regular load
method.
5.2. Supported Character Sets¶
TokenMonster supports two character sets: "UTF-8" and "None." You can check the character set used by the vocabulary using the charset
method.
5.3. Supported Normalization Options¶
The vocabulary can have various normalization options applied, including "None," "NFD," "Lowercase," "Accents," "Quotemarks," "Collapse," "Trim," "LeadingSpace," and "UnixLines." You can access the normalization setting using the normalization
method.
5.4. Capcode Levels¶
The capcode level of the vocabulary can be set to values between 0 and 2 using the capcode
method. Capcoding is a way to encode multiple tokens using a single token, which can save memory.
5.5. Optimization Modes¶
TokenMonster supports optimization modes from 0 to 5, which affect the memory usage and performance of the library. You can check the optimization mode using the mode
method.
6. Examples ¶
Let's explore some examples of how to use TokenMonster for tokenization and vocabulary management.
Example 1: Tokenizing Text¶
from zeta.tokenizers import TokenMonster
# Initialize TokenMonster with a vocabulary file
tokenizer = TokenMonster("path/to/vocabulary")
# Tokenize a text string
text = "Hello, world!"
token_ids = tokenizer.tokenize(text)
print(token_ids)
Example 2: Decoding Tokens¶
from zeta.tokenizers import TokenMonster
# Initialize TokenMonster with a vocabulary file
tokenizer = TokenMonster("path/to/vocabulary")
# Decode token IDs into a string
decoded_text = tokenizer.decode([1, 2, 3])
print(decoded_text)
Example 3: Modifying Vocabulary¶
from zeta.tokenizers import TokenMonster
# Initialize TokenMonster with a vocabulary file
tokenizer = TokenMonster("path/to/vocabulary")
# Add a special token
tokenizer.modify(
add_special_tokens="[_START_]",
add_regular_tokens=None,
delete_tokens=None,
resize=None,
change_unk=None,
)
# Delete a regular token
tokenizer.modify(
add_special_tokens=None,
add_regular_tokens=None,
delete_tokens=["apple"],
resize=None,
change_unk=None,
)
Example 4: Exporting Vocabulary to YAML¶
from zeta.tokenizers import TokenMonster
# Initialize TokenMonster with a vocabulary file
tokenizer = TokenMonster("path/to/vocabulary")
# Export the vocabulary to a YAML file
yaml_data = tokenizer.export_yaml()
with open("vocabulary.yaml", "wb") as file:
file.write(yaml_data)
7. Conclusion ¶
TokenMonster is a powerful Python module for tokenization and vocabulary management. Whether you're working on natural language processing tasks or need to create custom tokenization pipelines, TokenMonster provides the flexibility and functionality to handle tokenization efficiently. Use the examples and methods provided in this guide to leverage TokenMonster for your projects.