Documentation for Zeta Library's MultiModalTokenizer Class¶
Introduction and Overview¶
The MultiModalTokenizer
class is part of the Zeta Library, designed to provide tokenization capabilities for both text and image data. This enables more seamless integration and utilization of multimodal (text and image) data, especially when used with models that can handle such information simultaneously, like the CLIP model.
Key Features:
- Multimodal Tokenization: Combines text and image tokenization within one unified class.
- Integration with Hugging Face Transformers: Utilizes the
CLIPProcessor
for image tokenization andAutoTokenizer
for text tokenization. - Special Tokens for Image Segmentation: Uses special tokens
<image>
and</image>
to denote image token boundaries within text. - Error Handling: Implements comprehensive error handling and logging to ensure robustness.
Class Definition¶
MultiModalTokenizer¶
class MultiModalTokenizer:
"""
A tokenizer class for the kosmos model
Attributes:
processor(CLIPProcessor): The processor to tokenize images.
tokenizer(AutoTokenizer): The tokenizer to tokenize text.
im_idx(int): The Index of the "<image>" token.
im_end_idx(int): The index of the "</image>" token.
"""
Parameters:¶
- max_length (int, optional): Maximum length of the tokenized sequence. Defaults to 8192.
Attributes:¶
- processor (CLIPProcessor): The processor used to tokenize images.
- tokenizer (AutoTokenizer): The tokenizer used to tokenize text.
- im_idx (int): Index of the
<image>
token. - im_end_idx (int): Index of the
</image>
token.
Methods¶
1. tokenize_texts¶
def tokenize_texts(self, texts: str) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Tokenize given texts.
Args:
texts (str): The text to be tokenized.
Returns:
A tuple containing the tokenized texts and only the text tokens.
"""
2. tokenize_images¶
def tokenize_images(self, images) -> torch.Tensor:
"""
Tokenizes given images.
Args:
images: The images to be tokenized.
Returns:
The tokenized images.
"""
3. tokenize¶
def tokenize(self, sample) -> Dict[str, torch.Tensor]:
"""
Tokenizes given sample.
Args:
sample: The sample to be tokenized.
Returns:
A dictionary containing the tokenized text tokens, images, labels, and attention mask.
"""
Usage Examples¶
Example 1: Tokenizing Texts¶
import torch
from zeta import MultiModalTokenizer
tokenizer = MultiModalTokenizer()
texts = ["Hello World", "Zeta Library is great!"]
tokenized_texts, only_texts = tokenizer.tokenize_texts(texts)
print(tokenized_texts)
print(only_texts)
Example 2: Tokenizing Images¶
import torch
from zeta import MultiModalTokenizer
tokenizer = MultiModalTokenizer()
images = torch.randn(2, 3, 224, 224) # Assuming 2 random images of shape 3x224x224
tokenized_images = tokenizer.tokenize_images(images)
print(tokenized_images)
Example 3: Tokenizing Multimodal Data¶
import torch
from zeta import MultiModalTokenizer
tokenizer = MultiModalTokenizer()
sample = {
"target_text": ["Hello World", "Zeta Library is great!"],
"image": torch.randn(2, 3, 224, 224),
}
tokenized_data = tokenizer.tokenize(sample)
print(tokenized_data)
Mathematical Overview¶
Given a text sequence \( T \) of length \( n \) and an image \( I \) represented by a tensor of shape \( C \times H \times W \), where \( C \) is the number of channels, \( H \) is the height, and \( W \) is the width:
-
The tokenized text, \( T' \), is represented as: [ T' = [
,] ], , T_{1}, T_{2}, ..., T_{n}, -
The tokenized image, \( I' \), is processed using the CLIP processor to obtain a tensor representation.
-
When both text and image data are tokenized using the
tokenize
method, the output contains both \( T' \) and \( I' \) with their respective attention masks.
Additional Tips¶
-
Ensure you have the required model weights and configurations for the specified pretrained models ("laion/CLIP-ViT-L-14-laion2B-s32B-b82K" and "EleutherAI/gpt-neox-20b") downloaded or accessible from the Hugging Face Model Hub.
-
Handle potential tokenization errors gracefully using try-except blocks, as demonstrated in the provided methods.
References and Resources¶
- CLIP: Connecting Vision and Language with Reinforced Loss - OpenAI: Link
- Hugging Face's Transformers library: Link
- Documentation on Special Tokens in Transformers: Link