UTW 2025

ICU/CLDR
Internationalization

Breaking Text Barriers: Adobe's Unified Tokenization Engine for Web I18n

Shashank Oberoi

on  Wed, 13:50in  for  40min

This session shares insights from Adobe’s development of a unified tokenization engine that addresses complex internationalization challenges for diverse text types in web applications like Adobe Express, including mixed language content, URLs, email addresses, and emojis. We’ll present our approach of leveraging ICU4C library’s break iterator & other functionalities through WebAssembly to enable advanced features on web like multilingual spellcheck, showing how we coupled tokenization with language & script detection with near native performance on web along with overcoming browser sizing limitations through optimized handling of associated multilingual data. We would be demonstrating how this unified solution supports sophisticated text segmentation for diverse text on web through a layered approach involving boundary detection across scripts, proper parsing of structured content like web & email addresses, intelligent emoji handling and context-aware processing for multilingual environments. The talk will cover key architectural decisions, implementation strategies, dealing with size constraints associated with multilingual data on browser and lessons learned while building this comprehensive system that brings desktop-class Unicode processing capabilities to web environments with near-native performance. Attendees will gain practical insights into developing similar unified tokenization approaches for their own applications and learn about the trade-offs & benefits of our ICU-WebAssembly integration. We hope to evangelize this unified tokenization approach for broader web i18n adoption and gather community feedback on extending these approaches to solve similar text processing challenge with objective of enabling sophisticated multilingual features across different platforms and use cases.

 Overview  Program