UTW 2025

ICU/CLDR
Internationalization

No Spaces? No Problem! Segmenting Complex Scripts with Machine Learning

Shane Carr

on  Wed, 14:50in  for  40min

Many languages, including Thai and Japanese, do not use spaces between words. Have you ever thought about how a machine can figure out where word boundaries occur so that it can perform text layout and other tasks?

This talk will explore how Machine Learning and Artificial Intelligence have helped us build text segmentation models for these languages, with the help of passionate contributors in Google Summer of Code. The talk will discuss the classical dictionary-based models and the progression into more sophisticated models including an LSTM (long short-term memory), CNN (convolutional neural network), and AdaBoost (adaptive boosting). The talk will show how these newer models have both improved accuracy and reduced model size, and it will discuss how to use these models in ICU4X and ICU.

 Overview  Program