Tailoring UAX#29 Word Breaking for Arabic Text From the Wild
Intended Audience: Software Engineers
Session Level: Intermediate
Unicode Standard Annex #29, Text Boundaries, provides guidelines for determining the boundaries between text elements in Unicode-encoded text. In particular, a general purpose (but tailorable) set of rules for determining word boundaries is described. This works well in many cases, but when faced when Arabic text found on the World Wide Web it has several limitations caused by the loose orthographic conventions seen in "real" text. This paper presents some of the problems caused by Arabic text, shows how the UAX #29 word boundary rules can be tailored to account for these, and how they can be efficiently implemented with a state machine. The paper outlines the orthographic complexity faced in Arabic text, and shows how the general word breaking method described in the Annex can be easily modified to handle these complexities.