Chinese Word Breaking

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 21 Jul 2015 07:56:33 +0100

I'm puzzled by a statement in UAX #29 Unicode Text Segmentation:

"In particular, the characters with the Line_Break property values of
Contingent_Break (CB), Complex_Context (SA/Southeast Asian), and
Unknown (XX) are assigned word boundary property values based on
criteria outside of the scope of this annex. That means that
satisfactory treatment of languages like Chinese or Thai requires
special handling."

Is 'Contingent_Break (CB)' an error for 'Ideographic (ID)'? That would
make sense for Chinese, for some applications needs to group ideographs
into words.

While I am on the topic, does anyone know of character level
mechanisms used to advise alogrithms of the word boundaries (or lack
of boundaries) in Chinese text?

Richard.
Received on Tue Jul 21 2015 - 01:57:42 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 21 2015 - 01:57:42 CDT