From: Philippe Verdy (email@example.com)
Date: Thu Nov 20 2003 - 08:52:20 EST
From: "Peter Kirk" <firstname.lastname@example.org>
> As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ
> are not listed, and so as Cf characters are ignored in the line breaking
> algorithm. I note also that the combining mark CGJ is listed as GL and
> so is not CM. The descriptive text of rules LB7a-c implies that CM =
> combining mark whereas this is not in fact true; some combining marks
> are not CM and some CM are not combining marks. In rule LB7b the term
> "combining character sequence" is used, contrary to its regular defined
> meaning, for a sequence of CM characters and the preceding non-CM
Other proofs that even the Unicode exact terminology is to be used with
extreme care, as there are many exceptions, even in _standard_ technical
reports such as UAX's.
If it was possible, I would suggest performing an audit of the terminology
and classification of all character categories, including in the UTS. It's
just too much complicate for now to comply to each UTR (or only to UAX and
UTS), as one need to check simultaneously a lot of sometime "conflicting"
properties used by various technical reports.
We need a comprehensive new technical report that lists all the exceptions
to the general category system, as these line-breaking or word-breaking or
grapheme cluster breaking properties are orthogonal to the basic GC system
and to the combining class system.
This archive was generated by hypermail 2.1.5 : Thu Nov 20 2003 - 09:39:05 EST