From: Philippe Verdy (email@example.com)
Date: Wed Aug 06 2003 - 19:41:41 EDT
On Thursday, August 07, 2003 1:13 AM, Kenneth Whistler
> Well, yes, which is why I have been advocating it as the
> solution to the Biblical Hebrew text representation problem.
> I agree with you about that. But it need not be characterized
> as "legal" in opposition to the other examples I cited above.
> All of these sequences are "legal" and allowed by the
Once again sorry if I used the terms "ill-formed" or "well-formed"
instead of "defective" or "non defective" (normal?). Such distinction
in the standard does not help its understanding when discussing
about interoperability of text processing where neither ill-formed
nor defective sequences should be used if interoperability is the
main focus (and also normally the design focus for Unicode).
The canonical equivalences (NFC, NFD, canonical ordering) is
needed now for XML processing and in fact it greatly reduces
the number of ill-formed, invalid, or defective sequences or
whatever bad encoding of actual text, to simplify its processing.
Still these equivalences don't solve all the issues and create their
own (and this is now a good reason to use CGJ to override the
canonical ordering of combining diacritics).
Of course there may be a lot of strings created with Unicode
which are not "ill-formed" and not canonically equivalent (per
NFC, NFD, canonical ordering), but I won't enter in that zone.
For XML what is relevant is that it processes strings in NFC
form and thus implies only canonical equivalences, but XML
will still process "defective" sequences by correctly
processing characters per its canonical combining sequences.
I'd like to see a more formal rule for defective uses of CGJ used
to fix canonical ordering. What I suggested was to specify that
only some sequences with CGJ would be "non defective", if
the CGJ appears before a base character or between two
combining characters. The character model needs then to be
refined to be more precise to document which uses are
considered non defective, and which ones are not.
So a sequence <..., ring above, CGJ, cedilla, ...> would
not be defective as it fixes the canonical ordering, even if
in this case it does not interact graphically (note that this
statement supposes that the cedilla effectively appears
below, something which is wrong with some languages,
where the cedilla appears in fact like an acute accent
The example of the effective rendering of diacritics at the
presupposed placement indicated by their combining class
is significant: it shows that combining classes just handle
some common placement rules, but not every case, and
a particular language or renderer may need to place
diacritics on other positions, in which case the canonical
ordering would have an impact on the renderer. That's a
good enough reason to justify and document the use of
CGJ as a combining class override for diacritics, whose
usage should be restricted for interoperability.
This has a consequence for input methods and editors:
users can type base characters and diacritics, and the
editor will, by default, use a canonical ordering, that the user
may fix if needed for a particular language with a control
command that would "swap" two misplaced diacritics by
automatically inserting a CGJ only if needed because both
diacritics have distinct combining classes: this editor control
command would have no other effect if executed after two
diacritics with identical combining, or after a single diacritic,
and the editor should make its best effort to not allow user
enter ill-formed or defective sequences.
-- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 21:37:27 EDT