From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 08 2004 - 17:53:31 CST
John Cowan <jcowan@reutershealth.com> writes:
>> String equality in a programming language should not treat composed
>> and decomposed forms as equal. Not this level of abstraction.
>
> Well, that assumes that there's a special "string equality" predicate,
> as distinct from just having various predicates that DWIM.
No, I meant the default generic equality predicate when applied to two
strings.
> It's a broken opening tag.
Ok, so it's the conversion from raw text to escaped character
references which should treat combining characters specially.
What about < with combining acute, which doesn't have a precomposed
form? A broken opening tag or a valid text character?
What about AACUTE where ACUTE stands for combining acute? Is this
A with acute, or a broken character reference which ends with an
accented semicolon?
If it's a broken character reference, then what about Á (769 is
the code for combining acute if I'm not mistaken)? If *this* is A with
acute, then it's inconsistent: here combining accents are processed
after resolving numeric character references, and previously it was
in the opposite order. OTOH if this is something else, then it's
impossible to represent letters without precomposed forms with numeric
character references.
The general trouble is that numeric character references can only
encode individual code points rather than graphemes (is this a correct
term for a non-combining code point with a sequence of combining code
points?). So if XML is supposed to be treated as a sequence of
graphemes, weird effects arise in the above boundary cases...
-- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 17:58:07 CST