Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (
Date: Wed Dec 08 2004 - 17:53:31 CST

  • Next message: Lisa Moore: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

    John Cowan <> writes:

    >> String equality in a programming language should not treat composed
    >> and decomposed forms as equal. Not this level of abstraction.
    > Well, that assumes that there's a special "string equality" predicate,
    > as distinct from just having various predicates that DWIM.

    No, I meant the default generic equality predicate when applied to two

    > It's a broken opening tag.

    Ok, so it's the conversion from raw text to escaped character
    references which should treat combining characters specially.

    What about < with combining acute, which doesn't have a precomposed
    form? A broken opening tag or a valid text character?

    What about &#65;ACUTE where ACUTE stands for combining acute? Is this
    A with acute, or a broken character reference which ends with an
    accented semicolon?

    If it's a broken character reference, then what about A&#769; (769 is
    the code for combining acute if I'm not mistaken)? If *this* is A with
    acute, then it's inconsistent: here combining accents are processed
    after resolving numeric character references, and previously it was
    in the opposite order. OTOH if this is something else, then it's
    impossible to represent letters without precomposed forms with numeric
    character references.

    The general trouble is that numeric character references can only
    encode individual code points rather than graphemes (is this a correct
    term for a non-combining code point with a sequence of combining code
    points?). So if XML is supposed to be treated as a sequence of
    graphemes, weird effects arise in the above boundary cases...

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 17:58:07 CST