Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (
Date: Wed Dec 08 2004 - 16:41:25 CST

  • Next message: John Cowan: "Re: Nicest UTF"

    "D. Starner" <> writes:

    > The semantics there are surprising, but that's true no matter what you
    > do. An NFC string + an NFC string may not be NFC; the resulting text
    > doesn't have N+M graphemes.

    Which implies that automatically NFC-ing strings as they are processed
    would be a bad idea. They can be NFC-ed at the end of processing if the
    consumer of this data will demand this. Especially if other consumers
    would want NFD.

    String equality in a programming language should not treat composed
    and decomposed forms as equal. Not this level of abstraction.

    IMHO splitting into graphemes is the job of a rendering engine, not of
    a function which extracts a part of a string which matches a regex.

    > If you do so with an language that includes <, you violate the Unicode
    > standard, because <&#824; (not <) and &#8814; are canonically equivalent.

    I think that Unicode tries to push implications of "equivalence"
    too far.

    They are supposed to be equivalent when they are actual characters.
    What if they are numeric character references? Should "<&#824;"
    (7 characters) represent a valid plain-text character or be a broken
    opening tag?

    Note that if it's a valid plain-text character, it's impossible
    to represent isolated combining code points in XML, and thus it's
    impossible to use XML for transportation of data which allows isolated
    combining code points (except by introducing custom escaping of
    course, e.g. transmitting decimal numbers instead of characters).
    I expect breakage of XML-based protocols if implementations are
    actually changed to conform to these rules (I bet they don't now).

    OTOH if it's not a valid plain-text character, then conversion between
    numeric character references and actual characters is getting more

    > I'll see if I have time after finals to pound out a basic API that
    > implements this, in Ada or Lisp or something.

    My language is quite similar to Lisp semantically.

    Implementing an API which works in terms of graphemes over an API
    which works in terms of code points is more sane than the converse,
    which suggests that the core API should use code points if both APIs
    are sometimes needed at all.

    While I'm not obsessed with efficiency, it would be nice if changing
    the API would not slow down string processing too much.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:41:58 CST