Re: Nicest UTF

From: D. Starner (shalesller@writeme.com)
Date: Wed Dec 08 2004 - 15:51:47 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Invalid UTF-8 sequences"

    "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl> writes:
    > "D. Starner" <shalesller@writeme.com> writes:
    >
    > > You could hide combining characters, which would be extremely useful if we were just using Latin
    > > and Cyrillic scripts.
    >
    > It would need a separate API for examining the contents of a combining
    > character. You can't avoid the sequence of code points completely.

    Not a seperate API; a function that takes a character and returns an array of integers.

    > It would yield to surprising semantics: for example if you concatenate
    > a string with N+1 possible positions of an iterator with a string with
    > M+1 positions, you don't necessarily get a string with N+M+1 positions
    > because there can be combining characters at the border.

    The semantics there are surprising, but that's true no matter what you
    do. An NFC string + an NFC string may not be NFC; the resulting text
    doesn't have N+M graphemes. Unless you're explicitly adding a combining
    character, a combining character should never start a string. This could
    be fixed several ways, including by inserting a dummy character to hold
    the combining character, and "normalizing" the string by removing the dummy
    characters. That would, for the most part, only hurt pathological cases.

    > It would impose complexity in cases where it's not needed. Most of the
    > time you don't care which code points are combining and which are not,
    > for example when you compose a text file from many pieces (constants
    > and parts filled by users) or when parsing (if a string is specified
    > as ending with a double quote, then programs will in general treat a
    > double quote followed by a combining character as an end marker).

    If you do so with an language that includes <, you violate the Unicode
    standard, because <&#824; (not <) and &#8814; are canonically equivalent. You've
    either got to decompose first or look at the individual characters as
    a whole instead of looking at code points.

    Has anyone considered this while defining a language? How about the official
    standards bodies? Searching for XML in the archives is a bit unhelpful, and
    UTR #20 doesn't mention the issue. Your solution is just fine if you're
    considering the issue on the bit level, but it strikes me as the wrong answer,
    and I would think that it would surprising to a user that didn't understand
    Unicode, especially in the &#8814; case. A warning either way would be nice.

    I'll see if I have time after finals to pound out a basic API that implements
    this, in Ada or Lisp or something. It's not going to be the most efficient thing,
    but I doubt it's going to be a big difference for most programs, and if you want
    C, you know where to find it.

    -- 
    ___________________________________________________________
    Sign-up for Ads Free at Mail.com
    http://promo.mail.com/adsfreejump.htm
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 15:52:35 CST