Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Fri Dec 10 2004 - 13:27:04 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    "D. Starner" <shalesller@writeme.com> writes:

    >> String equality in a programming language should not treat composed
    >> and decomposed forms as equal. Not this level of abstraction.
    >
    > This implies that every programmer needs an indepth knowledge of
    > Unicode to handle simple strings.

    There is no way to avoid that.

    If the runtime automatically performed NFC on input, then a part of a
    program which is supposed to pass a string unmodified would sometimes
    modify it. Similarly with NFD.

    You can't expect each and every program which compares strings to
    perform normalization (e.g. Linux kernel with filenames).

    Perhaps if there was a single normalization format which everybody
    agreed to, and unnormalized strings were never used for data
    interchange (if UTF-8 was specified such that to disallow unnormalized
    data, etc.), things would be different. But Unicode treats both
    composed and decomposed representations as valid.

    >> IMHO splitting into graphemes is the job of a rendering engine, not of
    >> a function which extracts a part of a string which matches a regex.
    >
    > So S should _sometimes_ match an accented S? Again, I feel extended misery
    > of explaining to people why things aren't working right coming on.

    Well, otherwise things get ambiguous, similarly to these XML issues.
    Does "\n" followed by a combining code point start a new line? Does
    a double quote followed by a combining code point start a string
    literal? Does a slash followed by a combining code point separate
    subdirectory names?

    An iterator which delivers whole combining character sequences out of
    a sequence of code points can be used. You can also manipulate strings
    as arrays of combining character sequences. But if you insist that
    this is the primary string representation, you become incompatible
    with most programs which have different ideas about delimited strings.
    You can't expect each and every program to check combining classes
    of processed characters. It's hard enough to convince them that a
    character is not the same as a byte.

    >> I expect breakage of XML-based protocols if implementations are
    >> actually changed to conform to these rules (I bet they don't now).
    >
    > Really? In what cases are you storing isolated combining code points
    > in XML as text?

    In case I want to circumvent security or deliberately cause a piece of
    software to misbehave. Robustness require unambiguous and simple rules.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 13:29:34 CST