Re: Nicest UTF

From: D. Starner (
Date: Fri Dec 10 2004 - 19:10:16 CST

  • Next message: Philippe Verdy: "Re: Nicest UTF"

    "Marcin 'Qrczak' Kowalczyk" writes:

    > "D. Starner" writes:
    > > This implies that every programmer needs an indepth knowledge of
    > > Unicode to handle simple strings.
    > There is no way to avoid that.

    Then there's no way that we're ever going to get reliable Unicode
    > If the runtime automatically performed NFC on input, then a part of a
    > program which is supposed to pass a string unmodified would sometimes
    > modify it. Similarly with NFD.

    No. By the same logic you used above, I can expect the programmer to
    understand their tools, and if they need to pass strings unmodified,
    they shouldn't load them using methods that normalize the string.
    > You can't expect each and every program which compares strings to
    > perform normalization (e.g. Linux kernel with filenames).

    As has been pointed out here, Posix filenames are not character strings;
    they are byte strings. They quite likely aren't even valid UTF-8 strings.

    > > So S should _sometimes_ match an accented S? Again, I feel extended misery
    > > of explaining to people why things aren't working right coming on.
    > Well, otherwise things get ambiguous, similarly to these XML issues.

    Sometimes things get ambiguous if one day ŝ is matched by s and one
    day ŝ isn't? That's absolutely wrong behavior; the program must serve
    the user, not the programmer. 's' cannot, should, must not match 'ŝ';
    and if it must, then it absolutely always must match 'ŝ' and someway
    to make a regex that matches s but not ŝ must be designed. It doesn't
    matter what problems exist in the world of programming; that is the
    entirely reasonable expectation of the end user.

    > Does "\n" followed by a combining code point start a new line?

    The Standard says no, that's a defective combining sequence.

    > Does
    > a double quote followed by a combining code point start a string
    > literal?

    That would depend on your language. I'd prefer no, but it's obvious
    many have made other choices.

    > Does a slash followed by a combining code point separate
    > subdirectory names?

    In Unix, yes; that's because filenames in Unix are byte streams with
    the byte 0x2F acting as a path seperator.
    > It's hard enough to convince them that a
    > character is not the same as a byte.

    That contradicts you above statement, that every programmer needs an
    indepth knowledge of Unicode.

    > In case I want to circumvent security or deliberately cause a piece of
    > software to misbehave. Robustness require unambiguous and simple rules.

    The rules you are offering are only simple and unambiguous to the programmer;
    they appear completely random to the end user. To have ≮ sometimes start a
    tag means that a user can't look at the XML and tell whether something opens
    a tag or is just text. You might be able to expect all programmers, but you
    can't expect all end users to.

    Sign-up for Ads Free at

    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 19:13:46 CST