Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (
Date: Sat Dec 11 2004 - 04:52:10 CST

  • Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    "D. Starner" <> writes:

    >> > This implies that every programmer needs an indepth knowledge of
    >> > Unicode to handle simple strings.
    >> There is no way to avoid that.
    > Then there's no way that we're ever going to get reliable Unicode
    > support.

    This is probably true.

    I wonder whether things could have been done significantly better,
    or it's an inherent complexity of text. Just curious, it doesn't help
    with the reality.

    >> If the runtime automatically performed NFC on input, then a part of a
    >> program which is supposed to pass a string unmodified would sometimes
    >> modify it. Similarly with NFD.
    > No. By the same logic you used above, I can expect the programmer to
    > understand their tools, and if they need to pass strings unmodified,
    > they shouldn't load them using methods that normalize the string.

    That's my point: if he normalizes, he does this explicitly.

    If a standard (a programming language, XML, whatever) specifies that
    identifiers should be normalized before comparison, a program should
    do this. If it specifies that Cf characters are to be ignored, then a
    program should comply. A standard doesn't have to specify such things
    however, so a programming language shouldn't do too much automatically.
    It's easier to apply a transformation than to undo a transformation
    applied automatically.

    > Sometimes things get ambiguous if one day &#349; is matched by s and one
    > day &#349; isn't? That's absolutely wrong behavior; the program must serve
    > the user, not the programmer.

    If I use grep to search for a combining acute, I bet it will currently
    match cases where it's a separate combining character but will not
    match precomposed characters.

    Do you say that this should be changed?

    Hey, Linux grep matches only a single byte by ".", even in UTF-8 locale.
    Now, I can agree that this should be changed.

    But demanding that each program which searches strings checks for
    combining classes is I'm afraid too much.

    >> Does "\n" followed by a combining code point start a new line?
    > The Standard says no, that's a defective combining sequence.

    Is there *any* program which behaves this way?

    How useful is a rule in a standard which nobody obeys to?

    >> Does a double quote followed by a combining code point start a
    >> string literal?
    > That would depend on your language. I'd prefer no, but it's obvious
    > many have made other choices.

    Since my language is young and almost doesn't have users, I can even
    change decisions made earlier: I'm not constrained by compatibility

    But if lexical structure of the program worked in terms of combining
    character sequences, it would have to be somehow supported by generic
    string processing functions, and it would have to consistely work for
    all lexical features. For example */ followed by a combining accent
    would not end a comment, accented backslash would not need escaping in
    a string literal, and something unambiguous would have to be done with
    an accented newline.

    Such rules would be harder to support with most text processing tools.
    I know no language in which searching for a backslash in a string would
    not find an accented backslash.

    It doesn't matter that accented backslashes don't occur practice. I do
    care for unambiguous, consistent and simple rules.

    >> Does a slash followed by a combining code point separate
    >> subdirectory names?
    > In Unix, yes; that's because filenames in Unix are byte streams with
    > the byte 0x2F acting as a path seperator.

    My current implementation doesn't support filenames which can't be
    encoded in the current default encoding. The encoding can be changed
    from within a program (perhaps locally during execution of some code).
    So one can process any Unix filename by temporarily setting the
    encoding to Latin1. It's unfortunate that the default setting is more
    restrictive than the OS, but I have found no sensible alternative
    other than encouraging processing strings in their transportation

    Anyway, if a string *is* accepted as a file name, the program's idea
    about directory separators is the same as the OS (as long as we assume
    Unix; I don't yet provide any OS-generic pathname handling). If the
    program assumed that an accented slash is not a directory separator,
    I expect possible security holes (the program thinks that a string
    doesn't include slashes, but from the OS point of view it does).

    > The rules you are offering are only simple and unambiguous to the
    > programmer; they appear completely random to the end user.

    And yours are the opposite :-)

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 04:52:51 CST