Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Dec 12 2004 - 05:53:33 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    "D. Starner" <shalesller@writeme.com> writes:

    >> But demanding that each program which searches strings checks for
    >> combining classes is I'm afraid too much.
    >
    > How is it any different from a case-insenstive search?

    We started from string equality, which somehow changed into searching.
    Default string equality is case-sensitive.

    Searching for an arbitrary substring entered by a user should use
    user-friendly rules which fold various minor differences like
    decomposition and case and soft hyphens, but it's a rare task and
    changing rules generally affects convenience rather than correctness.

    String equality is used for internal and important operations like
    lookup in a dictionary (not necessarily of strings ever viewed by
    the user), comparing XML tags, filenames, mail headers, program
    identifiers, hyperlink addresses etc. They should be unambiguous,
    simple and fast. Computing approximate equivalence by folding "minor"
    differenes must be done explicitly when needed, as mandated by
    relevant protocols and standards, not forced as the default.

    >> >> Does "\n" followed by a combining code point start a new line?
    >> >
    >> > The Standard says no, that's a defective combining sequence.
    >>
    >> Is there *any* program which behaves this way?
    >
    > I misstated that; it's a new line followed by a defective combining
    > sequence.

    What is the definition of combining sequences?

    >> It doesn't matter that accented backslashes don't occur practice.
    >> I do care for unambiguous, consistent and simple rules.
    >
    > So do I; and the only unambiguous, consistent and simple rule that
    > won't give users hell is that "ba" never matches "bä". Any programs
    > for end-users must follow that rule.

    Please give a precise definition of string equality. What representation
    of strings it needs - a sequence of code points or something else?
    Are all strings valid and comparable? Are there operations which give
    different results for "equal" strings?

    If string equality folded the difference between precomposed and
    decomposed characters, then the API should hide that difference in
    other places as well, otherwise string equality is not the finest
    distinction between string values but some arbitrary equivalence
    relation.

    >> My current implementation doesn't support filenames which can't be
    >> encoded in the current default encoding.
    >
    > The right thing to do, IMO, would be to support filenames as byte
    > strings, and let the programmer convert them back and forth between
    > character strings, knowing that it won't roundtrip.

    Perhaps. Unfortunately it makes filename processing harder, e.g.
    you can't store them in *text* files processed through a transparent
    conversion between its encoding and Unicode. In effect we must go
    back from manipulating context-insensitive character sequences to
    manipulating byte sequences with context-dependent interpretation.

    We can't even sort filenames using Unicode algorithms for collation
    but must use some algorithms which are capable of processing both
    strings in the locale's encoding and arbitrary byte sequences at the
    same time. This is much more complicated than using Unicode algorithms
    alone.

    What is worse, in Windows filenames the primary representation of
    filenames is Unicode, so programs which carefully use APIs based on
    byte sequences for processing filenames will be less general than
    Unicode-based APIs when the program is ported to Windows.

    The computing world is slowly migrating from processing byte sequences
    in ambiguous encodings to processing Unicode strings, often represented
    by byte sequences in explicitly labeled encodings. There are relics
    when the new paradigm doesn't fit well, like Unix filenames, but
    sticking to the old paradigm means that programs will continue to
    support mixing scripts poorly or not at all.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Sun Dec 12 2004 - 05:56:02 CST