Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Dec 12 2004 - 05:53:33 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: Lars Kristan: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"D. Starner" <shalesller@writeme.com> writes:

>> But demanding that each program which searches strings checks for
>> combining classes is I'm afraid too much.
>
> How is it any different from a case-insenstive search?

We started from string equality, which somehow changed into searching.
Default string equality is case-sensitive.

Searching for an arbitrary substring entered by a user should use
user-friendly rules which fold various minor differences like
decomposition and case and soft hyphens, but it's a rare task and
changing rules generally affects convenience rather than correctness.

String equality is used for internal and important operations like
lookup in a dictionary (not necessarily of strings ever viewed by
the user), comparing XML tags, filenames, mail headers, program
identifiers, hyperlink addresses etc. They should be unambiguous,
simple and fast. Computing approximate equivalence by folding "minor"
differenes must be done explicitly when needed, as mandated by
relevant protocols and standards, not forced as the default.

>> >> Does "\n" followed by a combining code point start a new line?
>> >
>> > The Standard says no, that's a defective combining sequence.
>>
>> Is there *any* program which behaves this way?
>
> I misstated that; it's a new line followed by a defective combining
> sequence.

What is the definition of combining sequences?

>> It doesn't matter that accented backslashes don't occur practice.
>> I do care for unambiguous, consistent and simple rules.
>
> So do I; and the only unambiguous, consistent and simple rule that
> won't give users hell is that "ba" never matches "bä". Any programs
> for end-users must follow that rule.

Please give a precise definition of string equality. What representation
of strings it needs - a sequence of code points or something else?
Are all strings valid and comparable? Are there operations which give
different results for "equal" strings?

If string equality folded the difference between precomposed and
decomposed characters, then the API should hide that difference in
other places as well, otherwise string equality is not the finest
distinction between string values but some arbitrary equivalence
relation.

>> My current implementation doesn't support filenames which can't be
>> encoded in the current default encoding.
>
> The right thing to do, IMO, would be to support filenames as byte
> strings, and let the programmer convert them back and forth between
> character strings, knowing that it won't roundtrip.

Perhaps. Unfortunately it makes filename processing harder, e.g.
you can't store them in *text* files processed through a transparent
conversion between its encoding and Unicode. In effect we must go
back from manipulating context-insensitive character sequences to
manipulating byte sequences with context-dependent interpretation.

We can't even sort filenames using Unicode algorithms for collation
but must use some algorithms which are capable of processing both
strings in the locale's encoding and arbitrary byte sequences at the
same time. This is much more complicated than using Unicode algorithms
alone.

What is worse, in Windows filenames the primary representation of
filenames is Unicode, so programs which carefully use APIs based on
byte sequences for processing filenames will be less general than
Unicode-based APIs when the program is ported to Windows.

The computing world is slowly migrating from processing byte sequences
in ambiguous encodings to processing Unicode strings, often represented
by byte sequences in explicitly labeled encodings. There are relics
when the new paradigm doesn't fit well, like Unix filenames, but
sticking to the old paradigm means that programs will continue to
support mixing scripts poorly or not at all.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: Lars Kristan: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 12 2004 - 05:56:02 CST