Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Fri Dec 10 2004 - 13:27:04 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

Previous message: Andy Heninger: "Re: When to validate?"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: D. Starner: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"D. Starner" <shalesller@writeme.com> writes:

>> String equality in a programming language should not treat composed
>> and decomposed forms as equal. Not this level of abstraction.
>
> This implies that every programmer needs an indepth knowledge of
> Unicode to handle simple strings.

There is no way to avoid that.

If the runtime automatically performed NFC on input, then a part of a
program which is supposed to pass a string unmodified would sometimes
modify it. Similarly with NFD.

You can't expect each and every program which compares strings to
perform normalization (e.g. Linux kernel with filenames).

Perhaps if there was a single normalization format which everybody
agreed to, and unnormalized strings were never used for data
interchange (if UTF-8 was specified such that to disallow unnormalized
data, etc.), things would be different. But Unicode treats both
composed and decomposed representations as valid.

>> IMHO splitting into graphemes is the job of a rendering engine, not of
>> a function which extracts a part of a string which matches a regex.
>
> So S should _sometimes_ match an accented S? Again, I feel extended misery
> of explaining to people why things aren't working right coming on.

Well, otherwise things get ambiguous, similarly to these XML issues.
Does "\n" followed by a combining code point start a new line? Does
a double quote followed by a combining code point start a string
literal? Does a slash followed by a combining code point separate
subdirectory names?

An iterator which delivers whole combining character sequences out of
a sequence of code points can be used. You can also manipulate strings
as arrays of combining character sequences. But if you insist that
this is the primary string representation, you become incompatible
with most programs which have different ideas about delimited strings.
You can't expect each and every program to check combining classes
of processed characters. It's hard enough to convince them that a
character is not the same as a byte.

>> I expect breakage of XML-based protocols if implementations are
>> actually changed to conform to these rules (I bet they don't now).
>
> Really? In what cases are you storing isolated combining code points
> in XML as text?

In case I want to circumvent security or deliberately cause a piece of
software to misbehave. Robustness require unambiguous and simple rules.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Previous message: Andy Heninger: "Re: When to validate?"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: D. Starner: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 13:29:34 CST