From: D. Starner (shalesller@writeme.com)
Date: Fri Dec 10 2004 - 19:10:16 CST
"Marcin 'Qrczak' Kowalczyk" writes:
> "D. Starner" writes:
>
> > This implies that every programmer needs an indepth knowledge of
> > Unicode to handle simple strings.
>
> There is no way to avoid that.
Then there's no way that we're ever going to get reliable Unicode
support.
> If the runtime automatically performed NFC on input, then a part of a
> program which is supposed to pass a string unmodified would sometimes
> modify it. Similarly with NFD.
No. By the same logic you used above, I can expect the programmer to
understand their tools, and if they need to pass strings unmodified,
they shouldn't load them using methods that normalize the string.
> You can't expect each and every program which compares strings to
> perform normalization (e.g. Linux kernel with filenames).
As has been pointed out here, Posix filenames are not character strings;
they are byte strings. They quite likely aren't even valid UTF-8 strings.
> > So S should _sometimes_ match an accented S? Again, I feel extended misery
> > of explaining to people why things aren't working right coming on.
>
> Well, otherwise things get ambiguous, similarly to these XML issues.
Sometimes things get ambiguous if one day ŝ is matched by s and one
day ŝ isn't? That's absolutely wrong behavior; the program must serve
the user, not the programmer. 's' cannot, should, must not match 'ŝ';
and if it must, then it absolutely always must match 'ŝ' and someway
to make a regex that matches s but not ŝ must be designed. It doesn't
matter what problems exist in the world of programming; that is the
entirely reasonable expectation of the end user.
> Does "\n" followed by a combining code point start a new line?
The Standard says no, that's a defective combining sequence.
> Does
> a double quote followed by a combining code point start a string
> literal?
That would depend on your language. I'd prefer no, but it's obvious
many have made other choices.
> Does a slash followed by a combining code point separate
> subdirectory names?
In Unix, yes; that's because filenames in Unix are byte streams with
the byte 0x2F acting as a path seperator.
> It's hard enough to convince them that a
> character is not the same as a byte.
That contradicts you above statement, that every programmer needs an
indepth knowledge of Unicode.
> In case I want to circumvent security or deliberately cause a piece of
> software to misbehave. Robustness require unambiguous and simple rules.
The rules you are offering are only simple and unambiguous to the programmer;
they appear completely random to the end user. To have ≮ sometimes start a
tag means that a user can't look at the XML and tell whether something opens
a tag or is just text. You might be able to expect all programmers, but you
can't expect all end users to.
-- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 19:13:46 CST