Re: Nicest UTF

From: D. Starner (shalesller@writeme.com)
Date: Fri Dec 10 2004 - 19:10:16 CST

Next message: Philippe Verdy: "Re: Nicest UTF"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Marcin 'Qrczak' Kowalczyk" writes:

> "D. Starner" writes:
>
> > This implies that every programmer needs an indepth knowledge of
> > Unicode to handle simple strings.
>
> There is no way to avoid that.

Then there's no way that we're ever going to get reliable Unicode
support.

> If the runtime automatically performed NFC on input, then a part of a
> program which is supposed to pass a string unmodified would sometimes
> modify it. Similarly with NFD.

No. By the same logic you used above, I can expect the programmer to
understand their tools, and if they need to pass strings unmodified,
they shouldn't load them using methods that normalize the string.

> You can't expect each and every program which compares strings to
> perform normalization (e.g. Linux kernel with filenames).

As has been pointed out here, Posix filenames are not character strings;
they are byte strings. They quite likely aren't even valid UTF-8 strings.

> > So S should _sometimes_ match an accented S? Again, I feel extended misery
> > of explaining to people why things aren't working right coming on.
>
> Well, otherwise things get ambiguous, similarly to these XML issues.

Sometimes things get ambiguous if one day ŝ is matched by s and one
day ŝ isn't? That's absolutely wrong behavior; the program must serve
the user, not the programmer. 's' cannot, should, must not match 'ŝ';
and if it must, then it absolutely always must match 'ŝ' and someway
to make a regex that matches s but not ŝ must be designed. It doesn't
matter what problems exist in the world of programming; that is the
entirely reasonable expectation of the end user.

> Does "\n" followed by a combining code point start a new line?

The Standard says no, that's a defective combining sequence.

> Does
> a double quote followed by a combining code point start a string
> literal?

That would depend on your language. I'd prefer no, but it's obvious
many have made other choices.

> Does a slash followed by a combining code point separate
> subdirectory names?

In Unix, yes; that's because filenames in Unix are byte streams with
the byte 0x2F acting as a path seperator.

> It's hard enough to convince them that a
> character is not the same as a byte.

That contradicts you above statement, that every programmer needs an
indepth knowledge of Unicode.

> In case I want to circumvent security or deliberately cause a piece of
> software to misbehave. Robustness require unambiguous and simple rules.

The rules you are offering are only simple and unambiguous to the programmer;
they appear completely random to the end user. To have ≮ sometimes start a
tag means that a user can't look at the XML and tell whether something opens
a tag or is just text. You might be able to expect all programmers, but you
can't expect all end users to.

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Next message: Philippe Verdy: "Re: Nicest UTF"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 19:13:46 CST