Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sat Dec 11 2004 - 04:52:10 CST

Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

Previous message: Lars Kristan: "RE: When to validate?"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: Philippe Verdy: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"D. Starner" <shalesller@writeme.com> writes:

>> > This implies that every programmer needs an indepth knowledge of
>> > Unicode to handle simple strings.
>>
>> There is no way to avoid that.
>
> Then there's no way that we're ever going to get reliable Unicode
> support.

This is probably true.

I wonder whether things could have been done significantly better,
or it's an inherent complexity of text. Just curious, it doesn't help
with the reality.

>> If the runtime automatically performed NFC on input, then a part of a
>> program which is supposed to pass a string unmodified would sometimes
>> modify it. Similarly with NFD.
>
> No. By the same logic you used above, I can expect the programmer to
> understand their tools, and if they need to pass strings unmodified,
> they shouldn't load them using methods that normalize the string.

That's my point: if he normalizes, he does this explicitly.

If a standard (a programming language, XML, whatever) specifies that
identifiers should be normalized before comparison, a program should
do this. If it specifies that Cf characters are to be ignored, then a
program should comply. A standard doesn't have to specify such things
however, so a programming language shouldn't do too much automatically.
It's easier to apply a transformation than to undo a transformation
applied automatically.

> Sometimes things get ambiguous if one day ŝ is matched by s and one
> day ŝ isn't? That's absolutely wrong behavior; the program must serve
> the user, not the programmer.

If I use grep to search for a combining acute, I bet it will currently
match cases where it's a separate combining character but will not
match precomposed characters.

Do you say that this should be changed?

Hey, Linux grep matches only a single byte by ".", even in UTF-8 locale.
Now, I can agree that this should be changed.

But demanding that each program which searches strings checks for
combining classes is I'm afraid too much.

>> Does "\n" followed by a combining code point start a new line?
>
> The Standard says no, that's a defective combining sequence.

Is there *any* program which behaves this way?

How useful is a rule in a standard which nobody obeys to?

>> Does a double quote followed by a combining code point start a
>> string literal?
>
> That would depend on your language. I'd prefer no, but it's obvious
> many have made other choices.

Since my language is young and almost doesn't have users, I can even
change decisions made earlier: I'm not constrained by compatibility
yet.

But if lexical structure of the program worked in terms of combining
character sequences, it would have to be somehow supported by generic
string processing functions, and it would have to consistely work for
all lexical features. For example */ followed by a combining accent
would not end a comment, accented backslash would not need escaping in
a string literal, and something unambiguous would have to be done with
an accented newline.

Such rules would be harder to support with most text processing tools.
I know no language in which searching for a backslash in a string would
not find an accented backslash.

It doesn't matter that accented backslashes don't occur practice. I do
care for unambiguous, consistent and simple rules.

>> Does a slash followed by a combining code point separate
>> subdirectory names?
>
> In Unix, yes; that's because filenames in Unix are byte streams with
> the byte 0x2F acting as a path seperator.

My current implementation doesn't support filenames which can't be
encoded in the current default encoding. The encoding can be changed
from within a program (perhaps locally during execution of some code).
So one can process any Unix filename by temporarily setting the
encoding to Latin1. It's unfortunate that the default setting is more
restrictive than the OS, but I have found no sensible alternative
other than encouraging processing strings in their transportation
encoding.

Anyway, if a string *is* accepted as a file name, the program's idea
about directory separators is the same as the OS (as long as we assume
Unix; I don't yet provide any OS-generic pathname handling). If the
program assumed that an accented slash is not a directory separator,
I expect possible security holes (the program thinks that a string
doesn't include slashes, but from the OS point of view it does).

> The rules you are offering are only simple and unambiguous to the
> programmer; they appear completely random to the end user.

And yours are the opposite :-)

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Previous message: Lars Kristan: "RE: When to validate?"
In reply to: D. Starner: "Re: Nicest UTF"
Next in thread: Philippe Verdy: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 04:52:51 CST