RE: Unicode in a URL

From: Martin Duerst (duerst@w3.org)
Date: Thu Apr 26 2001 - 22:00:45 EDT


At 15:02 01/04/26 -0700, Paul Deuter wrote:
>Based on the responses, I guess my original question/problem was not
>very well written.

>The %XX idea does not work because this it already in use by lots of
>software
>to encode many different character sets. So again we need something that
>identifies
>it as UTF-8.

It's used with lot's of different encodings. Adding one more (UTF-8)
won't make it much worse, in the first place.

Second, it turns out that UTF-8 is extremely easy to detect/check,
the easiest of all encodings. For details, see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

Apart from that, the HTTP protocol says exactly what you can send,
and so you can't just invent something new (such as %u....),
even though it might work 'sometimes'.

>I see this as somewhat analogus to the invention of the U+XXXX notation
>in Unicode consortium writings? They needed a completely unambiguous way
>to tell their readers that the 16 bit value was not "any" 16 bit value
>but rather a specific Unicode codepoint. They invented a new kind of escape
>sequence that said two things: what follows is hex *and* Unicode.
>
>I see the BOM as filling the same need for text files. It was not enough
>to invent Unicode but also a way to identify the encoding.

The BOM for UTF-8 is doing a lot of damage. All the tools that
would work very nicely without the BOM stop to work.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT