RE: 32'nd bit & UTF-8

From: Lars Kristan (
Date: Thu Jan 20 2005 - 08:19:12 CST

  • Next message: Mark E. Shoulson: "Re: 32'nd bit & UTF-8"

    Hans Aberg wrote:
    > The situation is the same as that the values > 0x7F are
    > illegal in ASCII.
    > When people made ASCII, they fantasized it was the end of it,
    > and that the
    > full 8 bits would never be used. At least Don Knuth says so.
    > Now the Unicode
    > people evidently wants people to pretend that the values >
    > 0x10FFFF don't
    > exist.

    There are good reasons for both. I won't go into why ASCII was 7 bit. But
    the 0x10FFFF limitation is there because UTF-16 can't handle more. Indeed,
    UTF-16 is the least fortunate of the UTFs, but it won't go away any time

    I would say that the current limit is enough for many years. By the time we
    run out, not only will UTF-16 be gone, but perhaps also UTF-8. Text will be
    a spit in the ocean and will be transmitted and stored in UTF-128 :)

    If by any chance UTF-8 survives, it can be extended, probably more or less
    the way it was proposed. But it doesn't need to be extended today.

    I am not sure about what lexers are, but I gathered you want to convert all
    Unicode data to UTF-8 and process it in UTF-8, possibly directly process any
    8-bit stream.

    This is a good approach. In your case it came naturally since it simplifies
    whatever you are doing. But it is a very good approach in general. You would
    have far more problems if you'd want to convert everything to UTF-16 or
    UTF-32. Then you'd have the problem of invalid sequences, which Unicode says
    is not their problem. Unfortunately, invalid sequences cannot be solved
    efficiently and unambiguously without dedicating new codepoints. So it
    cannot be done without cooperation from Unicode. You, on the other hand, can
    use an extended definition of UTF-32 to UTF-8 conversion if you choose so
    and need no approval from Unicode. All you need is to be careful to select
    the best algorithm. I think at least three variants already emerged. But
    even that decision will probably not be crucial. Persuading Unicode to
    recognise your algorithm is doomed to fail. Unicode has no need to define it
    and will not define it until there is need for it. Until then, they will
    observe what is going on an learn on your and other people's mistakes.

    Not that I completely agree with that attitude. In some cases, yes, as is
    your case and the case of the UTF-8 BOM. But there are other cases where it
    would indeed be useful if Unicode would sometimes address issues that fall
    slightly out of its domain. Like handling invalid sequences, actions on
    invalid sequences or invalid/non characters, transformation of invalid/non
    characters where this transformation can be done, and so on. Specifically, I
    think transforming an unpaired surrogate should be defined. On the other
    hand, I think it is a bit early to define transformation for a 32 bit value.


    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:20:19 CST