Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Tue Jan 18 2005 - 13:27:18 CST

  • Next message: E. Keown: "Jan. 31 deadline for Feb. 7 meeting?"

    On 2005/01/18 03:31, Philippe Verdy at verdy_p@wanadoo.fr wrote:

    >>> The old RFC you're refering to is not designating UTF-8, but UTF-BSS,
    >>> which is a transformation format,
    >>
    >> OK. Fine, so we have a name for it.
    >
    > I was not sure about the name of it when writing the message.

    According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, UTF is short for
    UCS Transformation Format, where UCS stands for Universal Character Set.
    When speaking about the extensions that I speak about, I think they should
    certainly have a separate name. Perhaps UTF-8X for extended, or BTF-8 for
    "bit (byte) transformation format".

    I should mention that in the first version of the UTF-8 and UTF-32 regular
    expression generator functions for Unicode character classes that I wrote, I
    excluded the illegal Unicode numbers, overloaded as well as U+D800-U+DFFF
    and U+FFFE-U+FFFF. But it then turns out that the lexer generator then
    becomes more complicated. So I felt it prudent to add regular expression
    generator functions also for the overloaded UTF-8 numbers, so as to make it
    convenient to do generate error handling.

    The Unicode standard is like Big Brother in George Orwell's "1984", making
    it possible to only speak about what is right, but not what is wrong. The
    lexer generator needs to be able to speak about what is wrong as well, in
    order to give proper handling to that.

    Besides, even though Unicode has declared to never use more than 21 bits, in
    the track record, Unicode has reneged on such promises. It might be prudent
    to knock down a full 32-bit encoding, declaring UTF-8/32 to be subsets of
    that.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 13:30:33 CST