RE: Subject: Re: 32'nd bit & UTF-8

From: Richard T. Gillam (
Date: Fri Jan 21 2005 - 10:49:10 CST

  • Next message: Andy Heninger: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"


    >> Good grief. We seem to be going through another round of "night of
    >> the living thread."
    >Have you found out first now. :-)

    No; I've been biting my tongue for several days.

    >> Depending on your particular situation,
    >> any of the three [UTFs] might be the best fit. There's a reason all
    >> exist.
    >At least for now. UTF-16 cannot be extended beyond the current range,
    but UTF-8/32 can both be extended to 2^32
    >numbers, the size of a natural type. Even though UTF-16 has a distinct
    legacy advantage, it likely does not have that
    >in the long run. So deprecating it seems to be a distinct possibility.

    I really wish you'd quit saying this. This simply isn't true. Or, at
    the very least, is EXTREMELY unlikely and very far into the future. As
    several other people have already pointed out to you, the Unicode
    codespace contains room for 1.1 million characters. 150,000 code
    positions have been set aside for private use or other special purposes,
    leaving room for 1 million actual characters. Right now, after 15 years
    of encoding, 95,000 of those spaces have been assigned to characters.
    At the current rate of encodings, it'll be centuries before the space
    fills up. If it ever does-- the consensus seems to be that there just
    aren't that many things that will ever merit encoding.

    The only thing that would put the codespace in danger of filling up is a
    sudden loss in discipline on the part of the committees that maintain
    Unicode that turns Unicode into something other than what it's supposed
    to be. If people tried to turn Unicode into a generic glyph registry,
    for example, or tried to extend it to do styled text, or start
    allocating code points for representation of non-text data. The current
    committee is EXTREMELY vigilant and won't let these things happen.
    People suggest this kind of thing all the time and routinely get slapped
    down. It's not that there isn't a need for some of this stuff; it's
    just that Unicode isn't the thing that should fill this need. Unicode
    is a plain-text character encoding standard. Period. Trying to make it
    something else would destroy it.

    The space is not going to fill up, and UTF-16 will never have to be
    deprecated. Get that notion out of your head once and for all.

    >Well, in UTF-8 it has to go away as a requirement to be ignored in
    >processes: Either Unicode removes it in the standard, or one will see
    that people just don't bother following the
    >Unicode standard in that respect.

    Again, many people have addressed this point and you're ignoring them.
    UTF-8 HAS NO BOM. There is nothing in the Unicode standard mandating or
    even encouraging the use of EF BB BF at the beginning of a UTF-8 file.
    That sequence has no special meaning in UTF-8; it's just a zero-width
    non-breaking space. FE FF at the top of a UTF-8 file is just flat

    The practice of using EF BB BF as a signature byte to indicate that a
    file is in UTF-8 is mentioned in one spot in the standard, but not
    encouraged. Some applications (notably Notepad) do this; many do not.
    You'll also see it from time to time coming out of an application that
    doesn't handle UTF-16 or UTF-32 properly. So EF BB BF at the top of the
    UTF-8 file does occur in practice and it's good for software to be aware
    of it (but relatively harmless if it isn't). But the fact that it
    occurs in practice is a VERY different thing from it being mandated by
    Unicode, which it absolutely isn't.

    I'll respond to your more substantive note after I get back from

    --Rich Gillam
      Language Analysis Systems, Inc.

    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 10:53:37 CST