Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Thu Jan 20 2005 - 06:51:11 CST

  • Next message: Christopher Fynn: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/20 09:40, Arcane Jill at wrote:

    >... The truth is that the Unicode standard doesn't give a damn what
    > you do - it only cares about what you /call/ it. Thus, you can use any form of
    > encoding you like - so long as you don't call it "UTF-8". Similarly, you are
    > absolutely free to ignore BOMs - so long as you don't claim to be a Unicode
    > Conformant Process.

    The problem is that Unicode does require that UTF-8 programs ignore the BOM.
    So one then will get a lot of programs that formally are not UTF-8, but
    probably will use that name anyhow. The whole point of Unicode being a
    universal encoding for everyone is being lost.

    > Of course, it would be the work of ten minutes to write a Unicode Conformant
    > version of "cat".

    If the changes were merely about a few simple programs, nobody would
    complain, but quietly provide the upgrades for those programs. But
    evidently, UNIX uses a setup in which BOM's do not fit well. For example,
    scripts will not be executed properly. There are other major hurdles. Some
    of these problems are described in

    >> The problem is that UNIX software looks at the first bytes to determine if
    >> it is a shell script.
    > As noted above, so long as such software does not claim to be Unicode
    > Conformant, who cares? Ah - but wait. What if there are users out there
    > demanding Unicode Conformant software? Hmmm...

    This si also another problem. For example, US Federal Agencies may be
    required to only use software that is conformant to certain standards. Say
    at the same time that it is almost impossible to adapt UNIX to be strictly
    UTF-8 conformant. Then one cannot formally use UNIX anymore in US federal
    government computers...

    >> And lexers that are made for ASCII
    >> data will most likely treat a BOM as an error.
    > Quite rightly so. A BOM /is/ an error in ASCII, as is /any/ character beyond
    > U+007F. Lexers, or indeed /any/ software made purely for the seven-bit-wide
    > standard that is ASCII, can't be expected to work correctly if bytes 0x80 to
    > 0xFF are present in the stream.
    > Or did you mean "lexers that are made for 8-bit character sets which are
    > supersets of ASCII and trivially encoded"?

    Yes. The old notion of strict 7-bit ASCII is pretty much outdated any now.
    These are programs that process 8-bit bytes, assuming that 0 leading byte
    indicates ASCII. The other numbers are often reserved for some ISO-Latin

    >> The main point is that BOM will not be specially treated in the UNIX world,
    >> regardless what Unicode says. So I guess MS does not want its text files to
    >> be read in the UNIX world. Unicode has made the mistake of favoring a
    >> special platform over all the others.
    > It would be more accurate to say that Unicode Conformant Processes often do
    > not
    > care if non-Unicode-Conformant Processes can't read them. Unicode has
    > therefore
    > "made the mistake" of favoring processes that conform the Unicode Standard
    > over
    > those that don't. And this is a problem because...?

    It is hard to make UNIX processes becoming Unicode Conformant Processes when
    the BOM requirement is present.

    >> So it is clear that MS somehow has tricked Unicode to adapt an inhouse file
    >> format as part of the UTF-8 encoding, in the expense on other platforms.
    >> Unicode might loose prestige here, favoring one platform over all others.
    > Er ... what?
    > Sorry, I don't understand. I'm certain you're wrong though. I find Windows
    > support of Unicode to be laughable.

    Perhaps the words were too strong. There is some MS Windows text editor that
    always generate BOM's, also for UTF-8 files. Somehow that practise of that
    particular piece of software has slipped into the Unicode UTF-8 standard.

    >> The problem is that platforms such as UNIX use different methods to
    >> determine file encodings that file contents, and there are other problems
    >> with it, see <>
    > I am not clear why you keep citing this web page. It is not definitive. This
    > one is:

    The webpage you mention s perhaps definitive to Unicode. But the other page
    deals with how to deal with it in practise in the UNIX environment. The
    experience that is that standards that are not practically usable will die.
    So Unicode is at practical disadvantage there, on the BOM question,
    regardless what formal decision one makes.

    >> One might give a purely mathematical definition of a Unicode character,...

    > One might indeed, but, astonishingly, this has already been done, which is why
    > people are arguing with you on this one. I've been lurking on this newsgroup
    > for a while now, posting occasionally, and making a fool of myself more often
    > than not. And one thing I've learned is that YOU HAVE TO USE THE JARGON.
    > There's no way round it. You have to use the terms defined in the web page
    >, and the document
    >, or else you WILL be
    > misunderstood. UTF-8 is called a "Unicode Encoding Form". It is a mapping
    > between every "Unicode Scalar Value" and a finite subset of all "Encoded
    > Character Sequence"s. If you get the jargon right, you'll get a lot less
    > argument. Like I said at the start - what you call things is very important to
    > Unicode. Like any specialist jargon, the /intention/ is to enable people to be
    > clear, precise and unambiguous. No technical vocabulary every /intends/ to
    > divide the world into those-who-know-it (one of us) and those-who-don't
    > (obviously an outsider), but it happens, as it does with medicine, physics,
    > biology, whatever. It's not on purpose, but that's life. On this list,
    > however,
    > I recommend that you take the trouble ro read the definitions I just cited. It
    > will make your words clearer to others (as well as others' words clearer to
    > you).

    Thanks for the pointers. The problem is not only, as some other posters
    indicated, that Unicode defines a jargon, and that the Unicode folks want
    one to use just that, but that Unicode has taken over jargon from other
    places, and then made its own definitions, instead of giving it new names.
    So Unicode UTF-8 means now something else than one originally meant in the
    RFC's. It is like if one defines formally "apples" to mean what other call
    "oranges", and so forth. The technical jargon so created will be fully
    valid, but rather prone to misunderstandings.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 06:53:15 CST