Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 12:16:29 CST

  • Next message: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"

     On 2005/01/20 14:15, Lars Kristan at lars.kristan@hermes.si wrote:
      
    > Hans Aberg wrote:
    >> The main point is that BOM will not be specially treated in
    >> the UNIX world,
    >> regardless what Unicode says.
    >
    > I won't say it won't, I won't say it will.
    >
    > UTF-8 BOM breaks down the UNIX right in the foundation. Which is because text
    > is often treated as opaque, almost binary data.

    Right.

    > UTF-8 BOM is in a way no worse than UTF-16 BOM. Except that UTF-16 plain text
    > is rare.

    The problem with UTF-8 BOM is that in 8-bits, one wants to use the 8-bit
    extended ASCII handling already available. Since one does not expect to use
    16-bits that way, there is much less problem from the point of view of
    implementing the OS.

    > This is a lot of work and a lot of confusion.

    So it seems.

    >...
    > Examine the above lists and see how things are strongly related. It is
    > practically impossible to allow BOM on UNIX without introducing the text mode.

    It clear that using UTF-8 with BOM requirement on UNIX will cause a lot of
    problems. And it is unclear how to find effective solutions.

    > If we think that UTF-8 will be THE encoding to be used for decades, then we
    > shouldn't burden it with the BOM.

    So I think too. The idea of having file markers tied to OS file handling
    seems to be an archaic one. Unicode, in effect, tries to rune the clock
    back.

    > If we think other formats will start
    > gaining, then we will need the mechanism to distinguish among them and text
    > mode is inevitable. But, introducing text mode on UNIX will be a pain. UNIX
    > would much rather go with exising binary approach and stick with UTF-8 as the
    > format to stay.

    There are already UNIX versions, such as Mac OS X, making that distinction
    by introducing extra files, or "resource" files. It easier on the basic OS
    level to make use of several binary files bundled together as one unit,
    rather than having a single file with all the information. Unicode break the
    possibles to develop the most efficient OS.

    >> So I guess MS does not want its
    >> text files to
    >> be read in the UNIX world.
    >
    > There are many possibilities. Maybe they are convinced their approach is
    > better. Maybe they know it is not, but are convinced this approach will
    > prevail. And want to be a part of it.
    >
    > Maybe it's a conspiracy. Maybe they really don't want their files to be useful
    > on UNIX. Funny, it's even worse: as I pointed out, Notepad doesn't even
    > display UNIX files properly unless LFs are extended into CRLF pairs. One would
    > expect at least a one way compatibility, to help users to 'move' to Windows.
    > But perhaps they think a clean cut is even more 'convincing'.
    >
    > I am not saying any of the above scenarios is true. Maybe it just happens
    > naturally. I suppose it does. UTF-8 BOM is simply useful on Windows. And
    > simply devastating on UNIX. It's the text mode that makes it that way.

    Conspiracy or not, it just seems to happen that big companies develop their
    own formats, incompatible with other's formats. Standards, which should
    properly help up the problem, gets corrupted.

    >> Unicode has made the mistake of favoring a
    >> special platform over all the others.
    >
    > I am not sure what Unicode says about the UTF-8 BOM. I assume it is loosely
    > allowed.

    Folks say here that the BOM is actually required in at the beginning of
    files, in order for them to be allowed to be called UTF-8. Then it is
    obvious it is a file format, not a character encoding.

    >Which is actually the best Unicode can come up with. Deciding on one
    > approach over another before FULL implications are understood would be a
    > mistake.

    Unicode has evidently already made that mistake.

    >And it would hit one platform or the other. Waiting for the problem
    > to be fully understood is NOT a mistake. And loosely allowing it is waiting.

    Just dropping the BOM as a requirement in UTF-8 would remove the problem.
    BOM's need not even be recognized by UTF-8, because one still use them to
    define a file format. Then it only means that the UTF-8 proper code is what
    none gets when the BOM has been removed.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:42 CST