Re: UTF-8 'BOM'

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 08:14:47 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/20 13:31, Philippe Verdy at vpi92@yahoo.fr wrote:

    > Hans Aberg <haberg@math.su.se> wrote:
    >> In fact, one idea might be to add \xFFFE and \xFFFF as delimiters for
    >> file format markers. Then programs that do not need such markers need
    >> not deal with them. Other program can make use of them, or simply
    >> remove them at will.
    >> Such markers could also be used to alter the format within the
    >> same stream.
    >
    > What an horrible idea! Not only you are rejecting the idea of BOM, but
    > now you want to introduce reassignements (that are already immutably
    > defined to NON-CHARACTERS) that will BREAK the existing STANDARD
    > which DOES use the fact that FFFE and FFFF are non-characters to
    > reliably regnize byte-order marks and UTF encoding forms!

    Sorry, there is a typo here: One will have to use \xFEFF in order know that
    it is not byte swapped. See below though.

    > I strongly reject such idea. Accept the ide of BOMs as they are, and
    > then accept that they already expect that FFFF and FFFE won't EVER be
    > used within encoded texts.

    First of all, I want the BOM requirement to be dropped from UTF-8. Or invent
    a new variation of UTF-8 which does not have a BOM requirement. (This latter
    approach seems not prudent, as one should keep down the number of
    encodings.)

    But then there seems to be the need for a method to indicate file encodings
    by the use of file contents. Then what one wants is that this indicator
    should not be confused with the Unicode data proper. Further, this file
    contents indicator should ideally be independent of encodings like UTF-8,
    ... So then one might agree that is it admissible, but not required, to
    indicate not only the whole file encoding, but also use it to shift
    encodings in a stream. I no not push for this myself, only indicating that
    if one should admit file contents indicators, this might be a way to go.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:16:28 CST