Re: Names for UTF-8 with and without BOM

From: Tex Texin (tex@i18nguy.com)
Date: Sat Nov 02 2002 - 21:04:27 EST

  • Next message: John Cowan: "Re: Names for UTF-8 with and without BOM"

    John Cowan wrote:
    >
    > Tex Texin scripsit:
    >
    > > So when the parser gets JOECODE, I can understand ignoring the signature
    > > and autodetection, but exactly how does it find the first "<"?
    >
    > Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
    > be UTF-32 big-endian, but we'll suppose the parser can't handle that).
    > JOECODE is what's left. At worst it is in some other encoding and/or
    > not well-formed, in which case you expect an error and you get one.
    > Of course the processor knows that "<" is encoded as 0xFF in JOECODE....
    >
    > The point is that signatures don't decode to a character: processors in
    > general, not just XML processors, are expected to skip them.
    >
    > > It must have to try all of the encodings known to it... ugh.
    >
    > In such a bad case, that's all you can do.

    John,

    The bad case is what I was whinging about, since more processors deal
    with more than 3 encodings. Ultimately, because the initial characters
    are fixed, autodetection is not as bad as it is for plaintext, I realize
    that.

    Interestingly, although I didn't study it in detail, looking at rfc 2376
    for prioritization over charset conflicts, it seems to recommend
    stripping the BOM when converting from utf-16 to other charsets (and
    without considering that ucs-4 would like to keep it). (section 5).

    Also, in considering charset conflicts, 2376 fails to consider conflicts
    between signature and the encoding declaration. (I have a utf-16BE BOM
    and the encoding declaration is for utf-8...).

    I'll have to check for a more up-to-date rfc.

    All in all I agree with you and Michka (yes you were right, I was wrong
    Michael!) that it isn't that big a deal to support a variety of BOMs but
    the world did not need yet another way to sometimes (maybe its there),
    almost (maybe its unique), redundantly (one hopes its redundant and not
    conflicting) declare an encoding.

    tex

    -- 
    -------------------------------------------------------------
    Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
    Xen Master                          http://www.i18nGuy.com
                             
    XenCraft		            http://www.XenCraft.com
    Making e-Business Work Around the World
    -------------------------------------------------------------
    


    This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 21:42:37 EST