Re: Why is "endianness" relevant when storing data on disks but not when in memory? from Leif Halvard Silli on 2013-01-06 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Mon, 07 Jan 2013 04:12:31 +0100

Doug Ewell, Sun, 6 Jan 2013 17:58:38 -0700:
> Leif Halvard Silli wrote:
>
>> I believe that even the U+FEFF *itself* is either UTF-32LE or UTF-32BE.
>> Thus, there is, per se, no implication of lack of byte-order mark in
>> Martin’s statement.
>
> By definition, data in the "UTF-nBE" or "UTF-nLE" encoding scheme
> (for whatever value of n) does not have a byte-order mark.

Sounds like you see "UTF-32BE data" as synonym for "UTF-32BE encoding".

>> Assuming that the label "UTF-32" is defined the
>> same way as the label "UTF-16", then it is an umbrella label or a
>> "macro label" (hint: macro language) which covers the two *real*
>> encodings - UTF-32LE and UTF-32BE.
>
> I've sometimes wished it were that way, that (for example) the
> "UTF-32BE" and "UTF-32LE" encoding schemes were defined as variations
> of "UTF-32"

The encompassed languages of a macrolanguage are not variations of the
macrolanguage. Likewise, the "UTF-32BE" label does not designate a
variant of "UTF-32". You may read me that way, but I have not meant
that "UTF-32BE" is a variant of "UTF-32".

If something is labelled "no" (for "Norwegian"), then one must "taste"
it to know whether the content is Norwegian Bokmål ("nb") or Norwegian
Nynorsk ("nn"). Likewise, if something is labelled, by default (as in
XML) or explicitly, as "UTF-16", then the parser must taste/sniff -
typically by sniffing the BOM - whether the document is big-endian or
little-endian.

When the BOM is supposed to be interpreted as the BOM, then we cannot
label the document as e.g. "UTF-16BE" but must use, by default or
explicitly, the label "UTF-16". But "UTF-16BE data" should be a valid
term in either case (provided the UTF-16 file is big-endian).

> with special rules related to the BOM, not defined as
> completely separate encoding schemes. But that's not how the
> definitions are written.

A file labelled "UTF-16" is specified to contain BOM + big-endian data
or BOM + little-endian data or - third - just big-endian data, without
BOM. Thus, one of the encoding variants that can be legally labelled
"UTF-16", is inseparable from "UTF-16BE" in every way.

> The LE and BE versions are not at all "the two *real* encodings" when
> there is real-world data that contains an initial U+FEFF meant to be
> interpreted as a BOM or "signature."

The "UTF-16" label - whether set implicitly by the format (such as in
XML when the file is 16-bit and there is no declaration) or explicitly
(such as via HTTP), is mostly just a label (and not an encoding) whose
meaning is: "this is 16 bit, but please sniff the endianness [and thus
the encoding] of the data". Since "UTF-16" can cover 3 different ways
to do it, then that seems like the most sensible definition.

I think it is more fruitful to focus on the fact that "UTF-16" is a
distinct label rather than a distinct encoding. In fact, I don't agree
that the RFC that defines UTF-16/UTF-16LE/UTF-16BE describes the 3 as
separate encodings. In fact, RFC 2781 barely defines UTF-16LE and
UTF-16BE as separate encodings:[1]

   Appendix A of this specification contains registrations for three
   MIME charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets
   represent the combination of a CCS (a coded character set) and a CES
   (a character encoding scheme). Here the CCS is Unicode/ISO 10646 and
   the CES is the same in all three cases, except for the serialization
   order of the octets in each character, and the external determination
   of which serialization is used.

The only way "UTF-16" differs from the other two is in "the external
determination of which serialization is used". Also, the title of the
RFC signals that it is a single encoding: "UTF-16, an encoding of ISO
1064". In fact, through out RFC 2781, it is clear that it talks about a
single encoding.

The "UTF-16" label does not mandate the use of the BOM. It is the *XML*
specification that mandates the use of the BOM. Citing the RFC
again:[2]

   An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"
   would occur with document formats that mandate a BOM in UTF-16 text,
   thereby requiring the use of the "UTF-16" tag only.

It is also worth noting that the RFC says that if the author(ing tool)
doesn't know the endianness, then he/it "MUST" use the "UTF-16" label.
Which, again, is a principle which reminds very much of how to pick a
more general language tag if you don‘t know the most specific one.
Clearly, the RFC would not have made such a rule if "UTF-16" was
"completely separate encoding schemes".

Based on the RFC, we can say that the 3 UTF-16 labels are 3 "MIME
charset".[1] I would say that the RFC uses "encoding" slightly
different from you, and slightly different from me as well.

[1] http://tools.ietf.org/html/rfc2781#section-3
[2] http://tools.ietf.org/html/rfc2781#section-3.3

-- 
leif halvard silli

Received on Sun Jan 06 2013 - 21:17:58 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 06 2013 - 21:17:59 CST