Re: Why is "endianness" relevant when storing data on disks but not when in memory?

From: Doug Ewell <doug_at_ewellic.org>
Date: Sun, 6 Jan 2013 20:57:58 -0700

Leif Halvard Silli wrote:

>> By definition, data in the "UTF-nBE" or "UTF-nLE" encoding scheme
>> (for whatever value of n) does not have a byte-order mark.
>
> Sounds like you see "UTF-32BE data" as synonym for "UTF-32BE
> encoding".

"UTF-32BE data" is character data that is encoded according to
definition D99, which defines the UTF-32BE encoding scheme. "UTF-32BE"
is not merely a short way to say "big-endian UTF-32." As defined in TUS,
it has a specific meaning that goes beyond that.

> The encompassed languages of a macrolanguage are not variations of the
> macrolanguage. Likewise, the "UTF-32BE" label does not designate a
> variant of "UTF-32". You may read me that way, but I have not meant
> that "UTF-32BE" is a variant of "UTF-32".

The analogy with ISO 639-3 macrolanguages implies that under some
circumstances, it is appropriate to consider "UTF-32BE" and "UTF-32LE"
as separate encoding schemes, while under other circumstances, it is
more appropriate to lump them together under the single term "UTF-32".
But that isn't right; it still misses the point. "UTF-32BE" is not the
same as "UTF-32 that happens to be big-endian." The latter MAY begin
with a BOM; the former MUST NOT.

> If something is labelled "no" (for "Norwegian"), then one must "taste"
> it to know whether the content is Norwegian Bokmål ("nb") or Norwegian
> Nynorsk ("nn"). Likewise, if something is labelled, by default (as in
> XML) or explicitly, as "UTF-16", then the parser must taste/sniff -
> typically by sniffing the BOM - whether the document is big-endian or
> little-endian.

Data tagged as "UTF-16" might contain a BOM, or it might not. If it does
not, it is much more likely that platform or operating system
conventions will be used to determine the endianness of the data than
heuristics. There are comparatively few systems that will accept and
comprehend UTF-16 or UTF-32 data of the "wrong" endianness for the
platform. Andrew West's BabelPad is one tool that will sniff non-BOM
data, but the whole point of BabelPad is to be Unicode-aware and to help
the user be Unicode-aware; most systems and apps are not like that.

> When the BOM is supposed to be interpreted as the BOM, then we cannot
> label the document as e.g. "UTF-16BE" but must use, by default or
> explicitly, the label "UTF-16". But "UTF-16BE data" should be a valid
> term in either case (provided the UTF-16 file is big-endian).

Propose this change of terminology to the UTC. It is not consistent with
their existing use of the terms.

> A file labelled "UTF-16" is specified to contain BOM + big-endian data
> or BOM + little-endian data or - third - just big-endian data, without
> BOM. Thus, one of the encoding variants that can be legally labelled
> "UTF-16", is inseparable from "UTF-16BE" in every way.

That's correct. That situation is called out in the official
definitions; it doesn't imply a loosening of them.

> The "UTF-16" label does not mandate the use of the BOM.

I never said it did.

We are pretty much going round and round on this. The bottom line for me
is, it would be nice if there were a shorthand way of saying "big-endian
UTF-16," and many people (including you?) feel that "UTF-16BE" is that
way, but it is not. That term has a DIFFERENT MEANING. The following
stream:

FE FF 00 48 00 65 00 6C 00 6C 00 6F

is valid big-endian UTF-16, but it is NOT valid "UTF-16BE" unless the
leading U+FEFF is explicitly meant as a zero-width no-break space, which
may not be stripped.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­ 
Received on Sun Jan 06 2013 - 22:01:16 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 06 2013 - 22:01:17 CST