Errors in FAQ on compression

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Mar 11 2002 - 00:26:29 EST


The FAQ on compression says:

<quote>
Q: Why not use UTF-8 as compressed format?
A: UTF-8 represents only the ASCII characters in less space than needed
in UTF-16, for <i>all</i> other characters it expands.
</quote>

The end of this sentence means "... it expands compared to UTF-16," and
of course that is not true. Code points from U+0080 through U+07FF are
represented in UTF-8 as two bytes, the same as UTF-16. For an FAQ, this
is an unfortunate error.

Perhaps something along the lines of:

A: UTF-8 represents only the ASCII characters in less space than needed
in UTF-16; for all other characters it requires the same or more space.

would be more accurate.

Later on...

<quote>
A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded
Unicode text, by removing the extra redundancy that is part of the
encoding (sequences of every other byte being null) and not a redundancy
in the content. The output of SCSU should be sent to LZW for block
compression where that's desired.
</quote>

The part about "sequences of every other byte being null" bothers me.
For one thing, this case is specific to Latin-1 usage. In Cyrillic
text, you have sequences of every other byte being 0x04; in kana, it's
0x30; and so forth. Then there's that word "null," which has a special
meaning of "nothing" or "unassigned" in many programming languages. The
fact that Latin-1 text encoded as UTF-16 results in every other byte
being 0x00 has nothing to do with any of the symbolic meanings of
"null."

How about:

(sequences of every other byte being the same)

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Mon Mar 11 2002 - 00:51:44 EST