L2/04-228

Source: Mark Davis
Subject: Types of Charsets
Date: June 9, 2004


Since UTR 17 is on the agenda at C.11.3 Issue 25: Review of Proposed Update
Unicode Technical Report #17 Character Encoding Model , I'd like this added to
the doc registry and to that agenda item.
====================

[This is a recast of private mail exchanged on the topic of UTR 17.]

UTF17 has recently had some updates, thanks to Asmus. These are definite
improvements, but as I look at the typology that it uses, the distinctions that
we make between CES and TES are, to my mind, really not the most important
distinctions to make when dealing with character encodings. I will use different
terms to describe below what I think are the more important distinctions. I'll
avoid the terminology in #17 just so there is no confusion in talking about one
versus the other.

For Unicode, the most fundamental entity that all users of charset conversions
are concerned about in practice is what I'll call a Charset Mapping (CM), which
is a function
that maps from byte sequences to code unit sequences (in some Unicode encoding
form). The features of CMs that are important in practice are the following -- 
note that all of them are easily confused with one another!

I'll used lowercase variables for byte sequences and uppercase for Unicode code
unit sequences in examples.

A. Determinism
We say that a CM is deterministic if no two different byte sequences can ever
map to the same code unit sequence. In that case, there is a well-defined
inverse function CM' that maps bytes to Unicode.

Latin-1, UTF-16BE, are deterministic; any difference in a sequence of bytes will
result in a different sequence of Unicode characters (or be invalid). BOCU-1 is
also deterministic.

SCSU and ISO-2022-JP are not deterministic; different byte sequences can result
in the same Unicode text.

Determinism is important for a number of reasons, including security. If a CM is
deterministic, than you can test the sequence of bytes for identity in order to
establish whether the resulting Unicode text will be identical.


B. Context-Independence
A CM is context-independent if  for all byte sequences x and y, CM(x) + CM(y) =
CM(x + y), where + indicates concatenation.

ASCII, UTF-16BE, UTF-16LE, Latin-1, ... qualify here.

So does ISO-2022-JP, as commonly implemented and used. That is, most converters
require that a terminating sequence be emitted that returns to a 'ground' state.
In that case, any sequence without a terminating sequence is invalid. Any two
such sequences can be concatenated to produce a valid result, which also has the
same meaning when converted.

SCSU-1 is not context-independent. If you concatenate two valid SCSU sequences,
the result can map to different text. UTF-16 CES is also not context
independent if the second piece starts with a BOM.

If a programmer knows that a CM is context-independent, then some common
operations can be performed, such as concatenation of serialized text, without
data corruption.

A CM can be context-independent without being deterministic (SCSU), and a CM can
be deterministic without being context-independent (BOCU-1). So these are two
different beasts.


C. Fixed-Width
A CM is fixed-width if there is some constant k such that whenever CM(x) = X,
length(x) = k * length(X). While it would theoretically be possible to have a
fixed-width CM that was neither deterministic nor context independent, but in
practice such wacko charsets don't occur. A contrived example would be one that
stored any successive Unicode character as 3 bytes with the bottom 21 bits
containing the *difference* from the previous character, and the top 3 bits
being set to the random values. That would neither be context-independent nor
deterministic.

A CM that is fixed-width and context-independent permits direct random access,
which is a pretty useful feature. ASCII, 8859-x, UTF-32 all qualify.


D. Non-Overlap
The next best thing to Fixed_Width is Non-Overlap. A CM doesn't overlap if for
all cases where X = CM(x) and Y = CM(y), if X does not occur in Y, then x does
not occur in y. The practical benefit is that you can search for any characters
and never stumble across a false match. It also implies self-synchronization: it
is easy to tell where the boundaries of text are without having to backup very
far.

Shift-JIS, ISO-2022, etc. are Overlap CMs

All the Unicode CESs are non-overlap; all fixed-width CMs are non-overlap. While
not quite as useful as direct random access, non-overlap does permit "nearly"
random access. That is, you can jump to a point, do some very minor adjustment
to get synchronized, then iterate forward or backward to get characters.

Note that Overlap CMs like Shift-JIS, ISO-2022, etc. don't permit this; you may
have to scan arbitrarily far backwards to get a sync point.

Note that where you include canonical equivalence, all the Unicode CESs *do*
overlap, but the overlap is bounded.


C. Roundtripping
There are three kinds of mappings that a CM can contain:
CU <-> B: roundtrip mappings - a code unit sequence maps to a unique byte
sequence, and vice versa.
CU -> B: fallback mappings - a code unit sequence maps to a unique byte
sequence, but that sequence does not map back.
CU <- B: reverse-fallback mappings - a byte sequence maps to a unique code unit
sequence, but that sequence does not map back.

Example. ISO-8859-1 consists of only CU<->B mappings. windows-28591 includes all
of those, but also has some fallback mappings such as <0100>  -> <41>. If a
charset had characters that were not in Unicode, it could have reverse
fallbacks -- we don't recommend this, because it causes problems: better to map
to private user characters -- but some CMs do it.

Where fallbacks or reverse fallbacks exist, they are typically dependent; you
will have CU1 <-> B1 plus CU2 > B1, for example. (Two different code unit
sequences will map to the same byte sequence). I don't know of any cases where
you don't have independent fallbacks, e.g. A -> x, but no A <-> a or X <-> x.
That would be pretty bizarre.

Roundtripping is very important to a programmer. If there a fallback mappings in
use, then when mapping to XML (for example), there will be data corruption. You
have to be able to catch the fallbacks and map them to NCRs instead (e.g.
"&#x100;"). (If a CM does have fallbacks or reverse fallbacks, a good API will
let you turn them off). There are some subtleties here with private use
characters that I won't go into here.

A CM can be deterministic and context-independent, but not be roundtripping.
This can happen where there are reverse fallbacks, such as in windows-28591.

========

Interestingly, none of the above correspond exactly to the distinction that we
have currently in #17, where we have two groups:

"CES": Latin-1, UTF-16, windows 28591, Shift-JIS, ISO-2022-JP

"TES": SCSU, BOCU-1, UTF-7, Punycode

I myself don't especially see the utility of this particular distinction in
practice -- I will leave that for Asmus and Ken to describe. It would, however,
be a more useful categorization in practice if we moved the compounds into the
TES category, so that CES were "simple" and TES were "complicated".

"CES": Latin-1, UTF-16BE, windows 28591, Shift-JIS

"TES": SCSU, BOCU-1, UTF-7, Punycode,  ISO-2022-JP, UTF-16, UTF-8

.