L2/04-228 Source: Mark Davis Subject: Types of Charsets Date: June 9, 2004 Since UTR 17 is on the agenda at C.11.3 Issue 25: Review of Proposed Update Unicode Technical Report #17 Character Encoding Model , I'd like this added to the doc registry and to that agenda item. ==================== [This is a recast of private mail exchanged on the topic of UTR 17.] UTF17 has recently had some updates, thanks to Asmus. These are definite improvements, but as I look at the typology that it uses, the distinctions that we make between CES and TES are, to my mind, really not the most important distinctions to make when dealing with character encodings. I will use different terms to describe below what I think are the more important distinctions. I'll avoid the terminology in #17 just so there is no confusion in talking about one versus the other. For Unicode, the most fundamental entity that all users of charset conversions are concerned about in practice is what I'll call a Charset Mapping (CM), which is a function that maps from byte sequences to code unit sequences (in some Unicode encoding form). The features of CMs that are important in practice are the following -- note that all of them are easily confused with one another! I'll used lowercase variables for byte sequences and uppercase for Unicode code unit sequences in examples. A. Determinism We say that a CM is deterministic if no two different byte sequences can ever map to the same code unit sequence. In that case, there is a well-defined inverse function CM' that maps bytes to Unicode. Latin-1, UTF-16BE, are deterministic; any difference in a sequence of bytes will result in a different sequence of Unicode characters (or be invalid). BOCU-1 is also deterministic. SCSU and ISO-2022-JP are not deterministic; different byte sequences can result in the same Unicode text. Determinism is important for a number of reasons, including security. If a CM is deterministic, than you can test the sequence of bytes for identity in order to establish whether the resulting Unicode text will be identical. B. Context-Independence A CM is context-independent if for all byte sequences x and y, CM(x) + CM(y) = CM(x + y), where + indicates concatenation. ASCII, UTF-16BE, UTF-16LE, Latin-1, ... qualify here. So does ISO-2022-JP, as commonly implemented and used. That is, most converters require that a terminating sequence be emitted that returns to a 'ground' state. In that case, any sequence without a terminating sequence is invalid. Any two such sequences can be concatenated to produce a valid result, which also has the same meaning when converted. SCSU-1 is not context-independent. If you concatenate two valid SCSU sequences, the result can map to different text. UTF-16 CES is also not context independent if the second piece starts with a BOM. If a programmer knows that a CM is context-independent, then some common operations can be performed, such as concatenation of serialized text, without data corruption. A CM can be context-independent without being deterministic (SCSU), and a CM can be deterministic without being context-independent (BOCU-1). So these are two different beasts. C. Fixed-Width A CM is fixed-width if there is some constant k such that whenever CM(x) = X, length(x) = k * length(X). While it would theoretically be possible to have a fixed-width CM that was neither deterministic nor context independent, but in practice such wacko charsets don't occur. A contrived example would be one that stored any successive Unicode character as 3 bytes with the bottom 21 bits containing the *difference* from the previous character, and the top 3 bits being set to the random values. That would neither be context-independent nor deterministic. A CM that is fixed-width and context-independent permits direct random access, which is a pretty useful feature. ASCII, 8859-x, UTF-32 all qualify. D. Non-Overlap The next best thing to Fixed_Width is Non-Overlap. A CM doesn't overlap if for all cases where X = CM(x) and Y = CM(y), if X does not occur in Y, then x does not occur in y. The practical benefit is that you can search for any characters and never stumble across a false match. It also implies self-synchronization: it is easy to tell where the boundaries of text are without having to backup very far. Shift-JIS, ISO-2022, etc. are Overlap CMs All the Unicode CESs are non-overlap; all fixed-width CMs are non-overlap. While not quite as useful as direct random access, non-overlap does permit "nearly" random access. That is, you can jump to a point, do some very minor adjustment to get synchronized, then iterate forward or backward to get characters. Note that Overlap CMs like Shift-JIS, ISO-2022, etc. don't permit this; you may have to scan arbitrarily far backwards to get a sync point. Note that where you include canonical equivalence, all the Unicode CESs *do* overlap, but the overlap is bounded. C. Roundtripping There are three kinds of mappings that a CM can contain: CU <-> B: roundtrip mappings - a code unit sequence maps to a unique byte sequence, and vice versa. CU -> B: fallback mappings - a code unit sequence maps to a unique byte sequence, but that sequence does not map back. CU <- B: reverse-fallback mappings - a byte sequence maps to a unique code unit sequence, but that sequence does not map back. Example. ISO-8859-1 consists of only CU<->B mappings. windows-28591 includes all of those, but also has some fallback mappings such as <0100> -> <41>. If a charset had characters that were not in Unicode, it could have reverse fallbacks -- we don't recommend this, because it causes problems: better to map to private user characters -- but some CMs do it. Where fallbacks or reverse fallbacks exist, they are typically dependent; you will have CU1 <-> B1 plus CU2 > B1, for example. (Two different code unit sequences will map to the same byte sequence). I don't know of any cases where you don't have independent fallbacks, e.g. A -> x, but no A <-> a or X <-> x. That would be pretty bizarre. Roundtripping is very important to a programmer. If there a fallback mappings in use, then when mapping to XML (for example), there will be data corruption. You have to be able to catch the fallbacks and map them to NCRs instead (e.g. "Ā"). (If a CM does have fallbacks or reverse fallbacks, a good API will let you turn them off). There are some subtleties here with private use characters that I won't go into here. A CM can be deterministic and context-independent, but not be roundtripping. This can happen where there are reverse fallbacks, such as in windows-28591. ======== Interestingly, none of the above correspond exactly to the distinction that we have currently in #17, where we have two groups: "CES": Latin-1, UTF-16, windows 28591, Shift-JIS, ISO-2022-JP "TES": SCSU, BOCU-1, UTF-7, Punycode I myself don't especially see the utility of this particular distinction in practice -- I will leave that for Asmus and Ken to describe. It would, however, be a more useful categorization in practice if we moved the compounds into the TES category, so that CES were "simple" and TES were "complicated". "CES": Latin-1, UTF-16BE, windows 28591, Shift-JIS "TES": SCSU, BOCU-1, UTF-7, Punycode, ISO-2022-JP, UTF-16, UTF-8 .