Repertoire, encoding, and representation (Was: Charsets + encoding + codesets)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Oct 06 1997 - 20:39:19 EDT


Keld responded to Yve:

>
> You can both have a 10646 encoding and an 10646 repertoire.
> The canonical encoding of 10646 is UCS-4. That means if you are not
> more specific than saying "10646 coded character set" then you mean
> UCS-4.

I can't let that one go by.

The "encoding" (sense 4 of my last note) is the specification of all
the numbers associated with the characters in the repertoire. It is
neither UCS-4 nor UCS-2.

UCS-2 and UCS-4 are defined in Clause 14 of 10646 as "coded representation
forms of the UCS". UCS-2 is called the "Two-octet BMP form", and
UCS-4 is called the "Four-octet canonical form." What is "canonical" about
the UCS-4 form is that it enables the representation of any character
encoded in 10646, whether or not it is encoded on the Basic Multilingual
Plane (BMP), whereas UCS-2 only enables the representation of characters
encoded on the BMP.

However, canonical form does *not* mean default form, as implied by
Keld's statement above. 10646 does not define any concept of default
form of use. Instead, 10646 defines the alternatives, and then states
that the mechanism for identifying it is outside the scope of the
standard:

   "The identification of ISO/IEC 10646 (including the form), the
    implementation level, and any subset of the coding space that
    have been adopted by the originator must also be available to
    the recipient. The route by which such identification is
    communicated is outside the scope of ISO/IEC 10646." -- 17.1

10646 then goes on to say that *if* you are using ISO/IEC 2022 escape
sequences, one of a specified list of escape sequences can be used
to identify the form and implementation level, and other escape
sequences can be used to identify designated subsets of the repertoire.

The same applies to the specification of one of the two "transformation
formats", UTF-8 or UTF-16 (both of which are "encoding schemes" in the
sense identified earlier), which can also be identified by ISO/IEC 2022
escape sequences. Either or both can, however, be designated by other
means not involving 2022.

The Unicode Standard can be considered a profile of 10646 that
designates UTF-16 as the preferred encoding scheme. In that sense it
clearly *does* designate a default encoding scheme, unlike 10646.

>
> The trouble is that the "repertoire" of Unicode and 10646 is different.
> 10646 is clear on what is the repertoire: it is the characters of all
> its code points. Unicode is clear on "abstract characters" that
> you can make abstract characters by combining a number of characters
> such as a base letter and then one or more combining accents.
> But the combinations are not defined or limited, so for Unicode
> you have an unlimited repertoire of Unicode abstract characters.
>

I'll state this one more time, because Keld keeps claiming it isn't
so:

   The repertoire of the Unicode Standard and of ISO/IEC 10646 are
   *exactly* the same.

WG2 and the Unicode Technical Committee go to great lengths to ensure
that this is and remains the case. Additions to the repertoire of 10646
are matched by additions to the repertoire of the Unicode Standard,
and the two standards groups work together to synchronize the various
steps of balloting and publication, so that publication of the Unicode
Standard can be directly correlated with a known sequence of approved
and published amendments to 10646.

So what is Keld talking about? Combining marks, of course.

So once more, into the breach.

The Unicode Standard talks about abstract characters. <a-acute> is an
example of an abstract character in the Latin script. <d-dental-voiceless>
is another example of an abstract character in the Latin script.

<a-acute> is a part of the repertoire of 10646 (and Unicode). It is
   encoded at U+00E1 (= U-000000E1). The name of this encoded character is
   LATIN SMALL LETTER A WITH ACUTE.

<d-dental-voiceless> is not part of the repertoire of 10646 (or of Unicode).
   It is not encoded. It has no name in 10646 (or Unicode).

<a-acute> can also be represented (note, *not* encoded) by a combining
   character sequence. In particular, it can be represented by:
   U+0061 LATIN SMALL LETTER A + U+0301 COMBINING ACUTE ACCENT

<d-dental-voiceless> can be represented (note, *not* encoded) by a
   combining character sequence. In particular, it can be represented by:
   U+0064 LATIN SMALL LETTER D + U+032A COMBINING BRIDGE BELOW +
   U+0325 COMBINING RING BELOW

The Unicode Standard recognizes that for most purposes, the two different
representations of <a-acute> should be treated as identical. Users neither
know nor care what the underlying representation is, and will expect
that any <a-acute> they see will be the same as any other <a-acute>.
Because that is the case, the Unicode Standard defines a concept of
canonically equivalent sequences. The two representations of <a-acute>
are an example of a canonically equivalent sequence. The details for
Unicode conformance include treating canonically equivalent sequences
correctly. (Note that this is a stricter specification for conformance
with Unicode than for conformance with 10646 itself. 10646 does not define
canonical equivalence; nor does it specify many other aspects of the
"semantics" of the characters it encodes.)

Note that canonical equivalence does *not* mean duplicate encoding of
characters. It means two different representations of the same abstract
character--representations which under most circumstances should be
*interpreted* the same.

Note also that canonical equivalence also does not mean exact identity.
If your software process is allocating buffer space, it better not
treat U+00E1 the same as the sequence U+0061 + U+0301, or it will
overrun memory.

Keld is, of course, correct that the repertoire of abstract characters
is open. I just gave an example of an abstract character that could have
meaningful use in the transcription of a language, but it has never (to
my knowledge) been brought up before or discussed as a candidate to
be *encoded* as a character in 10646. That is not because it has two
accents; there are already such characters encoded in 10646, e.g.
U+01DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON. But the nature
of the Latin script is that it allows relatively free application of
accent marks to letter baseforms, either as diacritics to create new
"letters" for a particular orthography, or as accents to modify in various
ways the sounds represented by letters.

The reason for *encoding* all the combining marks is to allow the
*representation* (not encoding) of all these letter accent combinations,
including all the ones nobody has thought of or used yet.

As Tex pointed out, this kind of canonical equivalence is not the
only kind of equivalence which can (and must) be defined between
characters and sequences of characters in Unicode/10646. To add to
those cases already cited by Tex, consider the abstract character
<katakana-ga>.

In addition to the single encoded character form, and its canonically
equivalent combining character sequence:

U+30AC KATAKANA LETTER GA

U+30AB KATAKANA LETTER KA + U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK

one must also consider the sequence

U+FF76 HALFWIDTH KATAKANA LETTER KA + FF9E HALFWIDTH KATAKANA VOICED SOUND MARK

If I am interpreting data expressed as in the sequence of halfwidth forms,
as, for example, from a Japanese host database, it would be very wrong not
to equate it to U+30AC KATAKANA LETTER GA. This is an equivalence very
well understood and required for high-fidelity transforms of data between
IBM host databases and JIS-based systems that do not encode halfwidth
katakana as separate characters.

Just as for the canonically equivalent sequence, the existence of an
equivalent halfwidth sequence for representing the same abstract character
simply must be dealt with. It is not double encoding--but just one of
a very large number of equivalences that software must contend with when
using the universal character set.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT