Re: Terminology verification

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 30 2003 - 17:17:30 CST


From: "Lars Marius Garshol" <larsga@garshol.priv.no>

> Does this make sense? Is "code point" the right term, or should I say
> "scalar value"? And what about "abstract character"? Are two equal
> sequences of code points in NFC necessarily composed of the same
> sequence of abstract characters?

As Unicode and ISO/IEC 10646 are assigning the same codepoints for the samle
abstract character, the codepoint should be unique (so there's a bijection).

The issue is that not all Unicode strings are required to be in any
normalized form. So a Unicode string may be distinct from another one,
despite both are "canonically equivalent", i.e. equal after transformation
to a standard normalized form (NFC, NFD).

If you read this list, you'll see that some strings need to be encoded with
distinct sequences, despite they are canonically equivalent. This may cause
interpretation problems as the normalization process (even if canonical and
not "compatible") may alter the semantic in some cases (we discussed the
issue in Traditional Hebrew, Arabic, Tibetan, etc...), changing what is
considered to be a string of abstract characters (the canonicalization will
alter the abstract characters in some cases, even if it is not supposed to
change the way they are rendered to common readers).

Note also that the term "scalar value" is related to a assignment of a
relative position in a ordered character set. The term "code point" is be be
interpreted as symbolic, so that distinct code points have no defined
relative order (ordering code points is a question of collation, and the
collation in Unicode is defined to not act on the individual abstract
characters that make a string, but on the global string itself).

I would not use the term "scalar value" in your definition, even if strings
are normalized in a canonical composed form, where the representation of the
string is made of code points that have an inherent scalar value, which may
be stored in memory as code units, and then serialized as sequences of bytes
through an encoding scheme.

In fact there may exist fully Unicode-compliant applications that do not
handle strings of abstract characters using the scalar value of code points,
but instead symbolically (think about a Lisp processor that handles each
abstract character using a symbolic node, or about SGML applications that
handle them by their name or by character entity references): the scalar
value of each codepoint, is not required to perform Unicode string handling,
as strings may be serialized on input and output only as sequences of bytes
or code units with in any encoding scheme or coded charset.

If you think then about the normalization process, it can be performed also
symbolically, without using codepoints, and even when using equivalents
symbols to represent the same codepoint (for example in SGML or XML the
"numeric" character entities or named character entities)

Am I wrong ?



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST