From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 07 2003 - 10:27:35 EDT
On Thursday, August 07, 2003 2:40 AM, Doug Ewell <dewell@adelphia.net> wrote:
> Kenneth Whistler <kenw at sybase dot com> wrote:
>
> > But I challenge you to find anything in the standard that
> > *prohibits* such sequences from occurring.
>
> I've learned that this question of "illegal" or "invalid" character
> sequences is one of the main distinguishing factors between those who
> truly understand Unicode and those who are still on the Road to
> Enlightenment.
>
> Very, very few sequences of Unicode characters are truly "invalid" or
> "illegal." Unpaired surrogates are a rare exception.
>
> In almost all cases, a given sequence might give unexpected results
> (e.g. putting a combining diacritic before the base character) or
> might be ineffectual (e.g. putting a variation selector before an
> arbitrary character), but it is still perfectly legal to encode and
> exchange such a sequence.
For Unicode itself this is true, but what users want is interoperability
of the encoded text with accurate rendering rules.
In practice, this means that any undefined or unpredictable behavior
will mean lack of interoperability and should not be used.
The standard should then highly promote what is a /valid/ encoding
for text with regard of interoperability for all text processing algorithms
including parsing combining sequences, collation, and computing
character properties from those /valid/ encoded sequences.
We don't have to care much if some encoded text considered valid
under Unicode/ISO-IEC10646 is rendered or processed differently
or unpredictably, provided that this does not affect common text for
actual languages.
In fact the standard specifies that ALL sequences made of code
points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF
and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC
10646, but it does not attempt to assign properties or behavior to
ALL of these characters or encoded sequences, as this is the job
of Unicode to specify this behavior.
If there's something to enhance in the Unicode standard (not in the
ISO/IEC 10646), it's exactly the specification of interoperable encoded
sequences. This certainly means that concrete examples for actual
languages must be documented. Just assigning properties to individual
ISO/IEC 10646 characters is not enough, and Unicode should
concentrate more efforts in the actual encoding of text and not only on
individual characters.
So for me, the "validity" of text is a ISO/IEC 10646 concept (shared
now with Unicode versions for the assignment of characters in the
repertoire), related only to the legally usable code points, and Unicode
speaks about "well-formed" or "ill-formed" sequences, or about
"normalized" sequences and transformations that preserve the actual
text semantics.
There is no ambiguity in ISO/IEC 10646 for the character assignments.
But composed sequences are the real problem, for which Unicode
must seek agreements: the W3C character model is only based on
the simplified combining sequences, but Unicode should go further
with much more precise rules for the encoding of actual text, even
before any attempt to describe other transformation algorithms (only
the NF* transformations have for now a stability policy, but actual
text writers need also stability for the text composition rules for
actual languages.
We certainly don't need more assigned code points for existing
scripts. But more rules for the actual representation of text using
these scripts, and how distinct scripts can interact and be mixed.
There's some rules already specified for Combining jamos, or
combining Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana,
but we are still far from an agreement for Hebrew, and even for some
Han composed sequences, which still lack a specification needed
for interoperability.
The current wording of "Unicode validity" is for me very weak, and
probably defective. What it designates is only a ISO10646 validity
for used code points, and the validity of their UTF* transformations,
based on individual code points. The kind of validity rules users
want with Unicode is a conformance of the actually encoded scripts
for actual languages, for interoperability and data exchange.
The fact that Unicode is born by trying to maximize the roundtrip
convertibility with legacy codepages or encoded character sets has
introduced many difficulties: first the base+combining characters
model was introduced as fundamental for alphabetized scripts with
separate letters for vowels. Then there's the case of Brahmic scripts
which complicates things, as Unicode has chosen to support both
the ISCII standard model with nuktas and viramas in logical encoding
order, and the TIS620 model for Thai and Lao with a physical model.
On the opposite the combining jamos model is remarkably simple,
and it still follows the logical model shared by alphabetized scripts.
Looking now at the difficulties of encoding Tengwar reveals most of
the difficulties that already exist for Thai, and now Hebrew, and subtle
needed artefacts needed in existing scripts used to transliterate
foreign languages. Some of these difficulties are also affecting now
the general alphabetized scripts (Latin notably), showing that the
ummutable model used to encode base letters and diacritics is not
universal. So Unicode will need to extend and specify much more its
own character model to support more scripts and languages, including
in the case of transliterations.
May be in the future, this will lead to defining a new level of conformance
by defining something that is more precise than just some basic
canonical equivalence rules (for NF* transforms and XML), with more
precise definitions of "ill-formed" or "defective" sequences (I confess
that I do not understand the need to deferentiate both concepts, and
this current separation is really more confusive than helpful to
understand the Unicode standard). What this means, is that we need
something saying "Unicode valid text" and not just "Unicode encoded
text" which just relates to the shared assignment of code points to
individual characters. The current "valid" term should be left to the
ISO/IEC 10646 standard, and to the very few Unicode algorithms
that handle only individual code points (such as UTF* encoding
forms and schemes), but its current definition is not helping
implementers and writers to produce interoperable textual data.
If the term "valid" cannot be changed, then I suggest defining
"conforming" for encoded text independantly of its validity (a
"conforming text" would still need to use a "valid encoding").
-- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 11:18:58 EDT