Valid encodings

From: Jony Rosenne (
Date: Thu Aug 07 2003 - 13:29:09 EDT

  • Next message: John Cowan: "Re: Conflicting principles"

    We need an official Unicode Lint.


    > -----Original Message-----
    > From:
    > [] On Behalf Of Philippe Verdy
    > Sent: Thursday, August 07, 2003 4:28 PM
    > To:
    > Subject: SPAM: Re: Questions on ZWNBS - for line initial
    > holam plus alef
    > On Thursday, August 07, 2003 2:40 AM, Doug Ewell
    > <> wrote:
    > > Kenneth Whistler <kenw at sybase dot com> wrote:
    > >
    > > > But I challenge you to find anything in the standard that
    > > > *prohibits* such sequences from occurring.
    > >
    > > I've learned that this question of "illegal" or "invalid" character
    > > sequences is one of the main distinguishing factors between
    > those who
    > > truly understand Unicode and those who are still on the Road to
    > > Enlightenment.
    > >
    > > Very, very few sequences of Unicode characters are truly
    > "invalid" or
    > > "illegal." Unpaired surrogates are a rare exception.
    > >
    > > In almost all cases, a given sequence might give unexpected results
    > > (e.g. putting a combining diacritic before the base character) or
    > > might be ineffectual (e.g. putting a variation selector before an
    > > arbitrary character), but it is still perfectly legal to encode and
    > > exchange such a sequence.
    > For Unicode itself this is true, but what users want is
    > interoperability of the encoded text with accurate rendering
    > rules. In practice, this means that any undefined or
    > unpredictable behavior will mean lack of interoperability and
    > should not be used.
    > The standard should then highly promote what is a /valid/
    > encoding for text with regard of interoperability for all
    > text processing algorithms including parsing combining
    > sequences, collation, and computing character properties from
    > those /valid/ encoded sequences.
    > We don't have to care much if some encoded text considered
    > valid under Unicode/ISO-IEC10646 is rendered or processed
    > differently or unpredictably, provided that this does not
    > affect common text for actual languages.
    > In fact the standard specifies that ALL sequences made of
    > code points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF
    > and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC
    > 10646, but it does not attempt to assign properties or
    > behavior to ALL of these characters or encoded sequences, as
    > this is the job of Unicode to specify this behavior.
    > If there's something to enhance in the Unicode standard (not
    > in the ISO/IEC 10646), it's exactly the specification of
    > interoperable encoded sequences. This certainly means that
    > concrete examples for actual languages must be documented.
    > Just assigning properties to individual ISO/IEC 10646
    > characters is not enough, and Unicode should concentrate more
    > efforts in the actual encoding of text and not only on
    > individual characters.
    > So for me, the "validity" of text is a ISO/IEC 10646 concept
    > (shared now with Unicode versions for the assignment of
    > characters in the repertoire), related only to the legally
    > usable code points, and Unicode speaks about "well-formed" or
    > "ill-formed" sequences, or about "normalized" sequences and
    > transformations that preserve the actual text semantics.
    > There is no ambiguity in ISO/IEC 10646 for the character
    > assignments. But composed sequences are the real problem, for
    > which Unicode must seek agreements: the W3C character model
    > is only based on the simplified combining sequences, but
    > Unicode should go further with much more precise rules for
    > the encoding of actual text, even before any attempt to
    > describe other transformation algorithms (only the NF*
    > transformations have for now a stability policy, but actual
    > text writers need also stability for the text composition
    > rules for actual languages.
    > We certainly don't need more assigned code points for
    > existing scripts. But more rules for the actual
    > representation of text using these scripts, and how distinct
    > scripts can interact and be mixed. There's some rules already
    > specified for Combining jamos, or combining
    > Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but
    > we are still far from an agreement for Hebrew, and even for
    > some Han composed sequences, which still lack a specification
    > needed for interoperability.
    > The current wording of "Unicode validity" is for me very
    > weak, and probably defective. What it designates is only a
    > ISO10646 validity for used code points, and the validity of
    > their UTF* transformations, based on individual code points.
    > The kind of validity rules users want with Unicode is a
    > conformance of the actually encoded scripts for actual
    > languages, for interoperability and data exchange.
    > The fact that Unicode is born by trying to maximize the
    > roundtrip convertibility with legacy codepages or encoded
    > character sets has introduced many difficulties: first the
    > base+combining characters model was introduced as fundamental
    > for alphabetized scripts with separate letters for vowels.
    > Then there's the case of Brahmic scripts which complicates
    > things, as Unicode has chosen to support both the ISCII
    > standard model with nuktas and viramas in logical encoding
    > order, and the TIS620 model for Thai and Lao with a physical
    > model. On the opposite the combining jamos model is
    > remarkably simple, and it still follows the logical model
    > shared by alphabetized scripts.
    > Looking now at the difficulties of encoding Tengwar reveals
    > most of the difficulties that already exist for Thai, and now
    > Hebrew, and subtle needed artefacts needed in existing
    > scripts used to transliterate foreign languages. Some of
    > these difficulties are also affecting now the general
    > alphabetized scripts (Latin notably), showing that the
    > ummutable model used to encode base letters and diacritics is
    > not universal. So Unicode will need to extend and specify
    > much more its own character model to support more scripts and
    > languages, including in the case of transliterations.
    > May be in the future, this will lead to defining a new level
    > of conformance by defining something that is more precise
    > than just some basic canonical equivalence rules (for NF*
    > transforms and XML), with more precise definitions of
    > "ill-formed" or "defective" sequences (I confess that I do
    > not understand the need to deferentiate both concepts, and
    > this current separation is really more confusive than helpful
    > to understand the Unicode standard). What this means, is that
    > we need something saying "Unicode valid text" and not just
    > "Unicode encoded text" which just relates to the shared
    > assignment of code points to individual characters. The
    > current "valid" term should be left to the ISO/IEC 10646
    > standard, and to the very few Unicode algorithms that handle
    > only individual code points (such as UTF* encoding forms and
    > schemes), but its current definition is not helping
    > implementers and writers to produce interoperable textual data.
    > If the term "valid" cannot be changed, then I suggest
    > defining "conforming" for encoded text independantly of its
    > validity (a "conforming text" would still need to use a
    > "valid encoding").
    > --
    > Philippe.
    > Spams non tolérés: tout message non sollicité sera
    > rapporté à vos fournisseurs de services Internet.

    This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 13:04:26 EDT