From: Jony Rosenne (rosennej@qsm.co.il)
Date: Thu Aug 07 2003 - 13:29:09 EDT
We need an official Unicode Lint.
Jony
> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org] On Behalf Of Philippe Verdy
> Sent: Thursday, August 07, 2003 4:28 PM
> To: unicode@unicode.org
> Subject: SPAM: Re: Questions on ZWNBS - for line initial
> holam plus alef
>
>
> On Thursday, August 07, 2003 2:40 AM, Doug Ewell
> <dewell@adelphia.net> wrote:
>
> > Kenneth Whistler <kenw at sybase dot com> wrote:
> >
> > > But I challenge you to find anything in the standard that
> > > *prohibits* such sequences from occurring.
> >
> > I've learned that this question of "illegal" or "invalid" character
> > sequences is one of the main distinguishing factors between
> those who
> > truly understand Unicode and those who are still on the Road to
> > Enlightenment.
> >
> > Very, very few sequences of Unicode characters are truly
> "invalid" or
> > "illegal." Unpaired surrogates are a rare exception.
> >
> > In almost all cases, a given sequence might give unexpected results
> > (e.g. putting a combining diacritic before the base character) or
> > might be ineffectual (e.g. putting a variation selector before an
> > arbitrary character), but it is still perfectly legal to encode and
> > exchange such a sequence.
>
> For Unicode itself this is true, but what users want is
> interoperability of the encoded text with accurate rendering
> rules. In practice, this means that any undefined or
> unpredictable behavior will mean lack of interoperability and
> should not be used.
>
> The standard should then highly promote what is a /valid/
> encoding for text with regard of interoperability for all
> text processing algorithms including parsing combining
> sequences, collation, and computing character properties from
> those /valid/ encoded sequences.
>
> We don't have to care much if some encoded text considered
> valid under Unicode/ISO-IEC10646 is rendered or processed
> differently or unpredictably, provided that this does not
> affect common text for actual languages.
>
> In fact the standard specifies that ALL sequences made of
> code points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF
> and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC
> 10646, but it does not attempt to assign properties or
> behavior to ALL of these characters or encoded sequences, as
> this is the job of Unicode to specify this behavior.
>
> If there's something to enhance in the Unicode standard (not
> in the ISO/IEC 10646), it's exactly the specification of
> interoperable encoded sequences. This certainly means that
> concrete examples for actual languages must be documented.
> Just assigning properties to individual ISO/IEC 10646
> characters is not enough, and Unicode should concentrate more
> efforts in the actual encoding of text and not only on
> individual characters.
>
> So for me, the "validity" of text is a ISO/IEC 10646 concept
> (shared now with Unicode versions for the assignment of
> characters in the repertoire), related only to the legally
> usable code points, and Unicode speaks about "well-formed" or
> "ill-formed" sequences, or about "normalized" sequences and
> transformations that preserve the actual text semantics.
>
> There is no ambiguity in ISO/IEC 10646 for the character
> assignments. But composed sequences are the real problem, for
> which Unicode must seek agreements: the W3C character model
> is only based on the simplified combining sequences, but
> Unicode should go further with much more precise rules for
> the encoding of actual text, even before any attempt to
> describe other transformation algorithms (only the NF*
> transformations have for now a stability policy, but actual
> text writers need also stability for the text composition
> rules for actual languages.
>
> We certainly don't need more assigned code points for
> existing scripts. But more rules for the actual
> representation of text using these scripts, and how distinct
> scripts can interact and be mixed. There's some rules already
> specified for Combining jamos, or combining
> Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but
> we are still far from an agreement for Hebrew, and even for
> some Han composed sequences, which still lack a specification
> needed for interoperability.
>
> The current wording of "Unicode validity" is for me very
> weak, and probably defective. What it designates is only a
> ISO10646 validity for used code points, and the validity of
> their UTF* transformations, based on individual code points.
> The kind of validity rules users want with Unicode is a
> conformance of the actually encoded scripts for actual
> languages, for interoperability and data exchange.
>
> The fact that Unicode is born by trying to maximize the
> roundtrip convertibility with legacy codepages or encoded
> character sets has introduced many difficulties: first the
> base+combining characters model was introduced as fundamental
> for alphabetized scripts with separate letters for vowels.
> Then there's the case of Brahmic scripts which complicates
> things, as Unicode has chosen to support both the ISCII
> standard model with nuktas and viramas in logical encoding
> order, and the TIS620 model for Thai and Lao with a physical
> model. On the opposite the combining jamos model is
> remarkably simple, and it still follows the logical model
> shared by alphabetized scripts.
>
> Looking now at the difficulties of encoding Tengwar reveals
> most of the difficulties that already exist for Thai, and now
> Hebrew, and subtle needed artefacts needed in existing
> scripts used to transliterate foreign languages. Some of
> these difficulties are also affecting now the general
> alphabetized scripts (Latin notably), showing that the
> ummutable model used to encode base letters and diacritics is
> not universal. So Unicode will need to extend and specify
> much more its own character model to support more scripts and
> languages, including in the case of transliterations.
>
> May be in the future, this will lead to defining a new level
> of conformance by defining something that is more precise
> than just some basic canonical equivalence rules (for NF*
> transforms and XML), with more precise definitions of
> "ill-formed" or "defective" sequences (I confess that I do
> not understand the need to deferentiate both concepts, and
> this current separation is really more confusive than helpful
> to understand the Unicode standard). What this means, is that
> we need something saying "Unicode valid text" and not just
> "Unicode encoded text" which just relates to the shared
> assignment of code points to individual characters. The
> current "valid" term should be left to the ISO/IEC 10646
> standard, and to the very few Unicode algorithms that handle
> only individual code points (such as UTF* encoding forms and
> schemes), but its current definition is not helping
> implementers and writers to produce interoperable textual data.
>
> If the term "valid" cannot be changed, then I suggest
> defining "conforming" for encoded text independantly of its
> validity (a "conforming text" would still need to use a
> "valid encoding").
>
>
> --
> Philippe.
> Spams non tolérés: tout message non sollicité sera
> rapporté à vos fournisseurs de services Internet.
>
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 13:04:26 EDT