Re: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (
Date: Thu Aug 07 2003 - 19:13:09 EDT

  • Next message: Rick McGowan: "Unicode 4.0.1 Beta period now starting"

    Peter Kirk followed up:

    > On 07/08/2003 07:27, Philippe Verdy wrote:
    > >On Thursday, August 07, 2003 2:40 AM, Doug Ewell <> wrote:
    > >
    > >>Kenneth Whistler <kenw at sybase dot com> wrote:
    > >>
    > >>>But I challenge you to find anything in the standard that
    > >>>*prohibits* such sequences from occurring.
    > >>>
    > >>>
    > >>I've learned that this question of "illegal" or "invalid" character
    > >>sequences is one of the main distinguishing factors between those who
    > >>truly understand Unicode and those who are still on the Road to
    > >>Enlightenment.
    > >>
    > >>...
    > >>
    > >If the term "valid" cannot be changed, then I suggest defining
    > >"conforming" for encoded text independantly of its validity (a
    > >"conforming text" would still need to use a "valid encoding").
    > >
    > As a very quick thought, maybe what we need is not restrictions to the
    > Unicode standard but a set of rules for each language or group of
    > languages, defining exactly how Unicode characters should be used to
    > write the words etc of that language. Such definitions might be
    > independent of the actual Unicode standard.

    I emphatically agree with Peter on this.

    The impulse to get the Unicode Standard to head down the road
    to becoming the "spelling standard" for all languages of the
    world has to be constrained, simply because there is not the
    expertise or the bandwidth in the UTC to accomplish this and
    because it isn't the business of the UTC in the first place.

    This is the kind of task which *must* be distributed to the
    relevant stakeholders around the world, wherever they may
    be and however their relevant jurisdictions are defined and

    The establishment of orthographic rules for particular language in
    the context of the Unicode Standard means transferring the notion
    of what the printed conventions for that language are -- whatever
    they may be -- into a determination of exactly which Unicode
    characters are to be used to represent those conventions,
    including any constraints on cooccurrence with particular
    format control characters, and so on.

    The scope of the task of defining rendering rules in the
    Unicode Standard is generic to script behavior -- establishing
    the general rules of the road, as it were, for how the
    scripts behave in the encoding, so that people and implementations
    have a determinate sense of what order characters should be
    in, what it means for combining characters to "combine" with
    base characters, how format control characters may impact
    script rendering generically, and so on. But beyond that, one
    is getting into the realm of orthographic rules for particular
    languages or jurisdictions and the realm of typographic
    conventions for particular styles and regions. Making those
    determinations belongs to the stakeholders themselves: ministries,
    academies, associations, type designers, whoever.

    It is precisely because the developers of the Unicode Standard
    cannot foresee all possible orthographic conventions and
    uses to which the standard may be put in representing text
    that it is deliberately permissive: essentially any sequence
    of characters is "legal", and it is up to the users of
    the standard to determine, for them, what is a *sensible*
    sequence of characters for their multitudinous purposes.


    This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 19:56:45 EDT