From: Kenneth Whistler (email@example.com)
Date: Thu Aug 07 2003 - 19:13:09 EDT
Peter Kirk followed up:
> On 07/08/2003 07:27, Philippe Verdy wrote:
> >On Thursday, August 07, 2003 2:40 AM, Doug Ewell <firstname.lastname@example.org> wrote:
> >>Kenneth Whistler <kenw at sybase dot com> wrote:
> >>>But I challenge you to find anything in the standard that
> >>>*prohibits* such sequences from occurring.
> >>I've learned that this question of "illegal" or "invalid" character
> >>sequences is one of the main distinguishing factors between those who
> >>truly understand Unicode and those who are still on the Road to
> >If the term "valid" cannot be changed, then I suggest defining
> >"conforming" for encoded text independantly of its validity (a
> >"conforming text" would still need to use a "valid encoding").
> As a very quick thought, maybe what we need is not restrictions to the
> Unicode standard but a set of rules for each language or group of
> languages, defining exactly how Unicode characters should be used to
> write the words etc of that language. Such definitions might be
> independent of the actual Unicode standard.
I emphatically agree with Peter on this.
The impulse to get the Unicode Standard to head down the road
to becoming the "spelling standard" for all languages of the
world has to be constrained, simply because there is not the
expertise or the bandwidth in the UTC to accomplish this and
because it isn't the business of the UTC in the first place.
This is the kind of task which *must* be distributed to the
relevant stakeholders around the world, wherever they may
be and however their relevant jurisdictions are defined and
The establishment of orthographic rules for particular language in
the context of the Unicode Standard means transferring the notion
of what the printed conventions for that language are -- whatever
they may be -- into a determination of exactly which Unicode
characters are to be used to represent those conventions,
including any constraints on cooccurrence with particular
format control characters, and so on.
The scope of the task of defining rendering rules in the
Unicode Standard is generic to script behavior -- establishing
the general rules of the road, as it were, for how the
scripts behave in the encoding, so that people and implementations
have a determinate sense of what order characters should be
in, what it means for combining characters to "combine" with
base characters, how format control characters may impact
script rendering generically, and so on. But beyond that, one
is getting into the realm of orthographic rules for particular
languages or jurisdictions and the realm of typographic
conventions for particular styles and regions. Making those
determinations belongs to the stakeholders themselves: ministries,
academies, associations, type designers, whoever.
It is precisely because the developers of the Unicode Standard
cannot foresee all possible orthographic conventions and
uses to which the standard may be put in representing text
that it is deliberately permissive: essentially any sequence
of characters is "legal", and it is up to the users of
the standard to determine, for them, what is a *sensible*
sequence of characters for their multitudinous purposes.
This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 19:56:45 EDT