From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 06 2003 - 22:41:48 EDT
John Cowan asked:
> > D17a Defective combining character sequence: A combining character
> > sequence that does not start with a base character.
> >
> > * Defective combining character sequences occur when a sequence
> > of combining characters appears at the start of a string or
> > follows a control or format character. Such sequences are
> > defective from the point of view of handling of combining
> > marks, but are not ill-formed.
> > ^^^^^^^^^^^^^^^^^^^^^^
>
> What, if anything, does the term "ill-formed" mean when attached to
> a sequence of characters?
Nothing, really. The bullet goes on to point to the definition
(D30) of "ill-formed", which applies to code *unit* sequences in
the context of the encoding forms.
The rewrite of Chapter 3 of the Unicode Standard dispensed with
the ill-advised ;-) and confusing distinction between "illegal",
"irregular", and "ill-formed" "code value sequences" in the
context of the discussion of "transformations", in favor of
a much starker and simpler distinction:
a code unit sequence is either well-formed or it is not
> I understood that every sequence of
> characters whatsoever is permitted.
As regards code *point* sequences, these sequences can either
be conformant to the standard or not conformant to the standard.
They are conformant if they meet the conformance requirements
(the "C" clauses of Chapter 3). And as regards sequences of
characters that basically comes down to not trying to
interchange reserved or noncharacter code points. So if you
include an reserved (unassigned) code point (for a particular version
of the Unicode Standard) in an interchanged data stream,
a recipient could claim that data stream is not conformant
to (that version of) the standard. Shorthand: the data contains
"illegal" characters. But even that is relative to the version
of the standard, since a recipient of reserved code points is
obliged to preserve their values -- they may, after all, be
"legal" assigned code points in a future version of the
standard that that particular implementation is not supporting.
So, yeah, basically every sequence of code points "assigned to
abstract characters" is "legal" for interchange. What you cannot
interchange are code points with gc=Cs (U+D800..U+DFFF) or
code points with gc=Cn (noncharacters and reserved).
What D17a is trying to tell people is that while certain sequences
of Unicode characters may be "defective" from the point of
view of certain kinds of processing -- in this case rendering
of combining character sequences -- that does not make them
ill-formed (for which see the specification of encoding forms),
nor does it make them nonconformant to the standard.
There are many sequences of Unicode characters that we could
dream up which would be abominable, distasteful, problematical,
defective, implementation-busting, or just plain screwy,
but the standard itself isn't prohibiting people from
conformantly creating such sequences and then challenging
Microsoft or anybody else to display them without
blowing a gasket.
One of the reasons why we have to be so incredibly careful now
before introducing conceptually new *types* of characters,
like the COMBINING GRAPHEME JOINER or such things as
INVISIBLE BASE CHARACTER or COMBINING CLASS CHANGER or whatnot,
is precisely that it gets harder and harder to program
defensively against all the possible combinations and interactions
that such beasties might have when mixed with everything else
that is available.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 23:22:31 EDT