Re: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (
Date: Wed Aug 06 2003 - 22:41:48 EDT

  • Next message: Curtis Clark: "Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)"

    John Cowan asked:

    > > D17a Defective combining character sequence: A combining character
    > > sequence that does not start with a base character.
    > >
    > > * Defective combining character sequences occur when a sequence
    > > of combining characters appears at the start of a string or
    > > follows a control or format character. Such sequences are
    > > defective from the point of view of handling of combining
    > > marks, but are not ill-formed.
    > > ^^^^^^^^^^^^^^^^^^^^^^
    > What, if anything, does the term "ill-formed" mean when attached to
    > a sequence of characters?

    Nothing, really. The bullet goes on to point to the definition
    (D30) of "ill-formed", which applies to code *unit* sequences in
    the context of the encoding forms.

    The rewrite of Chapter 3 of the Unicode Standard dispensed with
    the ill-advised ;-) and confusing distinction between "illegal",
    "irregular", and "ill-formed" "code value sequences" in the
    context of the discussion of "transformations", in favor of
    a much starker and simpler distinction:

       a code unit sequence is either well-formed or it is not

    > I understood that every sequence of
    > characters whatsoever is permitted.

    As regards code *point* sequences, these sequences can either
    be conformant to the standard or not conformant to the standard.
    They are conformant if they meet the conformance requirements
    (the "C" clauses of Chapter 3). And as regards sequences of
    characters that basically comes down to not trying to
    interchange reserved or noncharacter code points. So if you
    include an reserved (unassigned) code point (for a particular version
    of the Unicode Standard) in an interchanged data stream,
    a recipient could claim that data stream is not conformant
    to (that version of) the standard. Shorthand: the data contains
    "illegal" characters. But even that is relative to the version
    of the standard, since a recipient of reserved code points is
    obliged to preserve their values -- they may, after all, be
    "legal" assigned code points in a future version of the
    standard that that particular implementation is not supporting.

    So, yeah, basically every sequence of code points "assigned to
    abstract characters" is "legal" for interchange. What you cannot
    interchange are code points with gc=Cs (U+D800..U+DFFF) or
    code points with gc=Cn (noncharacters and reserved).

    What D17a is trying to tell people is that while certain sequences
    of Unicode characters may be "defective" from the point of
    view of certain kinds of processing -- in this case rendering
    of combining character sequences -- that does not make them
    ill-formed (for which see the specification of encoding forms),
    nor does it make them nonconformant to the standard.

    There are many sequences of Unicode characters that we could
    dream up which would be abominable, distasteful, problematical,
    defective, implementation-busting, or just plain screwy,
    but the standard itself isn't prohibiting people from
    conformantly creating such sequences and then challenging
    Microsoft or anybody else to display them without
    blowing a gasket.

    One of the reasons why we have to be so incredibly careful now
    before introducing conceptually new *types* of characters,
    like the COMBINING GRAPHEME JOINER or such things as
    is precisely that it gets harder and harder to program
    defensively against all the possible combinations and interactions
    that such beasties might have when mixed with everything else
    that is available.


    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 23:22:31 EDT