Re: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Aug 07 2003 - 20:21:25 EDT

  • Next message: Magda Danish \(Unicode\): "FW: Web Form: Other Question: Unicode character in Visual C++6 w/ MSComm control"

    An anonymous wag who picks the nits even finer that I did
    wishes the following clarification to be posted regarding
    an assertion I made about what Unicode code points are
    interchangeable. ;-)

    ------------- Begin Forwarded Message -------------

    > So, yeah, basically every sequence of code points "assigned to
    > abstract characters" is "legal" for interchange. What you cannot
    > interchange are code points with gc=Cs (U+D800..U+DFFF) or
    > code points with gc=Cn (noncharacters and reserved).

    You *can* interchange reserved characters. You *should* not originate
    them, but if you are passed a string with them, you should preserve
    them, and pass them on. And in most circumstances you can depend on
    them being preserved. For noncharacters you can interchange, but
    should not depend on them being preserved.

    You *can* also interchange Cs characters; just not within conformant
    UTF encoding scheme/forms. But it is perfectly legal for me to have a
    record with a field containing an *arbitrary Unicode code point*,
    serialize that record, and send it off.

    ---------------End Forwarded Message ------------------

    I concur with the general intent of this clarification, but
    this is definitely in the gray area as regards exactly what
    the conformance claims for the standard means.

    It is certainly good practice and the most robust approach
    to an implementation for it to behave the way suggested here,
    but note also the following letter of the law from 10646,
    to which the Unicode Standard itself claims conformance:

    <quote>
    2.2 Conformance of information interchange
    A code-character-data-element (CC-data-element) within coded
    information for interchange is in conformance with ISO/IEC
    10646 if

    a) all the coded representations of graphic characters
       within that CC-data-element conform to clauses 6 and
       7, ...
    b) all the graphic characters represented within that
       CC-data-element are taken from those within an identified
       subset (clause 12)
       
    ...

    7. General requirements for the UCS
    ...
    b. Code positions to which a character is not allocated,
       except for the positions reserved for private use characters
       or for transformation formats, are reserved for future
       standardization and shall not be used for any other
       purpose. ...
    </quote>

    2.2.a and 7.b imply that it is not conformant to interchange
    reserved code points, and 2.2.b implies that what you can
    interchange are only the assigned characters from a subset
    (in the Unicode case, of course, the subset of the whole).

    So the way I would summarize this is:

    I. Reserved code points

    A conformant implementation should not originate them, but
    because conformant implementations may be designed to work
    with multiple versions of the standard and may encounter
    uplevel data, good implementation practice is to follow the
    Unicode recommendations about not munging uninterpreted
    code points and about passing them along unharmed.

    II. Noncharacters

    These cannot be used in open interchange, although they can,
    of course be used in "internal" interchange, which is
    essentially a private agreement (perhaps with oneself) regarding
    what noncharacter usage those code points have. No external
    recipient can interpret them, nor is an external recipient
    obliged to preserve them if received.

    III. Surrogate code points

    I would claim, contra the above, that these *cannot* be
    interchanged in conformance with the standard -- at all.
    If one is attempting to interchange arbitrary Unicode code
    points, including Cs code points (U-0000D800..U-0000DFFF),
    this cannot be done with a well-formed encoding form, and
    thus cannot be done in conformance with the standard.
    If one claims to be *interchanging* such code points in
    the context of a Unicode string (which does not, of course,
    have to be well-formed to constitute a "Unicode string" by
    the definition in the standard), then such interchange
    is effectively a protocol built on top of the standard,
    rather than something in conformance with the standard
    itself.

    At any rate, that is how *I* would pick the nits.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 20:57:18 EDT