Re: Default properties for PUA characters???

From: Kenneth Whistler (
Date: Mon Dec 02 2002 - 21:21:52 EST

  • Next message: John Cowan: "Re: Default properties for PUA characters???"

    Christian Wittern asked:

    > Leaving aside the red light that flashed in my head on the notion of
    > the W3C recommending PUA (for interchange?), I was wondering about the
    > notion of PUA characters being by "Unicode defaults" treated as
    > ideographs. Is there a canonical reference for this?
    > Just wondering,

    Many Unicode "character" properties are actually code point
    properties. They must partition the entire Unicode codespace,
    so that an API can return a meaningful value for any code
    point, including PUA and unassigned code points, not just
    for assigned characters.

    Because of this, the Unicode Standard now has a concept of
    a default property value, which applies in code points which
    are not otherwise given an explicit value for that property.

    In the case of PUA characters, the Unicode Character Database
    gives them all the same properties. Some of the most important of
    those properties are:

    gc=Co (general category = Private_Use)
    ccc=0 (combining class = 0, i.e. Not_Reordered)
    bc=L (bidi class = strong Left_To_Right)
    sc=Zyyy (script = Common)
    lb=XX (line break = Unknown)
    ea=A (east asian width = Ambiguous)

    For ideographs, which also all have the same properties, the
    relevant, corresponding properties are:

    gc=Lo (general category = Other_Letter)
    ccc=0 (combining class = 0, i.e. Not_Reordered)
    bc=L (bidi class = strong Left_To_Right)
    sc=Hani (script = Han)
    lb=ID (line break = Ideographic)
    ea=W (east asian width = Wide)

    Thus, while in some respects the PUA characters are, by default,
    like ideographs (they are all base characters and are treated
    as left-to-right for bidi purposes), in other respects, their
    properties differ.

    In particular, with respect to line-breaking, UAX #14 currently
    states for lb=XX:

    "The default behavior for [XX] is identical to class AL.
    [i.e. alphabetic characters] ... In addition, implementations
    can override or tailor this default behavior, e.g. by
    assigning characters the property ID or another class, if that
    is likely to give the correct default behavior for their users,
    or use other means to determine the correct behavior. For example,
    one implementation might treat any private use character in
    ideographic context as ID, while another implementation
    might support a method for assigning specific properties to
    specific definitions of private use characters. The details of
    such use of private use charaters are outside the scope of this

    So I'd say that the XML Core WG has got the situation only
    partially correct for Unicode PUA characters.


    This archive was generated by hypermail 2.1.5 : Mon Dec 02 2002 - 22:01:07 EST