Mon Jul 27 2009

    Karl Williamson noted:

    > Apparently that is what Asmus and others think as well,

    Add me to that list.

    > and it certainly
    > is the data that comes in DerivedAge.txt,

    And in the XML data derived from it, as well -- which is Eric's
    point, I think.

    > and if that were truly the
    > case, I wouldn't have any problem with the term "Age".

    Well, then you're all set! ;-)

    > But let me quote
    > from the header of that file:
    > # Caution: When using the Age *property*, all assigned code points
    > # in each version are included, not just the newly assigned code points.
    > # For more information, see
    > And, if you look at tr18, it says:
    > "
    > Caution: The DerivedAge data file in the UCD provides the deltas between
    > versions, for compactness. However, when using the property all
    > characters included in that version are included. Thus \p{age=3.0}
    > includes the letter a, which was included in Unicode 1.0. To get
    > characters that are new in a particular version, subtract off the
    > previous version as described in 1.3 Subtraction and Intersection. For
    > example: [\p{age=3.1} -- \p{age=3.0}]
    > "
    > So either you guys are wrong, or the documentation is wrong in at least
    > two places.

    The documentation is wrong in two places -- or at least
    misleading. Note that it doesn't actually say the property
    is *defined* thus and such, but rather that "when using the
    property all characters included in that version are included."
    That amounts to a pocket definition of a new derived property
    (or actually set of properties) based on the use of the Age property
    per se.

    This is one of these cases where an insufficiently carefully
    documented property is trying to have it both ways.

    Age is an enumerated property in the UCD. Among other things, that
    means that its values constitute a codespace partition. Each
    code point has one and and only one value of the property. Both
    the values in DerivedAge.txt and in the XML data files reflect
    that interpretation.

    The property defined that way is not, however, as useful as the
    property described the way it is used for regex matches in UTS #18,
    because it is far more useful for regex matches to know if a
    character is included in Unicode Version X (or any *earlier*
    version), rather than to know if it was encoded exactly in
    Version X. So the usage of the Age property in UTS #18 just
    blithely assumes that interpretation, and the caution at the
    top of DerivedAge.txt reflects that interpretation, even though
    it is in direct contradiction with the data itself.

    Note that there are no character properties in the UCD actually
    defined the way the Caution at the top of DerivedAge.txt currently
    implies Age is interpreted. If you think this through, for
    example, interpreted that way, U+0041 would have multiple
    Age property values: 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0,
    4.1, 5.0, 5.1, and soon, 5.2, because it would match a
    \p{age=n.n} expression for any of those values. Every character
    would continue to accumulate new Age values as future versions
    of the standard are published.

    > I have to assume that the documentation is right until
    > shown otherwise; and if it is correct, I think that proves my point. If
    > experienced people who work with Unicode all the time don't understand
    > what this property is, then something is wrong, and at a minimum a new
    > alias is needed to clarify things.

    There is definitely need for clarification here.

    > I also don't think that in these days of abundant cheap storage that the
    > Consortium should be worrying about compactness.

    Compactness is not the primary concern driving maintenance of
    UCD properties (and files) by the way.

    > I believe every
    > property that is exposed in the UCD should have a fully derived version
    > available, probably in the extracted directory. In 5.2 Beta, the only
    > properties and property values that the user has to derive (except for
    > defaults) are Age, gc=LC, gc=C, gc=L gc=M, gc=N, gc=P, gc=S, and gc=Z.

    However, none of those are actually property values per se. They
    are certainly not *extracted* values.

    Each of those is a different kind of derived property value.

    So gc=L (which I assume you meant, rather than "gc=LC") is actually
    not a value of General_Category proper at all, but rather the
    union of the set of characters with five different values:

       (gc=Lu) | (gc=Ll) | (gc=Lt) | (gc=Lo) | (gc=Lm)
    While it is certainly easy to derive such sets from the data, it
    is also perfectly reasonable to ask for pre-derived listings of
    such derived values in the UCD. It would be up to the UTC to
    decide whether the extra work to maintain additional derived
    values for each release is worth the benefit in such cases. Note
    that ICU provides a generic Unicode set notation that makes it
    trivial to construct such sets.

    Also, regarding "Age", what you are asking in this case would be
    not *one* derived property, but rather a distinct derived
    binary property for *each* Unicode version. I.e.:

    Included_In_Version_1_1 --> (Age=1.1)

    Included_In_Version_2_0 --> (Age=1.1) | (Age=2.0)

    Included_In_Version_2_1 --> (Age=1.1) | (Age=2.0) | (Age=2.1)

    Included_In_Version_3_0 --> (Age=1.1) | (Age=2.0) | (Age=2.1) | (Age=3.0)

    etc., etc., for each succeeding version.

    IMO, it isn't actually worth the effort to define and maintain
    such a list of derived property values (or equivalently, just
    the sets of characters, without actually *naming* the properties
    they assume), when the derivations are so trivial based on
    the existing DerivedAge.txt file. This is especially true for
    that particular file, because all you have to do is delete
    all the entries below the Age of concern, and the entries
    above it define your set in question. No programming necessary. :-)

    > There should be files in the extracted directory that show the derived
    > values for all of them. There are bound to be mistakes made when
    > programmers re-derive them; and there is duplicated work as well. This
    > Age property is a case in point. I wonder how many implementations
    > there are out there that have it wrong.

    Not too many, I would wager -- since most of them would be using
    one or the other of the two interpretations, and would have picked
    the one they wanted to accomplish what they were after. It is
    rather unlikely that there are many applications out there using
    an interpretation "all characters included in Version 3.0", but
    which are then blindly using Age=3.0 values from DerivedAge.txt,
    ignoring all the characters with Age=1.1, 2.0, or 2.1, for example.


