Re: Wanted: synonyms for Age

From: Kenneth Whistler (
Date: Thu Aug 06 2009 - 17:09:56 CDT

  • Next message: karl williamson: "Guidance wanted on implementing Greek context-dependent casing"

    Karl Williamson wrote:

    > ... I thought I should add some things I've been thinking about to
    > make sure I understand. Feel free to correct me.
    > Each Unicode property is defined on a subset of the Unicode code points.
    > Many are defined on the complete set, but some are not, such as Name,
    > as for example, surrogates and private use code points have no name.

    Actually Name *is* defined on the complete set. The values for
    the Name property are strings, and for reserved code points
    (and some other code point types), the value of the Name property
    is the null string.

    Since this has been confusing to a lot of people, the Unicode 5.2
    text about Unicode character names has been substantially updated
    to clarify this. See Section 4.8 Name--Normative in the Chapter 4
    pdf posted for review. (Accessible from the Unicode 5.2 beta
    > It's unclear to me if in releases before the Unknown property value was
    > added to the Script property, what the definition was, if any, of code
    > points that didn't have any other of the Script property values (and
    > similarly for a number of other catalog properties).

    The issue of default values is explained now in more detail
    in Section 4.2.8 Default Values in UAX #44. See the Unicode 5.2
    proposed update:

    As far as the default value of the Script property is concerned,
    before Script=Unknown was introduced, the Scripts.txt file itself
    defined Script=Common as the default value. See, for example:

    # Scripts-4.0.0.txt
    # Date: 2003-03-20, 20:07:48 GMT [MD]
    # For documentation, see UCD.html
    # Note: Unassigned and Noncharacter codepoints may be omitted
    # if they have default property values.
    # ================================================

    # ================================================
    # Script
    # All code points not explicitly listed in this file have the property
    # value: COMMON.

    > A property is a mapping from single code points to values. (Named
    > sequences and Standardized Variants, and I don't know about the Unihan
    > ones are anomalous.)

    Named Sequences and Standardized Variants are not character properties.
    Not every bit of data contained in the Unicode Character Database
    consists of character properties per se, even if ultimately
    it is all about characters in some sense or another.
    > Each code point that the property is defined for
    > has a single value.
    > This means that properties are true functions in the strict mathematical
    > sense, because the mapping for each code point is to a unique value.
    > However, when using a property as part of a regular expression pattern,
    > what is desired is essentially the reverse mapping, or 'inverse
    > relation' in mathematical terminology. For example (using Perl syntax)
    > 'A' =~ /\p{age=3.2}/, has us start with the property value 3.2, not the
    > letter 'A', and then see if age('A') is 3.2. This inverse mapping is
    > not necessarily a function; just a relation. For example, the property
    > value '3.2' can map to many code points, not just 'A'.

    The more usual way this is conceived is that a regular expression
    matches a set of code points that meet some criteria. And among
    those criteria may be specifications of particular property values
    according to their UCD definitions.

    > This distinction between the property mapping and the inverse mapping
    > was lost on me until this issue came up.
    > TR18 appears to be requiring that regular expressions not use the true
    > inverse relation of Age, but a different one, one which makes more sense
    > for real-world applications. If one were to accurately name that
    > inverse relation, it wouldn't be 'Age', but something more like
    > 'Designated_As_Of'

    The UTC will be discussing that next week, and I think it likely
    that more clarification of documentation will result. I doubt
    it will result in changes of the names in use, but I do think
    it would help to clarify the difference between the Age property
    per se in the UCD and the use of the Age property as spelled
    out in UTS #18.

    > karl williamson wrote:

    Other questions:

    > > Is there some way I could find out what other things might be like this
    > > that I've overlooked in trying to learn Unicode?

    Well, there's no one-stop shopping place for everything that
    one might have overlooked when trying to learn about Unicode.
    The editors just keep updating the documentation and the
    website as people ask questions and it becomes clear that
    one thing or another is missing or confusing in the documentation.

    Re listings of all derived properties:

    > >>> I believe every property that is exposed in the UCD should have a
    > >>> fully derived version available,

    > > And yes, they are trivial to derive, and at this stage probably not
    > > likely to change, but it's still work that has to be done, and it seems
    > > to me to be better done once, centrally, than many times. So I will
    > > submit a proposal for that. My request came about because I'm
    > > maintaining some code that hadn't kept up with the changing definitions
    > > of Case_Ignorable over the years; 5.2 Beta has that derived for us, and
    > > it occurred to me that why should I have to derive anything.

    The problem I see is that there are lots of potential derivations
    of properties, not all of which can be known ahead of time.
    Yes, there are some values, such as the General_Category superclasses
    (gc=L, gc=M, etc.) that could be derived, but I don't see much
    value in actually making such lists.

    But ultimately this is up to the UTC to decide what it wants
    to maintain explicitly in the UCD.


    This archive was generated by hypermail 2.1.5 : Thu Aug 06 2009 - 17:13:22 CDT