alpha, print, graph, blank, etc.

From: Mark Davis (
Date: Mon Apr 21 2003 - 21:30:58 EDT

  • Next message: Kenneth Whistler: "Re: Combining Grapheme Joiner"

    The POSIX/C-style property names (punct, alpha, lower, upper, digit, xdigit,
    alnum, cntrl, graph, print, space, blank) are not well specified, and don't
    really map well to the broader types of characters available in
    Unicode/10646. For example, there is no provision for titlecase, nor for a
    distinction between symbols and punctuation. These categories aren't really
    set up to make distinctions among combining marks, nor many of the other
    Unicode Properties.

    However, many programs use the POSIX-style properties, so for compatibility
    it is best to come up with uniform set of recommendations for how they
    should be interpreted in a Unicode context. This also relates to Java, since
    many of the methods on Character ultimately derive from trying to match some
    of the POSIX categories.

    The following compares current Perl, ICU, Java, Windows, and the POSIX spec,
    and tries to derive a recommendation for the best definition, given the way
    people use the properties in practice. Note that these are only current
    snapshots, since those environments may change their definitions, especially
    as they upgrade beyond Unicode 3.x.

    Feedback is welcome.


    This archive was generated by hypermail 2.1.5 : Mon Apr 21 2003 - 22:02:24 EDT