Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Jun 05 2003 - 10:28:18 EDT

  • Next message: Azzedine Ait Khelifa: "Tamazight/berber language : How to send mail, write word documents ...."

    The UCD has a property explicitly called "Alphabetic" in the UCD. So
    that should be used when determining whether a character is, well,
    alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html

    However, in the past many people have misused functions like isAlpha()
    for doing more complicated processing like determining text boundaries
    (line and word breaks, for example). The function isAlpha() does not
    discriminate finely enough to be very accurate for processing like
    that. For more information, see
    http://www.unicode.org/reports/tr14/
    http://www.unicode.org/reports/tr29/

    Also see the proposed update to Unicode Regular Expressions, for
    discussion of the use of Unicode properties in connection with alpha,
    punct, etc. (in the context of regular expressions, at least).
    http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Mount, Rob (Robert F)" <rfmount@ingr.com>
    To: <unicode@unicode.org>
    Sent: Wednesday, June 04, 2003 16:11
    Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
    MARK

    > All,
    > I am investigating differing behavior in various environments of the
    > wide-character version of the C function isAlpha with respect to
    > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
    > implementations indicate that it is alphabetic, some don't. I
    > suspect that other characters might be subject to the same
    confusion.
    >
    > The UNICODE documents seem abiguous on this point: the General
    > Catetory is "Lm" which, although informative instead of normative,
    > would seem to imply that it is alphabetic; likewise
    > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
    > PropList-4.0.0.txt contains two records - one indicating that it is
    > a diacritic, one that indicates it is an extender.
    >
    > On to my questions:
    >
    > Q1: Can a character be both alphabetic and diacritic?
    >
    > Q2: Is there a difinitive answer as to whether this is an alphabetic
    > character?
    >
    > Thanks in advance for answers to these questions and/or any
    > additional isight you can provide.
    >
    > Regards,
    > Rob Mount
    >
    >
    >
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Jun 05 2003 - 11:24:32 EDT