Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Jun 05 2003 - 15:47:30 EDT

  • Next message: Philippe Verdy: "Re: Tamazight/berber language : How to send mail, write word documents ...."

    Ah, I see why you didn't find the Alphabetic property. It was added in
    Unicode 3.1.0 (March 2001), precisely to capture characters that are
    not L yet are still alphabetic. If you look at the derivation in
    C:\DATA\UCD\3.1.0-Update\DerivedCoreProperties-3.1.0.txt, you will
    see:

    # Derived Property: Alphabetic
    # Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

    So Alphabetic includes all L's, but also other characters. And, as I
    said, it alone is not sufficient for word breaks.

    > Is the ommision of 30FC from the Alphabetic category of PropList.txt
    an
    > error?

    This is not an oversight. As I said, many characters are not
    Alphabetic and are still part of words. Examples include that
    character and many others. As a simple case, "can't" is a word in
    English, although the apostrophe is not alphabetic. There are many,
    many examples using combining marks, such as a virama (halant) in
    Hindi, which is not Alphabetic:

    http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=094D

    So if you want reasonable word-breaks, you need to use more than the L
    category, you need to look at
    > http://www.unicode.org/reports/tr14/
    > http://www.unicode.org/reports/tr29/

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Mount, Rob (Robert F)" <rfmount@ingr.com>
    To: "Mark Davis" <mark.davis@jtcsv.com>; <unicode@unicode.org>
    Sent: Thursday, June 05, 2003 11:57
    Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
    SOUND MARK

    >
    > Thanks to all who responded. The insight you provided is
    invaluable. And I
    >
    > appreciate your patience with a UNICODE beginner.
    >
    > Mark's reference to UCD.html, and by inference to
    DerivedCoreProperties.txt,
    > seems difinitive. However, these are part of the 4.0 spec. The
    suspect
    > implementation of isalpha is based, according to the vendor, on
    3.0.1.
    >
    > The vendor relys, instead, on
    > http://www.unicode.org/Public/3.0-Update1/PropList-3.0.1.txt
    > which classifies 30FC as Diacritic, Extender, Bidi: Left-to-Right,
    and
    > Identifier Part, but not
    > as Alphabetic. Is this an error in the specification? I could find
    no
    > reference to the Alphabetic
    > property in the 3.0.1 documentation except in, and with reference
    to,
    > PropList-3.0.1.txt.
    > However, it would seem, from the 4.0 documentation, that all
    characters
    > having a General
    > Category beginning with "L" should be considered as letters, and
    hence,
    > implicitly, as Alphabetic.
    >
    > Is this, indeed, the intent of the General Category classifications
    > beginning with "L"?
    >
    > Is the ommision of 30FC from the Alphabetic category of PropList.txt
    an
    > error?
    >
    > Rob
    >
    > -----Original Message-----
    > From: Mark Davis [mailto:mark.davis@jtcsv.com]
    > Sent: Thursday, June 05, 2003 9:28 AM
    > To: Mount, Rob (Robert F); unicode@unicode.org
    > Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
    SOUND
    > MARK
    >
    >
    > The UCD has a property explicitly called "Alphabetic" in the UCD. So
    > that should be used when determining whether a character is, well,
    > alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html
    >
    > However, in the past many people have misused functions like
    isAlpha()
    > for doing more complicated processing like determining text
    boundaries
    > (line and word breaks, for example). The function isAlpha() does not
    > discriminate finely enough to be very accurate for processing like
    > that. For more information, see
    > http://www.unicode.org/reports/tr14/
    > http://www.unicode.org/reports/tr29/
    >
    > Also see the proposed update to Unicode Regular Expressions, for
    > discussion of the use of Unicode properties in connection with
    alpha,
    > punct, etc. (in the context of regular expressions, at least).
    >
    http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties
    >
    > Mark
    > __________________________________
    > http://www.macchiato.com
    > ? "Eppur si muove" ?
    >
    > ----- Original Message -----
    > From: "Mount, Rob (Robert F)" <rfmount@ingr.com>
    > To: <unicode@unicode.org>
    > Sent: Wednesday, June 04, 2003 16:11
    > Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
    > MARK
    >
    >
    > > All,
    > > I am investigating differing behavior in various environments of
    the
    > > wide-character version of the C function isAlpha with respect to
    > > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
    > > implementations indicate that it is alphabetic, some don't. I
    > > suspect that other characters might be subject to the same
    > confusion.
    > >
    > > The UNICODE documents seem abiguous on this point: the General
    > > Catetory is "Lm" which, although informative instead of normative,
    > > would seem to imply that it is alphabetic; likewise
    > > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic;
    but
    > > PropList-4.0.0.txt contains two records - one indicating that it
    is
    > > a diacritic, one that indicates it is an extender.
    > >
    > > On to my questions:
    > >
    > > Q1: Can a character be both alphabetic and diacritic?
    > >
    > > Q2: Is there a difinitive answer as to whether this is an
    alphabetic
    > > character?
    > >
    > > Thanks in advance for answers to these questions and/or any
    > > additional isight you can provide.
    > >
    > > Regards,
    > > Rob Mount
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    >



    This archive was generated by hypermail 2.1.5 : Thu Jun 05 2003 - 16:52:15 EDT