RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK

From: Mount, Rob (Robert F) (rfmount@ingr.com)
Date: Thu Jun 05 2003 - 19:49:28 EDT

Next message: James H. Cloos Jr.: "Re: Classification of Alphabetic characters (was: Hiragana/Katakana sound marks)"

Previous message: Philippe Verdy: "Re: Classification of Alphabetic characters (was: Hiragana/Katakana sound marks)"
Maybe in reply to: Marco Cimarosti: "RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK"
Next in thread: Kenneth Whistler: "RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark,

Thanks again for your response.

I understand what you say about word formation, and combining marks, and
that the Alphabetic
classification should not be limited to "L"s. But 30FC is of General
Category "Lm" (which should be
included) and, since version 3.1, is classified explicitly as Alphabetic in
DerivedCoreProperties.txt.
(It appears that formal expression of the Alphabetic property was moved
from PropList.txt
to DerivedCoreProperties.txt in 3.1.) I don't understand why its exclusion
from the Alphabetic
category in 3.0.1 was not an oversight. But if not, then either the
consortium consensus on
the classification of this character has changed, or the current
classification is in error.

Here's a little more background regarding my motivation. The problem occurs
in a procedure
that evaluates whether a user-supplied name can be used as an identifier -
for which identification
of alphabetic characters is important. One implementation of isalpha(),
purportedly based on
Unicode 2.1, indicates that 30FC is an alpha character. The current
implementation from the
same vendor, based on 3.0.1, classifies it as non-alpha. Presumably the
next one will be based
on 3.1 or later and will reclassify it, again, as alpha.

I have since discovered section 5.16 of the spec which describes the Unicode
standard for
identifier formation, and frankly, our validation algorithm is a bit naive
and will require some
work. But our use of isalpha() is not, I think, fundamentally flawed; the
changes will require
only that we include some additional characters that are not currently
considered valid.
Certainly if the behavior of isalpha() did not change the existing algorithm
would at least
be stable across different platforms, warts and all. If we can't depend on
uniform behavior
of isalpha() we will have to eliminate its use from our validation function.

So I am trying to discover why the behavior of isalpha() has changed. Here
are the
possibilities: 1) the previous implementation was incorrect and the current
one is fixed;
2) the current implementation is flawed because it does not conform to the
documented
standard; 3) the current implementation is flawed because it's based on
incorrect
documentation of the standard; 4) both implementations are correct but are
based on
different, incompatible standards; 5) something else I don't yet understand.

The overriding assumption for this entire discussion is that the behavior of
isalpha() should
be governed by the Unicode Alphabetic property. That seems reasonable to me
and is, in
fact, the vendor's claim. If not, (or even if so) perhaps someone can
recommend a better
(or more stable) API for discovery of Unicode character metrics upon which
we might base
our identifier validation and other character processing logic.

Comments anyone?

Rob

-----Original Message-----
From: Mark Davis [mailto:mark.davis@jtcsv.com]
Sent: Thursday, June 05, 2003 2:48 PM
To: Mount, Rob (Robert F); unicode@unicode.org
Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
MARK

Ah, I see why you didn't find the Alphabetic property. It was added in
Unicode 3.1.0 (March 2001), precisely to capture characters that are
not L yet are still alphabetic. If you look at the derivation in
C:\DATA\UCD\3.1.0-Update\DerivedCoreProperties-3.1.0.txt, you will
see:

# Derived Property: Alphabetic
# Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

So Alphabetic includes all L's, but also other characters. And, as I
said, it alone is not sufficient for word breaks.

> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?

This is not an oversight. As I said, many characters are not
Alphabetic and are still part of words. Examples include that
character and many others. As a simple case, "can't" is a word in
English, although the apostrophe is not alphabetic. There are many,
many examples using combining marks, such as a virama (halant) in
Hindi, which is not Alphabetic:

http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=094D

So if you want reasonable word-breaks, you need to use more than the L
category, you need to look at
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/

Mark
__________________________________
http://www.macchiato.com
? "Eppur si muove" ?

----- Original Message -----
From: "Mount, Rob (Robert F)" <rfmount@ingr.com>
To: "Mark Davis" <mark.davis@jtcsv.com>; <unicode@unicode.org>
Sent: Thursday, June 05, 2003 11:57
Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND MARK

>
> Thanks to all who responded. The insight you provided is
invaluable. And I
>
> appreciate your patience with a UNICODE beginner.
>
> Mark's reference to UCD.html, and by inference to
DerivedCoreProperties.txt,
> seems difinitive. However, these are part of the 4.0 spec. The
suspect
> implementation of isalpha is based, according to the vendor, on
3.0.1.
>
> The vendor relys, instead, on
> http://www.unicode.org/Public/3.0-Update1/PropList-3.0.1.txt
> which classifies 30FC as Diacritic, Extender, Bidi: Left-to-Right,
and
> Identifier Part, but not
> as Alphabetic. Is this an error in the specification? I could find
no
> reference to the Alphabetic
> property in the 3.0.1 documentation except in, and with reference
to,
> PropList-3.0.1.txt.
> However, it would seem, from the 4.0 documentation, that all
characters
> having a General
> Category beginning with "L" should be considered as letters, and
hence,
> implicitly, as Alphabetic.
>
> Is this, indeed, the intent of the General Category classifications
> beginning with "L"?
>
> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?
>
> Rob
>
> -----Original Message-----
> From: Mark Davis [mailto:mark.davis@jtcsv.com]
> Sent: Thursday, June 05, 2003 9:28 AM
> To: Mount, Rob (Robert F); unicode@unicode.org
> Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND
> MARK
>
>
> The UCD has a property explicitly called "Alphabetic" in the UCD. So
> that should be used when determining whether a character is, well,
> alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html
>
> However, in the past many people have misused functions like
isAlpha()
> for doing more complicated processing like determining text
boundaries
> (line and word breaks, for example). The function isAlpha() does not
> discriminate finely enough to be very accurate for processing like
> that. For more information, see
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/
>
> Also see the proposed update to Unicode Regular Expressions, for
> discussion of the use of Unicode properties in connection with
alpha,
> punct, etc. (in the context of regular expressions, at least).
>
http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties
>
> Mark
> __________________________________
> http://www.macchiato.com
> ? "Eppur si muove" ?
>
> ----- Original Message -----
> From: "Mount, Rob (Robert F)" <rfmount@ingr.com>
> To: <unicode@unicode.org>
> Sent: Wednesday, June 04, 2003 16:11
> Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
> MARK
>
>
> > All,
> > I am investigating differing behavior in various environments of
the
> > wide-character version of the C function isAlpha with respect to
> > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> > implementations indicate that it is alphabetic, some don't. I
> > suspect that other characters might be subject to the same
> confusion.
> >
> > The UNICODE documents seem abiguous on this point: the General
> > Catetory is "Lm" which, although informative instead of normative,
> > would seem to imply that it is alphabetic; likewise
> > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic;
but
> > PropList-4.0.0.txt contains two records - one indicating that it
is
> > a diacritic, one that indicates it is an extender.
> >
> > On to my questions:
> >
> > Q1: Can a character be both alphabetic and diacritic?
> >
> > Q2: Is there a difinitive answer as to whether this is an
alphabetic
> > character?
> >
> > Thanks in advance for answers to these questions and/or any
> > additional isight you can provide.
> >
> > Regards,
> > Rob Mount
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Next message: James H. Cloos Jr.: "Re: Classification of Alphabetic characters (was: Hiragana/Katakana sound marks)"
Previous message: Philippe Verdy: "Re: Classification of Alphabetic characters (was: Hiragana/Katakana sound marks)"
Maybe in reply to: Marco Cimarosti: "RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK"
Next in thread: Kenneth Whistler: "RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 05 2003 - 20:42:16 EDT