From: Philippe Verdy (
Date: Thu Jun 05 2003 - 05:35:07 EDT

  • Next message: Yogesh Kumar Ahuja: "character set identification"

    My opinion is that it can be viewed, depending on its application, as a letter (for some transliteration purpose), or as a diacritic (for some other transliterations). But in reality it is mostly a letter modifier. For UCA, it sorts mostly like the base letter that it modifies, and UCA gives the most appropriate linguistic value of this character.

    This is not the only character of this type in Unicode. You'll find similar sound marks (length marks, repeat marks) in other scripts, including abjads, and IPA (the IPA column-like sign for example).

    For Japanese people, they consider this sign as a separate vowel whose phonetic value depends on the phonetic value of the previous character (which may have a point or double-point diacritic, for the voice mark used to alter the consonnant value of the base character). This is proably why the transliteration of this character to Latin generally doubles the previous Latin vowel.

    However, this character is not strictly a diacritic, as there is some uses of the character (according to grammatical rules) after a punctuation sign used to separate it from an imported foreign word (most often a proper name), sometimes written with another script. So the sign as its own lexical and grammatical semantic, and does not really combine like other diacritics.

    You should better handle it as alphabetic (and this is reflected by its general category which indicates it is a letter). For your application, the isalpha() C function is generally used to create word tokens. The word tokenization often requires grouping letters and diacritics at least, without creating a break between a previous character and the prolonged sound mark. Because the character is not combining (it can be used after a punctuation or separator or symbol to prolonge the sound before this punctuation), it needs to be handled as alphabetic.

    Another case to consider is line-breaking: a line break can occur before that character, something that would not be permitted if it was handled as a combining character.

    If your isAlpha() function doesn't do that, it would require you to handle this character as an exception in almost all cases to respect its linguistic value. Do you need this complication in your application code?

    -- Philippe.
    ----- Original Message -----
    From: "Mount, Rob (Robert F)" <>
    To: <>
    Sent: Thursday, June 05, 2003 1:11 AM

    > All,
    > I am investigating differing behavior in various environments of the
    > wide-character version of the C function isAlpha with respect to
    > implementations indicate that it is alphabetic, some don't. I
    > suspect that other characters might be subject to the same confusion.
    > The UNICODE documents seem abiguous on this point: the General
    > Catetory is "Lm" which, although informative instead of normative,
    > would seem to imply that it is alphabetic; likewise
    > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
    > PropList-4.0.0.txt contains two records - one indicating that it is
    > a diacritic, one that indicates it is an extender.
    > On to my questions:
    > Q1: Can a character be both alphabetic and diacritic?
    > Q2: Is there a difinitive answer as to whether this is an alphabetic
    > character?
    > Thanks in advance for answers to these questions and/or any
    > additional isight you can provide.
    > Regards,
    > Rob Mount

    This archive was generated by hypermail 2.1.5 : Thu Jun 05 2003 - 06:15:44 EDT