Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

From: Allen Haaheim (haaheima@interchange.ubc.ca)
Date: Sat Jun 21 2003 - 19:38:56 EDT

  • Next message: Philippe Verdy: "Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK"

    Phillippe,

    Sorry to reopen a (closed?) case. The below look like loose ends to me.

    >For Japanese people, they consider this sign as a separate vowel whose
    >phonetic value depends on the phonetic value of the previous character
    >(which may have a point or double-point diacritic, for the voice mark used
    >to alter the consonnant value of the base character). This is proably why
    >the transliteration of this character to Latin generally doubles the
    >previous Latin vowel.

    "Separate" doesn't seem right. In my understanding it's an "extender" (as
    Andrew notes) of the final vowel sound of the previous kana (so mentioning
    diacritics, which affect only the initial consonant, is irrelevant). To be
    more exact, it doubles the length of the vowel final.

    >However, this character is not strictly a diacritic, as there is some uses
    >of the character (according to grammatical rules) after a punctuation sign
    >used to separate it from an imported foreign word (most often a proper
    >name), sometimes written with another script.

    We can't think of any instances of such a use here. Can you give an example?

    Allen

    ----- Original Message -----
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: "Mount, Rob (Robert F)" <rfmount@ingr.com>
    Cc: <unicode@unicode.org>
    Sent: Thursday, June 05, 2003 2:35 AM
    Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

    > My opinion is that it can be viewed, depending on its application, as a
    letter (for some transliteration purpose), or as a diacritic (for some other
    transliterations). But in reality it is mostly a letter modifier. For UCA,
    it sorts mostly like the base letter that it modifies, and UCA gives the
    most appropriate linguistic value of this character.
    >
    > This is not the only character of this type in Unicode. You'll find
    similar sound marks (length marks, repeat marks) in other scripts, including
    abjads, and IPA (the IPA column-like sign for example).
    >
    > For Japanese people, they consider this sign as a separate vowel whose
    phonetic value depends on the phonetic value of the previous character
    (which may have a point or double-point diacritic, for the voice mark used
    to alter the consonnant value of the base character). This is proably why
    the transliteration of this character to Latin generally doubles the
    previous Latin vowel.
    >
    > However, this character is not strictly a diacritic, as there is some uses
    of the character (according to grammatical rules) after a punctuation sign
    used to separate it from an imported foreign word (most often a proper
    name), sometimes written with another script. So the sign as its own lexical
    and grammatical semantic, and does not really combine like other diacritics.
    >
    > You should better handle it as alphabetic (and this is reflected by its
    general category which indicates it is a letter). For your application, the
    isalpha() C function is generally used to create word tokens. The word
    tokenization often requires grouping letters and diacritics at least,
    without creating a break between a previous character and the prolonged
    sound mark. Because the character is not combining (it can be used after a
    punctuation or separator or symbol to prolonge the sound before this
    punctuation), it needs to be handled as alphabetic.
    >
    > Another case to consider is line-breaking: a line break can occur before
    that character, something that would not be permitted if it was handled as a
    combining character.
    >
    > If your isAlpha() function doesn't do that, it would require you to handle
    this character as an exception in almost all cases to respect its linguistic
    value. Do you need this complication in your application code?
    >
    > -- Philippe.
    > ----- Original Message -----
    > From: "Mount, Rob (Robert F)" <rfmount@ingr.com>
    > To: <unicode@unicode.org>
    > Sent: Thursday, June 05, 2003 1:11 AM
    > Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
    >
    >
    > > All,
    > > I am investigating differing behavior in various environments of the
    > > wide-character version of the C function isAlpha with respect to
    > > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
    > > implementations indicate that it is alphabetic, some don't. I
    > > suspect that other characters might be subject to the same confusion.
    > >
    > > The UNICODE documents seem abiguous on this point: the General
    > > Catetory is "Lm" which, although informative instead of normative,
    > > would seem to imply that it is alphabetic; likewise
    > > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
    > > PropList-4.0.0.txt contains two records - one indicating that it is
    > > a diacritic, one that indicates it is an extender.
    > >
    > > On to my questions:
    > >
    > > Q1: Can a character be both alphabetic and diacritic?
    > >
    > > Q2: Is there a difinitive answer as to whether this is an alphabetic
    > > character?
    > >
    > > Thanks in advance for answers to these questions and/or any
    > > additional isight you can provide.
    > >
    > > Regards,
    > > Rob Mount
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat Jun 21 2003 - 20:11:03 EDT