From: Allen Haaheim (email@example.com)
Date: Sat Jun 21 2003 - 19:38:56 EDT
Sorry to reopen a (closed?) case. The below look like loose ends to me.
>For Japanese people, they consider this sign as a separate vowel whose
>phonetic value depends on the phonetic value of the previous character
>(which may have a point or double-point diacritic, for the voice mark used
>to alter the consonnant value of the base character). This is proably why
>the transliteration of this character to Latin generally doubles the
>previous Latin vowel.
"Separate" doesn't seem right. In my understanding it's an "extender" (as
Andrew notes) of the final vowel sound of the previous kana (so mentioning
diacritics, which affect only the initial consonant, is irrelevant). To be
more exact, it doubles the length of the vowel final.
>However, this character is not strictly a diacritic, as there is some uses
>of the character (according to grammatical rules) after a punctuation sign
>used to separate it from an imported foreign word (most often a proper
>name), sometimes written with another script.
We can't think of any instances of such a use here. Can you give an example?
----- Original Message -----
From: "Philippe Verdy" <firstname.lastname@example.org>
To: "Mount, Rob (Robert F)" <email@example.com>
Sent: Thursday, June 05, 2003 2:35 AM
Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
> My opinion is that it can be viewed, depending on its application, as a
letter (for some transliteration purpose), or as a diacritic (for some other
transliterations). But in reality it is mostly a letter modifier. For UCA,
it sorts mostly like the base letter that it modifies, and UCA gives the
most appropriate linguistic value of this character.
> This is not the only character of this type in Unicode. You'll find
similar sound marks (length marks, repeat marks) in other scripts, including
abjads, and IPA (the IPA column-like sign for example).
> For Japanese people, they consider this sign as a separate vowel whose
phonetic value depends on the phonetic value of the previous character
(which may have a point or double-point diacritic, for the voice mark used
to alter the consonnant value of the base character). This is proably why
the transliteration of this character to Latin generally doubles the
previous Latin vowel.
> However, this character is not strictly a diacritic, as there is some uses
of the character (according to grammatical rules) after a punctuation sign
used to separate it from an imported foreign word (most often a proper
name), sometimes written with another script. So the sign as its own lexical
and grammatical semantic, and does not really combine like other diacritics.
> You should better handle it as alphabetic (and this is reflected by its
general category which indicates it is a letter). For your application, the
isalpha() C function is generally used to create word tokens. The word
tokenization often requires grouping letters and diacritics at least,
without creating a break between a previous character and the prolonged
sound mark. Because the character is not combining (it can be used after a
punctuation or separator or symbol to prolonge the sound before this
punctuation), it needs to be handled as alphabetic.
> Another case to consider is line-breaking: a line break can occur before
that character, something that would not be permitted if it was handled as a
> If your isAlpha() function doesn't do that, it would require you to handle
this character as an exception in almost all cases to respect its linguistic
value. Do you need this complication in your application code?
> -- Philippe.
> ----- Original Message -----
> From: "Mount, Rob (Robert F)" <firstname.lastname@example.org>
> To: <email@example.com>
> Sent: Thursday, June 05, 2003 1:11 AM
> Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
> > All,
> > I am investigating differing behavior in various environments of the
> > wide-character version of the C function isAlpha with respect to
> > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> > implementations indicate that it is alphabetic, some don't. I
> > suspect that other characters might be subject to the same confusion.
> > The UNICODE documents seem abiguous on this point: the General
> > Catetory is "Lm" which, although informative instead of normative,
> > would seem to imply that it is alphabetic; likewise
> > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
> > PropList-4.0.0.txt contains two records - one indicating that it is
> > a diacritic, one that indicates it is an extender.
> > On to my questions:
> > Q1: Can a character be both alphabetic and diacritic?
> > Q2: Is there a difinitive answer as to whether this is an alphabetic
> > character?
> > Thanks in advance for answers to these questions and/or any
> > additional isight you can provide.
> > Regards,
> > Rob Mount
This archive was generated by hypermail 2.1.5 : Sat Jun 21 2003 - 20:11:03 EDT