feedback on UAX #29 : word breaks with hiragana and voiced marks

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 22 2007 - 07:32:05 CDT

  • Next message: Philippe Verdy: "RE: Apostrophes at www.unicode.org"

    Initially I wanted to post it to the list, but I entered unicore instead of Unicode in the email address before the @. I have noted something for which I expect some comments:
    ------------
    I see that my proposal to make left single (6-shaped) quotation mark (U+2018) treated as a "mid letter" in UAX29 for word boundaries was accepted, but I note another thing:

    This UAX treats KATAKANA specially, to avoid breaking between two Katakana letters, but still break between hiragana. However, this is probably not true for every thing, notably in the sequence of an Hiragana letter and a voiced/semi voiced mark:

    U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK
    U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

    and possibly other characters currently listed in the Katakana value in table 3:

    U+3031 (〱) VERTICAL KANA REPEAT MARK
    U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK
    U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF
    U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF
    U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF
    U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN
    U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK
    U+FF70 (ー) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
    U+FF9E (゙) HALFWIDTH KATAKANA VOICED SOUND MARK
    U+FF9F (゚) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

    Do really word break occur between Hiragana letters and these marks coded after them (note that Hiragana letters are excluded from "Aletter" in table 3) ? If not, then

    (1) the list of characters above should better be listed under a
        separate value (say "ExtendKana"), and removed from Katakana in table 3.

    (2) a new value "Hiragana" should be created for Hiragana letters in table
        3, like this:

            Katakana script="KATAKANA" (rewritten first row in table 3)
            Hiragana script="HIRAGANA" (new inserted row in table 3)
            ExtendKana (the list of characters above) (new row in table 3)

    (3) the existing rule WB13 (Katakana × Katakana) should be rewritten
        equivalently as:

            WB13. (Katakana | ExtendKana) × (Katakana | ExtendKana)

    (4) the following subrules WB13a and WB13b rewritten equivalently as:

            WB13a. (ALetter | Numeric | Katakana | ExtendKana | ExtendNumLet)
                    × ExtendNumLet

            WB13b. ExtendNumLet × (ALetter | Numeric | Katakana | ExtendKana)

    (5) Another subrule should be added:

            WB13c. (Hiragana | ExtendKana) × ExtendKana

    No other change is needed, because word break will still occur either between two Hiragana letters, or after an ExtendKana and before a Hiragana letter, in the next rule:

            WB14. Any ÷ Any

    Or am I missing something?



    This archive was generated by hypermail 2.1.5 : Wed Aug 22 2007 - 07:36:01 CDT