L2/04-152R

Source:  Mark Davis
Subject: Katakana_Or_Hiragana and UTR#29
Date:    2004-06-17

This is a revised document, that incorporates L2/04-152 (first version), L2/04-124, L2/04-160, and L2/04-125.

A. Changes to UAX #29

When we added the new script value Katakana_Or_Hiragana, we didn't adjust Table 2 in http://www.unicode.org/reports/tr29/tr29-6.html#Word_Boundaries.

Here are the relevant pieces of #29

Katakana:
Script = KATAKANA, or
Any of the following:
U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF70 (ー) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF9E (゙) HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F (゚) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

and the rule:

Do not break between Katakana

Katakana×Katakana(13)

Here are the contents of the script value:

3031..3035    ; Katakana_Or_Hiragana
# Lm   [5] VERTICAL KANA REPEAT MARK..~ ~ ~ MARK LOWER HALF

309B..309C    ; Katakana_Or_Hiragana
# Sk   [2] KATAKANA-HIRAGANA VOICED SOUND MARK..~-~ SEMI-~ ~ ~

FF70          ; Katakana_Or_Hiragana
# Lm       HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK

FF9E..FF9F    ; Katakana_Or_Hiragana
# Lm   [2] HALFWIDTH KATAKANA VOICED SOUND MARK..~-~ SEMI-~ ~ ~

To address this, the proposal is:

A.1 Change TR#29 - the class Katakana to:

Katakana:
Script = Katakana, or
Script = Katakana_Or_Hiragana

A.2 And add new rules:

Katakana×Katakana_Or_Hiragana
Katakana_Or_Hiragana×Katakana_Or_Hiragana

B. Adding to Katakana_Or_Hiragana

It seems surprising that the following doesn't have the value Katakana_Or_Hiragana. Proposal is to add them:

30FC          ; Katakana_Or_Hiragana # Lm       KATAKANA-HIRAGANA PROLONGED SOUND MARK
30A0          ; Katakana_Or_Hiragana # Pd       KATAKANA-HIRAGANA DOUBLE HYPHEN

I'm guessing we just missed these in

http://www.unicode.org/L2/L2003/03427-script-codes.txt 
http://www.unicode.org/L2/L2004/04083-muller-scripts.html 
http://www.unicode.org/L2/L2004/04096-script-changes.txt

C. Removing Katakana_Or_Hiragana?

John Cowan writes the following (L2/04-160)

Date/Time:    Wed May 12 16:17:06 CDT 2004
Contact:      cowan@ccil.org
Report Type:  Submission (FAQ, Tech Note)
Opt Subject:  Hiragana_or_Katakana value of Script property should die

For the UTC:

I urge the UTC to change the 10 characters currently having the
Hiragana_or_Katakana value of the Script property to have the Inherited
value instead.  This eliminates all questions of the relationship
between Hiragana_or_Katakana and ISO 15924 Hrkt, and has no practical
effect, since all these characters are normally used only following a
Hiragana or Katakana character anyhow and are understood to have the
same script as that character, which is exactly how Inherited works.

Doug Ewell has expressed the same sentiments on the Unicode list.

I agree with John and Doug that there really was very little value to adding this script value; all of its characters should really just inherit their status from the previous character -- which is what Script=Inherited is to do. But it is probably too late to do this; we might just document that an acceptable implementation can just remap Hiragana_or_Katakana to Inherited.

D. Formal Properties for TR #29

Right now, people have to dig the definition of the properties out of the TR text. It would be better both for them and for our maintenance if they were treated like Line Break, as enumerated properties, with the values as given by TR #29 (as amended by the above). This also caused a problem in the last release, for which we issued the following erratum. The issue is that the character classes are supposed to be disjoint, and because of changes in property value they were not.

Also in Table 1. Default Grapheme Cluster Boundaries, the definition of the value Control is incorrect. It needed to have been adjusted for the change in status of the Joiner characters.

After the line:

and not U+000A LINE FEED (LF)

the following text is missing:

and not U+200C ZERO WIDTH NON-JOINER (ZWNJ)
and not U+200D ZERO WIDTH JOINER (ZWJ)

Note: I am including comments from Asmus, indented.

I'm guardedly in favor of pulling such data out of the text. It depends on how useful the information is to the implementor. If every implementation has to re-analyze the issue, so that these properties are merely examples, then offering them in list form does not add much value. If, on the other hand, we expect that most implementations need to tweak at most a few of the values then I see a lot of value added by making the list machine readable.

By providing separate enumerated properties, we can avoid these problems. Here are the suggested names.

Grapheme_Cluster_Class
Word_Class
Sentence_Class

Note: My original put "Default_" on the front of each, to emphasize that these are the default values, but that we expect tailoring, much like what we do with the DUCET (Default Unicode Collation Element Table). I also originally used "_Type" instead of "_Class. Asmus's comments on those original names were:

That would be inconsistent. We have other properties that are subject to tailoring that do not use the 'default' as part of their name, but spell that out in documentation. By doing this, we'd be implying that *all* other properties that are defaults have that in their name - which is not true. Therefore, I'd rather be consistent and continue to assign to the documentation the task of providing such information about properties. Besides, it would leave the names shorter.

The proposed name have another shortcoming in that they are slightly misleading. Word_type is not about a type of word, but a classification of a character to be used in determining word boundaries. So I suggest word_boundary_class etc. for the names. The three names need not be constructed on the same principle, since the type of composite is different.

E. Relation to L2/04-415

1. There is one point of overlap with this document. In point D of that document, it mentions Katakana in rules. In accordance with A, the rules should have Katakana | Katakana_Or_Hiragana instead.

2. While researching that document, we got the feedback that the following should not be General_Category=Connector_Punctuation (gc=Pc). The characters don't connect other elements, they separate them.

U+30FB KATAKANA MIDDLE DOT and
U+FF65 HALFWIDTH KATAKANA MIDDLE DOT

The proposal is to change those characters from Pc to Po (Other_Punctuation).