L2/04-125 From: Asmus Freytag Subject: Comments on Script Values Date: April 6, 2004 Comments ** At 12:59 PM 4/6/2004, Mark Davis wrote: 1. In 4.0.1 we added a new script value: 3031..3035 ; Katakana_Or_Hiragana # Lm [5] VERTICAL KANA REPEAT MARK..VERTICAL KANA REPEAT MARK LOWER HALF 309B..309C ; Katakana_Or_Hiragana # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK FF70 ; Katakana_Or_Hiragana # Lm HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK FF9E..FF9F ; Katakana_Or_Hiragana # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK ** I assume the theory for including the last named two characters had to do with attempting to not change properties when mapping between fullwidht and halfwidth forms? A. It appears that we missed some other characters that should have an explicit value: Proposed: 30FC ; Katakana_Or_Hiragana # Lm KATAKANA-HIRAGANA PROLONGED SOUND MARK Maybe also: 30A0 ; Katakana_Or_Hiragana # Pd KATAKANA-HIRAGANA DOUBLE HYPHEN ** In this case, the fact that they are named "KATAKANA-HIRAGANA" is a good indicator that we (and WG2) considered them to be used by both scripts. 30FB ; Katakana # Pc KATAKANA MIDDLE DOT FF65 ; Katakana # Pc HALFWIDTH KATAKANA MIDDLE DOT ** I'm less sure about this pair. The overwhelming usage will be with Katakana, but as you write, we wouldn't expect it to occur without an adjacent, and usually a preceding Kana character, which would allow one to correctly determine the overall script membership for the run. On the theory that it's better to err on the side of inclusiveness in this case, I'd support this. 2. Right now, people have to dig the definition of the properties out of the TR. It would be better both for them and for our maintenance if they were treated like Line Break, as enumerated properties, with the values as given by TR #29 (as amended by the above). Here are suggested names. ** I'm guardedly in favor of pulling such data out of the text. It depends on how useful the information is to the implementor. If every implementation has to re-analyze the issue, so that these properties are merely examples, then offering them in list form does not add much value. If, on the other hand, we expect that most implementations need to tweak at most a few of the values then I see a lot of value added by making the list machine readable. Default_Grapheme_Cluster_Type (DGCT) Default_Word_Type (DWT) Default_Sentence_Type (DST) The "Default" is explicit in the name, so that people are clear that these are expected to be overridden. ** That would be inconsistent. We have other properties that are subject to tailoring that do not use the 'default' as part of their name, but spell that out in documentation. By doing this, we'd be implying that *all* other properties that are defaults have that in their name - which is not true. Therefore, I'd rather be consistent and continue to assign to the documentation the task of providing such information about properties. Besides, it would leave the names shorter. ** The proposed name have another shortcoming in that they are slightly misleading. Word_type is not about a type of word, but a classification of a character to be used in determining word boundaries. So I suggest word_boundary_class etc. for the names. The three names need not be constructed on the same principle, since the type of composite is different: grapheme element class word boundary class sentence delimiter class for example sentences have delimiters, whereas grapheme clusters do not. Most characters in a typical text are part of a sentence, while, except for Jamos, few are part of grapheme clusters. Getting the names of the properties to match the kinds of distinctions that they make will improve their usability. (I know, I wish some of the linebreaking classes had better names). A./