L2/04-125

From: Asmus Freytag
Subject: Comments on Script Values
Date: April 6, 2004

Comments **

At 12:59 PM 4/6/2004, Mark Davis wrote:
1. In 4.0.1 we added a new script value:

3031..3035    ; Katakana_Or_Hiragana # Lm   [5] VERTICAL KANA REPEAT
MARK..VERTICAL KANA REPEAT MARK LOWER HALF
309B..309C    ; Katakana_Or_Hiragana # Sk   [2] KATAKANA-HIRAGANA VOICED SOUND
MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FF70          ; Katakana_Or_Hiragana # Lm       HALFWIDTH KATAKANA-HIRAGANA
PROLONGED SOUND MARK
FF9E..FF9F    ; Katakana_Or_Hiragana # Lm   [2] HALFWIDTH KATAKANA VOICED SOUND
MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

** I assume the theory for including the last named two characters had to do
with attempting to not change properties when mapping between fullwidht and
halfwidth forms?


A. It appears that we missed some other characters that should have an explicit
value:

Proposed:
30FC          ; Katakana_Or_Hiragana # Lm       KATAKANA-HIRAGANA PROLONGED
SOUND MARK

Maybe also:

30A0          ; Katakana_Or_Hiragana # Pd       KATAKANA-HIRAGANA DOUBLE HYPHEN

** In this case, the fact that they are named "KATAKANA-HIRAGANA" is a good indicator
that we (and WG2) considered them to be used by both scripts.


30FB          ; Katakana # Pc       KATAKANA MIDDLE DOT
FF65          ; Katakana # Pc       HALFWIDTH KATAKANA MIDDLE DOT

** I'm less sure about this pair. The overwhelming usage will be with Katakana, but
as you write, we wouldn't expect it to occur without an adjacent, and usually a
preceding Kana character, which would allow one to correctly determine the overall
script membership for the run.

On the theory that it's better to err on the side of inclusiveness in this case,
I'd support this.

2. Right now, people have to dig the definition of the properties out of the TR.
It would be better both for them and for our maintenance if they were treated
like Line Break, as enumerated properties, with the values as given by TR #29
(as amended by the above). Here are suggested names.

** I'm guardedly in favor of pulling such data out of the text. It depends on how
useful the information is to the implementor. If every implementation has to
re-analyze the issue, so that these properties are merely examples, then offering
them in list form does not add much value.

If, on the other hand, we expect that most implementations need to tweak at most
a few of the values then I see a lot of value added by making the list machine
readable.


Default_Grapheme_Cluster_Type (DGCT)
Default_Word_Type (DWT)
Default_Sentence_Type (DST)

The "Default" is explicit in the name, so that people are clear that these are
expected to be overridden.

** That would be inconsistent. We have other properties that are subject to
tailoring that do not use the 'default' as part of their name, but spell that
out in documentation. By doing this, we'd be implying that *all* other properties
that are defaults have that in their name - which is not true. Therefore, I'd rather
be consistent and continue to assign to the documentation the task of providing such
information about properties. Besides, it would leave the names shorter.

** The proposed name have another shortcoming in that they are slightly misleading.
Word_type is not about a type of word, but a classification of a character to be
used in determining word boundaries. So I suggest word_boundary_class etc.
for the names. The three names need not be constructed on the same principle, since
the type of composite is different:

grapheme element class
word boundary class
sentence delimiter class

for example sentences have delimiters, whereas grapheme clusters do not. Most
characters in a typical text are part of a sentence, while, except for Jamos,
few are part of grapheme clusters.

Getting the names of the properties to match the kinds of distinctions that they
make will improve their usability. (I know, I wish some of the linebreaking classes
had better names).

A./