L2/08-072

Date/Time:    Mon Jan 28 06:34:59 CST 2008
Contact:      www-international@w3.org (Richard Ishida)

Comments from the W3C i18n review of:  
http://www.unicode.org/reports/tr29/tr29-12.html

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 1
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
"To avoid ambiguity with the computer use of the term character, this is  
called a user-perceived characteror a grapheme cluster.".


Section 1 para 1 replaces 'grapheme clusters ("user-perceived  
characters")' with 'user-perceived characters', but should probably say  
'grapheme clusters (also known as user-perceived characters)'.


S1 para 4 replaces 'grapheme clusters (what end users usually think of as  
characters)' with just 'characters'.  This is incorrect.


S2 para1 deletes 'grapheme clusters' and leaves 'user-perceived characters'.


Later we read:


"Note: Default grapheme clusters have been referred to as"


This could point to a problem with terminology.  Is 'default grapheme  
clusters' meant to include default grapheme clusters of the extended and  
existing types? I would have thought so, but the meaning of the text is not  
clear. You'd need to say 'default grapheme clusters and extended default  
grapheme clusters' here to be clear (and elsewhere in the text, eg. 4 paras  
later).  We could rename the current 'default graphemecluster' to 'minimal  
default grapheme cluster' and define 'default grapheme cluster' to refer  
to both the minimal and extended varieties, or youcould simply use  
'grapheme cluster' when you want to be non-specific.


This is very inconsistent.


We would like to see some rationalization of the terminology used  
throughout the section, and consistency in its application.


Terms should be clearly defined, and only one term should be used for one  
concept. The definitions should be easy for the reader to locate visually,  
and compare. We suggest a mini-glossary internal to section 3 or links on  
terms to a glossary at the end of the document.


In particular, the replacement of the term "grapheme cluster" with term  
"character", starting in the introduction and proceeding through the  
document, seems to fly in the face of standard Unicode terminology and  
produces a significant problem. The term "character", as usually understood  
in Unicode contexts, refers to a logical character i.e. a code point. By  
using the term interchangeably with "grapheme cluster", we introduce  
confusion.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 2
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:

2 [http://www.unicode.org/reports/tr29/tr29-12.html#Conformance]

Comment:
The document calls out Thai and Lao in addition to Chinese and Japanese,  
due to the fact that they don't use spaces between words. Other similar  
scripts like Khmer and Myanmar should be added to the list, or it should be  
made clear that this is a non-exhaustive list.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 3
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: S/E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]


Comment:
para starting
"Grapheme clusters are important for..."


We would like to see this para significantly expanded to provide a more  
complete list of potential applications for the grapheme cluster. This  
information is rather scattered around the section. Eg. mouse selection,  
cursor movement and backspace (and presumably delete)are mentioned later.


We feel that this will not only help readers understand the concepts in  
the section, but to more formally list the intended applications of these  
rules before defining a solution for them will also help better establish  
the required features of default grapheme clusters that need to be defined.


At the moment the document reads as if we have a solution looking for an  
application, rather than a set of use cases forwhich we are providing a  
solution.


Note that applications we have come across recently include segmentation  
for vertical text and identification of boundaries for first-letter styling  
(which could be said to be a type of highlighting).  (Segmentation of  
indic and south-east asian scripts for these applications is done on a  
syllabic basis.  See examples at
http://www.flickr.com/photos/ishida/2212584968/  
[http://www.flickr.com/photos/ishida/2212584968/]
 and
http://www.w3.org/International/notes/firstletter.html  
[http://www.w3.org/International/notes/firstletter.html]
 )

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 4
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]


Comment:
The sentence starting
"Historically, the Unicode Standard originally provided for grapheme clusters"
 is redundant. Either say "historically" or say "originally".

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 5
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3.1  
[http://www.unicode.org/reports/tr29/tr29-12.html#Default_Grapheme_Cluster_Table]

Comment:
"Extended  default grapheme clusters should be used in implementations in  
preference to default grapheme clusters, because it provides better results  
for Indic scripts such as Tamil."


This should come much earlier and be easier to find.  We would suggest  
that very near the beginning of section three the document states that it  
defines two types of default grapheme cluster, and that the extended one is  
the preferred.

There also needs to be a separate section and heading for the definition  
of XDGCs.  The current definition is difficult to find because it is just a  
small adjunct to the section about default grapheme clusters.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 6
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E/S?
Owner: RI

Location in reviewed document:

3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
'Indic scripts such as Tamil' is ambiguous.


We were expecting to read something like 'Indic scripts, such as the Tamil  
we saw earlier' or 'most Indic scripts'.


On the other hand, this may be intentional because the XDGCs are intended  
to only address the needs of a simpler Indic script like Tamil that doesn't  
generally use conjunct forms (so the statement should say something more  
like "the set of Indic scripts that are like Tamil").


If this latter interpretation is true, a. there needs to be a clearer  
statement about the relevance of XDGCs to Indic and South-East Asian  
scripts in general, and b. we think the document is definitely setting its  
sights too low.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 7
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
One way to think of this is as a sequence of characters that form a "stack".


Talking about Hangul characters "One way to think of this is as a sequence  
of characters that form a"stack"."  Some jamos stand side by side rather  
than stacking.  Surely the point is that this constitutes a Korean  
syllable.


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 8
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: S
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
We don't think extending default grapheme clusters to just incorporate  
spacing marks goes far enough to actually providing better results for a  
very large proportion of the world's population. We feel that the Unicode  
TC should conduct further research on how to extend default grapheme  
clusters so that they incorporate the majority of indic and south-east  
asian syllables.


Example: It is very common to have a sequence such as  
consonant+virama+consonant+vowel_sign, eg.


0938: स DEVANAGARI LETTER SA

094D: ् DEVANAGARI SIGN VIRAMA

0925: थ DEVANAGARI LETTER THA

093F: ि DEVANAGARI VOWEL SIGN I


See this as it would be rendered  
[http://www.w3.org/International/reviews/0601-css3-selectors/sthiti.gif].


Without tailoring, the current rules would result in text wrapping the THA  
to the next line, or attempting to highlight only part of the conjunct.  
The basic unit for grapheme clusters for indic and south-east asian scripts  
is the syllable, and just addressing spacing marks will still leave you  
short of a useful solution.


We would like the Unicode TC to investigate the possibility of adding a  
rule to say that a vowel killer character extends the grapheme cluster to  
any immediately adjacent base character and all its combining characters.


We feel that introducing a definition of default grapheme clusters that  
addresses this issue will go a long way to helping ensure that implementers  
provide applications that can handle South Asian and South-East Asian  
scripts much better than now.


We feel that extending default grapheme clusters to include only spacing  
marks may only complicate things further. We do not,however, feel that the  
extension of grapheme clusters should be abandoned.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 9
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E/S?
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
There are many types of grapheme clusters. Examples include:...


It is not clear whether this list refers to user perceived characters or  
different types of default grapheme cluster defined in this document.  
Please clarify, and if the former, please add an example of a complex indic  
syllable.


The khmer coeng+consonant combinations do not seem to qualify as default  
grapheme clusters according to the rules in this section, unless the fact  
that they are named sequences has some bearing, though that is not made  
clear. Please clarify this and provide some explanatory text for the link  
to the named sequences list.


(This is another example of inconsistent use of terminology related to  
grapheme clusters.)

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 10
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: S
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
We feel that the current definition of default grapheme clusters envisages  
only one way in which operations interact with grapheme clusters, whereas  
we probably require at least two different types of behaviour.


For example, in the case of Khmer, the subscript consonants are viewed as  
distinct letters by Cambodians.


On the one hand we suspect that it would make sense to delete the  
subjoined consonants separately from the 'base' character above them. This  
may not, however, be a question of deleting a character at a time - since  
it may be appropriate to delete vowel signs with the subjoined consonant.


On the other hand, we do not expect that it would make sense to highlight  
the subjoined character and its vowel sign separately from the rest of the  
syllable, especially since there could be some discontinuity between the  
subscript consonant and the following vowel sign. Nor would you expect to  
see parts of these clusters wrapping separately either. (Especially since  
vowels can appear to the left or on both sides of the stack produced by  
coeng combinations.)


1780:   ក   KHMER LETTER KA


17D2:   ្   KHMER SIGN COENG


179B:   ល   KHMER LETTER LO


17B8:   ី   KHMER VOWEL SIGN II


See this as it would be rendered  
[http://www.w3.org/International/reviews/0801-uax29/khmerexample.gif]
..


We find ourselves wondering whether there may be two different types of  
grapheme cluster rules, one that produces the correct behaviour for  
wrapping or highlighting and another to produce correct behaviour for  
backspace deletion.


We would appreciate it if the authors of UAX 29 could point us to some  
discussions about this, or engage in some if they have not yet taken place.  


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 11
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
" Additional cases need to be added for complete, whereby any string of text "


Syntax error !

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 12
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:
The whole of section 3 is written in a way that suggests that default  
grapheme clusters are the norm, and extended grapheme clusters are a  
recommended extension. We feel that this the section should be re-edited to  
make it clear that the extended default grapheme clusteris the standard  
way to do things in the future, but that you *could* find applications  
dealing with the former definition.


To help with this, we suggest that you find a different word that  
'extended' for the name of extended default grapheme clusters, and that you  
rename default grapheme clusters to something like legacy default grapheme  
clusters.


[Note: the submitters omitted Comment #13. -- Ed]

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 14
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:Just following the Note: "A key feature... are"

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 15
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: AP

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:The examples for locale-specific tailorings are in a single  
run-on-like sentence and probably should be separated around the text:  
"...such as collation; Thai never breaks between..."


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 16
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
3 [http://www.unicode.org/reports/tr29/tr29-12.html#Grapheme_Cluster_Boundaries]

Comment:Under the heading "Grapheme Cluster Boundary Rules", the text  
refers to a rule "9b", but no such rule exists. This appears to mean rule  
9a. Note that no change bars are present here!

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 17
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: AP

Location in reviewed document:
4 [http://www.unicode.org/reports/tr29/tr29-12.html#Word_Boundaries]

Comment:
The added text about search engines, coupled with the somewhat obscure  
example about database queries, suggests that, as with our comment #3, more  
thought should be given to providing comprehensive or clear usage  
scenarios."


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 18
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: AP

Location in reviewed document:
4 [http://www.unicode.org/reports/tr29/tr29-12.html#Word_Boundaries]

Comment:All of the examples include space-separated languages. No mention  
is made of the fact that some languages don't use spaces between words,  
which we think is an extremely important point to make. It should be  
explicitly mentioned here and possibly an example given.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 19
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: AP

Location in reviewed document:
4 [http://www.unicode.org/reports/tr29/tr29-12.html#Word_Boundaries]

Comment:The problem with spaces in tailored word breaking should probably  
be added to the text. In particular, it should be pointed out (as with the  
Southeast Asian languages above) that the word break algorithm provides a  
"pretty good" default but that some more complex mechanisms may be needed  
to do a perfect job (with stuff like 1_234,56, where _ represents a space  
type character).

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Comment 20
At http://www.w3.org/International/reviews/0801-uax29/
Editorial/substantive: E
Owner: RI

Location in reviewed document:
1.1 [http://www.unicode.org/reports/tr29/tr29-12.html#Notation]

Comment:
"and not
 U+000D CARRIAGE RETURN (CR)<]"


We wonder if "<]" is a typo.  If this is intended, shouldn't there be some  
explanation ?