I have been reviewing UAX#29 Unicode Text Segmentation because I have a
feeling we will be trying to do too much with the concept of grapheme
clusters, even with tailoring, when we extend it to include whole
aksharas.
What is the meaning of "Word boundaries, line boundaries, and sentence
boundaries should not occur within a grapheme cluster: in other words,
a grapheme cluster should be an atomic unit with respect to the process
of determining these other boundaries"? In particular, whom is it
directed to?
Now, once quadrate support is added and we are able to write Ancient
Egyptian in Unicode, we will probably have two very significant
languages that regularly breach parts of that rule. (At least, I
assume a whole Egyptian quadrate would be included in a dropped
capital.) Sanskrit word boundaries frequently occur within *legacy*
grapheme clusters, and sentence boundaries may occur within quadrates.
I presume UAX#29 does not intend that we should use means other than
Unicode to write samhita Sanskrit and Ancient Egyptian.
Richard.
Received on Wed Dec 13 2017 - 12:37:10 CST
This archive was generated by hypermail 2.2.0 : Wed Dec 13 2017 - 12:37:11 CST