Re: unicode Digest V12 #108

From: Philippe Verdy <>
Date: Fri, 8 Jul 2011 22:54:46 +0200

2011/7/6 Asmus Freytag <>:
> On 7/3/2011 6:31 AM, Philippe Verdy wrote:
> Regarfing the previous comment about the Danish "aa",
> Sorry, most of that discussion missed the mark.
> "Modern" Danish can have "AA" for two reasons. Accidental occurrence, as in
> "dataanalyse" which is composed of two words which just happens to put two
> "A" together. The other is frozen spellings for names and the like. In the
> former case, you can never use "", in the latter case, you may not want to.

I had already perfectly understood that. May be you only read a part
of my message and made an assumption that I would consider them
equivalent, which I don't. This was clear in my message.

> In the former case, you do not want to sort "AA" as if it was "", in the
> latter case, you do.
> None of that has anything to do with ASCII - it's a question of orthographic
> practices, not of legacy encoding.

Here again, I have not asserted anythng about ASCII, except that it
was used (and probably continues to be used) as a practice in Danish
when is not available in a more limited repertoire (including in
DNS, where IDNA is not an option).

> Because accidental digraphs (in Danish) happen at word boundaries in a
> compound, the SHY is an elegant way to mark them.

Yes, OK, in a text where one do not want any word-breaking in the
rendered paragraphs (with or without justification of whitespaces or
microjustifications), it would be inconvenient. In fact, earlier in a
previous message I had already favored ZWNJ for that additional
control (just like I also favor ZWJ for the usual Danish digram, if it
occurs in a Danish word (such as a proper name) inserted in a
non-Danish text rendered with automatic word-breaking (for this case
of mulitlingual documents, in fact I doubt that those limited
occurences of Danish in text in another language that doesn't have
this digraph, shoudl prebably even avoid recognizing the Danish
digram, for exampel when indexing whole texts to create word lists,
notably for creating sorted lists, as the unusual Danish word would
not be found at the expected place in the index directory.)

My opinion is still that, in an almost 100% Danish text, nothing is
needed: the document should only be parsed globally when knowing (or
at least guessing) in which language it is written ; then you can use
an external dictionnary lookup for exceptions such as occurences in
glued compound words like "dataanalysis" (anyway in this word, the
gluing can be explicitly encoded as ZWNJ, or possibity better as Word
Joiner which has the interest of remaining outside of the first
grapheme cluster instead of part of it when using ZWNJ).

There's no good universal solution. Users will need to adapt to their
working environment and the rendering and computed semantic they get
with each option.

Received on Fri Jul 08 2011 - 15:58:34 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 08 2011 - 15:58:35 CDT