Re: Invisible Distinctions

From: Asmus Freytag (
Date: Thu Jun 29 2006 - 01:08:31 CDT

On 6/28/2006 2:37 PM, Richard Wordingham wrote:
> Does anyone know of invisible encoding distinctions (*not* canonically
> equivalent) actually being deliberately used by significant groups of
> users? I can think of a few possibilities:
> (1) <U+17D2 KHMER SIGN COENG, U+178A KHMER LETTER DA> v. <U+17D2,
> U+178F KHMER LETTER TA>. It is recommended that the choice be made
> according to pronunciation, but this would be unetymological in a few
> words.
> (2) Use of <U+034F COMBINING GRAPHEME JOINER> (CGJ) to distinguish
> digraphs from accidental sequences in sorting. The usual example
> given is Slovak 'ch'; Welsh 'ng' could also become a significant
> possibility. The two cases (Hebrew and German) where it is intended
> to affect the rendering are not relevant to my question.
The only German issue that the UTC considered is one of sorting, not
> However, I have no evidence of whether these distinctions are actually
> being made by significant number of users.
For the German distinction in sorting we have a dedicated user
community, i.e. those libraries signed up on the proposal. Unless
someone has actual intelligence to the contrary, I would expect that
they are in fact making the distinction they requested.
> I can well imagine CGJ only being used on keywords, and then only when
> the sorting process otherwise yields the wrong order for the data set
> in actual use.
The German case sorts U+0308 differntly depending on whether it is
required for German (acting as an umlaut) or whether it's an optional
spelling indicating that two vowels are pronounced distinctly. This
distinction makes sense in German, esp. for cataloging purposes, as
there is a common alternate spelling of Umlauts (mostly historical/in
names) using the letter 'e'. Therefore sorting U+0308 like an 'e' brings
similar sounding names together in the catalog, unburdening the users
from having to constantly consider spelling variations.

At the same time, this would be completely incorrect for the other use
of U+0308, esp. as the common alternate spelling is to leave out the
mark. Therefore, in those instances U+0308 really needs to be sorted as
an accent.

Yes, this is meaningful only in (large) lists of names/titles/places,
where examples of this would show up cosnsistenly not in random word
lists or a short index.

That's why the library system was indeed a very credible user group to
make this kind of request.

There is a long-standing Danish case where the spelling reform about 100
years ago made AA into A-ring, except in certain names and place names,
or pre-reform titles. Such 'aa' will now sort, like a-ring. after Z.

However, some compound words have accidental occurrence of a double 'a',
such as dataanalyse. In that case, inserting SHY will take care of the
sorting (preventing a<SHY>a from being recognized as a-ring) and at the
same time provide the line-break opportunity. It's a nice solution to
not having to have BOTH a CGJ and SHY in the text, but it suffers from
not being part of a general model - not all semantic distinctions happen
at a word boundary inside a compound word!

Nevertheless, it's been advocated for over a decade now, so if sorting
dataanalyse is in fact a problem in Danish, I would assume that it's
been implemented somewhere.

Spanish used to sort the 'ch' as a cluster, but has reformed to conform
to limits in technology. I don't know whether a non-clustered sorting of
'ch' would have been required. The German and Danish cases exhibit the
feature that both forms are required and must be treated differently.

> Richard.

This archive was generated by hypermail 2.1.5 : Thu Jun 29 2006 - 01:32:17 CDT