Re: Mixed up priorities

From: Otfried Cheong (otfried@cs.ust.hk)
Date: Fri Oct 22 1999 - 02:10:03 EDT


> >The English 'ch' can be separated into 't-sh', though 'sh' and 'th'
> >cannot be.

Actually, `ch' cannot be separated in English, it is a single phoneme.

> It is completely irrelevant how other languages treat the "ch". It is a
> character in at least two languages. If that's not good enough, why not
> remove the thorn from Unicode. Or the slashed L (only used by one
> language). Or the scharfes S. Or the Hungarian umlaut. (No, I'm not
> suggesting any of that, and yes, I know they all are written as a single
> glyph, but Unicode encodes characters, not glyphs.)

Actually, the scharfes S (which is really an awful misnomer) isn't
really a letter---it's a glyph variant (or ligature) of ss or sz. It
is an important variant, though, and it is semantically important to
be able to distinguish scharfes S from ss in plain text, so it needs
to be encoded separately somehow.

If the same reasoning applies to CH in Slovak, then there is no good
reason not to encode it. Unicode has long accepted that there can be
letters/characters whose glyph looks like two letters, such as

LATIN CAPITAL LETTER LJ
LATIN CAPITAL LETTER NJ
LATIN CAPITAL LETTER DZ

An interesting case is

LATIN SMALL LIGATURE IJ

Unicode calls it a "ligature", but it really is considered a letter in
Dutch.

I suppose in these cases no discussion was necessary because they are
in some legacy encoding and had to be encoded for compatibility
reasons. The Slovak national standards body apparently didn't show up
with its national encoding with its LETTER CH.

The current strategy seems to be not to encode any more decomposable
entities. But even if we take this for gospel, then the case for CH
is not clear. It may be possible to decompose it as C + H, but this
does not allow me to ambiguate between <THE LETTER CH> and <THE LETTER
C><THE LETTER H>.

If it is necessary to be able to make this ambiguation in plain text
in any language that Unicode claims to cater for, then Unicode must
provide the means to make it. Whether this is by creating a new
LETTER CH or creating a COMBINING LETTER H is a different question.

Otfried



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT