Digraphs as Distinct Logical Units

From: Sean B. Palmer (sean@mysterylights.com)
Date: Fri Aug 02 2002 - 16:17:56 EDT


Hi all,

I came across the following point in the Unicode FAQ that explains why the
Unicode standard does not contain any characters for digraphs:-

http://www.unicode.org/unicode/faq/ligature_digraph.html#3

I find the comments therein rather perplexing, especially seeing as how if
the digraphic characters were in fact denoted by a singular new glyph, then
they would certainly have been included. As a combination of glyphs is
treated as a single character in every way--from sorting in dictionaries,
to filling in crossword puzzles--it seems counterintuivitive that we should
have to rely upon (albeit well-developed) heuristics to collate words.

In all practicality, I do not expect that writers in languages such as
Spanish, Hungarian, and Welsh etc. where digraphs are used fairly commonly
would immediately change all their texts to use the appropriate single
Unicode characters, had they existed. Of course, it is also true that a
decade or two ago, common substitutions such as i^ for i-with-circumflex
had to be made. Since then, these characters appeared in various character
"sets", and whilst it's still fairly common to leave the diacritics off,
there are a great deal more people using the proper characters.

Since there are 676 possible digraph combinations, I endeavoured to come up
with a simpler approach to marking the digraphs as a single character than
simply creating a codepoint for each one. I have two ideas so far:-

* Come up with a set of A-Za-z combining characters, such that c +
combining-h would form a "ch" grapheme
* Come up with a digraph combinging character, such that c + h +
digraph-combinging-character forms the "ch" grapheme

The former idea is the most costly, since it involved reserving more
codepoints, is not backwards compatible in languages such as HTML (with the
latter solution, you can use CSS to prevent the digraph combinging
character from being displayed), and is limited to latin digraphs.

If anyone has any comments on this, or any references to previous
discussions, they would be gladly recieved.

--
Kindest Regards,
Sean B. Palmer
@prefix : <http://purl.org/net/swn#> .
:Sean :homepage <http://purl.org/net/sbp/> .



This archive was generated by hypermail 2.1.2 : Fri Aug 02 2002 - 14:31:52 EDT