We're taking a look at the encoding needs of one of our SIL
field entities - in Papua New Guinea - in relation to Unicode
as a trial for doing the same for our other entities around the
world that use extended Latin script. There are some issues
that arise, and I want to see if these have been considered
before and, if so, whether there are established preferences
for how these matters should be dealt with.
The following discussion makes reference to SIL PNG's standard
codepage, shown in the attached PDF. My understanding (I need
to double check this) is that this codepage covers the
character needs of most or all or the written languages of PNG
(which number in the hundreds).
The first question relates to the characters 0x8D and 0x8E, L/l
with equal sign overlay. These are not currently defined in
Unicode, neither is there a combining equal sign overlay
character. Would it be preferable to propose addition of one
combining character or of a pair of composite characters (with
no canonical decomposition)?
The second question relates to the following pairs of
0x8F, 0x90 L/l with tilde overlay
0x9A, 0x9B U/u with middle bar
0xD0, 0xF0 L/l with middle bar
For each of these pairs, the lower case character - and only
the lower case character - is already defined in the standard:
U+026B LATIN SMALL LETTER L WITH MIDDLE TILDE
U+0289 LATIN SMALL LETTER U BAR
U+019A LATIN SMALL LETTER L WITH BAR
All three of these characters could potentially have canonical
decompositions using existing characters, but in fact none of
these three characters has a canonical decomposition.
The upper case counterparts to all three could be encoded using
combining sequences as follows:
L with tilde overlay: 004C + 0334
U with middle bar: 0055 + 0335
L with middle bar: 004C + 0335
(It's not entirely clear that U+0335 is the appropriate
combining mark for the latter two; the distinction between
U+0335 and U+0336 appears to be purely visible. U+0335 seems to
me to be the better choice here. I think it would be good to
clarify which should be used for cases like this.)
The question is this: Is there any potential problem having a
Ll character with no decomposition that gets case mapped to an
Lu character that is defined only as a (decomposed) sequence?
The alternative would be to propose the upper case characters
as additions to the standards, but if added they would
certainly have to be added without canonical decompositions. (I
don't think we'd want a case pair where one is decomposable but
the other is not. Decompositions for the lower case characters
could, of course, in principle be added. But any addtional
decompositions are to be avoided at all costs since they create
problems for existing implementations.)
You may notice some other items of curiousity in this codepage.
I don't have all the facts yet, so I'm not looking to discuss
anything more than the questions I've raised here.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT