From: Kent Karlsson (firstname.lastname@example.org)
Date: Tue Apr 04 2006 - 15:46:24 CST
James Kass wrote:
> Kent Karlsson wrote,
>>> > If it is Unicode's official position that traditional Malayalam
>>> > use U+0D4C and that reformed Malayalam must use U+0D57,
>> That would be it. And it's so obvious that that is how it must work
>> that I worry that if you (James) have such a hard time with it, how
>> many other more subtle issues in Unicode you've gotten all wrong.
> That which is obvious depends upon the point of view.
> Consider how someone who is "outside-looking-in" would regard
> the problems associated with the Latin two-contour vowel sign
> LATIN SMALL LETTER I.
I agree that is a nasty little thing... To overcome part of that nastiness
I proposed the soft-dotted property (so that this was recorded as a
formal property, not just something that "everyone is supposed to know".
If it had been possible to do the encoding of the Latin script de-noveau,
but still much like Unicode is now with properties and decompositions,
I would have argued for having "ordinary" small 'i' to be canonically
decomposable... (GASP!) ...with the Turkish casing rules as the simple
case, special-casing (pun intended) for the more common case.
> Suppose there was a standards body in Kerala responsible
> for making a universal character set.
> Obviously, for casing pairs U+0130 would have to be the upper case
> form for LATIN SMALL LETTER I.
It is, for Turkish; see SpecialCasing.txt.
> If there were alternative orthographies for the Latin script, like
> roman or uncial for writing one Latin user community's language,
> and one form used the two-contour vowel sign in lower case (U+0069)
> while the other form used a one-contour glyph variant (U+0131), then
I'm very apprehensive about the "glyph variant" argument. I would not
consider those two "glyph variants".
And the dot on small i does NOT always disappear when accents are
applied. No, those cases are NOT considered glyph variants, as
evidenced by the SpecialCasing.txt rules for Lithuanian. I would
also not consider cedilla and comma below as glyph variants.
If one uses the CEDILLA, one should actually get cedilla
(regardless of font) and if one uses COMMA BELOW, one should
actually get comma below (but *for typographic space* reasons,
that actually is displayed as a inverted comma above if applied to
a small g).
As Antoine says, the line is hard to draw in some cases. But not
w.r.t. two-part vowels...
> those favoring the one-contour glyph variant would be forced to
> use U+0131 even if they considered U+0069 and U+0131 to be
> semantically equivalent and representative of the same atomic
What semantics? Phonetics? That is irrelevant to character encoding.
Character properties? Those are (from the Unicode point of view) part
of the character semantics. The character properties are chosen on a
variety criteria, and aren't all that obvious in all cases. There
are even errors that aren't corrected, some for reasons of a formal
stability guarantee, some for less firm reasons. E.g., the "Thai
danda" is recorded as a letter, though it should be recorded as
> character. If this should break their existing implementations,
> well, that would be too bad. If the government of that user
> community were faced with transcoding existing material and
> maintaining doubled files for everything, I'm sure that they'd
"maintain double files for everything"? Why that?
> all jump at the chance to do so rather than risk being called
> Now, in some cases that Latin two-contour vowel sign must lose
> one contour. An example is when the two-contour vowel sign is
> followed by an above-combining mark. The obvious solution is
> to require that U+0305 be used in these situations rather than
> U+0069. [[ corr: U+0305 -> U+0131 ]]
In a de-noveau setting... See above.
> Of course, if the standards body were faced with a similar problem
> for the script used in Kerala, the script with which they are most
> familiar, they'd probably come up with something like a "soft
> left-hand side" property for their two-part vowel. It may not
Well, it is a possible way of doing it. BUT:
* there is no other "flicky two-part vowel"
* both parts are already encoded as characters
* it already has a canonical decomposition
* the left side only character can be used, same script
> occur to that standards body that an approach similar to their
> "soft left-hand side" property could be applied to the pesky little
> dot on that Latin letter.
>> I agree. And I'm getting rather tired of beating this particular
>> dead horse, despite James's opposition.
> Any list member should be able to request a post-mortem in the
> event of an unexpected or unexplained death, and should be able
> to do so without being slammed, shouted at, or trivialized.
I don't think I've done that. But the arguments are getting round and
round again. And no much progress.
Unicode is not a magic wand unifying different spellings,
unveiling them only when given "special orthography" fonts.
It encodes characters, units of writing; and for the two-part
vowels, there really are two units of writing being used (though
there are some missing decompositions for some of the scripts.
I consider it a requirement that texts (in the same language
and same script) in "old" and "new" orthographies can be
side by side in the same font ("plain text" if you like)
and be recognisably (and correctly renderable) as being in
"old" and "new" orthography, regardless of font (as long as
the font supports all of the characters in the text).
"Old" vs. "new" overall font design is something quite different
(and I know it exists for Khmer; comparable to Fraktur vs. Antiqua).
Aside: I find it slightly surreal to have this debate alongside
the quite opposite one on danda disunification...
This archive was generated by hypermail 2.1.5 : Tue Apr 04 2006 - 16:01:24 CST