From: Nick Nicholas (opoudjis@optushome.com.au)
Date: Tue May 24 2005 - 02:54:28 CDT
from Dean Snyder:
> I gave several examples where glyphic
> information, in ancient texts, for example, is important information
> that is not conveyed when those texts are transliterated. Hence the
> utility of encoding those scripts.
>
>
I'm sorry, but the applications you've been talking about --- glyph-
based recognition, glyph-based restoration, caring that r and z in
Arabic differ by a dot --- require a glyph inventory. Not only
*cannot* Unicode provide you with that, it *must not*. Otherwise,
consider Italic lowercase Latin a: it is easily confusable with o,
wheras upright normal Latin a is not. This similarity is obviously
important for palaeographers or graphologists or whatever. Which
means --- what, that italic and upright a are to be disunified as
codepoints? (Mutatis mutandis, exactly the same goes for Serbian vs.
Russian italic Te. And noone bring up Latin Small Letter Alpha,
that's not used outside phonetic transcription.) If Roman-script
palaeographers can put up with two thousand years of ductus being
mooshed together into 52 codepoints, then cuneiformists can put up
with a cuneiform being encoded on emic rather than etic principles
too. (Which is what I always meant by "Don't Prolif, Translit": the
emic repertoires are almost always cemented in transliteration, not
in ahistorical normalisation of the historical script.)
You've complained that transliteration is lossy. (Nice countering of
slogans, btw.) But at the level of glyph identity, so is going from
italics to upright. So is dropping language tagging of Serbian vs.
Russian. At the level of how deep the stylus impressions in the
tablet go, so are 2-D photographs, for that matter. The lossiness is
a given in any change of medium, or normalisation of glyphs, or
indeed any encoding at all. And specifically to what Unicode was
designed for, it's why *plain*text is not richtext. That does not
prove that the distinctions you may want to make are relevant to
plaintext; in fact, the more you speak of glyphs, the more it proves
the opposite. (Dots? There are no dots in a hex number.)
Transliteration is lossy; so is the character-glyph model. And that's
a *good* thing: I like being able to use "Find" on text, thank you.
You're to be lauded in envisaging ways a computer-driven glyph-
recognition system can revolutionise cuneiform studies. But that
cannot be not Unicode's concern: Unicode has to provide an emic
repertoire of codepoints, whensoever possible.
To make myself clear: I don't oppose the Unicode encoding of
cuneiform --- more power to you. But where plaintext use of a script
is limited (and a lot of ancient script use is not obviously
plaintext), encoding that script is a much less pressing need: that's
a fact, and it's a fact because of the institutionalised preference
for transliteration of historical scripts. And I do object to the
risk of a script encoding ignoring the need to establish characters
out of glyphs, and making Unicode an openended glyph storehouse.
Where this can be avoided, it should. Where this means script
proposals need to be held up in discussion by whatever scholars UTC/
ISO brings over, it should. There's been some belittling of those
scholars of such scholars in similar debates on this list and
elsewhere (the "spoilsport" reaction I refer to on my site); but for
all that David Starner doesn't "find the concerns of the academics
interesting" :-( , those academics have a crucial stake in preventing
poor encodings of their subject area, especially with the door shut
on canonical equivalences. (Yes, you can customise DUCET, but that's
patching.) It continues to baffle me that this is even arguable.
===
O Roeschen Roth! Der Mensch liegt in tiefster Noth! Der Mensch
liegt in
tiefster Pein! Je lieber moecht' ich im Himmel sein! ---
_Urlicht_
nickn@unimelb.edu.au http://www.opoudjis.net
Dr Nick NICHOLAS, French Italian & Spanish, Univ. Melbourne, Australia
This archive was generated by hypermail 2.1.5 : Tue May 24 2005 - 02:57:58 CDT