From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 10 2004 - 20:10:31 CDT
> > Simply because some images appear in some
> > documents does not mean that they automatically should be
> > represented as encoded
> > characters.
>
> These aren't images. They're clearly letters; they occur in running texts and represent
> the sounds of a spoken language.
Well, I agree with that assessment.
> If I were transcribing them, I wouldn't encode them
> as pictures; I would encode them as PUA elements or XML elements (which are usually
> more easier to use and more reliable than the PUA).
And with that assessment, as well.
> I'll admit that it's a bit sketchy encoding these characters based on one article by
> one author. But I think it important to remember that more and more text is available
> online, even stuff that might never get reprinted in hardcopy, and that needs Unicode.
And in generally, I can't find fault with that, either.
But the argument in this particular case hinges on a particular,
nonce set of characters. We have this one scholar, who invented
a bunch of characters in the 20's to represent click sounds that nobody
was doing justice to at that point, either in understanding their
phonetics or making sufficiently accurate distinctions in their
recording. Bully for Dokes -- it was an important advance in the
field of Khoisan studies and the phonetics of clicks. But even
though he published his analysis, using his characters, nobody
else chose to adopt his character conventions. Subsequent scholars,
and the IPA, chose *other* characters to represent the distinctions
involved, in part because Dokes' inventions were just weird and
hard to use, as well as neither (in my opinion) mnemonic nor
aesthetically pleasing.
Well, we've encoded ugly letters for ugly orthographies in ugly
scripts before. That isn't the issue. But the non-use of these
forms is.
It comes down then to a *prospective* claim that someone *might*
want to digitize the classic Dokes publication and that if they
did so they would require that the particular set of weird
phonetic letters used by Dokes would have to be representable
in Unicode plain text in order for that one publication to be
made available electronically. (Or a few other publications that
might cite Dokes verbatim, of course.)
Well, in terms of requirements, I consider that more than a little
cart before the horse. I'd be more sympathetic if someone was
actually *trying* to do this and had a technical problem with
representing the text accurately for an online edition which was
best resolved by adding a dozen character to the Unicode Standard.
Then, at least there would be a valid *use* argument to be made,
as opposed to a scare claim that 50 years from now someone *might*
want to do this and not be able to if we don't encode these
characters right now.
Right *now* anyone could (if they had the rights) put a version of
Dokes online using pdf and an embedded font, and it would be perfectly
referenceable for anyone wanting access to the content of the
document. True, the dozen or so "weird" characters in the
orthography wouldn't have standard encodings, so searching inside
the document for them wouldn't be optimal. But is the burden that
might place on the dozen or so Khoisan orthographic historians and
phonetic historians who might actually be interested in doing so
out of scale with the burden placed permanently on the standard
itself for adding a dozen or so nonce characters for that *one*
document? After all those historians and scholars today are
basically using the document in its printed-only (out-of-print)
hard copy format, and we aren't exactly worried about the difficulties
that *that* poses them, now are we?
I might point out at this point that the Unicode Standard itself is
published online using non-standard encodings for many of its
textual examples, simply because of the limitations of FrameMaker
and PDF and fonts and the specialized requirements of citing lots
and lots of characters outside normal text contexts. But I don't
hear people yelling about the online Unicode Standard is crippled for
use by people who wish to refer to it because you can't do an
automated search for <ksha> in it which will accurately find all
instances of Devanagari ksha in the text.
And the *database* arguments just don't cut it. If anybody is seriously
going to be using Dokes materials in comparative Khoisan studies,
they will *normalize* the material in their text databases.
After all, this is just one of a large variety of really varied
material, in all kinds of orthographies, and in all levels of
detail and quality. Arguing that making these particular dozen
nonce characters searchable by giving them standard Unicode values
just doesn't cut it for me, because if I were going to do that kind
of work, a significant amount of philological work would be required
to "massage" the data into comparable formats, anyway, and use of
intermediate normalized conventions would not be a problem -- in fact,
it would almost be mandatory.
Finally, if someone actually wants to do a redacted publication of
Dokes for its *content*, as opposed its orthographic antiquarian
interest, it is perfectly possible to do so with an updated set
of orthographic conventions that would make it more accessible to
people used to modern IPA usage. Usability of published or republished
documents is not limited to slavish facsimile reproduction of their
orginal form -- for that we have facsimiles. :-) I love Shakespeare,
but I don't have to read his plays with long ess's and antique typefaces.
--Ken
This archive was generated by hypermail 2.1.5 : Thu Jun 10 2004 - 20:11:46 CDT