Re: Bantu click letters

From: Kenneth Whistler (
Date: Thu Jun 10 2004 - 20:10:31 CDT

  • Next message: Mark Davis: "Re: Bantu click letters"

    > > Simply because some images appear in some
    > > documents does not mean that they automatically should be
    > > represented as encoded
    > > characters.
    > These aren't images. They're clearly letters; they occur in running texts and represent
    > the sounds of a spoken language.

    Well, I agree with that assessment.

    > If I were transcribing them, I wouldn't encode them
    > as pictures; I would encode them as PUA elements or XML elements (which are usually
    > more easier to use and more reliable than the PUA).

    And with that assessment, as well.

    > I'll admit that it's a bit sketchy encoding these characters based on one article by
    > one author. But I think it important to remember that more and more text is available
    > online, even stuff that might never get reprinted in hardcopy, and that needs Unicode.

    And in generally, I can't find fault with that, either.

    But the argument in this particular case hinges on a particular,
    nonce set of characters. We have this one scholar, who invented
    a bunch of characters in the 20's to represent click sounds that nobody
    was doing justice to at that point, either in understanding their
    phonetics or making sufficiently accurate distinctions in their
    recording. Bully for Dokes -- it was an important advance in the
    field of Khoisan studies and the phonetics of clicks. But even
    though he published his analysis, using his characters, nobody
    else chose to adopt his character conventions. Subsequent scholars,
    and the IPA, chose *other* characters to represent the distinctions
    involved, in part because Dokes' inventions were just weird and
    hard to use, as well as neither (in my opinion) mnemonic nor
    aesthetically pleasing.

    Well, we've encoded ugly letters for ugly orthographies in ugly
    scripts before. That isn't the issue. But the non-use of these
    forms is.

    It comes down then to a *prospective* claim that someone *might*
    want to digitize the classic Dokes publication and that if they
    did so they would require that the particular set of weird
    phonetic letters used by Dokes would have to be representable
    in Unicode plain text in order for that one publication to be
    made available electronically. (Or a few other publications that
    might cite Dokes verbatim, of course.)

    Well, in terms of requirements, I consider that more than a little
    cart before the horse. I'd be more sympathetic if someone was
    actually *trying* to do this and had a technical problem with
    representing the text accurately for an online edition which was
    best resolved by adding a dozen character to the Unicode Standard.
    Then, at least there would be a valid *use* argument to be made,
    as opposed to a scare claim that 50 years from now someone *might*
    want to do this and not be able to if we don't encode these
    characters right now.

    Right *now* anyone could (if they had the rights) put a version of
    Dokes online using pdf and an embedded font, and it would be perfectly
    referenceable for anyone wanting access to the content of the
    document. True, the dozen or so "weird" characters in the
    orthography wouldn't have standard encodings, so searching inside
    the document for them wouldn't be optimal. But is the burden that
    might place on the dozen or so Khoisan orthographic historians and
    phonetic historians who might actually be interested in doing so
    out of scale with the burden placed permanently on the standard
    itself for adding a dozen or so nonce characters for that *one*
    document? After all those historians and scholars today are
    basically using the document in its printed-only (out-of-print)
    hard copy format, and we aren't exactly worried about the difficulties
    that *that* poses them, now are we?

    I might point out at this point that the Unicode Standard itself is
    published online using non-standard encodings for many of its
    textual examples, simply because of the limitations of FrameMaker
    and PDF and fonts and the specialized requirements of citing lots
    and lots of characters outside normal text contexts. But I don't
    hear people yelling about the online Unicode Standard is crippled for
    use by people who wish to refer to it because you can't do an
    automated search for <ksha> in it which will accurately find all
    instances of Devanagari ksha in the text.

    And the *database* arguments just don't cut it. If anybody is seriously
    going to be using Dokes materials in comparative Khoisan studies,
    they will *normalize* the material in their text databases.
    After all, this is just one of a large variety of really varied
    material, in all kinds of orthographies, and in all levels of
    detail and quality. Arguing that making these particular dozen
    nonce characters searchable by giving them standard Unicode values
    just doesn't cut it for me, because if I were going to do that kind
    of work, a significant amount of philological work would be required
    to "massage" the data into comparable formats, anyway, and use of
    intermediate normalized conventions would not be a problem -- in fact,
    it would almost be mandatory.

    Finally, if someone actually wants to do a redacted publication of
    Dokes for its *content*, as opposed its orthographic antiquarian
    interest, it is perfectly possible to do so with an updated set
    of orthographic conventions that would make it more accessible to
    people used to modern IPA usage. Usability of published or republished
    documents is not limited to slavish facsimile reproduction of their
    orginal form -- for that we have facsimiles. :-) I love Shakespeare,
    but I don't have to read his plays with long ess's and antique typefaces.


    This archive was generated by hypermail 2.1.5 : Thu Jun 10 2004 - 20:11:46 CDT