Re: Mixed up priorities

From: peter_constable@sil.org
Date: Sun Oct 24 1999 - 23:41:27 EDT


>From this perspective it seems clearly incorrect to say that
       [[ch]] is two things; it is arguably incorrect even to say that
       //ch// is two things. It all depends on your criteria. The
       observation that //c// and //h// exist separately in the Slovak
       graphemic repertoire supports the argument that //ch// is
       partitionable; but that doesn't mean [[ch]] is partitionable.
       The observation that [[ch]] is a unit in the Slovak semantic
       repertoire argues with equal force that //ch// should be
       considered unitary. The graphical form itself tells us nothing
       in this respect. Font technologies and products are
       irrelevant.

       Interesting discussion, but I'm inclined to think it's not all
       that important. It really makes *no difference whatsoever*
       whether Slovak "ch" (or [[ch]] or //ch// or whatever) is an
       atomic unit of some system or not. All that matters is:

       - what is the behaviour that Slovak users expect of their
       software in relation to the whattzit(s) "ch", and
       - what is needed in the way of encoding in order to achieve
       software that meets those expectations

       It doesn't matter one bit what we call anything. (So I'll try
       to avoid the offending labels.) The argument should be whether
       the sequence U+0063 + U+0068 (and variations involving the
       corresponding upper case Unicode-thingies) is sufficient in
       order to develop implementations that satisify the expectations
       of users, or whether a new Unicode-thingy, which probably would
       be called LATIN SMALL LETTER CH is necessary.

       Some have said that the new Unicode-thingy is necessary, though
       nobody has yet identified any behaviour for which the existing
       Unicode-thingies are not sufficient. In this situation, the
       null argument is that the existing Unicode-thingies are
       sufficient unless demonstrated otherwise. Some may not like it
       that why (somebody asked why the burden of proof is on those
       requesting the new character), but that doesn't change the fact
       that that's the way things are. The null argument is what it is
       because we want to make sure we don't assign more fluff than is
       politically necessary, and that the thingies in the standard
       are there for good reason and are getting used.

>Appeals to Unicode definitions of "character", "plaintext",
       etc. don't help here - their brokenness is at the root of the
       kind of miscommunication that this thread has so eloquently
       illustrated.

       I don't agree. The definitions aren't broken. The
       Unicode-thingies are called "characters", and there is a good
       and clear definition given for them - at least, there's no
       question whatever in my mind what they mean, and I'm pretty
       sure that my understanding is effectively equivalent to the
       understanding that Michael, Ken, Rick, Asmus, and all our other
       favourite Uniheroes have. That definition is an abstract one,
       and is different than definitions for units of writing systems
       (whether you call them "characters" or "letters" or "graphemes"
       or what have you).

       A Unicode character is

       [from ISO/IEC 10646-1:1993] 4.6 character: A member of a set of
       elements used for the organisation, control, or representation
       of data.

       [from Unicode 2.0] abstract character: A unit of information
       used for the organization, control, or representation of data.

       (There is no significant difference in these two definitions.)

       Now, there is still room for debate in specific situations as
       to what *sorts* of units of information ought to be included in
       the repertoire of accepted Unicode characters; by this I mean
       debates that are really about the philosophy, principles and
       value system around which the standard should be built. There
       is a debate of this sort going on right now over several
       hundred proposed characters for math purposes. But there is no
       debate about whether the definition of abstract character is
       understood or whether it's the right definition.

       In the case of Slovak "ch", there's not really a question about
       what "ch", as part of the Slovak writing system is, though some
       have been debating this. That doesn't matter. In particular,
       perceptions on the part of end users as to what "ch" is has no
       bearing whatsoever. What matters is whether we need to add an
       abstract character to represent it. I and others have suggested
       that the assumed encoding of U+0063 U+0068 is adequate. The way
       to argue that a new abstract character is necessary is to
       demonstrate that there are textual processes that people
       regularly want to do in software for which the expected results
       cannot be achieved (using reasonable means) without the
       proposed new character.

>BTW, I suggest we avoid "alphabet" and "letter" altogether.
       As I'm sure you know, alphabetisme has a long and shameful
       history in the toolbox of Western Imperialism. Unfortunately
       it is still not particularly difficult to find even scholarly
       works that carry an implicit assumption that "the alphabet" is
       superior to other forms.

       It doesn't matter to me; I'm happy to avoid them since, in my
       thinking, an alphabet is a specific structural type of the more
       general notion, "writing system", and letter is an element
       specifically of an alphabet (though I haven't banged my head at
       all over how I think letter should be defined). Writing system
       types in general are neither better nor worse than any other
       type, and I wouldn't want to suggest than any type is
       preferred. The only values that can reasonably be given to
       specific writing systems are the subjective attitudes of users,
       and utilitarian evaluations of their suitability for writing a
       particular language, which strictly speaking can only be based
       on observations regarding things like reading fluency, etc.

       I use "character" to talk about units of writing systems,
       except where there is confusion with the notion of "abstract
       character", as in this thread, in which case I'll generally
       resort to "orthographic character". (In an earlier message, I
       resorted to "orthographeme", but I've never used that before,
       and I'm not trying to push it. I was just in some mood.) I also
       use "character" to mean "abstract character" as defined above,
       except when there is confusion with the "orthographic
       character" sense, in which case I might use "codepoint" (not
       really the best choice), "encoded character", "abstract
       character" or "Unicode character". (Again, in an earlier
       message I used "encodeme", but that was a novel use and I'm not
       wanting to promote its use.)

       I suppose somebody out there will want to debate the definition
       of "orthographic character". Personally, I don't have the
       energy for that right now, so I don't think I'd enter that
       fray. And somebody will probably want to debate whether the
       definition of "abstract character" is appropriate, clear,
       unambiguous or correct. I'll just tell you now, don't expect it
       to change, because I'm pretty sure it won't.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT