Re: Mixed up priorities

From: peter_constable@sil.org
Date: Fri Oct 22 1999 - 00:01:11 EDT


       Adam:

       What you describe here is entirely the same for English "th"
       with the exception of sorting behaviour. This sequence
       represents a linguistic entity (two, actually) which cannot be
       analysed as a sequence of entities, whether reasonably
       represented orthographically by "t" and "h" or otherwise. There
       are other alphabets that have single symbols for the same
       entity/sound, and you would not ever put a hyphen between the
       two.

>The fact that it can be constructed from two glyphs, C and H,
       is irrelevant, many other characters can be so constructed
       (e.g. N with caron can constructed from an N and a caron, yet
       it is a separate character).

       Why this is irrelevant is because

       - we're not constructing entities (which suggests that the
       meaning represented by "ch" would be composed of the meanings
       represented by "c" and by "h" - this kind of semantic
       composition is not what anybody is suggesting in the case),
       we're establishing encoded representations for them

       - we're not talking about glyphs, but of characters

       That fact that "ch" *can* be given an encoding representation
       of C + H is entirely relevant. Since it can already be done
       this way, and since it is possible to make any textual process
       of interest work using this, then there's no need to add
       something different.

>>But you are wrong. CH is not a _character_ in any language...

>Respectfully, I disagree...

       We need to be careful here because (a) there are two senses of
       "character" to get confused over here (sense 1: atomic unit of
       textual information for encoding purposes, i.e. the Unicode
       definition; sense 2: a unit within an orthography/writing
       system); and (b) application of the second sense is subject to
       attitudes and perceptions on the part of individuals within a
       language community, and therefore not necessarily easy to
       determine and not necessarily consistent across the community.
       Your response to Michael was operating on the second sense.
       Whether Michael was wrong or not on this point makes no
       difference whatsoever, because we need to focus for this
       standard on the first sense.

       It is possible to use a sequence of two characters (sense 1) to
       encode the (single) linguistic entity "ch" (for sake of
       discussion, we'll say that it's one sense-2 character), and to
       do so while making it appear to the user that their software
       always perceives these (this) as a single character (sense-2,
       which is what users are interested in). Since it is possible to
       do this, then it is better to do this than to introduce a new,
       single character (sense 1), which, in cases like these, end up
       creating more problems than simplifications.

       Peter

       From: <adam@whizkidtech.net> AT Internet on 10/21/99 06:49 PM
             CDT

       Received on: 10/21/99

       To: Peter Constable/IntlAdmin/WCT, unicode@unicode.org AT
             Internet@Ccmail
       cc:
       Subject: Re: Mixed up priorities

       At 13:06 21-10-1999 -0700, Michael Everson wrote:
>But you are wrong. CH is not a _character_ in any language. It
       is a set of >strings of characters (C-H, C-h, c-h) used (sorted
       etc.) as a _letter_ in >languages like Slovak, Czech, Welsh,
       and traditional Spanish.

       Respectfully, I disagree. I cannot speak for Welsh and Spanish,
       but in Slovak and Czech, CH has all characteristics of a
       character: It denotes a specific sound which cannot be
       expressed in any other way. Nor can it be separated into two
       sounds.

       Many other alphabets have a separate character for this sound,
       e.g. the chi in Greek, or the Cyrillic character that looks
       like the Roman X.

       The fact that it can be constructed from two glyphs, C and H,
       is irrelevant, many other characters can be so constructed
       (e.g. N with caron can constructed from an N and a caron, yet
       it is a separate character).

       It is not simply a string of characters because it cannot be
       separated. You cannot, for example, divide a word at the end of
       a line by following the C with a - and starting the next line
       with an H. It is *not* C-H, C-h, and c-h. It is CH, Ch, and ch.

       Also, ask any Slovak to tell you what the alphabet is, he will
       inevitably list a H CH I within the sequence.

       And, by the way, I am in no way trying to undermine your effort
       to have the Klingon alphabet included in the Unicode. I just
       wish we treated real languages the way their native speakers
       treat them, not how Western experts perceive them.

       Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT