Re: Mixed up priorities

From: peter_constable@sil.org
Date: Fri Oct 22 1999 - 01:12:22 EDT


       Adam:

>Argh, Peter, I am no more Czech than Michael is English! I am
       Slovak.

       Please accept my apology.

>By the way, we have already had this discussion once,

       This was sounding familiar.

>and I really do not expect to convince anyone. Nor am I trying
       to convince anyone in Slovakia to use Unicode anymore. It is
       useless.

       Oh, but there are many of us that hope to convince you, and we
       certainly don't think Unicode is useless to Slovaks (or Czechs,
       for that matter), whether or not attempts to convince them of
       this are. If people refuse to try something because they have a
       preconceived expectation of how they think it ought to work and
       don't allow themselves to see that it can also work another
       way, then that's a shame.

>Yes, we can type "ch" using the GLYPHS "c" and "h", but
       Unicode prides itself in being a character encoding, not a
       glyph encoding. To us, "ch" is a character. Period. In our
       dictionaries the "ch" follows the "h" and precedes the "i". We
       would never dream of looking for "ch" after "cg" and before
       "ci".

       We are not talking about GLYPHS here but about CHARACTERS. Yes
       Unicode is a *character* encoding, but these characters (as
       I've just explained in a separate message) are not the same
       kind of objects as you're talking about when you say '"ch" is a
       character'. And it is this crucial difference that you're
       continuing to get hung up on.

       Let's start over with some new terms:

       "encodeme" = minimal unit of encoded textual information
       (Unicode's "character")
       "orthographeme" = unit within an orthographic system

       (I won't specify whether the latter are minimal units because
       some cases will be ambiguous due to variations in the attitudes
       and perceptions of members within a language community. For
       this discussion, it doesn't really matter; I'm not trying to be
       completely precise, but rather to get an idea across.)

       The following dialogue presents the situation in terms of this
       new terminology:

       A: "How should the orthographeme Slovak "ch" (I got the
       language right, yes?) be encoded in Unicode?"
       B: "Use the sequence of encodemes U+0063 + U+0068."

       A: "But, in such a sequence, what is the meaning of the
       separate parts? For example, what is the meaning of U+0063?"
       B: "Apart from the Unicode semantic properties, nothing: only
       the sequence of which this is a part has meaning, and that
       meaning is the orthographeme 'ch'."

       (Note that we either have a string that is tagged to indicate
       that the language is Slovak, or the sender and receiver have
       both implicitly agreed what language is involved.)

       A: "So orthographemes and encodemes don't need to match
       one-to-one?"
       B: "We could have done things that way, but we decided it would
       be pretty inefficient and create some issues we prefered to
       avoid. Instead, we allowed the relationship to be one-to-many,
       with some orthographemes encoded as sequences of several
       encodemes. There are lots of cases where this happens."

       A: "But doesn't that cause problems for things like sorting
       algorithms - the algorithm needs to treat a sequence of
       encodemes as though they are a unit for sorting purposes. This
       would be easier if only one-to-one relationships were allowed."
       B: "Well, we've had to solve that problem for years in systems
       based on legacy encodings, so we've already dealt with that.
       Besides, the problems we'd face with an encodeme for every
       single orthographeme are far worse."

       A: "But won't this be confusing to users, who will assume that
       there is one encodeme for every orthographeme?"
       B: "The average user shouldn't ever have to know what's going
       on at the encoding level; it should simply be a black box that
       works and does what they want. If software is implemented
       properly, that can be achieved. Sure, there will always be some
       users who, for whatever reason, need to peek inside that black
       box, but anyone who can do that is clever enough to understand
       that things aren't necessarily one-to-one. This kind of thing
       should never be a problem.

       "Even if an average user becomes aware of what's going on
       inside, that shouldn't have to be a problem: there are lots of
       users out there who have no problem getting used to a
       many-to-one relationship between keystrokes and encodemes, and
       that situation really isn't all that different. For some
       reason, though, some users, like you, do trip up on the
       encoding side. There's some kind of broad perception going on
       for some people that encodemes and orthographemes should match
       one-to-one, but that's just simply a false assumption. Rather
       than impose it on ourselves, which I've already said has some
       undesireable side effects, it's much better to try to educate
       so that people understand that encodings can, and in this case
       do, work differently."

       A: "Oh, I think I'm starting to get it, Socrates. So for
       Slovak, the entity "ch", which is an orthographeme, gets
       encoded as a sequence of encodemes, but that sequence gets
       treated by algorithms for things like sorting as though it were
       a single entity, just like the orthographeme, and users don't
       know the difference. Is that what you're saying?"
       B: "Yes! At last. I was starting to despair that I'd ever find
       a way of explaining this to you. Maybe I'll put this hemlock
       back away... "

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT