Adam:
>Argh, Peter, I am no more Czech than Michael is English! I am
Slovak.
Please accept my apology.
>By the way, we have already had this discussion once,
This was sounding familiar.
>and I really do not expect to convince anyone. Nor am I trying
to convince anyone in Slovakia to use Unicode anymore. It is
useless.
Oh, but there are many of us that hope to convince you, and we
certainly don't think Unicode is useless to Slovaks (or Czechs,
for that matter), whether or not attempts to convince them of
this are. If people refuse to try something because they have a
preconceived expectation of how they think it ought to work and
don't allow themselves to see that it can also work another
way, then that's a shame.
>Yes, we can type "ch" using the GLYPHS "c" and "h", but
Unicode prides itself in being a character encoding, not a
glyph encoding. To us, "ch" is a character. Period. In our
dictionaries the "ch" follows the "h" and precedes the "i". We
would never dream of looking for "ch" after "cg" and before
"ci".
We are not talking about GLYPHS here but about CHARACTERS. Yes
Unicode is a *character* encoding, but these characters (as
I've just explained in a separate message) are not the same
kind of objects as you're talking about when you say '"ch" is a
character'. And it is this crucial difference that you're
continuing to get hung up on.
Let's start over with some new terms:
"encodeme" = minimal unit of encoded textual information
(Unicode's "character")
"orthographeme" = unit within an orthographic system
(I won't specify whether the latter are minimal units because
some cases will be ambiguous due to variations in the attitudes
and perceptions of members within a language community. For
this discussion, it doesn't really matter; I'm not trying to be
completely precise, but rather to get an idea across.)
The following dialogue presents the situation in terms of this
new terminology:
A: "How should the orthographeme Slovak "ch" (I got the
language right, yes?) be encoded in Unicode?"
B: "Use the sequence of encodemes U+0063 + U+0068."
A: "But, in such a sequence, what is the meaning of the
separate parts? For example, what is the meaning of U+0063?"
B: "Apart from the Unicode semantic properties, nothing: only
the sequence of which this is a part has meaning, and that
meaning is the orthographeme 'ch'."
(Note that we either have a string that is tagged to indicate
that the language is Slovak, or the sender and receiver have
both implicitly agreed what language is involved.)
A: "So orthographemes and encodemes don't need to match
one-to-one?"
B: "We could have done things that way, but we decided it would
be pretty inefficient and create some issues we prefered to
avoid. Instead, we allowed the relationship to be one-to-many,
with some orthographemes encoded as sequences of several
encodemes. There are lots of cases where this happens."
A: "But doesn't that cause problems for things like sorting
algorithms - the algorithm needs to treat a sequence of
encodemes as though they are a unit for sorting purposes. This
would be easier if only one-to-one relationships were allowed."
B: "Well, we've had to solve that problem for years in systems
based on legacy encodings, so we've already dealt with that.
Besides, the problems we'd face with an encodeme for every
single orthographeme are far worse."
A: "But won't this be confusing to users, who will assume that
there is one encodeme for every orthographeme?"
B: "The average user shouldn't ever have to know what's going
on at the encoding level; it should simply be a black box that
works and does what they want. If software is implemented
properly, that can be achieved. Sure, there will always be some
users who, for whatever reason, need to peek inside that black
box, but anyone who can do that is clever enough to understand
that things aren't necessarily one-to-one. This kind of thing
should never be a problem.
"Even if an average user becomes aware of what's going on
inside, that shouldn't have to be a problem: there are lots of
users out there who have no problem getting used to a
many-to-one relationship between keystrokes and encodemes, and
that situation really isn't all that different. For some
reason, though, some users, like you, do trip up on the
encoding side. There's some kind of broad perception going on
for some people that encodemes and orthographemes should match
one-to-one, but that's just simply a false assumption. Rather
than impose it on ourselves, which I've already said has some
undesireable side effects, it's much better to try to educate
so that people understand that encodings can, and in this case
do, work differently."
A: "Oh, I think I'm starting to get it, Socrates. So for
Slovak, the entity "ch", which is an orthographeme, gets
encoded as a sequence of encodemes, but that sequence gets
treated by algorithms for things like sorting as though it were
a single entity, just like the orthographeme, and users don't
know the difference. Is that what you're saying?"
B: "Yes! At last. I was starting to despair that I'd ever find
a way of explaining this to you. Maybe I'll put this hemlock
back away... "
Peter
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT