>From this perspective it seems clearly incorrect to say that
[[ch]] is two things; it is arguably incorrect even to say that
//ch// is two things. It all depends on your criteria. The
observation that //c// and //h// exist separately in the Slovak
graphemic repertoire supports the argument that //ch// is
partitionable; but that doesn't mean [[ch]] is partitionable.
The observation that [[ch]] is a unit in the Slovak semantic
repertoire argues with equal force that //ch// should be
considered unitary. The graphical form itself tells us nothing
in this respect. Font technologies and products are
irrelevant.
Interesting discussion, but I'm inclined to think it's not all
that important. It really makes *no difference whatsoever*
whether Slovak "ch" (or [[ch]] or //ch// or whatever) is an
atomic unit of some system or not. All that matters is:
- what is the behaviour that Slovak users expect of their
software in relation to the whattzit(s) "ch", and
- what is needed in the way of encoding in order to achieve
software that meets those expectations
It doesn't matter one bit what we call anything. (So I'll try
to avoid the offending labels.) The argument should be whether
the sequence U+0063 + U+0068 (and variations involving the
corresponding upper case Unicode-thingies) is sufficient in
order to develop implementations that satisify the expectations
of users, or whether a new Unicode-thingy, which probably would
be called LATIN SMALL LETTER CH is necessary.
Some have said that the new Unicode-thingy is necessary, though
nobody has yet identified any behaviour for which the existing
Unicode-thingies are not sufficient. In this situation, the
null argument is that the existing Unicode-thingies are
sufficient unless demonstrated otherwise. Some may not like it
that why (somebody asked why the burden of proof is on those
requesting the new character), but that doesn't change the fact
that that's the way things are. The null argument is what it is
because we want to make sure we don't assign more fluff than is
politically necessary, and that the thingies in the standard
are there for good reason and are getting used.
>Appeals to Unicode definitions of "character", "plaintext",
etc. don't help here - their brokenness is at the root of the
kind of miscommunication that this thread has so eloquently
illustrated.
I don't agree. The definitions aren't broken. The
Unicode-thingies are called "characters", and there is a good
and clear definition given for them - at least, there's no
question whatever in my mind what they mean, and I'm pretty
sure that my understanding is effectively equivalent to the
understanding that Michael, Ken, Rick, Asmus, and all our other
favourite Uniheroes have. That definition is an abstract one,
and is different than definitions for units of writing systems
(whether you call them "characters" or "letters" or "graphemes"
or what have you).
A Unicode character is
[from ISO/IEC 10646-1:1993] 4.6 character: A member of a set of
elements used for the organisation, control, or representation
of data.
[from Unicode 2.0] abstract character: A unit of information
used for the organization, control, or representation of data.
(There is no significant difference in these two definitions.)
Now, there is still room for debate in specific situations as
to what *sorts* of units of information ought to be included in
the repertoire of accepted Unicode characters; by this I mean
debates that are really about the philosophy, principles and
value system around which the standard should be built. There
is a debate of this sort going on right now over several
hundred proposed characters for math purposes. But there is no
debate about whether the definition of abstract character is
understood or whether it's the right definition.
In the case of Slovak "ch", there's not really a question about
what "ch", as part of the Slovak writing system is, though some
have been debating this. That doesn't matter. In particular,
perceptions on the part of end users as to what "ch" is has no
bearing whatsoever. What matters is whether we need to add an
abstract character to represent it. I and others have suggested
that the assumed encoding of U+0063 U+0068 is adequate. The way
to argue that a new abstract character is necessary is to
demonstrate that there are textual processes that people
regularly want to do in software for which the expected results
cannot be achieved (using reasonable means) without the
proposed new character.
>BTW, I suggest we avoid "alphabet" and "letter" altogether.
As I'm sure you know, alphabetisme has a long and shameful
history in the toolbox of Western Imperialism. Unfortunately
it is still not particularly difficult to find even scholarly
works that carry an implicit assumption that "the alphabet" is
superior to other forms.
It doesn't matter to me; I'm happy to avoid them since, in my
thinking, an alphabet is a specific structural type of the more
general notion, "writing system", and letter is an element
specifically of an alphabet (though I haven't banged my head at
all over how I think letter should be defined). Writing system
types in general are neither better nor worse than any other
type, and I wouldn't want to suggest than any type is
preferred. The only values that can reasonably be given to
specific writing systems are the subjective attitudes of users,
and utilitarian evaluations of their suitability for writing a
particular language, which strictly speaking can only be based
on observations regarding things like reading fluency, etc.
I use "character" to talk about units of writing systems,
except where there is confusion with the notion of "abstract
character", as in this thread, in which case I'll generally
resort to "orthographic character". (In an earlier message, I
resorted to "orthographeme", but I've never used that before,
and I'm not trying to push it. I was just in some mood.) I also
use "character" to mean "abstract character" as defined above,
except when there is confusion with the "orthographic
character" sense, in which case I might use "codepoint" (not
really the best choice), "encoded character", "abstract
character" or "Unicode character". (Again, in an earlier
message I used "encodeme", but that was a novel use and I'm not
wanting to promote its use.)
I suppose somebody out there will want to debate the definition
of "orthographic character". Personally, I don't have the
energy for that right now, so I don't think I'd enter that
fray. And somebody will probably want to debate whether the
definition of "abstract character" is appropriate, clear,
unambiguous or correct. I'll just tell you now, don't expect it
to change, because I'm pretty sure it won't.
Peter
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT