From: Asmus Freytag (firstname.lastname@example.org)
Date: Tue Dec 15 2009 - 15:30:16 CST
On 12/15/2009 11:17 AM, Julian Bradfield wrote:
> Asmus wrote:
>> On 12/15/2009 2:31 AM, Julian Bradfield wrote:
>>> On 2009-12-14, Michael Everson <email@example.com> wrote:
>>>> On 14 Dec 2009, at 20:56, Julian Bradfield wrote:
>>> As Asmus has pointed out, the question then is, do you ask users to
>>> understand this, and magically know that two apparently different
>>> strings are actually the same?
>> This is where the disconnect is, and where you may be misquoting me. The
>> typical user knows a writing system but not the code sequence.
>> Programmers have tools that make code sequences visible to them, so they
>> can distinguish them. Correctly formatted and displayed, ordinary users
>> cannot tell the difference between alternative code sequences for the
>> same abstract character. That is as it should be, because what is
>> encoded is the abstract character.
> Yes - but how many users can distinguish the different abstract
> characters (Latin) o, (Greek) ο and (Cyrillic) о ? I certainly
Most users have no problem with any of these, because, except in
somewhat artificial test cases, they tend to be used in context with
other Latin, Greek or Cyrillic letters, respectively. And, going beyond
that, most users stick to one of these three scripts for the majority of
their interaction with their computers. That does not mean that there
aren't any real-world issues.
> Is this inherently different from the distinction between
> precomposed and combining characters?
Yes, because the combining characters themselves are part and parcel of
the methodology of mapping writing systems to binary encoded data. And
precomposed characters are an artifact of the history of encoding
characters. (There are also way more of these duplicates than of the
In contrast, the example you gave above is a result of the historical
development of writing systems (outside the sphere of their digital
encoding). On another (platonic?) level, the o and omicron once were
identical. But they are no longer. (and strictly speaking it ever
applied to their upper case forms only) There is no presumption, other
than typographic, of their having the exact same representation. In
fact, especially the Greek letter, is often rendered in a noticeably
different style, because many fonts show Greek in a different style from
Latin. (View them with any font created for the JIS character set and
they likely will immediately look distinct - that's what I did to verify
that you didn't cheat).
>> Unix users have inherited the mess created by the design approach that
>> was based on "character set independence". That approach seemed a nice,
>> value-neutral way to handle competing character sets, until it became
>> clear that it would in many instances lead to the creation of
>> effectively uninterpretable byte-streams. Hence Unicode. But all of that
>> is, of course, history.
> I wonder why we didn't settle on IS2022 encoded filenames before
> Uniocde came along? Just because of the overhead? Or just because of
> the timeline of non-ASCII use of computers?
Because of Unicode (and 10646). Absent these efforts to create a
unifying character set, 2022 would have been the only choice - and as
you note, the overhead would have been horrendous. Web-access for small
>> How the encoding relates an abstract character to code sequence(s), on
>> the other hand, is well defined in the Standard.
> But the definition of abstract character doesn't necessarily match
> what users think!
And doesn't have to. As long as a given sequence of abstract characters
is rendered and processed in a manner the users expect, the actual
internal divisions are relatively irrelevant. If support of combining
accents had been present and seamless from day one, you can argue that
no-one would have missed the precomposed characters.
This archive was generated by hypermail 2.1.5 : Tue Dec 15 2009 - 15:32:58 CST