Re: (TC304WG4.50) Charset vs. codeset

From: Alain LaBont\i - SCT (alb@riq.qc.ca)
Date: Sun Oct 05 1997 - 11:27:34 EDT


A 06:42 97-10-05 -0700, Martin J. Dürst a écrit :
>On Sat, 4 Oct 1997, Keld J|rn Simonsen wrote:
>
>> I had a few comments to Kenneth Whistler's recent writing:

[Martin] :
>> > Unicode is an encoded character set.

[Keld] :
>> I am not so sure about that. It violates the general principles of
>> that an encoded character set only encodes one (abstract) character
>> in one way.
>>
>> > ISO/IEC 10646 is an encoded character set.
>>
>> True.

[Martin] :
>You are probably refering to cases like A + combining ring above
>vs. A with ring above (sorry I don't remember the official names).
>
>In that sense, both Unicode and ISO/IEC 10646 are very much the
>same. Both include the possibilities to use combining marks.
>Unicode is a little bit more explicit about them. But it doesn't
>allow more things that ISO/IEC 10646. ISO/IEC doesn't explicitly
>define equivalences, and therefore in theory, it's possible to
>say that these are different abstract characters (or combinations
>of them). But Unicode can say the same, namely that they are
>different abstract characters/combinations. That the difference
>shouldn't be visible to the user is patently obvious in both cases.

[Alain] :
My 2 cents:

On one hand some combinations where you would not see a difference even
with bad implementations are not recognized as equivalent in UNICODE (SMALL
DOTLESS I WITH CIRCUMFLEX and SMALL DOTLESS I WITH DIAERESIS are cases in
point which typically affect French; with the I other languages are
affected as well).

On the other hand, if the implementation is done on the fly by overprinting
or overdisplaying, the difference will be visible with the COMBINING
DIACRITICS used with a SMALL DOTTED I (a traditional i!) while according to
UNICODE there is no difference of interpretation between the two encodings.

This is of course only anecdotical. However that should imho be corrected
in UNICODE. But nobody cares except me, it seems.

I would like the two following rules to be true (wish list) :

1. Within a given script, combinations which make no difference with a
   precomposed character should be considered equivalent in UNICODE.

2. It should be disallowed to show differences for UNICODE equivalences,
   when only one font is used.

Personally, I also have problem buying applications that do double
encoding, as this (as we all know with QP and SGML entities) multiplies the
possibilities of bugs, but also of inconsistencies (in particular in search
engines). I like that all passes through the same coding/decoding process,
at the lowest possible level (complete application environment or even
operating system level).

Alain LaBonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT