Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 26 2004 - 08:04:21 CST

Next message: Jony Rosenne: "RE: No Invisible Character - NBSP at the start of a word"

Previous message: Philippe Verdy: "Re: Misuse of 8th bit [Was: My Querry]"
In reply to: Doug Ewell: "Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)"
Next in thread: Peter Kirk: "Re: Relationship between Unicode and 10646"
Reply: Peter Kirk: "Re: Relationship between Unicode and 10646"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Doug Ewell" <dewell@adelphia.net>
> My impression is that Unicode and ISO/IEC 10646 are two distinct
> standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2,
> which have pledged to work together to keep the standards perfectly
> aligned and interoperable, because it would be destructive to both
> standards to do otherwise. I don't think of it at all as the "slave and
> master" relationship Philippe describes.

Probably not with the assumptions that one can think about "slave and
master", but it's still true that there can only be one standard body for
the character repertoire, and one formal process for additions of new
characters, even if two standard bodies are *working* (I don't say *decide*)
in cooperation.

The alternative would have been that UTC and WG2 are allocated each some
code space for making the allocations they want, but with the risk of
duplicate assignments. I really prefer to see the system like the "master
and slave" relationships, because it gets a simpler view for how characters
can be assigned in the common repertoire.

For example, Unicode has no more rights than national standardization bodies
making involved at ISO/IEC WG2. All of them will make proposals, all of them
will amend proposals, or suggest modifications, or will negociate to create
a final specification for the informal drafts. All what I see in the Unicode
standardization process is that it will finally approve a proposal, but
Unicode cannot declare it standard until there's been a formal agreement at
ISO/IEC WG2, which really rules the effective allocations in the common
repertoire, even if most of the preparation work will have been heavily
discussed within UTC, creating the finalized proposal and with Unicode
partners or with ISO/IEC members.

At the same time, ISO/IEC WG2 will also study the proposals made by other
standardization bodies, including the specifications prepared by other ISO
working groups, or by national standardization bodies. Unicode is not the
only approved source of proposals and specifications for ISO/IEC WG2 (and I
tend to think that Unicode best represent the interests of private
companies, whilst national bodies are most often better represented by their
permanent membership at ISO where they have full rights of voting or vetoing
proposals, according to their national interests...)

The Unicode standard itself agrees to obey to ISO/IEC 10646 allocations in
the repertoire (character names, representative glyphs, code points, and
code blocks), but in exchange, ISO/IEC has agreed with Unicode to not decide
about character properties or behavior (which are defined either by Unicode,
or by national standards based on the ISO/IEC 10646 coded repertoire, for
example the P.R.Chinese GB18030 standard, or by other ISO standards like ISO
646 and ISO 8859).

So, even if the UTC decides to veto a proposal submitted by Unicode members,
nothing prevent the same members to find allies within national standard
bodies, so that they submit the (modified) proposal to ISO/IEC 10646,
instead of Unicode which refuses to transmit that proposal.

I want to demonstrate some recent example: the UTC decided to vote against
the allocation of a new invisible character, with the properties of a
letter, a zero-width, and the same allowances of break opportunities as
letters, considering that the existing NBSP was enough, despite it causes
various complexities related to the normative properties of NBSP used as a
base character for combining diacritics. This proposal (that was previously
in informal discussion) has been rejected by UTC, but this leaves Indian and
Israeli standards with complex problems for which Unicode proposes no easy
solution.

So nothing prevents India and Israel to reformulate the proposal at ISO/IEC
WG2, which may then accept it, even if Unicode previously voted against it.
If ISO/IEC WG2 accepts the proposal, Unicode will have no other choice than
accepting it in the repertoire, and so giving to the new character some
correct properties. Such proposal will be easily accepted by ISO/IEC WG2 if
India and Israel demonstrate that the allocation allows making distinctions
which are tricky or computationnally difficult or ambiguous to resolve when
using NBSP. With a new distinct character, on the opposite, it can be
demonstrated by ISO/IEC 10646 members to Unicode that defining its Unicode
properties is not difficult, and simplifies the problem for correctly
representing complex cases found in large text corpus.

Unicode may think that this is a duplicate allocation, because there will
exist cases where two encoding are possible, but without the same
difficulties for implementations of applications like full-text search,
collation, or determination of break opportunities, notably in the many
cases where the current Unicode rules are already contradicting the
normative behavior of existing national standards (like ISCII in India). My
opinion is that the two encodings will still survive, but text encoded with
the new prefered character will be easier to process correctly, and over
time, the legacy encodings using NBSP would be deprecated by usage, making
the duplicate encodings less a critical issue for many applications that are
written for simplicity using partial implementations of the Unicode
properties... Legacy encodings will still exist, but users of these encoded
texts will be given an optional opportunity to recode their texts to match
with the new prefered encoding, without changing their applications.

Unicode already has tons of possible apparent duplicate encodings (see for
example the non-canonically equivalent strings that can be created with
multiple diacritics with the same combining class, despite they can't be
made visually distinct, for example with some indian vowels, or with the
presentation of some diacritics like the cedilla on some Latin letters; see
also the characters that should have been defined as canonically equivalent
but are not now, because Unicode has made string equivalence classes
irrevokable, i.e. "stable", within an agreement signed with other standard
bodies). Some purists may think that adding new apparent duplicates is a
problem, but it will be less a problem if the users of the national
standards directly used when using some scripts are exposed to tricky
problems or ambiguities with the legacy encoding, that simply don't appear
with the encoding using the new separate allocation.

The interests of Unicode and ISO/IEC 10646 are diverging: Unicode is working
so that the common repertoire can be managed in existing softwares created
by its private members, but ISO/IEC 10646 members are first concerned by the
correct representation of their national languages, without loss of
semantics.

In some cases, this correct representation conflicts with the simplest forms
of implementations in Unicode-enabled softwares, requiring unjustified uses
of large datasets for handling many exceptions, the absence of this dataset
meaning that the text will be given wrong interpretations, so that text
processing looses or changes parts of its semantics. (Note that many of the
ambiguities come from the Unicode standard itself, which is the case for the
normative behavior of NBSP at the begining of a word, or after a breakable
SPACE... sometimes because of omissions in past versions of the standard, or
because of unsuspected errors...)

The easiest solution to this problem: make it simpler to handle, using
separate encodings when this solves the difficult ambiguities (notably if
there are ambiguities about which Unicode version considered when the text
was encoded, or one of its addenda or corrigenda); then publish a guide that
makes clearly separate interpretations (semantics) for texts coded with the
legacy character, and texts coded with the new apparent "duplicate"
character.

The complex solution is to modify Unicode algorithms, and this may be even
more difficult if this is part of the Unicode core standard, or in one of
its standard annexes, or involves one of the normative character properties
(like general classes, or combining classes), or the script classification
of characters (script-specific versus common).

Next message: Jony Rosenne: "RE: No Invisible Character - NBSP at the start of a word"
Previous message: Philippe Verdy: "Re: Misuse of 8th bit [Was: My Querry]"
In reply to: Doug Ewell: "Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)"
Next in thread: Peter Kirk: "Re: Relationship between Unicode and 10646"
Reply: Peter Kirk: "Re: Relationship between Unicode and 10646"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 12:30:20 CST