Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 26 2004 - 08:04:21 CST

  • Next message: Jony Rosenne: "RE: No Invisible Character - NBSP at the start of a word"

    From: "Doug Ewell" <dewell@adelphia.net>
    > My impression is that Unicode and ISO/IEC 10646 are two distinct
    > standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2,
    > which have pledged to work together to keep the standards perfectly
    > aligned and interoperable, because it would be destructive to both
    > standards to do otherwise. I don't think of it at all as the "slave and
    > master" relationship Philippe describes.

    Probably not with the assumptions that one can think about "slave and
    master", but it's still true that there can only be one standard body for
    the character repertoire, and one formal process for additions of new
    characters, even if two standard bodies are *working* (I don't say *decide*)
    in cooperation.

    The alternative would have been that UTC and WG2 are allocated each some
    code space for making the allocations they want, but with the risk of
    duplicate assignments. I really prefer to see the system like the "master
    and slave" relationships, because it gets a simpler view for how characters
    can be assigned in the common repertoire.

    For example, Unicode has no more rights than national standardization bodies
    making involved at ISO/IEC WG2. All of them will make proposals, all of them
    will amend proposals, or suggest modifications, or will negociate to create
    a final specification for the informal drafts. All what I see in the Unicode
    standardization process is that it will finally approve a proposal, but
    Unicode cannot declare it standard until there's been a formal agreement at
    ISO/IEC WG2, which really rules the effective allocations in the common
    repertoire, even if most of the preparation work will have been heavily
    discussed within UTC, creating the finalized proposal and with Unicode
    partners or with ISO/IEC members.

    At the same time, ISO/IEC WG2 will also study the proposals made by other
    standardization bodies, including the specifications prepared by other ISO
    working groups, or by national standardization bodies. Unicode is not the
    only approved source of proposals and specifications for ISO/IEC WG2 (and I
    tend to think that Unicode best represent the interests of private
    companies, whilst national bodies are most often better represented by their
    permanent membership at ISO where they have full rights of voting or vetoing
    proposals, according to their national interests...)

    The Unicode standard itself agrees to obey to ISO/IEC 10646 allocations in
    the repertoire (character names, representative glyphs, code points, and
    code blocks), but in exchange, ISO/IEC has agreed with Unicode to not decide
    about character properties or behavior (which are defined either by Unicode,
    or by national standards based on the ISO/IEC 10646 coded repertoire, for
    example the P.R.Chinese GB18030 standard, or by other ISO standards like ISO
    646 and ISO 8859).

    So, even if the UTC decides to veto a proposal submitted by Unicode members,
    nothing prevent the same members to find allies within national standard
    bodies, so that they submit the (modified) proposal to ISO/IEC 10646,
    instead of Unicode which refuses to transmit that proposal.

    I want to demonstrate some recent example: the UTC decided to vote against
    the allocation of a new invisible character, with the properties of a
    letter, a zero-width, and the same allowances of break opportunities as
    letters, considering that the existing NBSP was enough, despite it causes
    various complexities related to the normative properties of NBSP used as a
    base character for combining diacritics. This proposal (that was previously
    in informal discussion) has been rejected by UTC, but this leaves Indian and
    Israeli standards with complex problems for which Unicode proposes no easy
    solution.

    So nothing prevents India and Israel to reformulate the proposal at ISO/IEC
    WG2, which may then accept it, even if Unicode previously voted against it.
    If ISO/IEC WG2 accepts the proposal, Unicode will have no other choice than
    accepting it in the repertoire, and so giving to the new character some
    correct properties. Such proposal will be easily accepted by ISO/IEC WG2 if
    India and Israel demonstrate that the allocation allows making distinctions
    which are tricky or computationnally difficult or ambiguous to resolve when
    using NBSP. With a new distinct character, on the opposite, it can be
    demonstrated by ISO/IEC 10646 members to Unicode that defining its Unicode
    properties is not difficult, and simplifies the problem for correctly
    representing complex cases found in large text corpus.

    Unicode may think that this is a duplicate allocation, because there will
    exist cases where two encoding are possible, but without the same
    difficulties for implementations of applications like full-text search,
    collation, or determination of break opportunities, notably in the many
    cases where the current Unicode rules are already contradicting the
    normative behavior of existing national standards (like ISCII in India). My
    opinion is that the two encodings will still survive, but text encoded with
    the new prefered character will be easier to process correctly, and over
    time, the legacy encodings using NBSP would be deprecated by usage, making
    the duplicate encodings less a critical issue for many applications that are
    written for simplicity using partial implementations of the Unicode
    properties... Legacy encodings will still exist, but users of these encoded
    texts will be given an optional opportunity to recode their texts to match
    with the new prefered encoding, without changing their applications.

    Unicode already has tons of possible apparent duplicate encodings (see for
    example the non-canonically equivalent strings that can be created with
    multiple diacritics with the same combining class, despite they can't be
    made visually distinct, for example with some indian vowels, or with the
    presentation of some diacritics like the cedilla on some Latin letters; see
    also the characters that should have been defined as canonically equivalent
    but are not now, because Unicode has made string equivalence classes
    irrevokable, i.e. "stable", within an agreement signed with other standard
    bodies). Some purists may think that adding new apparent duplicates is a
    problem, but it will be less a problem if the users of the national
    standards directly used when using some scripts are exposed to tricky
    problems or ambiguities with the legacy encoding, that simply don't appear
    with the encoding using the new separate allocation.

    The interests of Unicode and ISO/IEC 10646 are diverging: Unicode is working
    so that the common repertoire can be managed in existing softwares created
    by its private members, but ISO/IEC 10646 members are first concerned by the
    correct representation of their national languages, without loss of
    semantics.

    In some cases, this correct representation conflicts with the simplest forms
    of implementations in Unicode-enabled softwares, requiring unjustified uses
    of large datasets for handling many exceptions, the absence of this dataset
    meaning that the text will be given wrong interpretations, so that text
    processing looses or changes parts of its semantics. (Note that many of the
    ambiguities come from the Unicode standard itself, which is the case for the
    normative behavior of NBSP at the begining of a word, or after a breakable
    SPACE... sometimes because of omissions in past versions of the standard, or
    because of unsuspected errors...)

    The easiest solution to this problem: make it simpler to handle, using
    separate encodings when this solves the difficult ambiguities (notably if
    there are ambiguities about which Unicode version considered when the text
    was encoded, or one of its addenda or corrigenda); then publish a guide that
    makes clearly separate interpretations (semantics) for texts coded with the
    legacy character, and texts coded with the new apparent "duplicate"
    character.

    The complex solution is to modify Unicode algorithms, and this may be even
    more difficult if this is part of the Unicode core standard, or in one of
    its standard annexes, or involves one of the normative character properties
    (like general classes, or combining classes), or the script classification
    of characters (script-specific versus common).



    This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 12:30:20 CST