RE: marks (2 new symbols)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 02 2007 - 10:29:53 CST

  • Next message: Mike: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"

    Mark E. Shoulson wrote:
    > And nobody is seriously proposing a
    > Next-Generation Unicode in which the so-called "Cleanicode" (Unicode
    > where everything is done *right*) is implemented from scratch. Such a
    > radical change would not be worth the pain of implementing it.

    The same argument used against Dmitry's arguments about the need to encode
    only things that *are* used could also be applied to your argument here.

    Even though there's still no argument for a radicalchange in the way Unicode
    is encoding texts, there's no guarantee that (most probably in some
    looooooong term...) the Unicode or ISO 10646 standards are meant to be
    eternal.

    Like all other standards, these standards have a lifetime. They will work
    and will be used until someone demonstrates that there's a superior way to
    handle text in a more consitent way, *and* those proposing a newer standard
    have convinced *much* enough other people changing the way they handle text
    by adopting the newer proposal as their core encoding for handling
    everything.

    But trying to convince people to shift to another core standard will have
    the same issues: to convince people to make the change, the newer standard
    would need to exhibit and implement conversion rules that will allow good
    interoperability with the *huge* amount of texts and applications that will
    still depend on Unicode *only* for very long time.

    So the designer of the new "Unicode II" or "Cleancode" or "Next-Generation"
    standard (whatever its name) will have to face the same problem as Unicode:
    handling lots of roundtrip compatibility with the best and most widely used
    standards of the past, meaning that there will remain, in the new standard,
    also various tricks needed for compatibility (so they will also have things
    encoded like "compatibility characters" (or other entities), not recommended
    for normal use, but still valid for long until everything works using only
    the core standard with its "canonical" strings!).

    And they will also have to convince not only the users of the standard for
    encoding text, but also the designers of the other countless standards that
    have adopted Unicode or ISO 10646 as their core encoding whose support is
    now mandatory (if not the only encoding they support now), plus convincing
    software implementers to adapt their softwares to support these changes,
    plus convincing users to buy, install and use the new softwares (and support
    the cost of this software upgrade).

    I can't make any well-defined estimation of the generated cost of the
    conversion from Unicode/ISO10646 to something else, but this would be
    tremendous worldwide (really many, many billions dollars or euros or pounds
    or other currencies) and would affect almost everybody on earth in their
    daily life, due to the huge number of applications and objects that depend
    now on a correct implementation of Unicode and ISO 10646.

    Nothing will prevent a newer text encoding "standard" to be modelized,
    implemented and used, but this will remain within small local areas for
    specific needs by small communities of users with little interaction with
    the rest of the world.

    The only thing that can seriously happen now is the development of several
    alternate encodings for scripts that are rarely used outside of these small
    communities of specialists, or for scripts that are still not encoded (and
    where Unicode or ISO 10646 should not interfere before there's some wide
    agreement about all users in these communities, *and* they request the
    encoding within the UCS for facilitating the interoperability with
    applications and systems made by others outside these communities). In those
    areas, there may exist errors that *may* be partially corrected in Unicode.

    (Note: when I say "small" communities, I'm not speaking about the number of
    people needing the requested encoding, but the number of people and
    applications actually using it: this includes for example the Burmese
    community, which is quite large, but that currently has lots of difficulties
    with the current encoding of their script, so that the script is still
    considered by them as not encoded, even if it is currently part of the
    standard; the way it is perceived by them is that some parts of the existing
    standard may be kept, but other part would need to be "deprecated" or "not
    recommended" for general use of the script, because it will cause unsolvable
    interoperability problems; but anyway, Unicode will not change the existing
    encoding, the only thing it will do is to *add* other better characters with
    better behaviour and properties, where this is needed *and* demonstrated by
    actual use in some other non standard encodings, with good roundtrip
    compatibility with the best practices demonstrated in those external
    encodings).

    For now I see no justification for changing the standard: if it's not
    suitable as the core encoding for implementing some text handling algorithm
    in some application, nothing prevents this application to impelemnt another
    core encoding for local use, and conversion routines for their input and
    output, if this facilitates the work they need to do internally (this
    includes the possibility of using internally other character properties, or
    other decompositions, or another normalization with different ordering of
    the encoded entities, or the supplementation of the encoded entities with
    other entities than just characters):

    Just look at the UCA algorithm that makes such internal transforms to the
    encoded text by converting characters into collation keys, which are other
    entities not behaving like characters: even though the UCA algorithm is
    standardized, the entities it handles are not standardized in a mandatory
    way (they are "tailorable" everywhere) and it does not create a new standard
    encoding meant for general data interchange (even the existing documented
    collation keys for the DUCET are mutable across Unicode versions).



    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 10:33:36 CST