Re: Frequent incorrect guesses by the charset autodetection in IE7

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Jul 18 2006 - 17:38:47 CDT

  • Next message: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"

    The list that highlights what is failing in Unicode is a fact full list.

    I'll be taking the discussions at a slow phase. I'm going to start with
    graphics and Unicode.

    Taking PhotoPlus as a starting point. I have several other examples and I
    know quite a few others here are going through the pain of justifying why we
    need Unicode (including myself). Especially when we annoy a very large
    number of the population with Unicode scribble.

    As a starting point, please see the image. I'll appreciate if any one
    willing to come out and help me document these problems.
    http://www.araichchi.net/kanini/unicode/fail/u-photoplus-fails.jpg
    http://www.araichchi.net/kanini/unicode/fail/unicode_status.htm

    Look for question mark.
    Look for where the question mark starts, despite the use of Unicode fonts.

    Regards
    Sinnathurai Srivas

    ----- Original Message -----
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Asmus Freytag"
    <asmusf@ix.netcom.com>
    Cc: "Unicode Mailing List" <unicode@unicode.org>
    Sent: Monday, July 17, 2006 10:10 PM
    Subject: Re: Frequent incorrect guesses by the charset autodetection in IE7

    From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>

    > Unicode email is not working properly
    Wrong. It works as described.

    > ISO Email is working properly.
    Not more reliably than Unicode-based implementations. In fact, most email
    agents now process emails by first converting all 7/8-bit charsets they
    *know* into Unicode/ISO/IEC10646. If it works in ISO8859, it also works
    without change with Unicode. Implementations that do not convert first to
    Unicode/ISO/IEC10646 fail to handle lots of 7/8-bit encodings, and have
    extremely complex configurations of fonts, or are not easily extensible to
    support more.

    > Hacked 8bit coding works among vendor opposition.
    Wrong. Such hacks only works with *prior* mutual agreement, and requires
    specific implementations. Unicode/ISO/IEC 10646 does not require such
    agreement, especially for emails that use almost all modern languages that
    have been encoded in ISO/IEC10646. Many countries have chosen to not develop
    a 7/8-bit encoding because they don't need to.

    > Unicode desktop publishing is virtually non existent.
    Wrong.

    > ISO desktop publishing is working excellently.
    Wrong, because I think you mean ISO8859 here, and not ISO here (bad
    terminology from you). Such implementations fail to handle lots of
    characters that are supported with ISO10646. The most serious publishing
    softwares are now based on ISO10646 as their core working encoding, simply
    because fonts are made now based on it, and forget to provide tables mapping
    legacy charsets to the needed glyphs. To support more languages, andget
    consistent results the largest high-quality fonts are based on technologies
    like OpenType and AAT that DO require support of ISO/IEC 10646;

    > Hacked 8bit coding work very well in desktop publishing.
    Wrong. Such documents are not portable without the complete set of
    applications and tuning parameters that make up an environment. Due to
    licencing problems, such complete sets cannot be exchanged freely, and so
    documents created by users cannot be worked with others; this causes
    problems between publishers and authors, because authors will not want to
    invest the money to match exactly the environment of the publisher.

    Hacked encodings are also poorly encoded, and if doucmentation is available,
    it will be available only in one or a few languages, with poor translations,
    and with little people ready to help to work on these documentation. Unicode
    and ISO/IEC 10646 competence is available worldwide, and at very low cost.
    Authors and publishers can also choose the tools they want, and they have
    plenty of solutions.

    Remember that before desktop publishing, the main work is performed by
    authors that simply use word processors or common database or spreadsheet
    office tools. All serious publishers know how to handle the format used in
    submissions by authors.

    Remeber also that today, the work of a publisher is not only to print the
    documents, but also to prepare them for publication on other medias.
    Publishing on the Internet, or on CDROMs or making them accessible through
    databases is a great way to increase the audience of those documents, that
    can then get a newer public, and so a higher value; this benefits to authors
    (I don't speak here about the protection of medias, this is a separate issue
    and a separate choice by authors and publishers, where the encoding of texts
    has absolutely no influence).

    > Collation (after 15 years) is not yet working in Unicode.
    Huh???? Completely wrong.

    > ISO collation works very well.
    Very well? Not sure. Not better than collation based on Unicode/ISO/IEC
    10646; the results are identical and do not depend on the encoding of the
    documents. In fact, implementations based on unicode perform better, because
    they don't need special hacks like in 7/8-bit encodings to handle characters
    present in one document using one encoding, but absent from another.
    If you are speaking about the binary ordering, this is not collation, and
    there exists *NO* 7-8bit encoding whose binary encoding supports the
    conventions used in different languages. To suppport many languages the
    issue is *not* in the encoding of documents, but in the rules specific to
    each locale which are *completely* independant of the encoding.

    > Hacked 8bit coding not known to support collation.
    Wrong. There are Unicode-based implementations that have the support
    allwoing users to create simple custom mappings for "hacked" 8-bit encodings
    to Unicode. With those mappings, collation works immediately, based on
    existing rules for many languages.

    > Unicode word processing works with few vendor applications with immense
    > difficulties.
    There's no difficulty today. There are plenty of implementations everywhere,
    either in wellknown commercial Office applications, or in free open-source
    applications which can be supported by a wide range of service providers
    (with contract) or by the community of users (but without guarantee of
    service, in terms of delay, and without obligation by anyone in the
    community to reply to those that would insult them, like you did here!)

    > ISO word processing woks perfectly with mostly all applications.
    Not always (note your repeated terminoloy error: I suppose you mean ISO8859
    or ISO646 here, but not ISO10646...). There are plenty of word pressing
    applications that support only a few (if not only one!) ISO 8859 encoding.

    > Hacked 8bit works with applications, unless OS vendors deliberately
    > prevent
    > it.
    Wrong. Many applications provide absolutely no way to specify that a hacked
    encoding is used, so they cannot guarantee consistant results (so conversion
    of case, hyphenation, word breaking, line breaking, and so on will not work
    appropriately, as the origin encoding had distinct properties for the
    characters before they were hacked).

    To get consistant results would mean that the hacked positions can only be
    replaced by characters having ALL the same properties. In practice, this is
    imppossible to achive for complete alphabets, or for non-alphabetic scripts
    (just look at indic abugidas, or right-to-left scripts, or ideographic
    scripts, or syllabaries, and you'll see that it's impossoble to create a
    "hacked font" that will support them and will map characters with the same
    properties as with the non-hacked font.

    > Standardising the encoding is not to do with interoperability of
    > languages.
    Hmmm... this sentence has no sense.
    Languages by definition are not interoperable, and each have their own
    semantics with no exact equivalence. But in fact, every humane language is
    not completely unified, and are in fact families of cultures with
    differences across regions, social groups, and people (and their own
    experienceof the language).

    > Inter operability is a welcome by product.
    This sentence has no meaning.

    > Unicode is designed primarily of interoperability as the target.
    Not alone! In fact this is Unicode, along with ISO/IEC 10646 and the
    adoption of a common terminology and reference by OS and software vendors,
    and by ISO and almost all other standard bodies in the world (public or
    private), to base other protocols on it, or to adapt all existing protocols
    to support it each time it was possible (notably when the old protocol
    already allowed specifiying several legacy 7/8-bit encodings, using encosing
    identifiers like in MIME, in the IANA registry or in Internet protocols, or
    some other identiers like the CCSID in IBM platforms, or the codepage
    numbers in Microsoft and some IBM platforms.)

    > It lost it first hurdle in definition.
    Nothing demonstrated here. This sentence has a too broad meaning to be
    usable or attested by enough convincing facts. This just looks like a
    flaming attack.

    > Standard Encoding is to get A language working among all vendor and all
    > user
    > environment.
    > Unicode consortium do not seem to have this goal as it's primary target.
    > Email is not working, desk top publishing is not working, etc.. etc..
    > But inter operability between languages seems to take priority among other
    > things.
    >
    > Nearly 15 years, Unicode is not delivering yet. No one seem to care about
    > this status quo.
    >
    > ISO 8859 can do every thing at least for the near future.
    > For the pride and prejudice, 8859 is being made outcast.
    > Until Unicode works, new 8859 should be allowed so that within 6 to 9
    > moths
    > all languages will start to work temporarily,
    >
    > while the technically superior Unicode begins to walk say, in about 10
    > years
    > time.
    >
    > It is the duty of ISO to support 8bit and Unicode, by law and charter.
    > It is not the duty of ISO to out cast tried and tested technology, while
    > allowing to fiddle with encoding for over 15 years now.
    >
    > If any one wish to reply, I prefer discussions on what in Unicode is not
    > working.
    >
    > Kindly
    > Sinnathurai Srivas
    >
    >
    > ----- Original Message -----
    > From: "Asmus Freytag" <asmusf@ix.netcom.com>
    > To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > Cc: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Unicode Mailing
    > List"
    > <unicode@unicode.org>
    > Sent: Monday, July 17, 2006 12:48 AM
    > Subject: Re: Frequent incorrect guesses by the charset autodetection in
    > IE7
    >
    >
    >> On 7/16/2006 4:56 AM, Philippe Verdy wrote:
    >>> There was nothing wrong in the ISO-8859 standard series. ISO just
    >>> stopped
    >>> working on this, because there remained nobody wanting to continue the
    >>> work in maintaining a 7/8 bit standard, when all focus (and a very large
    >>> consensus at ISO) was for accelerating the development of the newer ISO
    >>> 10646 standard, that the indistry and lots of governments and
    >>> organizations wanted to develop.
    >>>
    >> fair statement
    >>> What is important to understand here is that ISO has changed its
    >>> priority: instead of creating many non-interoperable 7/8 bit encodings,
    >>> there was more value into creating a common international standard that
    >>> would contain a universal repertoire of characters.
    >>>
    >> ditto
    >>> Nothing in the iso 10646 standard or Unicode forbids any country from
    >>> deriving a 7/8 bit standard for their national usage and publishing it
    >>> so
    >>> that it can be supported with low or no cost by software vendors.
    >>> nothing
    >>> forbids them to even make its support required for use in future
    >>> products
    >>> sold in their countries, if they think it will be in the country's
    >>> interest.
    >>>
    >> I would quibble with 'low cost'. The total-lifetime cost of a new 7/8 bit
    >> standard is considerable, since it eventually does have to interwork with
    >> 10646 and Unicode, and the more 7/8 bit sets exist, the more difficult it
    >> becomes to manage the legacy sets in a clean way.
    >>> But honestly, the whole 7/8 bit encodings collection was becoming more
    >>> and more problematic and impossible to maintain consistently while also
    >>> ensuring interoperability! Only the ISO 10646 standard allowed to
    >>> reconcile the past encompatible standards, offering a uniform way to
    >>> handle international text and converting with much less errors between
    >>> otherwise incompatible encodings.
    >>>
    >> I think you are trying to say the same thing here.
    >>> The ISO body has NOT deprecated the ISO 646 and ISO 8859-* standard
    >>> series because of course they are widely used (and will continue to be
    >>> used at large scale for very long, probably many decennials, if not more
    >>> than a century, unless there's a complete change of computing technogy
    >>> and the changeover occurs at large scale; I even think that ISO 646/US
    >>> and possibly ISO 8859-1 will even survive the ISO 10646 standard when it
    >>> will be replaced by something better based on a new text encoding
    >>> paradigm with additional objectives not addressed today in ISO 10646 and
    >>> Unicode...)
    >>>
    >> ISO standards need to be affirmed or updated every 5 years. The character
    >> coding community realized that data, unlike parsers, operating systems,
    >> renderers and all other elements of software technology, once created
    >> remain in their original format. Therefore they pushed for an option to
    >> allow archiving of unchanged standards - keeping them officially
    >> available
    >> for people in need of interpreting legacy data, but not withdrawing them
    >> nor updating them. This is not the same as deprecation, which is usually
    >> the first step to withdrawal of a feature from a standard (and is a term
    >> that does not apply to ISO standards as a whole).
    >>
    >> Except for minor tweaks in language and terminology, affecting mostly the
    >> text of these standards and not the way they were supposed to beused, the
    >> 8859 standards could have been archived long time ago. They are utterly
    >> stable and need to be so.
    >>> Don't say that Unicode and ISO 10646 does not work. All proves today
    >>> that
    >>> these standards are very successful and that their implementation is
    >>> advancing fast, and available on many computers, supported by most
    >>> languages and tools now, and that efficient implementation is possible
    >>> and available for all, on all types of systems (from the smallest
    >>> hand-held device to the largest mainframes or server farms or computing
    >>> grids).
    >>>
    >>> The complete migration from legacy 7/8 bit encodings to ISO/IEC 10646 is
    >>> an international ongoing effort which is successful and has really
    >>> helped
    >>> decreasing the digital divide between the richest countries that have
    >>> the
    >>> power to require support for their legacy 7/8-bit encodings in their
    >>> languages, and the poorest countries that had languages whose 7/8-bit
    >>> encoding was rarely supported. With ISO 10646, softwares can be written
    >>> once to support input, handling and rendering of all languages and
    >>> cultures of the world.
    >>>
    >> Data will likely never migrate - which is one of the factors that makes
    >> adding any new 7/8 bit set so expensive: if it becomes popular at all, it
    >> needs to be kept around essentially forever or there's the risk of
    >> abandoning data.
    >>> With ISO 10646 (and the help of Unicode in its effective
    >>> implementation),
    >>> it is now in fact much less expensive to convice commercial companies to
    >>> support fully internationalizable softwares, because this single
    >>> standard
    >>> can be understood by everyone in the world, and it also allows
    >>> collaboration with more parties than just a single supporting government
    >>> or organization.
    >>>
    >>> You want support for a language or script? youy don't need to develop a
    >>> new standard. Instead you just need to document a minimum set of missing
    >>> characters to support, and they will be added to the same existing
    >>> standard, and easily supported in existing applications, after others
    >>> have contributed input methods, keyboard drivers, fonts... and academic
    >>> sources can already work on producing text corpus, and studying the
    >>> rules
    >>> needed to develop stable orthographies for many rare languages. Most of
    >>> the technologies and usage policies will already be there and
    >>> documented.
    >>>
    >>> In other words, ISO 10646 really saves money everywhere in the world,
    >>> unlike the past incompatible 7/8 bit encodings.
    >>>
    >> Fair conclusion.
    >>
    >> A./
    >>
    >>
    >>
    >
    >
    >
    >
    > ---------------------------------------------------------------------------------------
    > Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Jul 18 2006 - 17:47:29 CDT