Re: Frequent incorrect guesses by the charset autodetection in IE7

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 17 2006 - 16:10:31 CDT

  • Next message: Addison Phillips: "RE: Univarchar, Unichar in Sybase"

    From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>

    > Unicode email is not working properly
    Wrong. It works as described.

    > ISO Email is working properly.
    Not more reliably than Unicode-based implementations. In fact, most email agents now process emails by first converting all 7/8-bit charsets they *know* into Unicode/ISO/IEC10646. If it works in ISO8859, it also works without change with Unicode. Implementations that do not convert first to Unicode/ISO/IEC10646 fail to handle lots of 7/8-bit encodings, and have extremely complex configurations of fonts, or are not easily extensible to support more.

    > Hacked 8bit coding works among vendor opposition.
    Wrong. Such hacks only works with *prior* mutual agreement, and requires specific implementations. Unicode/ISO/IEC 10646 does not require such agreement, especially for emails that use almost all modern languages that have been encoded in ISO/IEC10646. Many countries have chosen to not develop a 7/8-bit encoding because they don't need to.

    > Unicode desktop publishing is virtually non existent.
    Wrong.

    > ISO desktop publishing is working excellently.
    Wrong, because I think you mean ISO8859 here, and not ISO here (bad terminology from you). Such implementations fail to handle lots of characters that are supported with ISO10646. The most serious publishing softwares are now based on ISO10646 as their core working encoding, simply because fonts are made now based on it, and forget to provide tables mapping legacy charsets to the needed glyphs. To support more languages, andget consistent results the largest high-quality fonts are based on technologies like OpenType and AAT that DO require support of ISO/IEC 10646;

    > Hacked 8bit coding work very well in desktop publishing.
    Wrong. Such documents are not portable without the complete set of applications and tuning parameters that make up an environment. Due to licencing problems, such complete sets cannot be exchanged freely, and so documents created by users cannot be worked with others; this causes problems between publishers and authors, because authors will not want to invest the money to match exactly the environment of the publisher.

    Hacked encodings are also poorly encoded, and if doucmentation is available, it will be available only in one or a few languages, with poor translations, and with little people ready to help to work on these documentation. Unicode and ISO/IEC 10646 competence is available worldwide, and at very low cost. Authors and publishers can also choose the tools they want, and they have plenty of solutions.

    Remember that before desktop publishing, the main work is performed by authors that simply use word processors or common database or spreadsheet office tools. All serious publishers know how to handle the format used in submissions by authors.

    Remeber also that today, the work of a publisher is not only to print the documents, but also to prepare them for publication on other medias. Publishing on the Internet, or on CDROMs or making them accessible through databases is a great way to increase the audience of those documents, that can then get a newer public, and so a higher value; this benefits to authors (I don't speak here about the protection of medias, this is a separate issue and a separate choice by authors and publishers, where the encoding of texts has absolutely no influence).

    > Collation (after 15 years) is not yet working in Unicode.
    Huh???? Completely wrong.

    > ISO collation works very well.
    Very well? Not sure. Not better than collation based on Unicode/ISO/IEC 10646; the results are identical and do not depend on the encoding of the documents. In fact, implementations based on unicode perform better, because they don't need special hacks like in 7/8-bit encodings to handle characters present in one document using one encoding, but absent from another.
    If you are speaking about the binary ordering, this is not collation, and there exists *NO* 7-8bit encoding whose binary encoding supports the conventions used in different languages. To suppport many languages the issue is *not* in the encoding of documents, but in the rules specific to each locale which are *completely* independant of the encoding.

    > Hacked 8bit coding not known to support collation.
    Wrong. There are Unicode-based implementations that have the support allwoing users to create simple custom mappings for "hacked" 8-bit encodings to Unicode. With those mappings, collation works immediately, based on existing rules for many languages.

    > Unicode word processing works with few vendor applications with immense
    > difficulties.
    There's no difficulty today. There are plenty of implementations everywhere, either in wellknown commercial Office applications, or in free open-source applications which can be supported by a wide range of service providers (with contract) or by the community of users (but without guarantee of service, in terms of delay, and without obligation by anyone in the community to reply to those that would insult them, like you did here!)

    > ISO word processing woks perfectly with mostly all applications.
    Not always (note your repeated terminoloy error: I suppose you mean ISO8859 or ISO646 here, but not ISO10646...). There are plenty of word pressing applications that support only a few (if not only one!) ISO 8859 encoding.

    > Hacked 8bit works with applications, unless OS vendors deliberately prevent
    > it.
    Wrong. Many applications provide absolutely no way to specify that a hacked encoding is used, so they cannot guarantee consistant results (so conversion of case, hyphenation, word breaking, line breaking, and so on will not work appropriately, as the origin encoding had distinct properties for the characters before they were hacked).

    To get consistant results would mean that the hacked positions can only be replaced by characters having ALL the same properties. In practice, this is imppossible to achive for complete alphabets, or for non-alphabetic scripts (just look at indic abugidas, or right-to-left scripts, or ideographic scripts, or syllabaries, and you'll see that it's impossoble to create a "hacked font" that will support them and will map characters with the same properties as with the non-hacked font.

    > Standardising the encoding is not to do with interoperability of languages.
    Hmmm... this sentence has no sense.
    Languages by definition are not interoperable, and each have their own semantics with no exact equivalence. But in fact, every humane language is not completely unified, and are in fact families of cultures with differences across regions, social groups, and people (and their own experienceof the language).

    > Inter operability is a welcome by product.
    This sentence has no meaning.

    > Unicode is designed primarily of interoperability as the target.
    Not alone! In fact this is Unicode, along with ISO/IEC 10646 and the adoption of a common terminology and reference by OS and software vendors, and by ISO and almost all other standard bodies in the world (public or private), to base other protocols on it, or to adapt all existing protocols to support it each time it was possible (notably when the old protocol already allowed specifiying several legacy 7/8-bit encodings, using encosing identifiers like in MIME, in the IANA registry or in Internet protocols, or some other identiers like the CCSID in IBM platforms, or the codepage numbers in Microsoft and some IBM platforms.)

    > It lost it first hurdle in definition.
    Nothing demonstrated here. This sentence has a too broad meaning to be usable or attested by enough convincing facts. This just looks like a flaming attack.

    > Standard Encoding is to get A language working among all vendor and all user
    > environment.
    > Unicode consortium do not seem to have this goal as it's primary target.
    > Email is not working, desk top publishing is not working, etc.. etc..
    > But inter operability between languages seems to take priority among other
    > things.
    >
    > Nearly 15 years, Unicode is not delivering yet. No one seem to care about
    > this status quo.
    >
    > ISO 8859 can do every thing at least for the near future.
    > For the pride and prejudice, 8859 is being made outcast.
    > Until Unicode works, new 8859 should be allowed so that within 6 to 9 moths
    > all languages will start to work temporarily,
    >
    > while the technically superior Unicode begins to walk say, in about 10 years
    > time.
    >
    > It is the duty of ISO to support 8bit and Unicode, by law and charter.
    > It is not the duty of ISO to out cast tried and tested technology, while
    > allowing to fiddle with encoding for over 15 years now.
    >
    > If any one wish to reply, I prefer discussions on what in Unicode is not
    > working.
    >
    > Kindly
    > Sinnathurai Srivas
    >
    >
    > ----- Original Message -----
    > From: "Asmus Freytag" <asmusf@ix.netcom.com>
    > To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > Cc: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Unicode Mailing List"
    > <unicode@unicode.org>
    > Sent: Monday, July 17, 2006 12:48 AM
    > Subject: Re: Frequent incorrect guesses by the charset autodetection in IE7
    >
    >
    >> On 7/16/2006 4:56 AM, Philippe Verdy wrote:
    >>> There was nothing wrong in the ISO-8859 standard series. ISO just stopped
    >>> working on this, because there remained nobody wanting to continue the
    >>> work in maintaining a 7/8 bit standard, when all focus (and a very large
    >>> consensus at ISO) was for accelerating the development of the newer ISO
    >>> 10646 standard, that the indistry and lots of governments and
    >>> organizations wanted to develop.
    >>>
    >> fair statement
    >>> What is important to understand here is that ISO has changed its
    >>> priority: instead of creating many non-interoperable 7/8 bit encodings,
    >>> there was more value into creating a common international standard that
    >>> would contain a universal repertoire of characters.
    >>>
    >> ditto
    >>> Nothing in the iso 10646 standard or Unicode forbids any country from
    >>> deriving a 7/8 bit standard for their national usage and publishing it so
    >>> that it can be supported with low or no cost by software vendors. nothing
    >>> forbids them to even make its support required for use in future products
    >>> sold in their countries, if they think it will be in the country's
    >>> interest.
    >>>
    >> I would quibble with 'low cost'. The total-lifetime cost of a new 7/8 bit
    >> standard is considerable, since it eventually does have to interwork with
    >> 10646 and Unicode, and the more 7/8 bit sets exist, the more difficult it
    >> becomes to manage the legacy sets in a clean way.
    >>> But honestly, the whole 7/8 bit encodings collection was becoming more
    >>> and more problematic and impossible to maintain consistently while also
    >>> ensuring interoperability! Only the ISO 10646 standard allowed to
    >>> reconcile the past encompatible standards, offering a uniform way to
    >>> handle international text and converting with much less errors between
    >>> otherwise incompatible encodings.
    >>>
    >> I think you are trying to say the same thing here.
    >>> The ISO body has NOT deprecated the ISO 646 and ISO 8859-* standard
    >>> series because of course they are widely used (and will continue to be
    >>> used at large scale for very long, probably many decennials, if not more
    >>> than a century, unless there's a complete change of computing technogy
    >>> and the changeover occurs at large scale; I even think that ISO 646/US
    >>> and possibly ISO 8859-1 will even survive the ISO 10646 standard when it
    >>> will be replaced by something better based on a new text encoding
    >>> paradigm with additional objectives not addressed today in ISO 10646 and
    >>> Unicode...)
    >>>
    >> ISO standards need to be affirmed or updated every 5 years. The character
    >> coding community realized that data, unlike parsers, operating systems,
    >> renderers and all other elements of software technology, once created
    >> remain in their original format. Therefore they pushed for an option to
    >> allow archiving of unchanged standards - keeping them officially available
    >> for people in need of interpreting legacy data, but not withdrawing them
    >> nor updating them. This is not the same as deprecation, which is usually
    >> the first step to withdrawal of a feature from a standard (and is a term
    >> that does not apply to ISO standards as a whole).
    >>
    >> Except for minor tweaks in language and terminology, affecting mostly the
    >> text of these standards and not the way they were supposed to beused, the
    >> 8859 standards could have been archived long time ago. They are utterly
    >> stable and need to be so.
    >>> Don't say that Unicode and ISO 10646 does not work. All proves today that
    >>> these standards are very successful and that their implementation is
    >>> advancing fast, and available on many computers, supported by most
    >>> languages and tools now, and that efficient implementation is possible
    >>> and available for all, on all types of systems (from the smallest
    >>> hand-held device to the largest mainframes or server farms or computing
    >>> grids).
    >>>
    >>> The complete migration from legacy 7/8 bit encodings to ISO/IEC 10646 is
    >>> an international ongoing effort which is successful and has really helped
    >>> decreasing the digital divide between the richest countries that have the
    >>> power to require support for their legacy 7/8-bit encodings in their
    >>> languages, and the poorest countries that had languages whose 7/8-bit
    >>> encoding was rarely supported. With ISO 10646, softwares can be written
    >>> once to support input, handling and rendering of all languages and
    >>> cultures of the world.
    >>>
    >> Data will likely never migrate - which is one of the factors that makes
    >> adding any new 7/8 bit set so expensive: if it becomes popular at all, it
    >> needs to be kept around essentially forever or there's the risk of
    >> abandoning data.
    >>> With ISO 10646 (and the help of Unicode in its effective implementation),
    >>> it is now in fact much less expensive to convice commercial companies to
    >>> support fully internationalizable softwares, because this single standard
    >>> can be understood by everyone in the world, and it also allows
    >>> collaboration with more parties than just a single supporting government
    >>> or organization.
    >>>
    >>> You want support for a language or script? youy don't need to develop a
    >>> new standard. Instead you just need to document a minimum set of missing
    >>> characters to support, and they will be added to the same existing
    >>> standard, and easily supported in existing applications, after others
    >>> have contributed input methods, keyboard drivers, fonts... and academic
    >>> sources can already work on producing text corpus, and studying the rules
    >>> needed to develop stable orthographies for many rare languages. Most of
    >>> the technologies and usage policies will already be there and documented.
    >>>
    >>> In other words, ISO 10646 really saves money everywhere in the world,
    >>> unlike the past incompatible 7/8 bit encodings.
    >>>
    >> Fair conclusion.
    >>
    >> A./
    >>
    >>
    >>
    >
    >
    >
    >
    > ---------------------------------------------------------------------------------------
    > Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 17 2006 - 16:16:30 CDT