Re: Frequent incorrect guesses by the charset autodetection in IE7

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Jul 18 2006 - 17:38:47 CDT

Next message: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"

Previous message: Magda Danish \(Unicode\): "Unicode Character Database 5.0 and Unicode Collation Alghorithm 5.0 Released"
In reply to: Philippe Verdy: "Re: Frequent incorrect guesses by the charset autodetection in IE7"
Next in thread: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"
Reply: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"
Reply: Otto Stolz: "Re: Frequent incorrect guesses by the charset autodetection in IE7"
Reply: Philippe Verdy: "Re: Frequent incorrect guesses by the charset autodetection in IE7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The list that highlights what is failing in Unicode is a fact full list.

I'll be taking the discussions at a slow phase. I'm going to start with
graphics and Unicode.

Taking PhotoPlus as a starting point. I have several other examples and I
know quite a few others here are going through the pain of justifying why we
need Unicode (including myself). Especially when we annoy a very large
number of the population with Unicode scribble.

As a starting point, please see the image. I'll appreciate if any one
willing to come out and help me document these problems.
http://www.araichchi.net/kanini/unicode/fail/u-photoplus-fails.jpg
http://www.araichchi.net/kanini/unicode/fail/unicode_status.htm

Look for question mark.
Look for where the question mark starts, despite the use of Unicode fonts.

Regards
Sinnathurai Srivas

----- Original Message -----
From: "Philippe Verdy" <verdy_p@wanadoo.fr>
To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Asmus Freytag"
<asmusf@ix.netcom.com>
Cc: "Unicode Mailing List" <unicode@unicode.org>
Sent: Monday, July 17, 2006 10:10 PM
Subject: Re: Frequent incorrect guesses by the charset autodetection in IE7

From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>

> Unicode email is not working properly
Wrong. It works as described.

> ISO Email is working properly.
Not more reliably than Unicode-based implementations. In fact, most email
agents now process emails by first converting all 7/8-bit charsets they
*know* into Unicode/ISO/IEC10646. If it works in ISO8859, it also works
without change with Unicode. Implementations that do not convert first to
Unicode/ISO/IEC10646 fail to handle lots of 7/8-bit encodings, and have
extremely complex configurations of fonts, or are not easily extensible to
support more.

> Hacked 8bit coding works among vendor opposition.
Wrong. Such hacks only works with *prior* mutual agreement, and requires
specific implementations. Unicode/ISO/IEC 10646 does not require such
agreement, especially for emails that use almost all modern languages that
have been encoded in ISO/IEC10646. Many countries have chosen to not develop
a 7/8-bit encoding because they don't need to.

> Unicode desktop publishing is virtually non existent.
Wrong.

> ISO desktop publishing is working excellently.
Wrong, because I think you mean ISO8859 here, and not ISO here (bad
terminology from you). Such implementations fail to handle lots of
characters that are supported with ISO10646. The most serious publishing
softwares are now based on ISO10646 as their core working encoding, simply
because fonts are made now based on it, and forget to provide tables mapping
legacy charsets to the needed glyphs. To support more languages, andget
consistent results the largest high-quality fonts are based on technologies
like OpenType and AAT that DO require support of ISO/IEC 10646;

> Hacked 8bit coding work very well in desktop publishing.
Wrong. Such documents are not portable without the complete set of
applications and tuning parameters that make up an environment. Due to
licencing problems, such complete sets cannot be exchanged freely, and so
documents created by users cannot be worked with others; this causes
problems between publishers and authors, because authors will not want to
invest the money to match exactly the environment of the publisher.

Hacked encodings are also poorly encoded, and if doucmentation is available,
it will be available only in one or a few languages, with poor translations,
and with little people ready to help to work on these documentation. Unicode
and ISO/IEC 10646 competence is available worldwide, and at very low cost.
Authors and publishers can also choose the tools they want, and they have
plenty of solutions.

Remember that before desktop publishing, the main work is performed by
authors that simply use word processors or common database or spreadsheet
office tools. All serious publishers know how to handle the format used in
submissions by authors.

Remeber also that today, the work of a publisher is not only to print the
documents, but also to prepare them for publication on other medias.
Publishing on the Internet, or on CDROMs or making them accessible through
databases is a great way to increase the audience of those documents, that
can then get a newer public, and so a higher value; this benefits to authors
(I don't speak here about the protection of medias, this is a separate issue
and a separate choice by authors and publishers, where the encoding of texts
has absolutely no influence).

> Collation (after 15 years) is not yet working in Unicode.
Huh???? Completely wrong.

> ISO collation works very well.
Very well? Not sure. Not better than collation based on Unicode/ISO/IEC
10646; the results are identical and do not depend on the encoding of the
documents. In fact, implementations based on unicode perform better, because
they don't need special hacks like in 7/8-bit encodings to handle characters
present in one document using one encoding, but absent from another.
If you are speaking about the binary ordering, this is not collation, and
there exists *NO* 7-8bit encoding whose binary encoding supports the
conventions used in different languages. To suppport many languages the
issue is *not* in the encoding of documents, but in the rules specific to
each locale which are *completely* independant of the encoding.

> Hacked 8bit coding not known to support collation.
Wrong. There are Unicode-based implementations that have the support
allwoing users to create simple custom mappings for "hacked" 8-bit encodings
to Unicode. With those mappings, collation works immediately, based on
existing rules for many languages.

> Unicode word processing works with few vendor applications with immense
> difficulties.
There's no difficulty today. There are plenty of implementations everywhere,
either in wellknown commercial Office applications, or in free open-source
applications which can be supported by a wide range of service providers
(with contract) or by the community of users (but without guarantee of
service, in terms of delay, and without obligation by anyone in the
community to reply to those that would insult them, like you did here!)

> ISO word processing woks perfectly with mostly all applications.
Not always (note your repeated terminoloy error: I suppose you mean ISO8859
or ISO646 here, but not ISO10646...). There are plenty of word pressing
applications that support only a few (if not only one!) ISO 8859 encoding.

> Hacked 8bit works with applications, unless OS vendors deliberately
> prevent
> it.
Wrong. Many applications provide absolutely no way to specify that a hacked
encoding is used, so they cannot guarantee consistant results (so conversion
of case, hyphenation, word breaking, line breaking, and so on will not work
appropriately, as the origin encoding had distinct properties for the
characters before they were hacked).

To get consistant results would mean that the hacked positions can only be
replaced by characters having ALL the same properties. In practice, this is
imppossible to achive for complete alphabets, or for non-alphabetic scripts
(just look at indic abugidas, or right-to-left scripts, or ideographic
scripts, or syllabaries, and you'll see that it's impossoble to create a
"hacked font" that will support them and will map characters with the same
properties as with the non-hacked font.

> Standardising the encoding is not to do with interoperability of
> languages.
Hmmm... this sentence has no sense.
Languages by definition are not interoperable, and each have their own
semantics with no exact equivalence. But in fact, every humane language is
not completely unified, and are in fact families of cultures with
differences across regions, social groups, and people (and their own
experienceof the language).

> Inter operability is a welcome by product.
This sentence has no meaning.

> Unicode is designed primarily of interoperability as the target.
Not alone! In fact this is Unicode, along with ISO/IEC 10646 and the
adoption of a common terminology and reference by OS and software vendors,
and by ISO and almost all other standard bodies in the world (public or
private), to base other protocols on it, or to adapt all existing protocols
to support it each time it was possible (notably when the old protocol
already allowed specifiying several legacy 7/8-bit encodings, using encosing
identifiers like in MIME, in the IANA registry or in Internet protocols, or
some other identiers like the CCSID in IBM platforms, or the codepage
numbers in Microsoft and some IBM platforms.)

> It lost it first hurdle in definition.
Nothing demonstrated here. This sentence has a too broad meaning to be
usable or attested by enough convincing facts. This just looks like a
flaming attack.

> Standard Encoding is to get A language working among all vendor and all
> user
> environment.
> Unicode consortium do not seem to have this goal as it's primary target.
> Email is not working, desk top publishing is not working, etc.. etc..
> But inter operability between languages seems to take priority among other
> things.
>
> Nearly 15 years, Unicode is not delivering yet. No one seem to care about
> this status quo.
>
> ISO 8859 can do every thing at least for the near future.
> For the pride and prejudice, 8859 is being made outcast.
> Until Unicode works, new 8859 should be allowed so that within 6 to 9
> moths
> all languages will start to work temporarily,
>
> while the technically superior Unicode begins to walk say, in about 10
> years
> time.
>
> It is the duty of ISO to support 8bit and Unicode, by law and charter.
> It is not the duty of ISO to out cast tried and tested technology, while
> allowing to fiddle with encoding for over 15 years now.
>
> If any one wish to reply, I prefer discussions on what in Unicode is not
> working.
>
> Kindly
> Sinnathurai Srivas
>
>
> ----- Original Message -----
> From: "Asmus Freytag" <asmusf@ix.netcom.com>
> To: "Philippe Verdy" <verdy_p@wanadoo.fr>
> Cc: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Unicode Mailing
> List"
> <unicode@unicode.org>
> Sent: Monday, July 17, 2006 12:48 AM
> Subject: Re: Frequent incorrect guesses by the charset autodetection in
> IE7
>
>
>> On 7/16/2006 4:56 AM, Philippe Verdy wrote:
>>> There was nothing wrong in the ISO-8859 standard series. ISO just
>>> stopped
>>> working on this, because there remained nobody wanting to continue the
>>> work in maintaining a 7/8 bit standard, when all focus (and a very large
>>> consensus at ISO) was for accelerating the development of the newer ISO
>>> 10646 standard, that the indistry and lots of governments and
>>> organizations wanted to develop.
>>>
>> fair statement
>>> What is important to understand here is that ISO has changed its
>>> priority: instead of creating many non-interoperable 7/8 bit encodings,
>>> there was more value into creating a common international standard that
>>> would contain a universal repertoire of characters.
>>>
>> ditto
>>> Nothing in the iso 10646 standard or Unicode forbids any country from
>>> deriving a 7/8 bit standard for their national usage and publishing it
>>> so
>>> that it can be supported with low or no cost by software vendors.
>>> nothing
>>> forbids them to even make its support required for use in future
>>> products
>>> sold in their countries, if they think it will be in the country's
>>> interest.
>>>
>> I would quibble with 'low cost'. The total-lifetime cost of a new 7/8 bit
>> standard is considerable, since it eventually does have to interwork with
>> 10646 and Unicode, and the more 7/8 bit sets exist, the more difficult it
>> becomes to manage the legacy sets in a clean way.
>>> But honestly, the whole 7/8 bit encodings collection was becoming more
>>> and more problematic and impossible to maintain consistently while also
>>> ensuring interoperability! Only the ISO 10646 standard allowed to
>>> reconcile the past encompatible standards, offering a uniform way to
>>> handle international text and converting with much less errors between
>>> otherwise incompatible encodings.
>>>
>> I think you are trying to say the same thing here.
>>> The ISO body has NOT deprecated the ISO 646 and ISO 8859-* standard
>>> series because of course they are widely used (and will continue to be
>>> used at large scale for very long, probably many decennials, if not more
>>> than a century, unless there's a complete change of computing technogy
>>> and the changeover occurs at large scale; I even think that ISO 646/US
>>> and possibly ISO 8859-1 will even survive the ISO 10646 standard when it
>>> will be replaced by something better based on a new text encoding
>>> paradigm with additional objectives not addressed today in ISO 10646 and
>>> Unicode...)
>>>
>> ISO standards need to be affirmed or updated every 5 years. The character
>> coding community realized that data, unlike parsers, operating systems,
>> renderers and all other elements of software technology, once created
>> remain in their original format. Therefore they pushed for an option to
>> allow archiving of unchanged standards - keeping them officially
>> available
>> for people in need of interpreting legacy data, but not withdrawing them
>> nor updating them. This is not the same as deprecation, which is usually
>> the first step to withdrawal of a feature from a standard (and is a term
>> that does not apply to ISO standards as a whole).
>>
>> Except for minor tweaks in language and terminology, affecting mostly the
>> text of these standards and not the way they were supposed to beused, the
>> 8859 standards could have been archived long time ago. They are utterly
>> stable and need to be so.
>>> Don't say that Unicode and ISO 10646 does not work. All proves today
>>> that
>>> these standards are very successful and that their implementation is
>>> advancing fast, and available on many computers, supported by most
>>> languages and tools now, and that efficient implementation is possible
>>> and available for all, on all types of systems (from the smallest
>>> hand-held device to the largest mainframes or server farms or computing
>>> grids).
>>>
>>> The complete migration from legacy 7/8 bit encodings to ISO/IEC 10646 is
>>> an international ongoing effort which is successful and has really
>>> helped
>>> decreasing the digital divide between the richest countries that have
>>> the
>>> power to require support for their legacy 7/8-bit encodings in their
>>> languages, and the poorest countries that had languages whose 7/8-bit
>>> encoding was rarely supported. With ISO 10646, softwares can be written
>>> once to support input, handling and rendering of all languages and
>>> cultures of the world.
>>>
>> Data will likely never migrate - which is one of the factors that makes
>> adding any new 7/8 bit set so expensive: if it becomes popular at all, it
>> needs to be kept around essentially forever or there's the risk of
>> abandoning data.
>>> With ISO 10646 (and the help of Unicode in its effective
>>> implementation),
>>> it is now in fact much less expensive to convice commercial companies to
>>> support fully internationalizable softwares, because this single
>>> standard
>>> can be understood by everyone in the world, and it also allows
>>> collaboration with more parties than just a single supporting government
>>> or organization.
>>>
>>> You want support for a language or script? youy don't need to develop a
>>> new standard. Instead you just need to document a minimum set of missing
>>> characters to support, and they will be added to the same existing
>>> standard, and easily supported in existing applications, after others
>>> have contributed input methods, keyboard drivers, fonts... and academic
>>> sources can already work on producing text corpus, and studying the
>>> rules
>>> needed to develop stable orthographies for many rare languages. Most of
>>> the technologies and usage policies will already be there and
>>> documented.
>>>
>>> In other words, ISO 10646 really saves money everywhere in the world,
>>> unlike the past incompatible 7/8 bit encodings.
>>>
>> Fair conclusion.
>>
>> A./
>>
>>
>>
>
>
>
>
> ---------------------------------------------------------------------------------------
> Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
> Aucun virus connu a ce jour par nos services n'a ete detecte.
>
>
>

Next message: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"
Previous message: Magda Danish \(Unicode\): "Unicode Character Database 5.0 and Unicode Collation Alghorithm 5.0 Released"
In reply to: Philippe Verdy: "Re: Frequent incorrect guesses by the charset autodetection in IE7"
Next in thread: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"
Reply: Dean Harding: "RE: Frequent incorrect guesses by the charset autodetection in IE7"
Reply: Otto Stolz: "Re: Frequent incorrect guesses by the charset autodetection in IE7"
Reply: Philippe Verdy: "Re: Frequent incorrect guesses by the charset autodetection in IE7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 18 2006 - 17:47:29 CDT