Re: Frequent incorrect guesses by the charset autodetection in IE7

From: Richard Wordingham (
Date: Mon Jul 17 2006 - 19:23:43 CDT

  • Next message: Erkki Kolehmainen: "Re: Frequent incorrect guesses by the charset autodetection in IE7"

    Adam Twardoch wrote on Monday, July 17, 2006 at 10:38 PM
    Subject: Re: Frequent incorrect guesses by the charset autodetection in IE7

    > Sinnathurai Srivas wrote:
    >> Unicode email is not working properly
    >> ISO Email is working properly.
    >> Hacked 8bit coding works among vendor opposition.
    > I’ve been using UTF-8 in my e-mail for several years now. I don’t know
    > what you mean.

    I'll hazard a guess. One issue is that many applications choose the font on
    the basis of the encoding. It is by no means unusual to see 'Unicode' in a
    pick list of languages! I notice these differences in Outlook Express
    because I use a mixture of Unicode, Latin-1 and Thai encodings. There is a
    further problem that fonts are scaled to a common pitch rather than a common
    x-height. I generally get round it by selecting Tahoma for Thai because it
    has a rather high x-height to pitch ratio for Thai - at the expense of
    making some vowel-tone combinations hard to distinguish. Of course, you
    might have exactly the same problem if you were to use a 7 or 8-bit ISO-2022
    scheme, though I suppose it might switch font according to character set.
    Word 2002 seems to do something similar, presumably working off Unicode
    blocks, and has styles that specify different fonts and pitches for
    different scripts.

    Philippe Verdy wrote on Monday, July 17, 2006 10:10 PM

    > From: "Sinnathurai Srivas" <>
    >> Collation (after 15 years) is not yet working in Unicode.
    > Huh???? Completely wrong.

    It's still taking its time to work through. Thai collation in Excel 2000
    and 2002 has to be seen to be believed. I can only believe that someone
    misinterpreted the specification - perhaps he knew just a little about
    Devanagari, and misapplied it to Thai. Or perhaps the error lay with the
    composition of the specification, so it's a design fault rather than a bug.
    On the other hand, Thai collation in Word 2002 seems to work.

    There are tools that will do collation, but being able to sorting a table in
    Word that contains a mixture of Tamil and English script to the satisfaction
    of a Tamil non-programmer resident in England may yet be another matter.
    (I've seen a commercial, or at least non-free, add-on to do Lao collation in
    Word.) As to sorting such a table to the satisfaction of a Malayalee...

    >> ISO collation works very well.
    > If you are speaking about the binary ordering, this is not collation, and
    > there exists *NO* 7-8bit encoding whose binary encoding supports the
    > conventions used in different languages.

    >> Unicode word processing works with few vendor applications with immense
    >> difficulties.
    > There's no difficulty today.

    Half-full v. half-empty. While most scripts may work well, there are and
    have been problems areas like Malayalam, Burmese and Bengali. However, I
    find it hard to believe that ISCII works any better for Malayalam and
    Bengali. On the other hand, adapting typewriter solutions ought to work -
    *provided* the typewriter solution works! Part of the problem a Windows
    user faces is that he can't override Uniscribe. If he thinks he knows
    better than Uniscribe, then he has to eschew Microsoft products.

    An imprtant deficiency in the Uniscribe implementation of the Tamil script
    is that one cannot use the superscript or subscript digits on all
    combinations of consonant and vowel. The Unicode standard mentions these
    combinations, but does not say how to encode them. I don't believe new
    characters are needed, so I don't know how one can tackle this omission.

    A possible example of the issues is the Unicode standard. Much of the
    non-Latin text does not appear to have been composed in Unicode! For a new
    complex script that is inevitable, and of course the text in a new script
    for a script proposal cannot be encoded in Unicode. (It could in theory be
    encoded in the PUA, but I'm not sure that that happens much.)

    Some scripts are well integrated. For example, there are fonts that are
    encoded as hacks on Unicoded Thai.

    Tamil illustrates some of the problems. When SSA was added in Unicode
    4.1.0, one could not go out and use it with all applications. Uniscribe
    refused to combine it with Tamil vowels, let alone form the 'shri' ligature.
    Imagine finding that the only way to have your name displayed was to
    misspell it!

    At least now one can get around the Uniscribe limitation for HTML on Windows
    if you are desperate. Deer Park supports Graphite , which allows one to
    specify one's own Indic re-arrangement etc, and Graphite comes with a good
    tutorial to get you started. Graphite does not seem to allow the
    pixel-level control of positioning available in OpenType lay-out tables.

    Of course, it would be nice if Uniscribe could allow a font to opt out of
    its automatic re-ordering and do its own thing. I think this would still be
    in accord with the principle of not needlessly duplicating information.


    This archive was generated by hypermail 2.1.5 : Mon Jul 17 2006 - 19:28:15 CDT