RE: Google posting about U5.1

From: Philippe Verdy (
Date: Sat May 10 2008 - 00:58:22 CDT

  • Next message: Philippe Verdy: "RE: Undefined code positions in 8-bit character sets"

    Looks like this fast evolution for unicode comes mostly from China, and from
    the need to support Chinese on websites operated outside China: GB encodings
    was not chosen, and even China does not support very actively its GB
    With the exploding number of Chinese Internet users, this looks very
    decisive. We'll still have a minority of ASCII only webpages, but it's true
    that even in Europe the ISO8859-based charsets are not enough, as many, many
    websites are now multilingual.
    The UCS has strong advantages everywhere, and the conversion from ISO8859-*
    to UTF-8 is quite easy now, given the number of tools and libraries capable
    of working with Unicode.
    The main problem that remains is with some popualr tools that still don't
    come bundled and preinstalled to support the full UCS; PHP is probably one
    of these tools, whose Unicode support is either poor or slow.
    On the opposite, .Net, Java, and Javascript have native support for the UCS
    (at least in the BMP, where surrogates don't have to be treated specially,
    but almost all the needed characters for modern usage are in the BMP, except
    some "less frequent" characters occasionnaly taken from the supplementary
    ideographic plane).
    For languages that absolutely need support of the full UCS, going to a
    32-bit internal encodingf is still possible, but most development do not
    even care about it: these is for a limited number of pages or resources, in
    comparison to the tons of pages and site that don't even need any character
    out of the BMP.
    So the sharp increase of UTF-8 is highly correlated to the progressive
    abandon of GB-* by millions (billion soon?) of new Chinese Internet users.
    Is the support of GB18030 still mandatory for products sold in China, if the
    support of Unicode offers the same coverage benefits with the addition of
    more interoperability?
    If only this page on the Google blog could help convince European
    administrations or organization to stop making pages encoded with ISO8859-*
    (and often not labelled at all! Many French administrations and commercial
    websites are still using ISO-8859-1 without even labelling it explicitly, so
    their pages don"t display properly, as the heuristic algorithms used by
    browsers to "guess" the encoding frequently "detect" a JIS encoding so that
    runs of characters with one containing an accent appear replaced by
    Let's convince everyone go with UTF-8 on the web, everything else will
    follow the UCS path including in documents, databases (even if they are
    handled internally with UTF-16, possibily leaving some bugs for incorrectly
    handled surrogates; these bugs are simple to solve)...


    De : [] De la
    part de Mark Davis
    Envoyé : lundi 5 mai 2008 21:25
    À : Unicode
    Objet : FYI: Google posting about U5.1

    FYI, we have a posting on Google's official blog
    ( about Unicode 5.1 and the growth of
    Unicode that we're seeing.

    Internal Virus Database is out of date.
    Checked by AVG.
    Version: 8.0.100 / Virus Database: 269.23.8/1415 - Release Date: 05/05/2008

    This archive was generated by hypermail 2.1.5 : Sat May 10 2008 - 09:05:49 CDT