RE: languages - mostly not UTF-8

From: Philippe Verdy (
Date: Mon Apr 13 2009 - 00:01:46 CDT

  • Next message: William J Poser: "RE: ASCII as a subset of Unicode"

    Don Osborn wrote:
    > A quick review of coding on BBC World Service pages in diverse languages
    > at reveals . a diversity of
    > charset codes used, with most pages *not* in utf-8. I suspect that BBC is
    > anticipating the kinds of systems that users in each language population
    > will rely on, trying to accommodate the least sophisticated systems and
    > font repertoires. Assuming that their read is accurate (and that they're
    > just being just conservative about making the change to utf-8), this would
    > seem to be an interesting window on how widespread the use of Unicode
    > is or is not at the present time. On the other hand, it is worth noting
    > no Latin-based orthography is displayed on in utf-8, even when
    > characters beyond Latin-1 are used (Turkish) or should be used (Hausa).
    > (...)
    > BBC lists 32 languages, but two of them - Kinyarwanda and Kirundi - lead
    > to the same "Great Lakes" page (the two languages are interintelligible).
    > Also for the sake of this list, I count Portuguese only once, even though
    > BBC has Brazilian and African varieties separate. Hence the total below
    > comes to 30.

    I can't remember when support for UTF-8 was added in major browsers.
    But the fact that BBC WS still maintains a collection of encodings,
    UTF-8 for some languages, without errors possibly means that they have
    used a backend that selects the "simplest" encoding within a list to
    Automatically reduce the page sizes. If there's no such tool installed on
    their CMS, it may just be the effet of common policies that have been

    Maintained since long by their journalists based on their experience, and
    never felt the need to change it. But I see no reason why it still maintains
    some Windows codepages, but not Windows 1252 (sticking to ISO 8859-1, even
    if it has a smaller set of characters, and despite the fact that many
    Microsoft web editing tools are "transparently" encoding Windows 1252
    characters in a ISO 8859-1 page, creating some problems if there's no filter
    to reencode some of the extra characters using NCRs or into approximants).

    Last year we saw a global statistics page showing that the UTF-8 web was
    growing much faster now, and that more than half of the web is now Unicode
    encoded, and legacy encodings now slowly decreasing in terms of frequence
    and accessible volume of information.

    The other problem that the use of multiple charsets causes is the difficulty
    (or complications) to integrate ads and utility javascript tools from third
    parties on a multilingual site. For me, it's the most decisive condition
    that has caused many sites to use UTF-8 instead of multiple legacy

    If I just look at some other multilingual news websites like
    (more recent than BBC WS), you can see that it uses XHTML fully encoded in
    UTF-8, for all its pages French or English and Arabic, even if this content
    could use ISO-8859-1 and windows-1256 like on the BBC site. This is also
    true for the content pushed to email (newsletters, RSS feeds, alerts...)

    Those sites that are the slowest to convert to UTF-8 are monolingual
    governmental sites, and personnal sites ; but since the advent nad huge
    success of "Web 2.0" interactive sites, most blogs, and collaborative sites
    and their related CMS software are now fully UTF-8 enabled by default
    (including for Western European languages). This is in this area that UTF-8
    usage has exploded. I see little or no reason to continue using legacy
    encodings (including ISO-8859-1), except for the source code used in
    producing the software and submitted to compilers (like the sources in C,
    C++, Java, C#... But not Javascript, HTML, CSS, XML).

    This archive was generated by hypermail 2.1.5 : Mon Apr 13 2009 - 10:14:09 CDT