languages - mostly not UTF-8

From: Don Osborn (
Date: Sun Apr 12 2009 - 20:24:04 CDT

  • Next message: Asmus Freytag: "Re: languages - mostly not UTF-8"

    A quick review of coding on BBC World Service pages in diverse languages at reveals . a diversity of
    charset codes used, with most pages *not* in utf-8. I suspect that BBC is
    anticipating the kinds of systems that users in each language population
    will rely on, trying to accommodate the least sophisticated systems and font
    repertoires. Assuming that their read is accurate (and that they're not
    just being just conservative about making the change to utf-8), this would
    seem to be an interesting window on how widespread the use of Unicode is or
    is not at the present time. On the other hand, it is worth noting that no
    Latin-based orthography is displayed on in utf-8, even when
    characters beyond Latin-1 are used (Turkish) or should be used (Hausa). If
    one had the time, it would be interesting to look also at other
    international radio sites - VOA, RFI, Deutsche Welle, Radio China, etc.


    Among the questions I have are whether we can expect that all web content
    (at least on high profile international sites) will eventually go to utf-8
    or another Unicode rendering or will various non-Unicode 8-bit standards
    continue to hold sway in selected areas for some time to come? I think that
    in the "ecology" of localization in a region such as West Africa, the use or
    non-use of utf-8 by international websites for a language like Hausa (which
    basically is the difference between being able to use the formal orthography
    or resorting to an ASCIIfied transcription as they currently do) certainly
    has an effect on the way that that language and others are used in text
    offline. At what point does the argument that too many local systems in a
    region do not have unicode fonts lose its validity, and at what point should
    organizations like BBC take the leadership in use of utf-8 (as it did a
    while back with a Unicode font for Urdu)?


    BBC lists 32 languages, but two of them - Kinyarwanda and Kirundi - lead to
    the same "Great Lakes" page (the two languages are interintelligible). Also
    for the sake of this list, I count Portuguese only once, even though BBC has
    Brazilian and African varieties separate. Hence the total below comes to 30.


    Albanian charset=windows-1250

    Arabic charset=windows-1256

    Azeri charset=utf-8

    Bangla charset=utf-8

    Burmese charset=utf-8

    Chinese charset=gb2312

    English (Caribbean) charset=iso-8859-1

    French charset=iso-8859-1

    Hausa charset=iso-8859-1

    Hindi charset=utf-8

    Indonesian charset=iso-8859-1

    Kinyarwanda (& Kirundi) charset=iso-8859-1

    Kyrgyz charset=utf-8

    Macedonian charset=windows-1251

    Nepali charset=utf-8

    Pashto charset=utf-8

    Persian charset=utf-8

    Portuguese (both Brazilian and African) charset=iso-8859-1

    Russian charset=windows-1251

    Serbian charset=windows-1250

    Sinhala charset=utf-8

    Somali charset=iso-8859-1

    Spanish charset=iso-8859-1

    Swahili charset=iso-8859-1

    Tamil charset=utf-8

    Turkish charset=charset=windows-1254

    Ukranian charset=windows-1251

    Urdu charset=utf-8

    Uzbek charset=utf-8

    Vietnamese charset=utf-8



    13 utf-8

    9 iso-8859-1

    3 windows-1251

    2 windows-1250

    1 windows-1254

    1 windows-1256

    1 gb2312

    This archive was generated by hypermail 2.1.5 : Sun Apr 12 2009 - 20:28:29 CDT