RE: BBC.co.uk languages - mostly not UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Apr 13 2009 - 00:01:46 CDT

Next message: William J Poser: "RE: ASCII as a subset of Unicode"

Previous message: Philippe Verdy: "RE: ASCII as a subset of Unicode (was: Re: Oxford proposes a leaner alphabet)"
In reply to: Don Osborn: "BBC.co.uk languages - mostly not UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Don Osborn wrote:
> A quick review of coding on BBC World Service pages in diverse languages
> at http://www.bbc.co.uk/worldservice/languages/ reveals . a diversity of
> charset codes used, with most pages *not* in utf-8. I suspect that BBC is
> anticipating the kinds of systems that users in each language population
> will rely on, trying to accommodate the least sophisticated systems and
> font repertoires. Assuming that their read is accurate (and that they're
not
> just being just conservative about making the change to utf-8), this would
> seem to be an interesting window on how widespread the use of Unicode
> is or is not at the present time. On the other hand, it is worth noting
that
> no Latin-based orthography is displayed on bbc.co.uk in utf-8, even when
> characters beyond Latin-1 are used (Turkish) or should be used (Hausa).
> (...)
> BBC lists 32 languages, but two of them - Kinyarwanda and Kirundi - lead
> to the same "Great Lakes" page (the two languages are interintelligible).
> Also for the sake of this list, I count Portuguese only once, even though
> BBC has Brazilian and African varieties separate. Hence the total below
> comes to 30.

I can't remember when support for UTF-8 was added in major browsers.
But the fact that BBC WS still maintains a collection of encodings,
including
UTF-8 for some languages, without errors possibly means that they have
used a backend that selects the "simplest" encoding within a list to
Automatically reduce the page sizes. If there's no such tool installed on
their CMS, it may just be the effet of common policies that have been

Maintained since long by their journalists based on their experience, and
never felt the need to change it. But I see no reason why it still maintains
some Windows codepages, but not Windows 1252 (sticking to ISO 8859-1, even
if it has a smaller set of characters, and despite the fact that many
Microsoft web editing tools are "transparently" encoding Windows 1252
characters in a ISO 8859-1 page, creating some problems if there's no filter
to reencode some of the extra characters using NCRs or into approximants).

Last year we saw a global statistics page showing that the UTF-8 web was
growing much faster now, and that more than half of the web is now Unicode
encoded, and legacy encodings now slowly decreasing in terms of frequence
and accessible volume of information.

The other problem that the use of multiple charsets causes is the difficulty
(or complications) to integrate ads and utility javascript tools from third
parties on a multilingual site. For me, it's the most decisive condition
that has caused many sites to use UTF-8 instead of multiple legacy
encodings.

If I just look at some other multilingual news websites like France24.com
(more recent than BBC WS), you can see that it uses XHTML fully encoded in
UTF-8, for all its pages French or English and Arabic, even if this content
could use ISO-8859-1 and windows-1256 like on the BBC site. This is also
true for the content pushed to email (newsletters, RSS feeds, alerts...)

Those sites that are the slowest to convert to UTF-8 are monolingual
governmental sites, and personnal sites ; but since the advent nad huge
success of "Web 2.0" interactive sites, most blogs, and collaborative sites
and their related CMS software are now fully UTF-8 enabled by default
(including for Western European languages). This is in this area that UTF-8
usage has exploded. I see little or no reason to continue using legacy
encodings (including ISO-8859-1), except for the source code used in
producing the software and submitted to compilers (like the sources in C,
C++, Java, C#... But not Javascript, HTML, CSS, XML).

Next message: William J Poser: "RE: ASCII as a subset of Unicode"
Previous message: Philippe Verdy: "RE: ASCII as a subset of Unicode (was: Re: Oxford proposes a leaner alphabet)"
In reply to: Don Osborn: "BBC.co.uk languages - mostly not UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 13 2009 - 10:14:09 CDT