Re: VOA- utf-8, lang="en" (Re: BBC.co.uk languages ...)

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Tue Apr 14 2009 - 14:39:18 CDT

  • Next message: Mark Davis: "Re: [A12n-policy] Re: VOA- utf-8, lang="en" (Re: BBC.co.uk languages ...)"

    It is a chicken & egg problem. Web page creators will only bother to set the
    language (or set it different than the default) if the language setting
    makes a difference. Because so much content is badly tagged, all of the
    interpreters of the pages end up having to disregard that information, and
    compute the language heuristically ("language detection"). Because of that
    the language setting doesn't make a difference, so the creators don't bother
    setting it.

    Mark

    On Tue, Apr 14, 2009 at 11:44, Don Osborn <dzo@bisharat.net> wrote:

    > Thanks Mark, I can see why. I kow about how smaller sites can miss on
    > this (and just recently mentioned this in regard to two Fula sites, one in
    > Pulaar of Mauritania with no language designation on most pages and ar-SA on
    > one section, and the other site based in Belgium listing lang="en-GB").
    > However I was a bit astonished to see that a major site like VOA appeared to
    > have totally disregarded the issue (or else they consider the page frame in
    > which the local content is situated to always be in English - but I see no
    > lang= commands other than the "en" ones, so in any event they missed on
    > adding proper language tags).
    >
    >
    >
    > Don
    >
    >
    >
    >
    >
    >
    >
    > *From:* Mark Davis [mailto:mark.edward.davis@gmail.com]
    > *Sent:* Tuesday, April 14, 2009 1:14 PM
    > *To:* Donald Z. Osborn
    > *Cc:* A12n tech support; a12n-policy@bisharat.net; Unicode Mailing List
    > *Subject:* Re: VOA- utf-8, lang="en" (Re: BBC.co.uk languages ...)
    >
    >
    >
    > FYI, in Google we essentially ignore the language setting in the web page,
    > because it is too often missing or wrong to be useful.
    >
    > Mark
    >
    > On Tue, Apr 14, 2009 at 07:23, Donald Z. Osborn <dzo@bisharat.net> wrote:
    >
    > Thanks to all for the feedback on this topic. It sounds like the choice of
    > utf-8 or not is mainly one of policy (or lack of same) and not technical
    > restraints?
    >
    > Interesting on this point to contrast with VOA,* which has all of its
    > language pages in utf-8.
    >
    > On the other hand, while BBC uses lang= parameter in page coding to
    > indicate the main language in each page, VOA pages are apparently all
    > lang="en"
    >
    > Like BBC, VOA ASCIIfies Hausa Boko orthography. It also has no text in
    > Amharic or Tigrinya (among non-Latin scripts), only audio from an English
    > language "Horn" page.
    >
    > Like BBC, it groups the similar languages Kinyarwanda and Kirundi on a
    > single page (with text in one, the other, both, or something inbetween). It
    > would be interesting to know what exactly is the language of the text
    > content of that page. BBC codes their page "rw" (for Kinyarwanda), not "rn"
    > (for Kirundi), even though both languages share it. But as already noted,
    > VOA incorrectly uses lang="en" everywhere.
    >
    >
    > * http://www.voa.gov (click on Languages) or
    > http://www.voanews.com/english/screen_map.cfm
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Apr 14 2009 - 14:41:31 CDT