Re: Wanted: An Internet Unicode Meter

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Jul 26 2006 - 14:40:49 CDT

  • Next message: Don Osborn: "Re: Wanted: An Internet Unicode Meter"

    On Wed, 26 Jul 2006, Daniel Yacob wrote:

    > I was asked twice within a week recently how many Amharic documents
    > were on the internet and I could only guess at a figure.

    If I knew Amharic, I would probably try the approach of doing some Google
    searches using some very common Amharic words that might be expected to
    appear in most Amharic documents. This might give a very rough estimate of
    the Amharic pages in Google's data base (which is surely an incomplete,
    though nobody knows how incomplete, extract of the content of the WWW).
    (Naturally, the common words used should be strings that do not normally
    appear in other languages.)

    > So it
    > dawned on me that it would be a nice service if search engine
    > companies could provide some statistics -based on language (if
    > identified) and script. Perhaps these stats are available and
    > I just wasn't able to find them?

    Well, you can use the Google Advanced Search and restrict searches to
    pages in a particular language, but Amharic is not among the languages
    recognized by Google, the recognition is heuristic only (and the methods
    have not been disclosed), and you would still need to use some search
    strings. So this is clumsy at best, and it would indeed be nice if search
    engine companies compiled the statistics.

    I found a study on lingustic diversity on the Web, by Unesco, at
    http://unesdoc.unesco.org/images/0014/001421/142186e.pdf
    but it seems somewhat theoretical and contains some relative old data,
    computed from a "sample" of web pages. (I wonder how you can draw a
    sample, in the statistical sense, from the content of the Web.)

    > Going a step further, stats on a per character basis, or even a
    > property basis would be useful and not just academically interesting.

    Well, maybe. It would be closer to specifically Unicode-related issues,
    but I don't quite see the practical or theoretical relevance. And I'm
    afraid search engines aren't so interested in all characters, just those
    that appear in words (for some definition of "word").

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Jul 26 2006 - 14:49:17 CDT