From: Jukka K. Korpela (firstname.lastname@example.org)
Date: Wed Jul 26 2006 - 14:40:49 CDT
On Wed, 26 Jul 2006, Daniel Yacob wrote:
> I was asked twice within a week recently how many Amharic documents
> were on the internet and I could only guess at a figure.
If I knew Amharic, I would probably try the approach of doing some Google
searches using some very common Amharic words that might be expected to
appear in most Amharic documents. This might give a very rough estimate of
the Amharic pages in Google's data base (which is surely an incomplete,
though nobody knows how incomplete, extract of the content of the WWW).
(Naturally, the common words used should be strings that do not normally
appear in other languages.)
> So it
> dawned on me that it would be a nice service if search engine
> companies could provide some statistics -based on language (if
> identified) and script. Perhaps these stats are available and
> I just wasn't able to find them?
Well, you can use the Google Advanced Search and restrict searches to
pages in a particular language, but Amharic is not among the languages
recognized by Google, the recognition is heuristic only (and the methods
have not been disclosed), and you would still need to use some search
strings. So this is clumsy at best, and it would indeed be nice if search
engine companies compiled the statistics.
I found a study on lingustic diversity on the Web, by Unesco, at
but it seems somewhat theoretical and contains some relative old data,
computed from a "sample" of web pages. (I wonder how you can draw a
sample, in the statistical sense, from the content of the Web.)
> Going a step further, stats on a per character basis, or even a
> property basis would be useful and not just academically interesting.
Well, maybe. It would be closer to specifically Unicode-related issues,
but I don't quite see the practical or theoretical relevance. And I'm
afraid search engines aren't so interested in all characters, just those
that appear in words (for some definition of "word").
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Wed Jul 26 2006 - 14:49:17 CDT