Re: Usage stats? from John D. Burger on 2015-03-27 (Unicode Mail List Archive)

From: John D. Burger <john_at_mitre.org>
Date: Fri, 27 Mar 2015 16:23:17 -0400

On Mar 27, 2015, at 15:57 , Michael Norton <michaelanortonster_at_gmail.com <mailto:michaelanortonster_at_gmail.com>> wrote:

> Why wouldn't Unicode itself have it?

Because as Ken explained, acquiring (and constantly updating) such statistics would require roughly the effort that Google puts into its crawler. And it wouldn't include all the printed material that isn't on the web.

Turning your question around, why would Unicode have this information? What would be the value, and how would it be worth the (considerable) effort required?

- John Burger
MITRE

>
> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler <kenwhistler_at_att.net <mailto:kenwhistler_at_att.net>> wrote:
> Search engine companies (and in particular, Google) have such
> information squirreled away in their index databases, at least as
> far as usage stats for Unicode characters on the web go -- but it
> is proprietary information, and they generally don't publish
> information about such statistics.
>
> Perhaps there are researchers out there who have set web crawlers
> on a mission to generate such web statistics for publication, and maybe
> somebody on this list knows of such research -- but it would be
> virtually impossible to generate such information for the much
> wider collection of documents and data that are not easily accessible
> for web indexing. (Behind password walls, in pdf document archives,
> in proprietary databases, ... ) As an example of why this is a problem,
> consider the fact that there are *peta*bytes of information picked up
> and stored in databases from scanners and other devices used at
> tens of millions of retail points of sale. Such data, by its nature, would tend
> to skew heavily towards use of ASCII a-z and digits 0-9 in its
> character data. How would you end up weighting such (mostly
> publicly inaccessible) data in trying to count up for overall statistics
> on character use?
>
> There are more traditional usage count studies that focus on
> counts of character frequency within single language orthographies
> in single scripts (e.g., letter frequences for French text), but I don't
> think that is what you were asking about.
>
> Here is some discussion of a similar question posted on stackoverflow:
>
> http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics <http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics>
>
> --Ken
>
> On 3/27/2015 9:31 AM, Michael Norton wrote:
> Hello and thank you for an incredible service (just joining the list). Is there a list of usage statistics per character of the Unicode set available somewhere?
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org <mailto:Unicode_at_unicode.org>
> http://unicode.org/mailman/listinfo/unicode <http://unicode.org/mailman/listinfo/unicode>
>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com <http://www.nortonsnook.com/>
>
> "All great actors are mere mathematical masters of speech and the human body."
>
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org <mailto:Unicode_at_unicode.org>
> http://unicode.org/mailman/listinfo/unicode

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Mar 27 2015 - 15:21:50 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 27 2015 - 15:21:50 CDT