Re: Usage stats? from Michael Norton on 2015-03-27 (Unicode Mail List Archive)

From: Michael Norton <michaelanortonster_at_gmail.com>
Date: Fri, 27 Mar 2015 16:27:26 -0400

Easy example: what's the code for [blank space] U+020 across all language
sets of Unicode? Is it the same ie: 100%?

On Fri, Mar 27, 2015 at 4:24 PM, Michael Norton <
michaelanortonster_at_gmail.com> wrote:

> Just using the tools and formulations we have at present ought to allow
> Unicode to produce a usage set without indexing the entire web which would
> provide implementors with an indication of variances for traffic, overflow,
> and override purposes relative to users of the standard. If the figure
> varies significantly from page:website, website:region, region:language,
> for example, it simplifies our ability to standardize the set.
>
> I have particular concerns, but, like Google, they are proprietary.
>
> On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger <john_at_mitre.org> wrote:
>
>> On Mar 27, 2015, at 15:57 , Michael Norton <michaelanortonster_at_gmail.com>
>> wrote:
>>
>> Why wouldn't Unicode itself have it?
>>
>>
>> Because as Ken explained, acquiring (and constantly updating) such
>> statistics would require roughly the effort that Google puts into its
>> crawler. And it wouldn't include all the printed material that isn't on the
>> web.
>>
>> Turning your question around, why would Unicode have this information?
>> What would be the value, and how would it be worth the (considerable)
>> effort required?
>>
>> - John Burger
>> MITRE
>>
>>
>> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler <kenwhistler_at_att.net>
>> wrote:
>>
>>> Search engine companies (and in particular, Google) have such
>>> information squirreled away in their index databases, at least as
>>> far as usage stats for Unicode characters on the web go -- but it
>>> is proprietary information, and they generally don't publish
>>> information about such statistics.
>>>
>>> Perhaps there are researchers out there who have set web crawlers
>>> on a mission to generate such web statistics for publication, and maybe
>>> somebody on this list knows of such research -- but it would be
>>> virtually impossible to generate such information for the much
>>> wider collection of documents and data that are not easily accessible
>>> for web indexing. (Behind password walls, in pdf document archives,
>>> in proprietary databases, ... ) As an example of why this is a problem,
>>> consider the fact that there are *peta*bytes of information picked up
>>> and stored in databases from scanners and other devices used at
>>> tens of millions of retail points of sale. Such data, by its nature,
>>> would tend
>>> to skew heavily towards use of ASCII a-z and digits 0-9 in its
>>> character data. How would you end up weighting such (mostly
>>> publicly inaccessible) data in trying to count up for overall statistics
>>> on character use?
>>>
>>> There are more traditional usage count studies that focus on
>>> counts of character frequency within single language orthographies
>>> in single scripts (e.g., letter frequences for French text), but I don't
>>> think that is what you were asking about.
>>>
>>> Here is some discussion of a similar question posted on stackoverflow:
>>>
>>> http://stackoverflow.com/questions/22184624/unicode-
>>> character-usage-statistics
>>>
>>> --Ken
>>>
>>> On 3/27/2015 9:31 AM, Michael Norton wrote:
>>>
>>>> Hello and thank you for an incredible service (just joining the list).
>>>> Is there a list of usage statistics per character of the Unicode set
>>>> available somewhere?
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Unicode mailing list
>>> Unicode_at_unicode.org
>>> http://unicode.org/mailman/listinfo/unicode
>>>
>>
>>
>>
>> --
>>
>> Michael A. Norton, B.A. Cinema, M.P.A.
>> My Cinema Home: http://www.NortonsNook.com <http://www.nortonsnook.com/>
>>
>> "All great actors are mere mathematical masters of speech and the human
>> body."
>>
>>
>>
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode_at_unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>>
>>
>
>
> --
>
> Michael A. Norton, B.A. Cinema, M.P.A.
> My Cinema Home: http://www.NortonsNook.com
>
> "All great actors are mere mathematical masters of speech and the human
> body."
>
>
>
>
>

-- 
Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com
"All great actors are mere mathematical masters of speech and the human
body."

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Mar 27 2015 - 15:28:21 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 27 2015 - 15:28:21 CDT