Re: Usage stats?

From: Michael Norton <michaelanortonster_at_gmail.com>
Date: Sat, 28 Mar 2015 07:30:23 -0400

Thanks Doug. I did not know there exists a *representative* sample of the
world's text. :) I do know that 400 years ago there were about 10,000
languages; now there are about 6,500. Time flies!

Your frequency chart is great. The average char appearance is 2.91%.
Only 34% from your list exceed 10% of it. Therefore, U+0020 is the
elephant in the room (ie. 15%.05% is far > 2.91%). In fact, it's almost
>50% greater than the next most-appearing character.

So from the two frequency lists you've given me (my email and yours) we
begin to see some patterns emerge. Provided prior data and observation,
most useful patterns prevail over other more obscure ones and present a
provocative opportunity for webbers out there....While this is probably out
of context for most of the 700 Unicode members, I can report that it's good
news.

On Fri, Mar 27, 2015 at 5:31 PM, Doug Ewell <doug_at_ewellic.org> wrote:

> Here is a frequency chart for my previous message. I used the Character
> Frequency tool in Andrew West's BabelPad editor (
> http://www.babelstone.co.uk/Software/BabelPad.html) and sent the output
> to Excel to calculate the percentages. To make Excel happy, I had to
> manually add a single quote ' before the double quotation mark " .
>
> This is still *nowhere near* a realistic sample of which Unicode
> characters are used with what frequency in the entire world. There are
> still only 69 discrete characters, less than the printable ASCII set. And
> according to this sample, Regional Indicator Symbols occur as often as
> capital A, and capital R never occurs at all.
>
> In Japanese or Thai text you will have almost no instances of U+0020.
>
> If you search a non-representative sample of the world's text, you will
> get non-representative statistics.
>
>
> Code point Character Character Name Count U+0020 SPACE 177 15.05%
> U+0065 e LATIN SMALL LETTER E 92 7.82% U+0074 t LATIN SMALL LETTER T 86
> 7.31% U+006F o LATIN SMALL LETTER O 76 6.46% U+0061 a LATIN SMALL LETTER A
> 63 5.36% U+0069 i LATIN SMALL LETTER I 62 5.27% U+006E n LATIN SMALL
> LETTER N 54 4.59% U+0072 r LATIN SMALL LETTER R 50 4.25% U+0073 s LATIN
> SMALL LETTER S 47 4.00% U+006C l LATIN SMALL LETTER L 44 3.74% U+0063 c LATIN
> SMALL LETTER C 38 3.23% U+0068 h LATIN SMALL LETTER H 34 2.89% U+0075 u LATIN
> SMALL LETTER U 33 2.81% U+0064 d LATIN SMALL LETTER D 27 2.30% U+0079 y LATIN
> SMALL LETTER Y 25 2.13% U+0067 g LATIN SMALL LETTER G 18 1.53% U+002E . FULL
> STOP 16 1.36% U+0030 0 DIGIT ZERO 15 1.28% U+0062 b LATIN SMALL LETTER B
> 15 1.28% U+0066 f LATIN SMALL LETTER F 15 1.28% U+003E > GREATER-THAN SIGN
> 13 1.11% U+0070 p LATIN SMALL LETTER P 13 1.11% U+0077 w LATIN SMALL
> LETTER W 12 1.02% U+002C , COMMA 11 0.94% U+006D m LATIN SMALL LETTER M 11
> 0.94% U+0055 U LATIN CAPITAL LETTER U 9 0.77% U+002D - HYPHEN-MINUS 8
> 0.68% U+0076 v LATIN SMALL LETTER V 7 0.60% U+0078 x LATIN SMALL LETTER X
> 7 0.60% U+0027 ' APOSTROPHE 6 0.51% U+0025 % PERCENT SIGN 5 0.43% U+002B + PLUS
> SIGN 5 0.43% U+0037 7 DIGIT SEVEN 5 0.43% U+006B k LATIN SMALL LETTER K 5
> 0.43% U+0022 '" QUOTATION MARK 4 0.34% U+0031 1 DIGIT ONE 4 0.34% U+0036 6 DIGIT
> SIX 4 0.34% U+003A : COLON 4 0.34% U+003F ? QUESTION MARK 4 0.34% U+0032 2 DIGIT
> TWO 3 0.26% U+0033 3 DIGIT THREE 3 0.26% U+0034 4 DIGIT FOUR 3 0.26%
> U+0042 B LATIN CAPITAL LETTER B 3 0.26% U+004C L LATIN CAPITAL LETTER L 3
> 0.26% U+004F O LATIN CAPITAL LETTER O 3 0.26% U+0057 W LATIN CAPITAL
> LETTER W 3 0.26% U+002F / SOLIDUS 2 0.17% U+0035 5 DIGIT FIVE 2 0.17%
> U+0043 C LATIN CAPITAL LETTER C 2 0.17% U+0045 E LATIN CAPITAL LETTER E 2
> 0.17% U+0046 F LATIN CAPITAL LETTER F 2 0.17% U+0049 I LATIN CAPITAL
> LETTER I 2 0.17% U+004E N LATIN CAPITAL LETTER N 2 0.17% U+007C | VERTICAL
> LINE 2 0.17% U+0028 ( LEFT PARENTHESIS 1 0.09% U+0029 ) RIGHT PARENTHESIS
> 1 0.09% U+0038 8 DIGIT EIGHT 1 0.09% U+0039 9 DIGIT NINE 1 0.09% U+003B ;
> SEMICOLON 1 0.09% U+003C < LESS-THAN SIGN 1 0.09% U+0041 A LATIN CAPITAL
> LETTER A 1 0.09% U+0044 D LATIN CAPITAL LETTER D 1 0.09% U+004A J LATIN
> CAPITAL LETTER J 1 0.09% U+004D M LATIN CAPITAL LETTER M 1 0.09% U+0050 P LATIN
> CAPITAL LETTER P 1 0.09% U+0054 T LATIN CAPITAL LETTER T 1 0.09% U+0059 Y LATIN
> CAPITAL LETTER Y 1 0.09% U+1F1F8 🇸 REGIONAL INDICATOR SYMBOL LETTER S 1
> 0.09% U+1F1FA 🇺 REGIONAL INDICATOR SYMBOL LETTER U 1 0.09% 1176
> 100.00%
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>

-- 
Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com
"All great actors are mere mathematical masters of speech and the human
body."

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sat Mar 28 2015 - 10:33:40 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 28 2015 - 10:33:41 CDT