Re: lists of actual character/diacritic combinations

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Feb 29 2000 - 17:19:26 EST


Michael Everson surmised:
 
> Ar 14:06 -0800 2000-02-28, scríobh Chris Pratley:
> >Does anyone have a list of combinations of character + combining
> >diacritic(s) that actually occur in use in the world's writing
> >systems?
> >m curious as to which are the most common, which are never found, etc.
>
> I suspect you'll find that acutes and circumflexes top the list. Diaereses
> and tildes next. "Never founds" never occur.... :-) There's always somebody
> who'll use something....
>

I took the data file that John Cowan has posted at:

http://www.ccil.org/~cowan/elsie/elsie.html

and did the parsing and counting. The raw figures are posted below.
These constitute the lumped sums from both the MUMS Books database and
the JACKPHY database, containing 12,421,528 instances of characters with
diacritics, out of a total of 1,492,948,727 Latin characters.

As I noted before, take this with a grain of salt. This is merely a
corpus count, and the frequencies depend entirely on what is included
in that corpus.

--Ken

3320174 : 0304 macron
2761619 : 0301 acute (NOTE: 38 tokens are probably for double acute)
1529686 : 0308 diaeresis
1033167 : 0306 breve
 893998 : 0323 dot below
 691920 : FE20 ligature left half
 691829 : FE21 ligature right half (NOTE: some miscoding in data implied)
 252196 : 0300 grave
 229524 : 030C caron
 224048 : 0303 tilde
 184104 : 0307 dot above
 161285 : 0327 cedilla
 140016 : 0302 circumflex
  79076 : 0326 comma below
  77278 : 0331 macron below
  53663 : 030A ring above
  39912 : 0328 ogonek
  32388 : 031C left half ring below (NOTE: probably mostly intended for ogonek)
  31537 : 030B double acute
  22960 : 0325 ring below
   9067 : 0324 diaeresis below
   7087 : 0309 hook above
   5949 : 0310 candrabindu
   4430 : 0315 comma above right
   1220 : 0333 double low line
    696 : 0313 comma above
    172 : 032E breve below
    142 : FE22 double tilde left half
     85 : FE23 double tilde right half

And in case anyone is interested in *what* the diacritics get applied to,
here are the raw figures for the frequency of base characters:

2594522 : 0061 a
2391847 : 006F o
1792835 : 0069 i
1594686 : 0065 e
1486937 : 0075 u
 407268 : 0073 s
 331776 : 0074 t
 294089 : 006E n
 254005 : 0063 c
 201077 : 0068 h
  90764 : 006B k
  82214 : 0053 S
  74224 : 0049 I
  70229 : 0041 A
  64821 : 0045 E
  62795 : 007A z
  62553 : 0072 r
  58625 : 004F O
  56210 : 0055 U
  55316 : 006D m
  52729 : 0048 H
  48828 : 0076 v
  47061 : 0054 T
  43585 : 0064 d
  41736 : 0079 y
  36920 : 006C l
  24252 : 0043 C
  20431 : 004B K
  17107 : 0067 g
  11100 : 01B0 u-hook
   9727 : 00E6 ae
   8127 : 005A Z
   7767 : 0056 V
   7218 : 0044 D
   4621 : 0153 oe
   2995 : 01A1 o-hook
   2121 : 0052 R
   1897 : 0131 dotless-i
   1735 : 004E N
   1676 : 0047 G
    545 : 004C L
    528 : 0077 w
    497 : 0046 F
    363 : 0062 b
    252 : 0070 p
    199 : 006A j
    165 : 0066 f
    153 : 0071 q
     95 : 004A J
     93 : 0042 B
     66 : 0059 Y
     45 : 004D M
     31 : 0078 x
     22 : 0050 P
     21 : 01AF U-hook
     17 : 00C6 AE
     14 : 0051 Q
     11 : 01A0 O-hook
      8 : 0057 W
      4 : 0058 X
      3 : 0152 OE



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT