Re: lists of actual character/diacritic combinations

From: Joan Aliprand (BR.JMA@RLG.ORG)
Date: Wed Mar 01 2000 - 17:57:34 EST


Ken Whistler <kenw@sybase.com> wrote:

>John Cowan noted
{about Ken Whistler's initial calculation for the MUMS and
JACKPHY databases]:

 [snip]
>> BTW, the JACKPHY database (IIRC) is bibliographic information (in Latin
>> alphabet transliteration) for books written in non-Latin scripts.
>> So it represents "non-native" uses of diacritics.
>
>Fair enough. It appears that lumping the two sets of data from the
>differing corpora yields misleading results. So here is the raw
>data, recalculated, separating the MUMS Books database (first column)
>and the JACKPHY database (second column), sorted in descending
>frequency by number of occurrences in the MUMS Books database.
>
>Comparing the two sets of data, it is clear that the JACKPHY database
>contains an anomalously high frequency of macrons, breves, and dot belows,
>and an anomalously low frequency of acutes, graves, carons, and tildes,
>etc.
>
>For the base letters, the JACKPHY database has an anomalously high
>proportion of o's, u's, h's, k's, and v's carrying diacritics, and
>an anomalously low proportion of e's, n's, and c's, etc.

Not only does the JACKPHY database represent "non-native" uses of
diacritics, but the MUMS database includes "non-native" uses. :-(

JACKPHY stands for "Japanese, Arabic, Chinese, Korean, Persian,
Hebrew and Yiddish."

When the Library of Congress proposed discontinuing its printed
card service and supplying only machine-readable records, the US
library community requested retention of printed cards for the
"JACKPHY" languages until original script support was available.*

The base letter frequencies for the JACKPHY data reflects the
ALA-LC conventions for the transliteration of these languages.

All other non-Roman scripts have been cataloged in
transliteration since then, and are in the MUMS database. (I
don't know whether the transliterated version of JACKPHY records
are as well.)

For example, FE20 and FE21 (ligature halves) are "high scorers"
in MUMS data; these diacritics are used exclusively in
transliteration (chiefly for languages written in Cyrillic
script).

-- Joan Aliprand
   Research Libraries Group

* All of scripts of the JACKPHY languages, plus Cyrillic, are
  supported on the Research Libraries Information Network (RLIN).
  LC uses RLIN for its JACKPHY script cataloging (including
  Arabic cataloging done by its office in Cairo).

To: UNICODE@UNICODE.ORG



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT