Re: Most commonly used characters not in BMP

From: Leonardo Boiko (
Date: Mon Jun 14 2010 - 11:27:54 CDT

  • Next message: Asmus Freytag: "Re: Writing a proposal for an unusual script: SignWriting"

    On Mon, Jun 14, 2010 at 13:10, John H. Jenkins <> wrote:
    > I imagine that the best data would come from Google.

    As far as I know, Google discards punctuation and other miscellaneous
    characters during tokenization, so it would only work for the subset
    of Unicode they are willing to index (I think? Iā€™m rusty on the
    details). Iā€™d like just a simple, unfiltered, raw usage count per
    codepoint (perhaps with separate counters per-language and country,
    with the usual caveats of how hard it is to auto-detect those).

    This archive was generated by hypermail 2.1.5 : Mon Jun 14 2010 - 11:29:27 CDT