RE: Characters

From: Doug Ewell (doug@ewellic.org)
Date: Fri Feb 11 2011 - 12:14:33 CST

Next message: mpsuzuki@hiroshima-u.ac.jp: "Re: [unicode] RE: Characters"

Previous message: Fr�d�ric Grosshans: "Proposed *HIRAGANA/KATAKANA LETTER SMALL KO"
Maybe in reply to: anbu@peoplestring.com: "Characters"
Next in thread: John Burger: "Re: Characters"
Reply: John Burger: "Re: Characters"
Reply: William_J_G Overington: "RE: Characters"
Reply: William_J_G Overington: "RE: Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

<anbu at peoplestring dot com> wrote:

> No, this is not a joke. Whenever I post something, you are making fun
> of it. What's the problem? I seriously want to know the characters
> present in Unicode 6 and each of their frequencies of usage.

The characters are available from the Unicode Character Database, as
others have said.

If you know or have read anything about text compression -- and I assume
that is what you are trying to implement, based on this and previous
postings -- you know that frequency of usage of text characters is
completely, totally dependent on context.

In English text, there are different letter frequencies compared to
French or Greek or Tamil or Japanese text. SMS messages probably have
different frequencies compared to e-mails or scholarly works. Financial
or statistical reports may have a higher concentration of digits. C#
code has a high concentration of ( and ) and { and }. The beat goes on.
There is no one frequency chart, for alphabetic letters or for all of
Unicode, that is right for all text-compression needs; and a compression
scheme that assumes one will fail spectacularly for text samples that
fit a different model.

I'm not trying to make fun of your posts, but the simple fact is that
your questions make me doubt whether you have enough background
knowledge to take on a project like this. I recommend "Data Compression:
The Complete Reference" by David Salomon for information about
compression in general, and maybe also Unicode Technical Note #14
(disclaimer: I wrote it) if you want to compress Unicode text.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Next message: mpsuzuki@hiroshima-u.ac.jp: "Re: [unicode] RE: Characters"
Previous message: Fr�d�ric Grosshans: "Proposed *HIRAGANA/KATAKANA LETTER SMALL KO"
Maybe in reply to: anbu@peoplestring.com: "Characters"
Next in thread: John Burger: "Re: Characters"
Reply: John Burger: "Re: Characters"
Reply: William_J_G Overington: "RE: Characters"
Reply: William_J_G Overington: "RE: Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Feb 11 2011 - 12:17:07 CST