RE: Tamil

From: Jonathan Rosenne (
Date: Sun Feb 13 2011 - 14:29:50 CST

  • Next message: Mark Davis ☕: "Re: Characters"

    Furthermore, while these bits could have been saved in transmission or in storage, variable bit length characters are quite inconvenient for processing.

    With the current costs of transmission and storage it is not always worthwhile to compress text, as opposed to video, but when it is, as has been said here, use compression.


    > -----Original Message-----
    > From: [] On
    > Behalf Of Mark E. Shoulson
    > Sent: Sunday, February 13, 2011 6:47 PM
    > To:
    > Subject: Re: Tamil
    > On 02/13/2011 09:59 AM, wrote:
    > > Tamil letters ஙா(0B99+0BBE), ஙி(0B99+0BBF), ஙீ(0B99+0BC0),
    > ஙு(0B99+0BC1),
    > > ஙூ(0B99+0BC2), ஙெ(0B99+0BC6), ஙே(0B99+0BC7), ஙை(0B99+0BC8),
    > ஙொ(0B99+0BCA),
    > > ஙோ(0B99+0BCB), ஙௌ(0B99+0BCC), ஞி(0B9E+0BBF), ஞீ(0B9E+0BC0),
    > ஞு(0B9E+0BC1),
    > > ஞூ(0B9E+0BC2), ஞெ(0B9E+0BC6), ஞே(0B9E+0BC7), ஞை(0B9E+0BC8),
    > ஞொ(0B9E+0BCA),
    > > ஞோ(0B9E+0BCB), ஞௌ(0B9E+0BCC) are almost unused and most Tamil
    > symbols less
    > > used. We can assign them to more bits instead of the 16 bits they are
    > > assigned to, as they are occupying space with almost no use.
    > >
    > Indeed. This is the basis for Huffman Coding (see
    > ). And it should be
    > considered when compressing text. But if you are suggesting that the
    > codings in Unicode be changed, that really won't work, for several
    > reasons.
    > For one thing, Unicode has all these stability regulations: they are
    > not
    > going to change anything that's already been assigned (even if it's
    > actually wrong!) Too much depends on what is already done to allow
    > that.
    > Also, Unicode is generally about assigning codes to characters, and the
    > simplest way to do that is to assign codes of the same length to
    > everything. This is not the most efficient way in terms of bit-length,
    > as you point out, but that isn't the point of Unicode. For efficiency
    > in those terms, there are compression algorithms, like Huffman coding
    > and others. And that makes sense, too. Doing a general Huffman coding
    > over ALL of the Unicode characters and their general usage across the
    > whole corpus as it stands now would be very inefficient when applied to
    > individual documents. A document written in (say) Phags-Pa would
    > probably take a lot more bits per character than one written in ASCII,
    > because Phags-Pa has much less usage altogether, but if we do the
    > Huffman coding *afterwards*, based only on the frequency of that
    > document, then the rarity of Phags-Pa with respect to Latin letters no
    > longer matters, and we wind up with much shorter codes for the letters
    > we are actually using.
    > Those characters aren't "occupying space". They only occupy space when
    > you use them, which as you said is not very often.
    > ~mark

    This archive was generated by hypermail 2.1.5 : Sun Feb 13 2011 - 14:34:19 CST