From: Mark E. Shoulson (mark@kli.org)
Date: Sun Feb 13 2011 - 10:46:43 CST
On 02/13/2011 09:59 AM, anbu@peoplestring.com wrote:
> Tamil letters ஙா(0B99+0BBE), ஙி(0B99+0BBF), ஙீ(0B99+0BC0), ஙு(0B99+0BC1),
> ஙூ(0B99+0BC2), ஙெ(0B99+0BC6), ஙே(0B99+0BC7), ஙை(0B99+0BC8), ஙொ(0B99+0BCA),
> ஙோ(0B99+0BCB), ஙௌ(0B99+0BCC), ஞி(0B9E+0BBF), ஞீ(0B9E+0BC0), ஞு(0B9E+0BC1),
> ஞூ(0B9E+0BC2), ஞெ(0B9E+0BC6), ஞே(0B9E+0BC7), ஞை(0B9E+0BC8), ஞொ(0B9E+0BCA),
> ஞோ(0B9E+0BCB), ஞௌ(0B9E+0BCC) are almost unused and most Tamil symbols less
> used. We can assign them to more bits instead of the 16 bits they are
> assigned to, as they are occupying space with almost no use.
>
Indeed.  This is the basis for Huffman Coding (see 
http://en.wikipedia.org/wiki/Huffman_coding ).  And it should be 
considered when compressing text.  But if you are suggesting that the 
codings in Unicode be changed, that really won't work, for several reasons.
For one thing, Unicode has all these stability regulations: they are not 
going to change anything that's already been assigned (even if it's 
actually wrong!)  Too much depends on what is already done to allow that.
Also, Unicode is generally about assigning codes to characters, and the 
simplest way to do that is to assign codes of the same length to 
everything.  This is not the most efficient way in terms of bit-length, 
as you point out, but that isn't the point of Unicode.  For efficiency 
in those terms, there are compression algorithms, like Huffman coding 
and others.  And that makes sense, too.  Doing a general Huffman coding 
over ALL of the Unicode characters and their general usage across the 
whole corpus as it stands now would be very inefficient when applied to 
individual documents.  A document written in (say) Phags-Pa would 
probably take a lot more bits per character than one written in ASCII, 
because Phags-Pa has much less usage altogether, but if we do the 
Huffman coding *afterwards*, based only on the frequency of that 
document, then the rarity of Phags-Pa with respect to Latin letters no 
longer matters, and we wind up with much shorter codes for the letters 
we are actually using.
Those characters aren't "occupying space".  They only occupy space when 
you use them, which as you said is not very often.
~mark
This archive was generated by hypermail 2.1.5 : Sun Feb 13 2011 - 10:49:20 CST