Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Mon Sep 18 2006 - 05:00:40 CDT

  • Next message: Chris Harvey: "Re: FW: Technology leads to cool fonts in Native language"

    On 17 Sep 2006, at 23:41, Mark Davis wrote:

    > > Technical bias arises in encoding schemes for text such as
    > Unicode UTF-8, which causes text in a non-roman script to require
    > two to three times more space than comparable text in a roman script

    > Character frequency. One can't just compare the amount that a
    > particular character will grow or shrink; you have to look at the
    > frequency of usage of characters in the language.

    It seems me that one should employ what might be called a character
    compression method, i.e., a compression method compression the
    character numbers (code points) rather than the encoded binary data,
    as it is probably more efficient in view of how compression
    algorithms work. (I.e. finding statistical regularities, and using a
    variable size encoding, with smaller size for the more frequent
    combinations.)

    Then, of cause, the compressed size of a file with Unicode text, is
    independent of the encoding (UTF-N, N = 7, 8, 15, 32, etc.) used.
    These latter encodings can be used based on the other criteria alone.

    Perhaps Unicode should take up the initiative, persuading
    implementers of common compression formats to implement such
    character compression methods.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Mon Sep 18 2006 - 05:07:02 CDT