Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Tue Sep 19 2006 - 06:47:11 CDT

  • Next message: William J Poser: "Re: Salish-Kootenay keyboards"

    On 19 Sep 2006, at 07:46, Asmus Freytag wrote:

    > In such situations, you cannot afford to compress/uncompress, as
    > most data is seen only once.

    Sure you can, you merely cannot do the base the compression based on
    the whole of the data. So either divide it in subpackets, or make an
    assumption of what the statistical proportions might be. This is not
    as efficient as copression the whole data, but modems and streaming
    video and the like use compression techniques, so it is surely
    possible to do it one the fly on a stream.

    > Finally, if most (much) of your data is ASCII due to the ASCII bias
    > of protocols, then any format that's close to ASCII is beneficial.
    > UTF-8 fits that bill. SCSU and BOCU take too much processing time
    > compared to UTF-8, and UTF-16/32 take too much space given the
    > assumptions.
    >
    > Add to that the fact that often data streams are already in UTF-8,
    > and that format becomes the format of choice for applications that
    > have the constraints mentioned. (As has been pointed out, the
    > 'bloat' for CJK is not a factor as long as the data always contains
    > high proportion of ASCII.)

    It is probably more efficient to translate the stream into code
    points and then use a compression technique on that, because then the
    full character structure is taken into account. Then it does not
    matter which character encoding is used.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Sep 19 2006 - 06:47:59 CDT