Re: Unicode & space in programming & l10n

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Sep 19 2006 - 00:46:00 CDT

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    Some additional thoughts:

    There are certain types of applications (web log processing for one)
    where handling vast amount of "text" data is important. In these
    situations, a reasonably dense representation of data will enable more
    processing with many fewer cache misses. The size of the mass-storage
    device is irrelevant, it's the size of your 'peephole' represented by
    your cache that's the limiting factor.

    In such situations, you cannot afford to compress/uncompress, as most
    data is seen only once.

    Finally, if most (much) of your data is ASCII due to the ASCII bias of
    protocols, then any format that's close to ASCII is beneficial. UTF-8
    fits that bill. SCSU and BOCU take too much processing time compared to
    UTF-8, and UTF-16/32 take too much space given the assumptions.

    Add to that the fact that often data streams are already in UTF-8, and
    that format becomes the format of choice for applications that have the
    constraints mentioned. (As has been pointed out, the 'bloat' for CJK is
    not a factor as long as the data always contains high proportion of ASCII.)

    SCSU was developed for a system that had bandwidth limitations in
    straight transmission. Except for communications to remote areas
    (wilderness, marine, space etc.) such severe limitations on transmission
    bandwidth are a thing of the past, and even then, the block compression
    algorithms can often be used to good advantage.

    As for mass-storage limitations (or not) the rest of the thread contains
    sufficient discussion.

    A./



    This archive was generated by hypermail 2.1.5 : Tue Sep 19 2006 - 00:50:22 CDT