Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Hans Aberg (
Date: Fri Jun 02 2006 - 15:34:56 CDT

  • Next message: Markus Scherer: "Re: UTF-7 - is it dead?"

    For such uses, such as those below, it is probably better finding
    more efficient compression techniques, rather than hoping that UTF-8
    should do the job. The original idea with UTF-8, coming from UNIX
    developers, is compatibility with ASCII, not text compression. In
    view of Moore's law <>,
    space will be fairly quickly sufficient for UTF-32 in any given
    application. Otherwise, the argument for UTF-8 space efficiency can
    also be made in favor of UTF-32 in time efficiency in type setting
    programs like TeX, where some people may want to plus in a whole
    encyclopedia, and get it compiled interactively in a fraction of a
    second. So use UTF-8 alternatively UTF-32 where the tradeoff is most
    practical and efficient for your needs at hand.

    On 2 Jun 2006, at 19:38, John D. Burger wrote:

    > Stephane Bortzmeyer wrote:
    >> Show me someone who can fill a modern hard disk with only raw text
    >> (Unicode is just that, raw text) encoded in UTF-32. Even UTF-256
    >> would
    >> not do it.
    > Huh? There's a lot of text out there. I'm pretty sure that
    > Google's cache fills far more than one hard disk, for instance.
    > For a personal example, I do research with this text collection:
    > catalogId=LDC2003T05
    > In UTF-32, this would take up close to 50 gigabytes, one-tenth of
    > the disk on my machine. And LDC has dozens of such collections,
    > although Gigaword is probably one of the biggest, and I'm typically
    > only working with a handful at a time.
    > I'm also about to begin some work on Wikipedia. The complete
    > English dump, with all page histories, which is what I'm interested
    > in, takes up about a terabyte. In UTF8.
    > - John D. Burger
    > MITRE

    This archive was generated by hypermail 2.1.5 : Fri Jun 02 2006 - 15:41:12 CDT