Re: UTF-24

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Apr 03 2003 - 15:01:50 EST

  • Next message: David Starner: "Re: UTF-24"

    Pim Blokland wrote:
    > Why is there no UTF-24?

    Well, I once proposed UTF-20...

    > See, these MathText characters take up a lot of space. No matter how
    > you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes
    > long.

    True for them alone, in those UTFs. Short of defining another Unicode encoding, there are two
    answers that I can offer you:

    1. Such characters are expected to be the minority of text, I suppose even in Math text, because
    there are lots of other characters in such documents - punctuation, spaces, digits, regular text -
    that are mostly on the BMP and thus shorter. So total Math documents with some MathText
    supplementary characters will use, on average, fewer than 3B/code point in UTF-8/16.

    2. If you want compression, use the existing SCSU (UTR #6) and BOCU-1 (UTN #6), or general-purpose
    compressions like bzip2.

    Note that this is only for text interchange - the majority of Unicode-aware software programs uses
    UTF-16 internally.

    Best regards,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Thu Apr 03 2003 - 15:35:05 EST