Re: UTF-7 - I'm not really smarter

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Mar 29 2006 - 01:38:05 CST

  • Next message: Philippe Verdy: "Re: UTF-7"

    From: "Antoine Leca" <Antoine10646@leca-marti.org>
    > On Tuesday, March 28th, 2006 15:56Z, Jon Hanna wrote:
    >> UTF-7 is an obsolete means of encoding unicode characters.
    >
    > What do you mean (sorry) by "obsolete"?
    >
    > If you mean, "do NOT use that for a new project", I agree with you
    > wholeheartly, and I believe such an advice was already being given for about
    > 8 years.
    >
    > On the other hand, I was under the impression UTF-7 is still used by e.g.
    > some IMAP (and more generally electronic mail servers) implementations, some
    > of them /quite/ widespread.

    My opinionis that UTF-7 does not fall into any category defined as "encoded character set" or "charset" (in MIME). IT looksmore like a small compression of the base64 transfer encoding syntax.

    It cannot be recommanded for generalpurposebecause a single byte in the UTF-7 stream contains bits that belong to several distinct Unicode characters, which means that you can't extract safely a substring on every codepoint boundary without breaking the encoding. Instead, you need to count the byte position sincethe last + and then infer where each code unit starts, then try to see if it's a surrogate, and if so look backward (or forward) to get the start (or end) of the codepoint in a supplementary plane.

    Despite it has all the features of a transfer encoding syntax for UTF-16 (similar to other compression mechanisms), it is not handled like this in email systems that treat it as if it was a MIME charset. This sort of hybrid can't be safely standardized, because of interoperability problems caused by this exception.

    For this reason, it's just simpler to use a standard UTF as the encoding scheme, and then apply a standard compression and a transfer encoding syntax layer. The compression step may be avoided using BOCU-1 or SCSU (but SCSU has the same problem as UTF-7, and possibly worse, as it's impossible to extract substrings from it, and its compression step is very complex to implement efficiently, and the compressed stream is quite difficult to parse, much more than UTF-7). If I had the choice for a compression mechanism that canfall in the MIME charset category and respect the codepoint boundaries, I would definitely use BOCU-1.

    But for 7-bit only transfers, there's no better choice for now than:
    * base64 on UTF-16 for Asian texts,
    * base64 on UTF-8 for Greek, Cyrillic texts
    * quoted-printable on UTF-8 for Latin texts

    So may be there's the need for a compression mechanism similar to BOCU-1, but using 7-bit only code units. Such mechanism would have to preserve all letters,digits, spaces, the most important punctuation signs, and controls of ASCII. It would use the symbols of ASCII (normally not used in most texts) to denote a codepoint that has been "packed" into this leading code unit followed by other 7 bits code units in the [0x21..0x7e] range. (The reserved ASCII symbols that can't be represented by a single code unit would have to be encoded using this scheme).

    Some ASCII symbols that could be reassigned to code leaders would be:
    - The plus sign,
    - The slash,
    - The equal sign
    - The ampersand,
    - The percent sign
    - The dollar sign
    - The arrobas
    - The opening and closing square braces
    - The opening and closing curly braces
    - The tilde symbol,
    - The vertical line symbol,
    - The back-quote symbol,
    - The backslash

    (The less-than and greater-than symbols should not be used, and left for easy integration with HTML and XML, so these characters should be encoded if present in the actual plain-text content; the underscore may also be useful to avoid breaking symbolic identifiers, and so should be preserved)

    (There may be some other restrictions, if one wants it to be compatible with URIs embedded in 7-bit only plain-text email, so that they are not altered by this encoding, but it would be extremely hard to satisfy).

    This gives 15 distinct leaders, much enough to represent sequences with variable size. Two of them would be used to represent the "missing" ASCII characters ISO8859-1 on two bytes (so the encoding would not take more bytes than UTF-8). The assignment of the other 12 leading bytes are to study.

    The trailing bytes could take any value between 0x21 and 0x7e, making simply a base-94 numeric system (where each leading byte would indicate a starting codepoint offset, statically assigned in a very small lookup table).

    Such encoding scheme could outperform base64 including for Asian texts, would be equal in size for most Latin texts, with a very small overhead, and would even be smaller than UTF-8 for Arabic, Hebrew, Indic scripts (these are just estimates, actual study would be needed to tweak such encoding scheme.



    This archive was generated by hypermail 2.1.5 : Wed Mar 29 2006 - 01:43:17 CST