RE: Rot13 and letters with accents

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 06 2007 - 17:53:58 CST

  • Next message: Asmus Freytag: "Re: Questions regarding U+FD3E ORNATE LEFT PARENTHESIS and U+FD3F ORNATE RIGHT PARENTHESIS"

    Doug Ewell wrote:
    > Envoyé : jeudi 6 décembre 2007 21:35
    > À : Unicode Mailing List
    > Cc : William J Poser; aprilop2007@trashmail.net
    > Objet : Re: Rot13 and letters with accents
    >
    > William J Poser <wjposer at ldc dot upenn dot edu> wrote:
    >
    > > But then even better would be to Unicode-ify rot13 so that it affects
    > > non-ASCII characters. For example, restricting ourselves to the BMP,
    > > we could have rot7FFF, which would produce meaningless strings of CJK
    > > characters from (extended) Latin text.
    >
    > (This is not quite the same thing, but you might find it interesting
    > nonetheless:
    > http://www.mindspring.com/~markus.scherer/unicode/base16k.html )

    One of the design goals for Base16k is:
    * The characters should be inert under most Unicode text transformations,
    especially normalization, but ideally also case mapping etc., so that such
    an encoding of binary data does not get corrupted by common processing.

    While also choosing a subset of the BMP that is continuous and immune to
    almost all Unicode transforms (including normalizations). But I'm not sure
    that the chosen subset (in the Han ideographic block) is effectively immune
    to these transforms.

    I would have probably chosen a block that is really immune to all Unicode
    transforms and mapping, and used the large PUA block of the BMP to implement
    such binary encoding, but one of the goal says:

    * Unassigned and private-use code points should be avoided because they are
    often restricted and could be affected by future or custom processing.

    Despite I agree this statement about assigned code points, I don't
    understand the justification about the exclusion of PUAs, which are standard
    in Unicode. Those restrictions affecting some applications could as well
    apply to the Han ideographs, that are also not immune to "custom
    processing", such as decompositions into component radicals and strokes.

    So if we take a binary transform that is really immune, just use the PUAs of
    the BMP. Yes there are less than 16k code points there, but anyway, using a
    block of 4k codes would not degrade severely the performance in terms of
    extra binary length.

    So, suppose you have 4k code points allocated in PUAs for this purpose, it
    allows encoding 12 bits of binary data, i.e. one byte and a half, and only
    the status of the last encoded byte (that may or may not encode an actual
    decoded byte) would be significant (and you could also use the method used
    in Base64 where an extra padding "=" sign is appended to complete a
    sequence).

    In addition, such processing would be much easier.

     Number of Number of
     binary bytes Base4k characters
     ------------ -------------------------------------
     3N 2N
     3N+1 2N+1 (+ optional 1 padding character)
     3N+2 2N+1 (+ optional 1 padding character)

    If UTF-16 is used, the transport stream has this length:

     Number of Number of UTF-16 code units for Byte lengths in the
     binary bytes Base4k characters transport stream
     ------------ ------------------------------------- ------------------
     3N 2N 4N
     3N+1 2N+1 (+ optional 1 padding character) 4N+2
     3N+2 2N+1 (+ optional 1 padding character) 4N+2

    Unlike Base64 but like Base16k, the number of characters per binary byte is
    not unique. However, for the first case (3N+1) you would need only 8 bits of
    information for encoding the last byte, but the second case (3N+2) will not
    use all the 12 bits in the last base16k encoded character, so 4 bits will
    remain unused, you could solve the problem by appending a single last
    character in the cases of an encoding of (2N+1) characters, to determine if
    the previous (last encoded) character encodes one or two binary bytes. You
    could also set one bit in the first character used for the last sequence of
    the (3N+1) and (3N+2) binary bytes case (there are 4 or 8 unused bits there)
    to see if it encodes one or two binary bytes, and no extra padding is
    necessary and not extra block of characters for the last sequence. (For
    decoding, it just requires testing a single bit of the encoded length in
    characters, and if this bit odd, the last character encodes this bit in the
    addition of the final 4 or 8 bits of source binary data).

    So encoding the total length as a decimal number prefix is not absolutely
    necessary (it could be done optionally and verified in the decoder, that
    would accept those leading digits if they are present, and that are
    separated from the 4k characters used for the binary encoding).

    If UTF-8 is used, the transport stream has this length:

     Number of Number of UTF-16 code units for Byte lengths in the
     binary bytes Base4k characters transport stream
     ------------ ------------------------------------- ------------------
     3N 2N 6N
     3N+1 2N+1 (+ optional 1 padding character) 6N+2
     3N+2 2N+1 (+ optional 1 padding character) 6N+4

    (it is mostly identical in this case to Base64)

    Efficiency comparison (assuming that the 4k characters are allocated in a
    block where each one requires 3 bytes in UTF-8, so it is within the BMP):

     % UTF-8 UTF-16 SCSU
     base64 75.0% 37.5% 75.0%
     base4k 75.0% 66.7% 75.0%
     base16k 58.3% 87.5% 87.5%

    The main interest would not be for the transport stream, because 8-bit byte
    encodings and byte order independence is the important feature of UTF-8, so
    Base64 would be better. But for local data storage and management, it is
    significantly better than Base64, assuming that UTF-16 code units are
    preserved and their byte order is predicable. What this suggests is that
    Base64 and base4k would work in concert, Base64 being used only for the
    transport, and local management using Base4k over UTF-16 instead as it is
    twice more efficient, but still very simple to decode (note that for
    computing addresses, the need to use a division by 3 is faster to implement
    than a division by 7, even if you use the division by
    multiplication-and-shift trick).



    This archive was generated by hypermail 2.1.5 : Thu Dec 06 2007 - 17:58:38 CST