Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Mon, 15 Oct 2018 14:11:58 +0200

Note that all these discussion about padding applies to all other base-N
encodings, including base-10.

For example to represent numbers of arbitrary precision: padding does not
require a separate symbol but can use the "0" digit which is part of the
10-symbols alphabet, or encoders can discard them on the left, or on the
right if there's a decimal dot; when the precision is less than a integral
number of decimal digits, the extra bits or fractional bits of information
in the last digit of the encoded sequence does not matter, encoders may
choose to not set them to 0 but may prefer to use rounding which may
conditionally set these bits to 1, depedning on the value of the last
significant bits or fractional bits of maximum precision.

As well the same decoders may want to use extra whitespaces (notably to
limit line lengths at arbitrary lengths, notably for embedding the encoded
sequences in printed documents or documents with a page layout and rendered
with a readable font size suitable for the page width, or for presentation
purpose by grouping symbols).

In summary, padding is not required at all by all Base-N encoders/decoders,
and non significant whitespace is frequently needed.

Le lun. 15 oct. 2018 à 13:57, Philippe Verdy <verdy_p_at_wanadoo.fr> a écrit :

> If you want an example where padding with "=" is not used at all,
> - look into URL-shortening schemes
> - look into database fields or data input forms and numerous data formats
> where the "=" sign is restricted (just like in URLs and file paths, or in
> identifiers)
> Padding is not used anywhere in the middle of the binary encoding or even
> at end, only the 64 symbols of the encoding alphabet are needed and the
> extra 2 or 4 lowest bits that may be encoded in the last character of the
> encoded sequence are discarded by the decoder (these extra bits are not
> necessarily set to 0 by encoders in the last symbol, even if this is the
> canonical form recommanded in encoders, their value is simply ignored by
> decoders).
> Some Base64 encoders do not necessarily encode binary octets-streams, but
> bits-streams whose length in bits is not necessarily multiple of 8, in
> which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
> symbol of the encoded sequence.
> Other encoders use streams of binary code units that are larger than 8
> bits, and may want to encode more padding symbols to force the alignment of
> data required in their associated decoders, or will choose to not use any
> padding at all, letting the decoder discard the trailing bits themselves at
> end of the encoded stream.
>
> Le lun. 15 oct. 2018 à 13:24, Philippe Verdy <verdy_p_at_wanadoo.fr> a
> écrit :
>
>> Also the rationale for supporting "unnecessary" whitespace is found in
>> MIME's version of Base64, also in RFCs describing encoding formats for
>> digital certificates, or for exchanging public keys in encryption
>> algorithms like PGP (notably, but not only, as texts in the body of emails
>> or in documentations and websites).
>>
>> Le lun. 15 oct. 2018 à 03:56, Tex <textexin_at_xencraft.com> a écrit :
>>
>>> Philippe,
>>>
>>>
>>>
>>> Where is the use of whitespace or the idea that 1-byte pieces do not
>>> need all the equal sign paddings documented?
>>>
>>> I read the rfc 3501 you pointed at, I don’t see it there.
>>>
>>>
>>>
>>> Are these part of any standards? Or are you claiming these are practices
>>> despite the standards? If so, are these just tolerated by parsers, or are
>>> they actually generated by encoders?
>>>
>>>
>>>
>>> What would be the rationale for supporting unnecessary whitespace? If
>>> linebreaks are forced at some line length they can presumably be removed at
>>> that length and not treated as part of the encoding.
>>>
>>> Maybe we differ on define where the encoding begins and ends, and where
>>> higher level protocols prescribe how they are embedded within the protocol.
>>>
>>>
>>>
>>> Tex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Unicode [mailto:unicode-bounces_at_unicode.org] *On Behalf Of *Philippe
>>> Verdy via Unicode
>>> *Sent:* Sunday, October 14, 2018 1:41 AM
>>> *To:* Adam Borowski
>>> *Cc:* unicode Unicode Discussion
>>> *Subject:* Re: Base64 encoding applied to different unicode texts
>>> always yields different base64 texts ... true or false?
>>>
>>>
>>>
>>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>>> enough to indicate the end of an octets-span. The extra = after it do not
>>> add any other octet. and as well you're allowed to insert whitespaces
>>> anywhere in the encoded stream (this is what ensures that the
>>> Base64-encoded octets-stream will not be altered if line breaks are forced
>>> anywhere (notably within the body of emails).
>>>
>>>
>>>
>>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>>> unlike "=" which discards extra bits remaining from the encoded stream
>>> before that are not on 8-bit boundaries).
>>>
>>>
>>>
>>> Also:
>>>
>>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X"
>>> symbol before "=" can vary in its 4 lowest bits (which are then
>>> ignored/discarded by the "=" symbol)
>>>
>>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
>>> symbol before "=" can vary in its 2 lowest bits (which are then
>>> ignored/discarded by the "=" symbol)
>>>
>>>
>>>
>>> So you can use Base64 by encoding each octet in separate pieces, as one
>>> Base64 symbol followed by an "=" symbol, and even insert any number of
>>> whitespaces between them: there's a infinite number of valid Base64
>>> encodings for representing the same octets-stream payload.
>>>
>>>
>>>
>>> Base64 allows encoding any octets streams but not directly any
>>> bits-streams : it assumes that the effective bits-stream has a binary
>>> length multiple of 8. To encode a bits-stream with an exact number of bits
>>> (not multiple of 8), you need to encode an extra payload to indicate the
>>> effective number of bits to keep at end of the encoded octets-stream (or at
>>> start):
>>>
>>> - Base64 does not specify how you convert a bitstream of arbitrary
>>> length to an octets-stream;
>>>
>>> - for that purpose, you may need to pad the bits-stream at start or at
>>> end with 1 to 6 bits (so that it the resulting bitstream has a length
>>> multiple of 8, then encodable with Base64 which takes only octets on input).
>>>
>>> - these extra padding bits are not significant for the original
>>> bitstream, but are significant for the Base64 encoder/decoder, they will be
>>> discarded by the bitstream decoder built on top of the Base64 decoder, but
>>> not by the Base64 decoder itself.
>>>
>>>
>>>
>>> You need to encode somewhere with the bitstream encoder how many padding
>>> bits (0 to 7) are present at start or end of the octets-stream; this can be
>>> done:
>>>
>>> - as a separate payload (not encoded by Base64), or
>>>
>>> - by prepending 3 bits at start of the bits-stream then padded at end
>>> with 1 to 7 random bits to get a bit-length multiple of 8 suitable for
>>> Base64 encoding.
>>>
>>> - by appending 3 bits at end of the bits-stream, just after 1 to 7
>>> random bits needed to get a bit-length multiple of 8 suitable for Base64
>>> encoding.
>>>
>>> Finally your bits-stream decoder will be able to use this padding count
>>> to discard these random padding bits (and possibly realign the stream on
>>> different byte-boundaries when the effective bitlength bits-stream payload
>>> is not a multiple of 8 and padding bits were added)
>>>
>>>
>>>
>>> Base64 also does not specify how bits of the original bits-stream
>>> payload are packed into the octets-stream input suitable for
>>> Base64-encoding, notably it does not specify their order and endian-ness.
>>> The same remark applies as well for MIME, HTTP. So lot of network protocols
>>> and file formats need to how to properly encode which possible option is
>>> used to encode bits-streams of arbitrary length, or need to specify which
>>> default choice to apply if this option is not encoded, or which option must
>>> be used (with no possible variation). And this also adds to the number of
>>> distinct encodings that are possible but are still equivalent for the same
>>> effective bits-stream payload.
>>>
>>>
>>>
>>> All these allowed variations are from the encoder perspective. For
>>> interoperability, the decoder has to be flexible and to support various
>>> options to be compatible with different implementations of the encoder,
>>> notably when the encoder was run on a different system. And this is the
>>> case for the MIME transport by mail, or for HTTP and FTP transports, or
>>> file/media storage formats even if the file is stored on the same system,
>>> because it may actually be a copy stored locally but coming from another
>>> system where the file was actually encoded).
>>>
>>>
>>>
>>> Now if we come back to the encoding of plain-text payloads, Unicode just
>>> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
>>> points (it actually does not mandate an exact bit-length because the range
>>> does not fully fit exactly to 21 bits and an encoder can still pack
>>> multiple code points together into more compact code units.
>>>
>>>
>>>
>>> However Unicode provides and standardizes several encodings
>>> (UTF-8/16/32) which use code units whose size is directly suitable as input
>>> for an octets-stream, so that they are directly encodable with Base64,
>>> without having to specify an extra layer for the bits-stream
>>> encoder/decoder.
>>>
>>>
>>>
>>> But many other encodings are still possible (and can be conforming to
>>> Unicode, provided they preserve each Unicode scalar value, or at least the
>>> code point identity because an encoder/decoder is not required to support
>>> non-character code points such as surrogates or U+FFFE), where Base64 may
>>> be used for internally generated octets-streams.
>>>
>>>
>>>
>>>
>>>
>>> Le dim. 14 oct. 2018 à 03:47, Adam Borowski via Unicode <
>>> unicode_at_unicode.org> a écrit :
>>>
>>> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode
>>> wrote:
>>> > Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
>>> > unicode_at_unicode.org> a écrit :
>>> > > The only variance is described as:
>>> > >
>>> > > Care must be taken to use the proper octets for line breaks if
>>> base64
>>> > > encoding is applied directly to text material that has not been
>>> > > converted to canonical form. In particular, text line breaks must
>>> be
>>> > > converted into CRLF sequences prior to base64 encoding. The
>>> > > important thing to note is that this may be done directly by the
>>> > > encoder rather than in a prior canonicalization step in some
>>> > > implementations.
>>> > >
>>> > > This is MIME, it specifies (in the same RFC):
>>> >
>>> > I've not spoken aboutr the encoding of new lines **in the actual
>>> encoded
>>> > text**:
>>> > - if their existing text-encoding ever gets converted to Base64 as if
>>> the
>>> > whole text was an opaque binary object, their initial text-encoding
>>> will be
>>> > preserved (so yes it will preserve the way these embedded newlines are
>>> > encoded as CR, LF, CR+LF, NL...)
>>> >
>>> > I spoke about newlines used in the transport syntax to split the
>>> initial
>>> > binary object (which may actually contain text but it does not matter).
>>> > MIME defines this operation and even requires splitting the binary
>>> object
>>> > in fragments with maximum binary size so that these binary fragments
>>> can be
>>> > converted with Base64 into lines with maximum length. In the MIME
>>> Base64
>>> > representation you can insert newlines anywhere between fragments
>>> encoded
>>> > separately.
>>>
>>> There's another kind of fragmentation that can make the encoding differ
>>> (but
>>> still decode to the same payload):
>>>
>>> The data stream gets split into 3-byte internal, 4-byte external packets.
>>> Any packet may contain less than those 3 bytes, in which cases it is
>>> padded
>>> with = characters:
>>> 3 bytes XXXX
>>> 2 bytes XXX=
>>> 1 byte XX==
>>>
>>> Usually, such smaller packets happen only at the end of a message, but to
>>> support encoding a stream piecewise, they are allowed at any point.
>>>
>>> For example:
>>> "meow" is bWVvdw==
>>> "me""ow" is bWU=b3c=
>>> yet both carry the same payload.
>>>
>>> > Base64 is used exactly to support this flexibility in transport (or
>>> > storage) without altering any bit of the initial content once it is
>>> > decoded.
>>>
>>> Right, any such variations are in packaging only.
>>>
>>>
>>> ᛗᛖᛟᚹ
>>> --
>>> ⢀⣴⠾⠻⢶⣦⠀
>>> ⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
>>> ⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
>>> ⠈⠳⣄⠀⠀⠀⠀ and 1 who narrowly avoided an off-by-one error.
>>>
>>>
Received on Mon Oct 15 2018 - 07:12:23 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 15 2018 - 07:12:23 CDT