Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Mon, 10 Dec 2018 12:14:55 +0100

I tend to agree with your analysis that emitting U+FFFD when there is no
content between escapes in "shifting" encodings like ISO-2022-JP is
unnecessary, and for consistency between implementations should not be
recommended.

Can you file this at http://www.unicode.org/reporting.html so that the
committee can look at your proposal with an eye to changing
http://www.unicode.org/reports/tr36/?

Mark

On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <
unicode_at_unicode.org> wrote:

> We're about to remove the U+FFFD generation for the case where there
> is no content between two ISO-2022-JP escape sequences from the WHATWG
> Encoding Standard.
>
> Is there anything wrong with my analysis that U+FFFD generation in
> that case is not a useful security measure when unnecessary
> transitions between the ASCII and Roman states do not generate U+FFFD?
>
> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen_at_hsivonen.fi>
> wrote:
> >
> > Context: https://github.com/whatwg/encoding/issues/115
> >
> > Unicode Security Considerations say:
> > "3.6.2 Some Output For All Input
> >
> > Character encoding conversion must also not simply skip an illegal
> > input byte sequence. Instead, it must stop with an error or substitute
> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
> > or an escape sequence in the output. (See also Section 3.5 Deletion of
> > Code Points.) It is important to do this not only for byte sequences
> > that encode characters, but also for unrecognized or "empty"
> > state-change sequences. For example:
> > [...]
> > ISO-2022 shift sequences without text characters before the next shift
> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
> > require at least one character in a text segment between shift
> > sequences. Security software written to the formal specification may
> > not detect malicious text (for example, "delete" with a
> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
> >
> > The WHATWG Encoding Standard bakes this requirement by the means of
> > "ISO-2022-JP output flag"
> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
> > ISO-2022-JP decoder algorithm
> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
> >
> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
> > WHATWG spec.
> >
> > After Gecko switched to encoding_rs from an implementation that didn't
> > implement this U+FFFD generation behavior (uconv), a bug has been
> > logged in the context of decoding Japanese email in Thunderbird:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
> >
> > Ken Lunde also recalls seeing such email:
> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
> >
> > The root problem seems to be that the requirement gives ISO-2022-JP
> > the unusual and surprising property that concatenating two ISO-2022-JP
> > outputs from a conforming encoder can result in a byte sequence that
> > is non-conforming as input to a ISO-2022-JP decoder.
> >
> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
> > sequence is immediately followed by another ISO-2022-JP escape
> > sequence. Chrome and Safari do, but their implementations of
> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
> > decoder implementations generally are informed by the Encoding
> > Standard (though the ISO-2022-JP decoder specifically might not be
> > yet), and I suspect that Safari's implementation (ICU) is either
> > informed by Unicode Security Considerations or vice versa.
> >
> > The example given as rationale in Unicode Security Considerations,
> > obfuscating the ASCII string "delete", could be accomplished by
> > alternating between the ASCII and Roman states to that every other
> > character is in the ASCII state and the rest of the Roman state.
> >
> > Is the requirement to generate U+FFFD when there is no content between
> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
> > transitions or useless transitions between ASCII and Roman are not
> > also required to generate U+FFFD? Would it even be feasible (in terms
> > of interop with legacy encoders) to make useless transitions between
> > ASCII and Roman generate U+FFFD?
> >
> > --
> > Henri Sivonen
> > hsivonen_at_hsivonen.fi
> > https://hsivonen.fi/
>
>
>
> --
> Henri Sivonen
> hsivonen_at_hsivonen.fi
> https://hsivonen.fi/
>
>
Received on Mon Dec 10 2018 - 05:15:32 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 10 2018 - 05:15:32 CST