RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

From: CE Whitehead <>
Date: Sat, 28 Jul 2012 13:35:57 -0400

> From:
> Date: Fri, 27 Jul 2012 03:17:07 +0200
> Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
> To:
> CC:;
> I just wonder where the XSS attack is really an issue here. XSS
> attacks involve bypassing the document source domain in order to
> attempt to use or insert data found in another document issued or
> managed by another domain, in a distinct security realm.
> What is a more serious issue would be the fact that the document
> parsed has an unknown security, and that its document is subject to an
> inspection (for example by an antivirus or antimalware trying to
> identify sensitive code which would remain usable (but hidden by the
> cipher-like invalid encoding that a browser would just interpret
> blindly).
Yes that's what I think is the issue here.
> One problem with the strategy of delering invalid sequences blindly is
> of course the fact that such invalid sequences may be complex and
> could be arbitrarily ylong. But antiviri/antimalware solutions already
> know how to ignore these invalid sequences when trying to identify
> malicious code, so that it will detect more possibilities.
Thanks for info. I did not know this.
> In that case, the safest strategy for an iantivirus is effectively to
> discard the invalid sequences, trying to mimic what an unaware browser
> would do blindly with the consequence of running the potentially
> dangerous code. The strategy used in a browser for rendering the
> documentn or in an security solution when trying to detect malicious
> code, will then be completely opposed.
Yes, this is a good strategy for anti-virus and malware detection programs; however I think unicode is more focused on general character handling/display.
> Another consern is the choice of the replacement character. This
> document only suggests the U+FFD character which may also not pass
> some encoding converters used when forwarding the document to a lower
> layer API running the code effectively.
> If the code (as opposed to the normal text) is used, it will
> frequently be restricted only to ASCII or to a SBCS encoding. And in
> that case, a better substitute will be the ASCII C0 control which is
> noramlly invalid in plain text programming/scripting source code.
> Traditionally this C0 control character is SUB. IT may even be used to
> replace all invalid bytes of an invalid UTF-8 sequence, without
> changing its length (this is not always possible with U+FFFD in UTF-8
> because it will be encoded as 3 bytes and there may be
> invalid/rejected sequences containing only 1 or 2 bytes that should
> survive with the same length after the replacement.
> Once concern is that SUB and U+FFFD have different character
> properties. And not all Unicode algorithms are treating it the way it
> should (for example in boundary breakers or in some transforms).
> Another concern is that even this C0 control may be used for
> controling some terminal functions (such uses are probably in very old
> applications), so some code converters are using instead the question
> mark (?) which is even worse as it may break a query URL, unexpectedly
> passing the data encoded after it to another HTTP(S) resource than the
> expected one, and also because it will bypass some cache-control
> mechanism.
> The document does not discuss really how to choose the replacement
> character. My opinion is that for UTF-8 encoded documents, the ASCII
> C0 control (SUB) is still better than the U+FFFD character which works
> well only in UTF-16 and UTF-32 encodings. It also works well with many
> legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
> and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
> many EBCDIC codepages, distinctly from simple filler/padding
> characters that are blindly stripped in many applications as if they
> were just whitespaces at end of a fixed-width data field).
> How many replacements must be made ? My opinion is that replacements
> should be done so that no change occurs to the data length. For the
> remaining cases, data security can detect this case with strong data
> signatures like SHA1 for not too long documents (like HTML pages, or
> full email contents, with some common headers needed for their
> indexing or routing or delivery to the right person), or SHA256 for
> very short documents (like single datagrams or the value of short
> database fields like phone numbers or people last name or email
> address) or very long documents (or with security certificates over a
> secure channel which will also detect undetected data corruption in
> the end-to-end communication channel, either one-to-one or one-to-many
> for broadcasts and selective multicasts but this case of secure
> channels should not be a problem here as it also has to detect and
> secure many other cases than just invalid plain-text encodings,
> notably by man-in-the-middle attacks or replay attacks, or to reliably
> detect DoS attack by a broken channel with unrecoverable data losses,
> something that can be enforced by reasonnable timeout watchdogs, if
> performance of the channel should be ensured).
Hmm, thanks for these suggestions (and sorry I have not looked into the replacement characters more at this point).


--C. E. Whitehead
> 2012/7/27 Mark Davis ☕ <>:
> > Thanks, good suggestion.
> >
> > Mark
> >
> > — Il meglio è l’inimico del bene —
> >
> >
> >
> > On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead <>
> > wrote:
> >>
> >> "Validation;" par 3, comment in parentheses
> >> ". . . (you never want to just delete it; that has security problems)."
> >> { COMMENT: would it be helpful here to have a reference here to the
> >> unicode security document that discusses this issue -- TR 36, 3.5
> >> ?}
> >
> >
Received on Sat Jul 28 2012 - 12:39:07 CDT

This archive was generated by hypermail 2.2.0 : Sat Jul 28 2012 - 12:39:16 CDT