Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

From: Philippe Verdy <>
Date: Fri, 27 Jul 2012 03:17:07 +0200

I just wonder where the XSS attack is really an issue here. XSS
attacks involve bypassing the document source domain in order to
attempt to use or insert data found in another document issued or
managed by another domain, in a distinct security realm.

What is a more serious issue would be the fact that the document
parsed has an unknown security, and that its document is subject to an
inspection (for example by an antivirus or antimalware trying to
identify sensitive code which would remain usable (but hidden by the
cipher-like invalid encoding that a browser would just interpret

One problem with the strategy of delering invalid sequences blindly is
of course the fact that such invalid sequences may be complex and
could be arbitrarily ylong. But antiviri/antimalware solutions already
know how to ignore these invalid sequences when trying to identify
malicious code, so that it will detect more possibilities.

In that case, the safest strategy for an iantivirus is effectively to
discard the invalid sequences, trying to mimic what an unaware browser
would do blindly with the consequence of running the potentially
dangerous code. The strategy used in a browser for rendering the
documentn or in an security solution when trying to detect malicious
code, will then be completely opposed.

Another consern is the choice of the replacement character. This
document only suggests the U+FFD character which may also not pass
some encoding converters used when forwarding the document to a lower
layer API running the code effectively.

If the code (as opposed to the normal text) is used, it will
frequently be restricted only to ASCII or to a SBCS encoding. And in
that case, a better substitute will be the ASCII C0 control which is
noramlly invalid in plain text programming/scripting source code.
Traditionally this C0 control character is SUB. IT may even be used to
replace all invalid bytes of an invalid UTF-8 sequence, without
changing its length (this is not always possible with U+FFFD in UTF-8
because it will be encoded as 3 bytes and there may be
invalid/rejected sequences containing only 1 or 2 bytes that should
survive with the same length after the replacement.

Once concern is that SUB and U+FFFD have different character
properties. And not all Unicode algorithms are treating it the way it
should (for example in boundary breakers or in some transforms).
Another concern is that even this C0 control may be used for
controling some terminal functions (such uses are probably in very old
applications), so some code converters are using instead the question
mark (?) which is even worse as it may break a query URL, unexpectedly
passing the data encoded after it to another HTTP(S) resource than the
expected one, and also because it will bypass some cache-control

The document does not discuss really how to choose the replacement
character. My opinion is that for UTF-8 encoded documents, the ASCII
C0 control (SUB) is still better than the U+FFFD character which works
well only in UTF-16 and UTF-32 encodings. It also works well with many
legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
many EBCDIC codepages, distinctly from simple filler/padding
characters that are blindly stripped in many applications as if they
were just whitespaces at end of a fixed-width data field).

How many replacements must be made ? My opinion is that replacements
should be done so that no change occurs to the data length. For the
remaining cases, data security can detect this case with strong data
signatures like SHA1 for not too long documents (like HTML pages, or
full email contents, with some common headers needed for their
indexing or routing or delivery to the right person), or SHA256 for
very short documents (like single datagrams or the value of short
database fields like phone numbers or people last name or email
address) or very long documents (or with security certificates over a
secure channel which will also detect undetected data corruption in
the end-to-end communication channel, either one-to-one or one-to-many
for broadcasts and selective multicasts but this case of secure
channels should not be a problem here as it also has to detect and
secure many other cases than just invalid plain-text encodings,
notably by man-in-the-middle attacks or replay attacks, or to reliably
detect DoS attack by a broken channel with unrecoverable data losses,
something that can be enforced by reasonnable timeout watchdogs, if
performance of the channel should be ensured).

2012/7/27 Mark Davis ☕ <>:
> Thanks, good suggestion.
> Mark
> — Il meglio è l’inimico del bene —
> On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead <>
> wrote:
>> "Validation;" par 3, comment in parentheses
>> ". . . (you never want to just delete it; that has security problems)."
>> { COMMENT: would it be helpful here to have a reference here to the
>> unicode security document that discusses this issue -- TR 36, 3.5
>> ?}
Received on Thu Jul 26 2012 - 20:19:56 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 26 2012 - 20:19:57 CDT