RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

From: CE Whitehead <>
Date: Tue, 31 Jul 2012 16:13:04 -0400

Hi, once more Phillipe; one more note: my apologies; I am still trying to make sense of the effects of the various characters/non-characters on the rest of the text in processing of character strings; thus, if there are any errors in my reply (below), someone correct me; I am not really a programmer (excepting a knowledge of html / css and a little java script and maybe just a bit of other stuff).

Subject: RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
Date: Sat, 28 Jul 2012 13:35:57 -0400

> From:
> Date: Fri, 27 Jul 2012 03:17:07 +0200
> Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
> To:
> CC:;
> I just wonder where the XSS attack is really an issue here. XSS
> attacks involve bypassing the document source domain in order to
> attempt to use or insert data found in another document issued or
> managed by another domain, in a distinct security realm.
> What is a more serious issue would be the fact that the document
> parsed has an unknown security, and that its document is subject to an
> inspection (for example by an antivirus or antimalware trying to
> identify sensitive code which would remain usable (but hidden by the
> cipher-like invalid encoding that a browser would just interpret
> blindly).
> Yes that's what I think is the issue here.

And this is also what's discussed in the Unicode security document I suggested linking to.
>> One problem with the strategy of delering invalid sequences blindly is
>> of course the fact that such invalid sequences may be complex and
>> could be arbitrarily ylong. But antiviri/antimalware solutions already
>> know how to ignore these invalid sequences when trying to identify
>> malicious code, so that it will detect more possibilities.
> Thanks for info. I did not know this.
>> In that case, the safest strategy for an iantivirus is effectively to
>> discard the invalid sequences, trying to mimic what an unaware browser
>> would do blindly with the consequence of running the potentially
>> dangerous code. The strategy used in a browser for rendering the
>> documentn or in an security solution when trying to detect malicious
>> code, will then be completely opposed.
> Yes, this is a good strategy for anti-virus and malware detection programs; however I think unicode is more focused on general
> character handling/display.
>> Another consern is the choice of the replacement character. This
>> document only suggests the U+FFD character which may also not pass
>> some encoding converters used when forwarding the document to a lower
>> layer API running the code effectively.
>> If the code (as opposed to the normal text) is used, it will
>> frequently be restricted only to ASCII or to a SBCS encoding. And in
>> that case, a better substitute will be the ASCII C0 control which is
>> noramlly invalid in plain text programming/scripting source code.
>> Traditionally this C0 control character is SUB. IT may even be used to
>> replace all invalid bytes of an invalid UTF-8 sequence, without
>> changing its length (this is not always possible with U+FFFD in UTF-8
>> because it will be encoded as 3 bytes and there may be
>> invalid/rejected sequences containing only 1 or 2 bytes that should
>> survive with the same length after the replacement.
> Once concern is that SUB and U+FFFD have different character
> properties. And not all Unicode algorithms are treating it the way it
> should (for example in boundary breakers or in some transforms).
Hmm, after checking several unicode documents and some of the faq (, my understanding is that using a non-character code point is the best solution here; I don't know which non-character code point is best, but at least in collation any non-character code point should be ignored. That is, collation is ideally performed on "normalized" character strings and not on code points.
However, I do believe that some string processing/comparison algorithms that look at the string itself and not the characters may be affected. So this is an issue to consider for some yes.
>> Another concern is that even this C0 control may be used for
>> controling some terminal functions (such uses are probably in very old
>> applications), so some code converters are using instead the question
>> mark (?) which is even worse as it may break a query URL, unexpectedly
>> passing the data encoded after it to another HTTP(S) resource than the
>> expected one, and also because it will bypass some cache-control
>> mechanism.
Thanks for bringing this up. (I'm not a programmer and really can't discuss this further thus but I do know how to create my own queries for the search engine, placing question marks wherever so I can bring a particular search page up by typing a url for example when I'm searching for particular text in a google book . . . )
>> The document does not discuss really how to choose the replacement
>> character. My opinion is that for UTF-8 encoded documents, the ASCII
>> C0 control (SUB) is still better than the U+FFFD character which works
>> well only in UTF-16 and UTF-32 encodings. It also works well with many
>> legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
>> and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
>> many EBCDIC codepages, distinctly from simple filler/padding
>> characters that are blindly stripped in many applications as if they
>> were just whitespaces at end of a fixed-width data field).
It seems that in a previous unicode discussion, it's been recommended that applications use codepoints in the noncharacter code points block rather than non-unicode control codes. Thus one should not use a character at all, just a placeholder.

See also:


>> How many replacements must be made ? My opinion is that replacements
>> should be done so that no change occurs to the data length. For the
>> remaining cases, data security can detect this case with strong data
>> signatures like SHA1 for not too long documents (like HTML pages, or
>> full email contents, with some common headers needed for their
>> indexing or routing or delivery to the right person), or SHA256 for
>> very short documents (like single datagrams or the value of short
>> database fields like phone numbers or people last name or email
>> address) or very long documents (or with security certificates over a
>> secure channel which will also detect undetected data corruption in
>> the end-to-end communication channel, either one-to-one or one-to-many
>> for broadcasts and selective multicasts but this case of secure
>> channels should not be a problem here as it also has to detect and
>> secure many other cases than just invalid plain-text encodings,
>> notably by man-in-the-middle attacks or replay attacks, or to reliably
>> detect DoS attack by a broken channel with unrecoverable data losses,
>> something that can be enforced by reasonnable timeout watchdogs, if
>> performance of the channel should be ensured).
IMO ("in my opinion"), just having any placeholder is helpful security-wise. (However, I'm still thinking this over.)


--C. E. Whitehead
>Hmm, thanks for these suggestions (and sorry I have not looked into the replacement characters more at this point).

> Sincerely,
> . . .

> 2012/7/27 Mark Davis ☕ <>:
> > Thanks, good suggestion.
> >
> > Mark
> >
> > — Il meglio è l’inimico del bene —
> >
> >
> >
> > On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead <>
> > wrote:
> >>
> >> "Validation;" par 3, comment in parentheses
> >> ". . . (you never want to just delete it; that has security problems)."
> >> { COMMENT: would it be helpful here to have a reference here to the
> >> unicode security document that discusses this issue -- TR 36, 3.5
> >> ?}
> >
> >
Received on Tue Jul 31 2012 - 15:22:32 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 31 2012 - 15:22:51 CDT