Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models) from Philippe Verdy on 2012-07-31 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 1 Aug 2012 01:51:22 +0200

2012/7/31 CE Whitehead <cewcathar_at_hotmail.com>

> Hmm, after checking several unicode documents and some of the faq (
> http://unicode.org/faq/collation.html), my understanding is that using a
> non-character code point is the best solution here; I don't know which
> non-character code point is best, but at least in collation any
> non-character code point should be ignored. That is, collation is ideally
> performed on "normalized" character strings and not on code points.
> However, I do believe that some string processing/comparison algorithms
> that look at the string itself and not the characters may be affected. So
> this is an issue to consider for some yes.
>

The issue when using a placeholder to replace invalid sequences, is that in
frequent cases, the stream length must not be altered. If you use a
non-character in an UTF-8 stream, it will not always be possible to insert
it. The null character (even though it is encoded as a single byte in
UTF-8) is the worst choice to to the many assumptions made throughout
softwares where it means an end-of-string or sometimes end-of-stream
(sometimes also some downstream processes will represent the actual
characer as a 2-byte sequence even if it's not strictly UTF-8.

In UTF-8 you may use 0xFF as a placeholder, but it will not pass through
some interfaces because it is an invalid sequence everywhere in UTF-8. So
you need a valid character, that is still encoded as a single byte, and not
used in plain-text files. The SUB C0 control character matches such needs.

As always, this is not an universal solution, there are always pros and
cons in all approaches when trying to manage encoding errors and how to
pass over them (if it is desirable).

>
>
>> Another concern is that even this C0 control may be used for
> >> controling some terminal functions (such uses are probably in very old
> >> applications), so some code converters are using instead the question
> >> mark (?) which is even worse as it may break a query URL, unexpectedly
> >> passing the data encoded after it to another HTTP(S) resource than the
> >> expected one, and also because it will bypass some cache-control
> >> mechanism.
> Thanks for bringing this up. (I'm not a programmer and really can't
> discuss this further thus but I do know how to create my own queries for
> the search engine, placing question marks wherever so I can bring a
> particular search page up by typing a url for example when I'm searching
> for particular text in a google book . . . )
>
> >>
> >> The document does not discuss really how to choose the replacement
> >> character. My opinion is that for UTF-8 encoded documents, the ASCII
> >> C0 control (SUB) is still better than the U+FFFD character which works
> >> well only in UTF-16 and UTF-32 encodings. It also works well with many
> >> legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
> >> and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
> >> many EBCDIC codepages, distinctly from simple filler/padding
> >> characters that are blindly stripped in many applications as if they
> >> were just whitespaces at end of a fixed-width data field).
> >>
> It seems that in a previous unicode discussion, it's been recommended
> that applications use codepoints in the noncharacter code points block
> rather than non-unicode control codes. Thus one should not use a character
> at all, just a placeholder.
>

If the encoding length is not an issue (UTF-16 and UTF-32 streams), yes
this is a good solution. Unfortunately we don't have any non-character in
the ASCII range which is encoded as one byte in most encodings.

>
> IMO ("in my opinion"), just having any placeholder is helpful
> security-wise. (However, I'm still thinking this over.)
>

Not any placeholder randomly, but placeholders that can be universally
replaced one for another, depending on the situations and constraints. Then
you pass only that value. But if encoding length is an issue, you'll have
no other choce than allowing sequences of multiple placeholders.

The list of possible placeholders that an application can process on input
or return on output should be documented. Non-characters are not the only
possible choices.
Received on Tue Jul 31 2012 - 18:56:46 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 31 2012 - 18:56:48 CDT