RE: RE: RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 14 2004 - 04:59:08 CST

Next message: Lars Kristan: "RE: Nicest UTF"

Previous message: James Kass: "Re: Subj: Displaying Chinese characters and Chu Nom characters"
Maybe in reply to: Lars Kristan: "RE: RE: Roundtripping in Unicode"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe VERDY wrote:
> I don't think I miss the point. My suggested approach to
> perform roundtrip conversions between UTF's while keeping all
> invalid sequences as invalid (for the standard UTFs), is much
> less risky than converting them to valid codepoints (and by
> consequence to valid code units, because all valid code
> points need valid code units in UTF encoding forms).

I still do think you are missing the point. About two years ago I started a
similar thread. At that time I was pursuing the use of UTF-8B conversion,
which uses one invalid sequence to represent another. It uses unpaired low
surrogates. It works rather well, but one of the readers alerted me that I
cannot expect that a Unicode database will be able (or, rather, willing) to
process such data. Since I am not in a habit of writing every piece of the
code myself (or by my team for that matter), I chose to use a third party
database. The data that I have is mainly UTF-8, and users expect it to be
interpreted as such. But are not expecting purism in the form of rejecting
data (filenames) which contain invalid sequences. I am thankful to the
person that pointed this out, and I have moved to using PUA. The rest of the
responses were much like what I am getting now. Useless. Telling me to
reject invalid sequences, telling me to rewrite everything and treat the
data as binary. Or use an escaping technique, forgetting that everything
they find wrong about the codepoint approach is also true for escaping.
Except that escaping has a lot of overhead and that there is an actual risk
of those escaping sequences being present in today's files. Not the ones on
UNIX, but the ones on Windows. It should work both ways.

>
> The application doing that just preserves the original byte
> sequences, for its internal needs, but will not expose to
> other applications or modules such invalid sequences without
> the same risks: these other modules need their own strategy,
> and their strategy could simply be rejecting invalid
> sequences, assuming that all other valid sequences are
> encoding valid codepoints (this is the risk you take with
> your proposal to assign valid codepoints to invalid byte
> sequences in a UTF-8 stream, and a module that would
> implement your proposal would remove important security features).
Only applications that do use the new conversion need to worry about
security issues. And only those of course, that security issues apply to in
the first place. All other applications can and should treat those
codepoints as letters. And convert them to UTF-8 just as any other valid
codepoint. I may have suggested otherwise at some point in time, but this is
my current position.

> Note also that once your proposal is implemented, all valid
> codepoints become convertible across all UTFs, without notice
> (this is the principle of UTF that they allow transparent
> conversions between each other).
Existing conversion is not modified. I am explaining how an alternate
conversion works simply to prove it is useful. And it does not convert to
UTF-8. It converts to byte sequences. And can be used in places where
interfacing with such data. For example UNIX filenames. And 'supposedly
UTF-8' is not the only case. The same technique can be used on 'supposedly
Latin 3' data. The new conversions are used in pairs and existing UTF
conversions remain as they are. Any security issues are up to whoever
decides to use the new conversions. There are no security issues for those
that do not.

>
> Suppose that your proposal is accepted, and that invalid
> bytes 0xnn in UTF-8 sources (these bytes are necessarily
> between 0x80 and 0xFF) get encoded to some valid code units
> U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they
> become immediately and transparently convertible to valid
> UTF-16 or even valid UTF-8. Your assumption that the byte
> sequence will be preserved will be wrong, because each
> encoded binary byte will become valid sequences of 3 or 4
> UTF-8 bytes (one lead byte in 0xE0..EF if code points are in
> the BMP, or in 0xF0..0xF7 if they are in a supplementary
> plane, and 2 or 3 trail bytes in 0x80..0xBF).
Again, a UTF-8 to UTF-16 converter does not need to (and should not) encode
the invalid sequences as valid codepoints. Existing rules apply. Signal,
reject, replace with U+FFFD.

>
> How do you think that other applications will treat these
> sequences: they won't notice that they are originally
> equivalent to the new valid sequences, and the byte sequence
> itself would be transmitted across modules without any
> warning (applications most often don't check whever
> codepoints are assigned, just that they are valid and
> properly encoded).
Exactly. This is why nothing breaks. And Unicode application should treat
the new codepoints exactly the say it treats them today. Today they are
unassigned and are converted according to existing rules. Once they are
assigned, they just get some properties, but are still treated as valid and
should be converted as before.

>
> Which application will take the responsibility to convert
> back these 3-4 bytes valid sequences back to invalid 1-byte
> sequences, given that your data will already be treated by
> them as valid, and already encoded with valid UTF code units
> or encoding schemes?
Typically the application that generated them. This technique allows the
application to use Unicode sublayers, databases, sorting and so on to
process the data. Most of the data IS valid UTF-8 text, and, I can tell you
from experience, the rest of the data does sort (collate) usefully. Let's
not make up examples where this is not true. For data in UNIX filesystems
this is true.

Now, it is true that data from two applications using this technique can
become intermixed. But this is not something we should fear. On the
contrary, this is why I do what to standardize the approach. Because in most
cases what will happen is exactly what one expects. If each of the two
applications chose an arbitrary escaping technique to solve the problem,
then you get a bigger mess.

And each time I prove something works, someone steps in and finds abuses.
Yes, it can be abused, but there are cases where there are no security
issues and the abuser only finds himself amused, but no more. We can discuss
the possible abuses and exactly what they cause and how they can be
prevented. And in which cases they really need to be prevented. I have
discussed some of that in other replies. But am willing to discuss it with
anyone.

>
> Come back to your filesystem problem. Suppose that there ARE
> filenames that already contain these valid 3-4 byte
> sequences. This hypothetic application will blindly convert
> the valid 3-4 bytes sequences to invalid 1-byte sequences,
> and then won't be able to access these files, despite they
> were already correctly UTF-8 encoded. So your proposal breaks
> valid UTF-8 encoding of filenames. In addition it creates
> dangerous aliases that will redirect accesses from one
> filename to another (so yes it is also a security problem).
We need to separate the UNIX and Windows side here. Using my conversion,
Windows can access any file on UNIX, because my conversion guarantees
roundtrip UX=>Win=>UX (can't say UTF-8=>UTF-16=>UTF8, because it is not
UTF-8). Even if an encoded replacement codepoint is present there, because
they are escaped themselves (but only in this conversion, not when using
regular UTF-8 interpretation).

Win=>UX=>Win roundtrip is not guaranteed. I admit that and have stated so a
long time ago. And it is not guaranteed only if they contain certain
sequences of the new codepoints. Note that the sequences that are generated
by the UX=>Win conversion do roundtrip and are the ones we will expect to
see, mostly. Existing filenames shouldn't contain any of the codepoints in
question, because these codepoints are still unused. If you want to suppose
that this is not true, it just becomes the same case as the abuse attempt
and we will deal with that next. But let me stress that the fact there
shouldn't be any actually means there aren't any. OK, next suppose there are
some or that some concatenation was done or that someone attempts to abuse
the concept.

I described this in another mail, but let's do it again. A fact is that you
do have multiple representation of filenames that map to the same filename.
Filenames are not case sensitive in Windows. That's it. Are there security
issues? Depends on what you do. If you let the system take over the security
and rely entirely on it, then there are no problems. And making double
checks and early assumptions is something that is not wise anyway, nor
efficient. Security is by definition centralized, and when it is,
bijectivity is not a requirement.

I am supposing most of security works the way I described above. Now, some
may not. Well, if they rely on bijectivity, they need strict validation. If
they use the UTF-8 conversion, they can again remain the same. Only if
someone wants to extend such a security layer to allow invalid sequences,
then they would need to strengthen the validation. But it can be done,
simply by roundtripping through UTF-8, and either use the result as-is, or
compare it to the original if rejection is desired. It can be made even
simpler. A very strict security layer could reject all the new codepoints.
But perhaps even before that, it should reject U+FFFD. U+FFFD may present a
security risk even today. And the new codepoints actually present less of a
risk. A pure Unicode security layer does not need to reject them, since it
doesn't use the new conversions. If anyone chooses to obtain a Unicode
string by using the new conversions, and feed it to the security layer, this
is no problem. As long as you don't compare such strings yourself and let
the security layer do all the work. An example where something would
apparently break is, suppose you have validated a user via such security,
and via the new conversion. And on your (UTF-16) system, the application
generated your home directory. You then use a string that you know will map
to the same user. Well, you ARE the same user, you had to use your password
and everything. It's no different from case insensitivity. The only risk is,
you are now not getting your home directory. Well, your loss, you shot
yourself in your foot, but the security didn't break. And just suppose this
would not happen only in malicious attempts but was really submitted as a
bug. The fix is simple, you just need to roundtrip (the 'broken' one) the
data through Unicode to get the same string the user database is getting.

>
> My opinion is then that we must not allow the conversion of
> any invalid byte sequences to valid code points. All what
> your application can do is to convert them to invalid
> sequences code units, to preserve the invalid status. Then
> it's up to that application to make this conversion privately
> and resoring the original byte sequence before communicating
> again with the external system. Another process or module can
> do the same if it wishes to, but none will communicate
> directly to each other with their private code unit
> sequences. The decision to accept invalid byte sequences must
> remain local to each module and is not transmissible.
Applications are built from building blocks. Limiting the choice of blocks
to those that are willing to process invalid data is not a good idea. I
won't go into the discussion of whether building blocks should or should not
process the invalid data in the first place. Or should I? I think they
should have the ability and should only validate if told to do so. But not
everybody will agree. And even if they would, it would take ages to fix all
the building blocks (functions, databases, conversions, etc). The
straightforward solution is to make the data valid. By assigning valid
codepoints for it. Whoever will chose to interpret those codepoints in a
special way will also need to worry about the consequences. The rest can and
should remain as it is.

>
> This means that permanent files containing invalid byte
> sequences must not be converted and replaced to another UTF
> as long as they contain an invalid byte sequence. Such file
> converter should fail, and warn the user about file contents
> or filenames that could not be converted. Then it's up to the
> user to decide if it wishes to:
> - drop these files
Oh, please.

> - use a filter to remove invalid sequences (if it's a
> filename, the filter may need to append some indexing string
> to keep filenames unique in a directory)
Possibly valid if you are renaming the files (with serious risks involved
though). But very impractical if you want to simply present the files on the
network.

> - use a filter to replace some invad sequences by a user
> specified valid substitution string
> - use a filter that will automatically generate valid
> substitution strings.
That's escaping. And has all the problems you brought up for my approach.
And contradicts one of your basic premises - that invalid sequences should
not be replaces with valid ones. So, if you are suggesting that valid
sequences CAN be replaced by valid ones, let's drop that premise. We just
need to chose the most appropriate escaping technique. And assigning
codepoints is the best choice.

> - use other programs that will accept and will be able to
> process invalid files as opaque sequences of bytes instead of
> as a stream of Unicode characters.
Text based programs were usable with Latin 1. Text based programs will be
usable with UTF-8 once there are no invalid sequences anywhere. Why should
complex programs be rewritten to treat data as binary just to get over a
period of time where there will be some invalid sequences present? Is this
cost effective? Is it as easy as you make it sound? Eventually that binary
data will need to be displayed. Or entered. UI is text, isn't it? Or should
we start displaying all UNIX filenames in HEX codes? And saying that text
based approach will work once everything really is in clean UTF-8 is also
not entirely true. There will always be occasional invalid sequences.
Suppose you are accessing a UNIX filesystem from Windows. Somehow, one file
has an invalid sequence. Isn't it better to be able to access that file? Or
at least rename it. But from Windows. You want to signal the error on the
UNIX side, which is not where the user is. Force the user to log to that
system? Why? Because of some fear my conversion will break everything?
Because one can use philosophy to prove UNIX filenames are sequences of
bytes, yet we are all aware they are text?

> - change the meta-data file-type so that it will no longer be
> considered as plain-text
> - change the meta-data encoding label, so that it will be
> treated as ISO-8859-1 or some other complete 8-bit charset
> with 256 valid positions (like CP850, CP437, ISO-8859-2, MacRoman...).
And have all the other names displayed wrong? There may be applications
running at the same time that depend on accessing the files, according to
their names at a previous point in time. Also depends on where the
conversion is done - what if the setting is on the share side? Then fix the
application, right? Can you weigh the cost of that against your desires to
not have some 128 codepoints in Unicode, just because you THINK they are not
needed?

Lars

Next message: Lars Kristan: "RE: Nicest UTF"
Previous message: James Kass: "Re: Subj: Displaying Chinese characters and Chu Nom characters"
Maybe in reply to: Lars Kristan: "RE: RE: Roundtripping in Unicode"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 05:01:40 CST