From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 13 2004 - 09:08:00 CST
Philippe Verdy wrote:
> An implementation that uses UTF-8 for valid string could use
> the invalid
> ranges for lead bytes to encapsultate invalid byte values.
> Note however that
> invalid bytes you would need to represent have 256 possible
> values, but the
> UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1)
> each for 64
> codes, if you want to use an encoding on two bytes. The
> alternative would be
> to use the UTF-8 lead byte values which have initially been
> assigned to byte
> sequences longer than 4 bytes, and that are now unassigned/invalid in
> standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
> Here also it will be a private encoding, that should NOT be
> named UTF-8, and
> the application should clearly document that it will not only
> accept any
> valid Unicode string, but also some invalid data which will have some
> roundtrip compatibility.
Now you are devising an algorithm to store invalid sequences with other
invalid sequences. In UTF-8. Why not simply stick with the original invalid
sequences?
And the whole purpose of what I am trying to do is to get VALID sequences.
In order to be able to store and manipulate with Unicode strings.
>
> So what is the problem: suppose that the application,
> internally, starts to
> generate strings containing any occurences of such private
> sequences, then
> it will be possible for the application to generate on its
> output a byte
> stream that would NOT have roundtrip compatibility, back to
> the private
> representation. So roundtripping would only be guaranteed for streams
> converted FROM an UTF-8 where some invalid sequences are
> present and must be
> preserved by the internal representation. So the
> transformation is not
> bijective as you would think, and this potentially creates
> lots of possible
> security issues.
Yes, it does. An application that uses my approach needs to be designed
accordingly. *IF* the security issues apply. For a UTF-16 text editor this
probably doesn't apply (in terms of data, not filenames). And this is just
an example, with a text editor you can perhaps force the user to select a
different encoding, but there are cases where this cannot be done, but data
still needs to be preserved.
So far, many people have suggested that there is no need to preserve
'invalid data'. After some argumentation and a couple of examples, the need
is acknowledged. But then they question the way it is done. They see the
codepoint approach as unsuitable or unneeded. And suggest using some form of
escaping. Now, any escaping has exactly the same problems you are
mentioning, and some on top. And is actually representing invalid data with
valid codepoints (except more than one per invalid byte), which you say is a
definite no-no.
And on top of all, the approach I am proposing is NOT intended to be used
everywhere. It should only be used when interfacing to a system that cannot
guarantee valid UTF-8, but does use UTF-8. For example, a UNIX filesystem.
And, actually, if the security is entirely done by the filesystem, then it
doesn't even matter if two UTF-16 strings map to the same filename. They
will open the same file. Or be both denied. Which is exactly what is
required. A Windows filesystem is case preserving but case insensitive. Did
it ever bother you that you can use either upper case or lower case filename
to open a file? Does it introduce security issues? Typically no, because you
leave the security to the filesystem. And those checks are always done in
the same UTF.
This is a simple example of something that doesn't even need to be fixed.
There are cases where validation would really need to be fixed. But then
again, only if you use the new conversion. If you don't, your security
remains exactly where it is today.
We should be analyzing the security aspects. Learning where it can break,
and in which cases. Get to know the enemy. And once we understand that
things are manageable and not as frigtening as it seems at first, then we
can stop using this as an argument against introducing 128 codepoints.
People who will find them useful should and will bother with the
consequences. Others don't need to and can roundtrip them as today.
So, interpreting the 128 codepoints as 'recreate the original byte sequence'
is an option. If you convert from UTF-16 to UTF-8, then you do exactly as
you do now. Even I will do the same where I just want to represent Unicode
in UTF-8. I will only use this conversion in certain places. The fact that
my conversion actually produces UTF-8 from most of Unicode points does not
mean it produced UTF-8. The result is just a byte sequence. The same one
that I started with when I was replacing invalid sequences with the 128
codepoints. And this is not limited to conversion from 'byte sequence that
is mostly UTF-8' to UTF-16. I can (and even should) convert from this byte
sequence to UTF-8. Preserving most of it and replacing each byte of invalid
sequences with several bytes that represent the appropriate codepoint, in
UTF-8.
> So the best thing you can do to secure your application, is
> to REJECT/IGNORE
> all files whose names do not match the strict UTF-8 encoding
> rules that your
> application expect (all will happen as if those files were
> not present, but
> this may still create security problems if an application
> that does not see
Some situations favor security over preserving data, other (far more common)
favor preserving data and have no security aspects at all.
> any file in a directory wants to delete that directory,
> assuming it is
> empty... In that case the application must be ready to accept
> the presence
> of directories without any content, and must not depend on
> the presence of a
> directory to determine that it has some contents; anyway, on secured
> filesystems, such things could happen due to access
> restrictions, completely
> unrelated to the encoding of filenames, and it is not
> unreasonnable to
> prepare the application so that it will behave correctly face to
> inaccessible files or directories, so that the application will also
> correctly handle the fact that the same filesystem will contain non
> plain-text and inaccessible filenames).
Inaccessible filenames are something we shouldn't accept. All your
discussion of non-empty empty directories is just approaching the problem
from the wrong end. One should fix the root cause, not consequences. And you
would be fixing just that, the consequences, the fact would remain that
there are inaccessible files. Isn't that a problem on its own? Why not fix
that and get rid of a plethora of problems.
> Notably, the concept of filenames is a legacy and badly
> designed concept,
> inherited from times where storage space was very limited,
> and the designers
> wanted to create a compact (but often cryptic) representation.
About as bad as a post-it label that you put on a box when you take the box
to the attic. I don't understand what is bad about them. And even if it is
bad, what is one suppposed to do? We have them, and should process them.
Lars
This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 09:14:17 CST