RE: Roundtripping in Unicode

From: Lars Kristan (
Date: Mon Dec 13 2004 - 09:08:00 CST

  • Next message: Lars Kristan: "RE: RE: Roundtripping in Unicode"

    Philippe Verdy wrote:
    > An implementation that uses UTF-8 for valid string could use
    > the invalid
    > ranges for lead bytes to encapsultate invalid byte values.
    > Note however that
    > invalid bytes you would need to represent have 256 possible
    > values, but the
    > UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1)
    > each for 64
    > codes, if you want to use an encoding on two bytes. The
    > alternative would be
    > to use the UTF-8 lead byte values which have initially been
    > assigned to byte
    > sequences longer than 4 bytes, and that are now unassigned/invalid in
    > standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
    > Here also it will be a private encoding, that should NOT be
    > named UTF-8, and
    > the application should clearly document that it will not only
    > accept any
    > valid Unicode string, but also some invalid data which will have some
    > roundtrip compatibility.
    Now you are devising an algorithm to store invalid sequences with other
    invalid sequences. In UTF-8. Why not simply stick with the original invalid
    And the whole purpose of what I am trying to do is to get VALID sequences.
    In order to be able to store and manipulate with Unicode strings.

    > So what is the problem: suppose that the application,
    > internally, starts to
    > generate strings containing any occurences of such private
    > sequences, then
    > it will be possible for the application to generate on its
    > output a byte
    > stream that would NOT have roundtrip compatibility, back to
    > the private
    > representation. So roundtripping would only be guaranteed for streams
    > converted FROM an UTF-8 where some invalid sequences are
    > present and must be
    > preserved by the internal representation. So the
    > transformation is not
    > bijective as you would think, and this potentially creates
    > lots of possible
    > security issues.
    Yes, it does. An application that uses my approach needs to be designed
    accordingly. *IF* the security issues apply. For a UTF-16 text editor this
    probably doesn't apply (in terms of data, not filenames). And this is just
    an example, with a text editor you can perhaps force the user to select a
    different encoding, but there are cases where this cannot be done, but data
    still needs to be preserved.

    So far, many people have suggested that there is no need to preserve
    'invalid data'. After some argumentation and a couple of examples, the need
    is acknowledged. But then they question the way it is done. They see the
    codepoint approach as unsuitable or unneeded. And suggest using some form of
    escaping. Now, any escaping has exactly the same problems you are
    mentioning, and some on top. And is actually representing invalid data with
    valid codepoints (except more than one per invalid byte), which you say is a
    definite no-no.

    And on top of all, the approach I am proposing is NOT intended to be used
    everywhere. It should only be used when interfacing to a system that cannot
    guarantee valid UTF-8, but does use UTF-8. For example, a UNIX filesystem.
    And, actually, if the security is entirely done by the filesystem, then it
    doesn't even matter if two UTF-16 strings map to the same filename. They
    will open the same file. Or be both denied. Which is exactly what is
    required. A Windows filesystem is case preserving but case insensitive. Did
    it ever bother you that you can use either upper case or lower case filename
    to open a file? Does it introduce security issues? Typically no, because you
    leave the security to the filesystem. And those checks are always done in
    the same UTF.

    This is a simple example of something that doesn't even need to be fixed.
    There are cases where validation would really need to be fixed. But then
    again, only if you use the new conversion. If you don't, your security
    remains exactly where it is today.

    We should be analyzing the security aspects. Learning where it can break,
    and in which cases. Get to know the enemy. And once we understand that
    things are manageable and not as frigtening as it seems at first, then we
    can stop using this as an argument against introducing 128 codepoints.
    People who will find them useful should and will bother with the
    consequences. Others don't need to and can roundtrip them as today.

    So, interpreting the 128 codepoints as 'recreate the original byte sequence'
    is an option. If you convert from UTF-16 to UTF-8, then you do exactly as
    you do now. Even I will do the same where I just want to represent Unicode
    in UTF-8. I will only use this conversion in certain places. The fact that
    my conversion actually produces UTF-8 from most of Unicode points does not
    mean it produced UTF-8. The result is just a byte sequence. The same one
    that I started with when I was replacing invalid sequences with the 128
    codepoints. And this is not limited to conversion from 'byte sequence that
    is mostly UTF-8' to UTF-16. I can (and even should) convert from this byte
    sequence to UTF-8. Preserving most of it and replacing each byte of invalid
    sequences with several bytes that represent the appropriate codepoint, in

    > So the best thing you can do to secure your application, is
    > all files whose names do not match the strict UTF-8 encoding
    > rules that your
    > application expect (all will happen as if those files were
    > not present, but
    > this may still create security problems if an application
    > that does not see
    Some situations favor security over preserving data, other (far more common)
    favor preserving data and have no security aspects at all.

    > any file in a directory wants to delete that directory,
    > assuming it is
    > empty... In that case the application must be ready to accept
    > the presence
    > of directories without any content, and must not depend on
    > the presence of a
    > directory to determine that it has some contents; anyway, on secured
    > filesystems, such things could happen due to access
    > restrictions, completely
    > unrelated to the encoding of filenames, and it is not
    > unreasonnable to
    > prepare the application so that it will behave correctly face to
    > inaccessible files or directories, so that the application will also
    > correctly handle the fact that the same filesystem will contain non
    > plain-text and inaccessible filenames).
    Inaccessible filenames are something we shouldn't accept. All your
    discussion of non-empty empty directories is just approaching the problem
    from the wrong end. One should fix the root cause, not consequences. And you
    would be fixing just that, the consequences, the fact would remain that
    there are inaccessible files. Isn't that a problem on its own? Why not fix
    that and get rid of a plethora of problems.

    > Notably, the concept of filenames is a legacy and badly
    > designed concept,
    > inherited from times where storage space was very limited,
    > and the designers
    > wanted to create a compact (but often cryptic) representation.
    About as bad as a post-it label that you put on a box when you take the box
    to the attic. I don't understand what is bad about them. And even if it is
    bad, what is one suppposed to do? We have them, and should process them.


    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 09:14:17 CST