RE: RE: RE: Roundtripping in Unicode

From: Lars Kristan (
Date: Tue Dec 14 2004 - 04:59:08 CST

  • Next message: Lars Kristan: "RE: Nicest UTF"

    Philippe VERDY wrote:
    > I don't think I miss the point. My suggested approach to
    > perform roundtrip conversions between UTF's while keeping all
    > invalid sequences as invalid (for the standard UTFs), is much
    > less risky than converting them to valid codepoints (and by
    > consequence to valid code units, because all valid code
    > points need valid code units in UTF encoding forms).

    I still do think you are missing the point. About two years ago I started a
    similar thread. At that time I was pursuing the use of UTF-8B conversion,
    which uses one invalid sequence to represent another. It uses unpaired low
    surrogates. It works rather well, but one of the readers alerted me that I
    cannot expect that a Unicode database will be able (or, rather, willing) to
    process such data. Since I am not in a habit of writing every piece of the
    code myself (or by my team for that matter), I chose to use a third party
    database. The data that I have is mainly UTF-8, and users expect it to be
    interpreted as such. But are not expecting purism in the form of rejecting
    data (filenames) which contain invalid sequences. I am thankful to the
    person that pointed this out, and I have moved to using PUA. The rest of the
    responses were much like what I am getting now. Useless. Telling me to
    reject invalid sequences, telling me to rewrite everything and treat the
    data as binary. Or use an escaping technique, forgetting that everything
    they find wrong about the codepoint approach is also true for escaping.
    Except that escaping has a lot of overhead and that there is an actual risk
    of those escaping sequences being present in today's files. Not the ones on
    UNIX, but the ones on Windows. It should work both ways.

    > The application doing that just preserves the original byte
    > sequences, for its internal needs, but will not expose to
    > other applications or modules such invalid sequences without
    > the same risks: these other modules need their own strategy,
    > and their strategy could simply be rejecting invalid
    > sequences, assuming that all other valid sequences are
    > encoding valid codepoints (this is the risk you take with
    > your proposal to assign valid codepoints to invalid byte
    > sequences in a UTF-8 stream, and a module that would
    > implement your proposal would remove important security features).
    Only applications that do use the new conversion need to worry about
    security issues. And only those of course, that security issues apply to in
    the first place. All other applications can and should treat those
    codepoints as letters. And convert them to UTF-8 just as any other valid
    codepoint. I may have suggested otherwise at some point in time, but this is
    my current position.

    > Note also that once your proposal is implemented, all valid
    > codepoints become convertible across all UTFs, without notice
    > (this is the principle of UTF that they allow transparent
    > conversions between each other).
    Existing conversion is not modified. I am explaining how an alternate
    conversion works simply to prove it is useful. And it does not convert to
    UTF-8. It converts to byte sequences. And can be used in places where
    interfacing with such data. For example UNIX filenames. And 'supposedly
    UTF-8' is not the only case. The same technique can be used on 'supposedly
    Latin 3' data. The new conversions are used in pairs and existing UTF
    conversions remain as they are. Any security issues are up to whoever
    decides to use the new conversions. There are no security issues for those
    that do not.

    > Suppose that your proposal is accepted, and that invalid
    > bytes 0xnn in UTF-8 sources (these bytes are necessarily
    > between 0x80 and 0xFF) get encoded to some valid code units
    > U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they
    > become immediately and transparently convertible to valid
    > UTF-16 or even valid UTF-8. Your assumption that the byte
    > sequence will be preserved will be wrong, because each
    > encoded binary byte will become valid sequences of 3 or 4
    > UTF-8 bytes (one lead byte in 0xE0..EF if code points are in
    > the BMP, or in 0xF0..0xF7 if they are in a supplementary
    > plane, and 2 or 3 trail bytes in 0x80..0xBF).
    Again, a UTF-8 to UTF-16 converter does not need to (and should not) encode
    the invalid sequences as valid codepoints. Existing rules apply. Signal,
    reject, replace with U+FFFD.

    > How do you think that other applications will treat these
    > sequences: they won't notice that they are originally
    > equivalent to the new valid sequences, and the byte sequence
    > itself would be transmitted across modules without any
    > warning (applications most often don't check whever
    > codepoints are assigned, just that they are valid and
    > properly encoded).
    Exactly. This is why nothing breaks. And Unicode application should treat
    the new codepoints exactly the say it treats them today. Today they are
    unassigned and are converted according to existing rules. Once they are
    assigned, they just get some properties, but are still treated as valid and
    should be converted as before.

    > Which application will take the responsibility to convert
    > back these 3-4 bytes valid sequences back to invalid 1-byte
    > sequences, given that your data will already be treated by
    > them as valid, and already encoded with valid UTF code units
    > or encoding schemes?
    Typically the application that generated them. This technique allows the
    application to use Unicode sublayers, databases, sorting and so on to
    process the data. Most of the data IS valid UTF-8 text, and, I can tell you
    from experience, the rest of the data does sort (collate) usefully. Let's
    not make up examples where this is not true. For data in UNIX filesystems
    this is true.

    Now, it is true that data from two applications using this technique can
    become intermixed. But this is not something we should fear. On the
    contrary, this is why I do what to standardize the approach. Because in most
    cases what will happen is exactly what one expects. If each of the two
    applications chose an arbitrary escaping technique to solve the problem,
    then you get a bigger mess.

    And each time I prove something works, someone steps in and finds abuses.
    Yes, it can be abused, but there are cases where there are no security
    issues and the abuser only finds himself amused, but no more. We can discuss
    the possible abuses and exactly what they cause and how they can be
    prevented. And in which cases they really need to be prevented. I have
    discussed some of that in other replies. But am willing to discuss it with

    > Come back to your filesystem problem. Suppose that there ARE
    > filenames that already contain these valid 3-4 byte
    > sequences. This hypothetic application will blindly convert
    > the valid 3-4 bytes sequences to invalid 1-byte sequences,
    > and then won't be able to access these files, despite they
    > were already correctly UTF-8 encoded. So your proposal breaks
    > valid UTF-8 encoding of filenames. In addition it creates
    > dangerous aliases that will redirect accesses from one
    > filename to another (so yes it is also a security problem).
    We need to separate the UNIX and Windows side here. Using my conversion,
    Windows can access any file on UNIX, because my conversion guarantees
    roundtrip UX=>Win=>UX (can't say UTF-8=>UTF-16=>UTF8, because it is not
    UTF-8). Even if an encoded replacement codepoint is present there, because
    they are escaped themselves (but only in this conversion, not when using
    regular UTF-8 interpretation).

    Win=>UX=>Win roundtrip is not guaranteed. I admit that and have stated so a
    long time ago. And it is not guaranteed only if they contain certain
    sequences of the new codepoints. Note that the sequences that are generated
    by the UX=>Win conversion do roundtrip and are the ones we will expect to
    see, mostly. Existing filenames shouldn't contain any of the codepoints in
    question, because these codepoints are still unused. If you want to suppose
    that this is not true, it just becomes the same case as the abuse attempt
    and we will deal with that next. But let me stress that the fact there
    shouldn't be any actually means there aren't any. OK, next suppose there are
    some or that some concatenation was done or that someone attempts to abuse
    the concept.

    I described this in another mail, but let's do it again. A fact is that you
    do have multiple representation of filenames that map to the same filename.
    Filenames are not case sensitive in Windows. That's it. Are there security
    issues? Depends on what you do. If you let the system take over the security
    and rely entirely on it, then there are no problems. And making double
    checks and early assumptions is something that is not wise anyway, nor
    efficient. Security is by definition centralized, and when it is,
    bijectivity is not a requirement.

    I am supposing most of security works the way I described above. Now, some
    may not. Well, if they rely on bijectivity, they need strict validation. If
    they use the UTF-8 conversion, they can again remain the same. Only if
    someone wants to extend such a security layer to allow invalid sequences,
    then they would need to strengthen the validation. But it can be done,
    simply by roundtripping through UTF-8, and either use the result as-is, or
    compare it to the original if rejection is desired. It can be made even
    simpler. A very strict security layer could reject all the new codepoints.
    But perhaps even before that, it should reject U+FFFD. U+FFFD may present a
    security risk even today. And the new codepoints actually present less of a
    risk. A pure Unicode security layer does not need to reject them, since it
    doesn't use the new conversions. If anyone chooses to obtain a Unicode
    string by using the new conversions, and feed it to the security layer, this
    is no problem. As long as you don't compare such strings yourself and let
    the security layer do all the work. An example where something would
    apparently break is, suppose you have validated a user via such security,
    and via the new conversion. And on your (UTF-16) system, the application
    generated your home directory. You then use a string that you know will map
    to the same user. Well, you ARE the same user, you had to use your password
    and everything. It's no different from case insensitivity. The only risk is,
    you are now not getting your home directory. Well, your loss, you shot
    yourself in your foot, but the security didn't break. And just suppose this
    would not happen only in malicious attempts but was really submitted as a
    bug. The fix is simple, you just need to roundtrip (the 'broken' one) the
    data through Unicode to get the same string the user database is getting.

    > My opinion is then that we must not allow the conversion of
    > any invalid byte sequences to valid code points. All what
    > your application can do is to convert them to invalid
    > sequences code units, to preserve the invalid status. Then
    > it's up to that application to make this conversion privately
    > and resoring the original byte sequence before communicating
    > again with the external system. Another process or module can
    > do the same if it wishes to, but none will communicate
    > directly to each other with their private code unit
    > sequences. The decision to accept invalid byte sequences must
    > remain local to each module and is not transmissible.
    Applications are built from building blocks. Limiting the choice of blocks
    to those that are willing to process invalid data is not a good idea. I
    won't go into the discussion of whether building blocks should or should not
    process the invalid data in the first place. Or should I? I think they
    should have the ability and should only validate if told to do so. But not
    everybody will agree. And even if they would, it would take ages to fix all
    the building blocks (functions, databases, conversions, etc). The
    straightforward solution is to make the data valid. By assigning valid
    codepoints for it. Whoever will chose to interpret those codepoints in a
    special way will also need to worry about the consequences. The rest can and
    should remain as it is.

    > This means that permanent files containing invalid byte
    > sequences must not be converted and replaced to another UTF
    > as long as they contain an invalid byte sequence. Such file
    > converter should fail, and warn the user about file contents
    > or filenames that could not be converted. Then it's up to the
    > user to decide if it wishes to:
    > - drop these files
    Oh, please.

    > - use a filter to remove invalid sequences (if it's a
    > filename, the filter may need to append some indexing string
    > to keep filenames unique in a directory)
    Possibly valid if you are renaming the files (with serious risks involved
    though). But very impractical if you want to simply present the files on the

    > - use a filter to replace some invad sequences by a user
    > specified valid substitution string
    > - use a filter that will automatically generate valid
    > substitution strings.
    That's escaping. And has all the problems you brought up for my approach.
    And contradicts one of your basic premises - that invalid sequences should
    not be replaces with valid ones. So, if you are suggesting that valid
    sequences CAN be replaced by valid ones, let's drop that premise. We just
    need to chose the most appropriate escaping technique. And assigning
    codepoints is the best choice.

    > - use other programs that will accept and will be able to
    > process invalid files as opaque sequences of bytes instead of
    > as a stream of Unicode characters.
    Text based programs were usable with Latin 1. Text based programs will be
    usable with UTF-8 once there are no invalid sequences anywhere. Why should
    complex programs be rewritten to treat data as binary just to get over a
    period of time where there will be some invalid sequences present? Is this
    cost effective? Is it as easy as you make it sound? Eventually that binary
    data will need to be displayed. Or entered. UI is text, isn't it? Or should
    we start displaying all UNIX filenames in HEX codes? And saying that text
    based approach will work once everything really is in clean UTF-8 is also
    not entirely true. There will always be occasional invalid sequences.
    Suppose you are accessing a UNIX filesystem from Windows. Somehow, one file
    has an invalid sequence. Isn't it better to be able to access that file? Or
    at least rename it. But from Windows. You want to signal the error on the
    UNIX side, which is not where the user is. Force the user to log to that
    system? Why? Because of some fear my conversion will break everything?
    Because one can use philosophy to prove UNIX filenames are sequences of
    bytes, yet we are all aware they are text?

    > - change the meta-data file-type so that it will no longer be
    > considered as plain-text
    > - change the meta-data encoding label, so that it will be
    > treated as ISO-8859-1 or some other complete 8-bit charset
    > with 256 valid positions (like CP850, CP437, ISO-8859-2, MacRoman...).
    And have all the other names displayed wrong? There may be applications
    running at the same time that depend on accessing the files, according to
    their names at a previous point in time. Also depends on where the
    conversion is done - what if the setting is on the share side? Then fix the
    application, right? Can you weigh the cost of that against your desires to
    not have some 128 codepoints in Unicode, just because you THINK they are not


    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 05:01:40 CST