RE: Roundtripping Solved

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Fri Dec 17 2004 - 04:13:30 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On Behalf
    Of Lars Kristan
    Subject: RE: Roundtripping Solved

    >However, requirements 1 and 2 are actually taken from Unicode standard, they
    >are not my requirements.
    >How's that? Well, they are my requirements also, but instead of "for all valid
    >UTF-x strings", in my case the requirement is relaxed to "for all valid UTF-8
    >strings that do not contain the 128 replacement codepoints".

    Yes, I follow that. But if you replace the phrase "128 replacement codepoints"
    with the phrase "128 replacement codepoint strings", or "128 replacement escape
    sequences" then you do actually still have a workable scheme which does the job
    just as well. You don't seem to have acknowledged this, but think it through.

    I know you argued against replacement strings a while back for "performance
    reasons". I should have replied to that at the time, but I let it go. But
    realistically, Lars, I think you should just take the performance hit. The
    computing cost of counting characters in a null-terminated UTF-8 stream is
    really not that much more than the cost of strlen(). Think about it - all you
    have to do is to disregard bytes which match the bit pattern 10xxxxxx. Just
    count all the rest. You're talking about adding a couple of machine code
    instructions to the loop, that's all. Not only that, as a programmer, you
    /must/ surely realise that the performance cost of even the most complex UTF
    conversion is going to be utterly insignificant when compared with the time it
    takes to move the drive head from one part of a hard disc to another. Your
    conversions will be totally swamped out by all the snail-pace fstat()s etc.
    that you'll need to do to get your filenames in the first place. And even if
    you don't accept that, I hope you can understand that if it is suggested to the
    UTC that they reserve some codepoints just so you don't have to take a
    performance hit, the proposal won't get much past their inbox.

    So let's hypothesise that you /can/ take the performance hit. In that case,
    escape sequences will work just as well as resevered characters. They will
    fulfil exactly the same function ... EXCEPT that you no longer have to worry
    that Unicode text might contain single codepoints by accident. Instead, you
    have a relaxed requirement - that Unicode text should not contain any escape
    strings by accident ... and that can be arranged with an utterly astronomical
    degree of certainty (though never /absolute/ certainty of course). I submit,
    therefore, again, that all of your needs will be met (possibly apart from the
    "no performance hit" thing) if you accept strings of characters instead of
    single characters. /This is workable/.

    >Furthermore, today, y should not contain any of the 128 codepoints (assuming
    >UTC takes unassigned codepoints and assigns them today).

    This is also true of suitably chosen escape sequences. Except that the UTC does
    not need to assign them - you can do that yourself - with any desired level of
    probability that it won't turn up by accident.

    > And considerably less than inability to access files or even files being
    > displayed with missing characters (or no characters at all).

    There is also one other thing which you seem not to have considered. It is
    possible (and /much/ more likely than that a suitably chosen escape sequence
    might turn up by accident) that, in some non-Unicode encoding ... let's say the
    fictitious encoding Krakozhian ... the byte sequence emitted by UTF-8(c) might
    be extremely common (where c is one of your 128 reserved codepoints). In other
    words, you have to forbid the byte-sequences UTF-8(c), for all 128 c's, not
    just in Unicode (which, granted, you could do by reserving the characters, c,
    assuming you could wave a magic wand at the UTC), but in ALL OTHER ENCODINGS
    also. It strikes me that you have no way to guarantee that.

    Further, if you argue that this circumstance is unlikely enough not to bother
    about, then my previous arguments involving probability hold.

    I hope I don't come across as arguing for the sake of arguing. I'm actually
    trying to help here. But you WILL NOT get your 128 codepoints, so it seems
    reasonable to look for other ways of solving the original problem which those
    codepoints were designed to solve.

    One last question - why /can't/ locale conversion be automated? I don't really
    get this one, but it's the root of this whole topic. Surely, if we make the
    following assumptions:
    (1) No user has a locale of UTF-8, and
    (2) Some users will have created UTF-8 filenames and UTF-8 text files, and
    (3) Some of those text files may have been concatenated, leading to
    mixed-encoding text files
    then we can surely automate everything. (Requirement (1) can be met simply by
    asking all users who have changed their locale to UTF-8 to change it back
    again, temporarily). Assuming these requirements, all you have to do is:

    # for (all users)
    # {
    # for (all filename below ~/)
    # {
    # if (filename not valid UTF-8)
    # {
    # rename it by re-encoding it (assuming it to be currently
    encoded in the user's locale) to UTF-8
    # }
    # }
    # for (all files below ~/)
    # {
    # if (the file can be positively identified as a text file)
    # {
    # re-encode all non-UTF-8 substrings (assuming them to be in the
    user's locale) to UTF-8
    # }
    # }
    # change the user's locale to UTF-8
    # }

    Kernel files and other files under / but not under /user should all have ASCII
    filenames and contain ASCII text, so they won't be a problem anyway. (And even
    that's not true, the superuser can do the same thing, taking care to avoid
    traversing /user). References to filenames in scripts will have been modified
    along with the filenames, because scripts are text files. All that would fall
    through would be references to non-ASCII filenames in binary files, and you can
    mitigate even that, at least partially - for instance by spitting out all
    databases into .sql files before conversion and reloading them after;
    recompiling as much as possible from source after the conversion; etc.. A small
    amount of stuff would still fall through, but that set will be so small that by
    now it would be pretty reasonable just to say "hell - let it break". And when
    it breaks, fix it. I mean - if you actually /can/ automate things, then the
    whole of the rest of this line of discussion becomes unnecessary.

    Just my thoughts.
    Jill

    PS. I'm on holiday from tomorrow, so if I fail to respond to any comments,
    it'll be because I'm not here. :-)



    This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 04:16:44 CST