RE: Roundtripping Solved

From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Dec 17 2004 - 09:37:07 CST

  • Next message: Lars Kristan: "Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)"

    Arcane Jill wrote:
    > realistically, Lars, I think you should just take the
    > performance hit. The

    It is not just about performance and the CPU cycles. Suppose I have a
    million lines of code. And want to replace a UTF-8 conversion with my
    conversion. If my conversion has different size requirements than the
    previous one, I have to carefully analyze what programmers did in the code,
    or risk a buffer overrun in some odd corner of the application.

    And even if it was about performance. Suppose I am processing thousands of
    filenames per second, gathering from multiple systems to one. Well, this one
    system will have little disk activity, no fstats, just a bunch of
    conversions. Suppose I have put filenames in an XML along with their
    properties. Now I have to convert entire XMLs.

    Now, during the time there are some odd characters present, the network load
    will also increase. Sure, that will become irrelevant after some time. But I
    will still need to have oversized buffers, just in case, indefinitely. Which
    is only slightly better than 'strconvlen' each incoming buffer and burden
    the system with a bunch of malloc and free calls.

    > In that case,
    > escape sequences will work just as well as resevered
    > characters. They will
    > fulfil exactly the same function ... EXCEPT that you no
    > longer have to worry
    > that Unicode text might contain single codepoints by
    > accident.

    I am not worried about it. My solution with PUA is solid enough for me. The
    range was carefully chosen. The performance (and convenience) requirements
    were stronger and have prevailed. In my case it was a trade-off. In itself,
    this makes the solution unclean. But if other people would want to use the
    same solution and we would agree to have it standardized, then assigning the
    128 codepoints would solve that problem too. And that would remove the
    unclean part of my solution. And make it suitable for standardization.

    > There is also one other thing which you seem not to have
    > considered. It is
    > possible (and /much/ more likely than that a suitably chosen
    > escape sequence
    > might turn up by accident) that, in some non-Unicode encoding
    > ... let's say the
    > fictitious encoding Krakozhian ... the byte sequence emitted
    > by UTF-8(c) might
    > be extremely common (where c is one of your 128 reserved
    > codepoints).

    No problem. They are escaped themselves and do roundtrip. My size
    requirements are also met.
    You could be also worried not about the 128 sequences, but about all UTF-8
    sequences. Those will be far more frequent. One could argue that presence of
    the escape codepoints in Unicode should indicate a legacy encoding and that
    this is not guaranteed. Well, this possibility of late detection is only a
    side-effect of what I am doing. It is not guaranteed and is not a
    requirement. Eventually, the problem will be detected, even if not a single
    invalid sequence was encountered, and the important thing is that the
    original byte sequence can be recreated entirely.

    > In other
    > words, you have to forbid the byte-sequences UTF-8(c), for all 128 c's,
    not
    > just in Unicode

    The codepoints in Unicode are not to be forbidden (on the contrary) nor
    reserved. They are merely assigned for a specific purpose. Using codepoints
    that are already assigned for some other purpose is bad. Good enough for my
    private solution, but I am looking for a solution that can be used by
    everyone. You are frustrated, because you cannot find it. Well, there isn't
    one, at least not one that would meet all the requirements. I still claim
    that my solution works and that there is just one step missing.

    > One last question - why /can't/ locale conversion be
    > automated?
    It *sorta* works in *some* cases. Not all users will do it. And the odd
    filenames will keep reappearing for a long time. Perhaps even for malicious
    reasons.

    Lars

    P.S.
    > PS. I'm on holiday from tomorrow, so if I fail to respond to
    > any comments,
    > it'll be because I'm not here. :-)
    You have taken my "take a break" seriously :) Merry Christmas ;)
    L.



    This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 09:47:11 CST