RE: Roundtripping Solved

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Dec 16 2004 - 08:33:21 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    Arcane Jill wrote:

    > They are therefore
    > nothing to do with
    > Unicode or the UTC (... or even this list ! ).

    This is one of the excuses UTC *can* use to stay out of this mess. I am
    hoping they won't do that.

    But I do not agree with you. Those functions can solve several problems, by
    allowing:

    * Retaining the relevant bits when (during conversion to Unicode strings)
    encountering an unassigned character in some SBCS or an invalid sequence in
    any MBCS, including, but not limited to, UTF-8. And provide a means to
    reliably reconstruct the data should the original be lost by the time the
    problem is detected. As Marcin would say, it is better to prevent it in the
    first place by signaling the problem when the conversion is done, but that
    is not always practiced, nor is always practical.

    * Temporary coexistence of UTF-8 and legacy encoded filenames on the same
    filesystem, or within the same LAN. No matter how good the tools for
    speeding up that process, it will take time and the number of the legacy
    encoded filenames will only reduce exponentially. Making the coexistence a
    pain should (in theory) make it faster, but will not make it go away. It
    could however delay it.

    * Reliable manipulation with filenames even if they contain invalid UTF-8
    sequences. Thus reducing security risks and load on the IT departments.

    * A simple way to fix any application that HAS to deal with non-validated
    UTF-8 data. As opposed to declaring the data as binary and having to rewrite
    existing code or, in case of fresh development, implement functions,
    transports and protocols to deal with it.

    All this should help Unicode (in general, and UTF-8 in UNIX filesystems in
    particular) to be accepted faster and with less pain.

    And that is something that definitely has something to do with both UTC and
    this list.

    > I'm not quite sure why Lars
    > isn't happy with
    > these suggestions
    I already have a solution. I would be embarrassed if you would manage to
    find a better one overnight :)

    > - maybe his goal has still not been clearly
    > stated -
    To verify the solution and possibly provide the 128 codepoints. Not just for
    me, but for anyone else who might find them useful.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 08:40:24 CST