From: Lars Kristan (email@example.com)
Date: Thu Dec 16 2004 - 08:33:21 CST
Arcane Jill wrote:
> They are therefore
> nothing to do with
> Unicode or the UTC (... or even this list ! ).
This is one of the excuses UTC *can* use to stay out of this mess. I am
hoping they won't do that.
But I do not agree with you. Those functions can solve several problems, by
* Retaining the relevant bits when (during conversion to Unicode strings)
encountering an unassigned character in some SBCS or an invalid sequence in
any MBCS, including, but not limited to, UTF-8. And provide a means to
reliably reconstruct the data should the original be lost by the time the
problem is detected. As Marcin would say, it is better to prevent it in the
first place by signaling the problem when the conversion is done, but that
is not always practiced, nor is always practical.
* Temporary coexistence of UTF-8 and legacy encoded filenames on the same
filesystem, or within the same LAN. No matter how good the tools for
speeding up that process, it will take time and the number of the legacy
encoded filenames will only reduce exponentially. Making the coexistence a
pain should (in theory) make it faster, but will not make it go away. It
could however delay it.
* Reliable manipulation with filenames even if they contain invalid UTF-8
sequences. Thus reducing security risks and load on the IT departments.
* A simple way to fix any application that HAS to deal with non-validated
UTF-8 data. As opposed to declaring the data as binary and having to rewrite
existing code or, in case of fresh development, implement functions,
transports and protocols to deal with it.
All this should help Unicode (in general, and UTF-8 in UNIX filesystems in
particular) to be accepted faster and with less pain.
And that is something that definitely has something to do with both UTC and
> I'm not quite sure why Lars
> isn't happy with
> these suggestions
I already have a solution. I would be embarrassed if you would manage to
find a better one overnight :)
> - maybe his goal has still not been clearly
> stated -
To verify the solution and possibly provide the 128 codepoints. Not just for
me, but for anyone else who might find them useful.
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 08:40:24 CST