From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Dec 17 2004 - 09:37:07 CST
Arcane Jill wrote:
> realistically, Lars, I think you should just take the
> performance hit. The
It is not just about performance and the CPU cycles. Suppose I have a
million lines of code. And want to replace a UTF-8 conversion with my
conversion. If my conversion has different size requirements than the
previous one, I have to carefully analyze what programmers did in the code,
or risk a buffer overrun in some odd corner of the application.
And even if it was about performance. Suppose I am processing thousands of
filenames per second, gathering from multiple systems to one. Well, this one
system will have little disk activity, no fstats, just a bunch of
conversions. Suppose I have put filenames in an XML along with their
properties. Now I have to convert entire XMLs.
Now, during the time there are some odd characters present, the network load
will also increase. Sure, that will become irrelevant after some time. But I
will still need to have oversized buffers, just in case, indefinitely. Which
is only slightly better than 'strconvlen' each incoming buffer and burden
the system with a bunch of malloc and free calls.
> In that case,
> escape sequences will work just as well as resevered
> characters. They will
> fulfil exactly the same function ... EXCEPT that you no
> longer have to worry
> that Unicode text might contain single codepoints by
> accident.
I am not worried about it. My solution with PUA is solid enough for me. The
range was carefully chosen. The performance (and convenience) requirements
were stronger and have prevailed. In my case it was a trade-off. In itself,
this makes the solution unclean. But if other people would want to use the
same solution and we would agree to have it standardized, then assigning the
128 codepoints would solve that problem too. And that would remove the
unclean part of my solution. And make it suitable for standardization.
> There is also one other thing which you seem not to have
> considered. It is
> possible (and /much/ more likely than that a suitably chosen
> escape sequence
> might turn up by accident) that, in some non-Unicode encoding
> ... let's say the
> fictitious encoding Krakozhian ... the byte sequence emitted
> by UTF-8(c) might
> be extremely common (where c is one of your 128 reserved
> codepoints).
No problem. They are escaped themselves and do roundtrip. My size
requirements are also met.
You could be also worried not about the 128 sequences, but about all UTF-8
sequences. Those will be far more frequent. One could argue that presence of
the escape codepoints in Unicode should indicate a legacy encoding and that
this is not guaranteed. Well, this possibility of late detection is only a
side-effect of what I am doing. It is not guaranteed and is not a
requirement. Eventually, the problem will be detected, even if not a single
invalid sequence was encountered, and the important thing is that the
original byte sequence can be recreated entirely.
> In other
> words, you have to forbid the byte-sequences UTF-8(c), for all 128 c's,
not
> just in Unicode
The codepoints in Unicode are not to be forbidden (on the contrary) nor
reserved. They are merely assigned for a specific purpose. Using codepoints
that are already assigned for some other purpose is bad. Good enough for my
private solution, but I am looking for a solution that can be used by
everyone. You are frustrated, because you cannot find it. Well, there isn't
one, at least not one that would meet all the requirements. I still claim
that my solution works and that there is just one step missing.
> One last question - why /can't/ locale conversion be
> automated?
It *sorta* works in *some* cases. Not all users will do it. And the odd
filenames will keep reappearing for a long time. Perhaps even for malicious
reasons.
Lars
P.S.
> PS. I'm on holiday from tomorrow, so if I fail to respond to
> any comments,
> it'll be because I'm not here. :-)
You have taken my "take a break" seriously :) Merry Christmas ;)
L.
This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 09:47:11 CST