RE: Roundtripping Solved

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Fri Dec 17 2004 - 04:13:30 CST

Next message: Lars Kristan: "RE: Roundtripping Solved"

Previous message: Arcane Jill: "Re: Roundtripping Solved"
Maybe in reply to: Arcane Jill: "Roundtripping Solved"
Next in thread: Peter Kirk: "Re: Roundtripping Solved"
Reply: Peter Kirk: "Re: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On Behalf
Of Lars Kristan
Subject: RE: Roundtripping Solved

>However, requirements 1 and 2 are actually taken from Unicode standard, they
>are not my requirements.
>How's that? Well, they are my requirements also, but instead of "for all valid
>UTF-x strings", in my case the requirement is relaxed to "for all valid UTF-8
>strings that do not contain the 128 replacement codepoints".

Yes, I follow that. But if you replace the phrase "128 replacement codepoints"
with the phrase "128 replacement codepoint strings", or "128 replacement escape
sequences" then you do actually still have a workable scheme which does the job
just as well. You don't seem to have acknowledged this, but think it through.

I know you argued against replacement strings a while back for "performance
reasons". I should have replied to that at the time, but I let it go. But
realistically, Lars, I think you should just take the performance hit. The
computing cost of counting characters in a null-terminated UTF-8 stream is
really not that much more than the cost of strlen(). Think about it - all you
have to do is to disregard bytes which match the bit pattern 10xxxxxx. Just
count all the rest. You're talking about adding a couple of machine code
instructions to the loop, that's all. Not only that, as a programmer, you
/must/ surely realise that the performance cost of even the most complex UTF
conversion is going to be utterly insignificant when compared with the time it
takes to move the drive head from one part of a hard disc to another. Your
conversions will be totally swamped out by all the snail-pace fstat()s etc.
that you'll need to do to get your filenames in the first place. And even if
you don't accept that, I hope you can understand that if it is suggested to the
UTC that they reserve some codepoints just so you don't have to take a
performance hit, the proposal won't get much past their inbox.

So let's hypothesise that you /can/ take the performance hit. In that case,
escape sequences will work just as well as resevered characters. They will
fulfil exactly the same function ... EXCEPT that you no longer have to worry
that Unicode text might contain single codepoints by accident. Instead, you
have a relaxed requirement - that Unicode text should not contain any escape
strings by accident ... and that can be arranged with an utterly astronomical
degree of certainty (though never /absolute/ certainty of course). I submit,
therefore, again, that all of your needs will be met (possibly apart from the
"no performance hit" thing) if you accept strings of characters instead of
single characters. /This is workable/.

>Furthermore, today, y should not contain any of the 128 codepoints (assuming
>UTC takes unassigned codepoints and assigns them today).

This is also true of suitably chosen escape sequences. Except that the UTC does
not need to assign them - you can do that yourself - with any desired level of
probability that it won't turn up by accident.

> And considerably less than inability to access files or even files being
> displayed with missing characters (or no characters at all).

There is also one other thing which you seem not to have considered. It is
possible (and /much/ more likely than that a suitably chosen escape sequence
might turn up by accident) that, in some non-Unicode encoding ... let's say the
fictitious encoding Krakozhian ... the byte sequence emitted by UTF-8(c) might
be extremely common (where c is one of your 128 reserved codepoints). In other
words, you have to forbid the byte-sequences UTF-8(c), for all 128 c's, not
just in Unicode (which, granted, you could do by reserving the characters, c,
assuming you could wave a magic wand at the UTC), but in ALL OTHER ENCODINGS
also. It strikes me that you have no way to guarantee that.

Further, if you argue that this circumstance is unlikely enough not to bother
about, then my previous arguments involving probability hold.

I hope I don't come across as arguing for the sake of arguing. I'm actually
trying to help here. But you WILL NOT get your 128 codepoints, so it seems
reasonable to look for other ways of solving the original problem which those
codepoints were designed to solve.

One last question - why /can't/ locale conversion be automated? I don't really
get this one, but it's the root of this whole topic. Surely, if we make the
following assumptions:
(1) No user has a locale of UTF-8, and
(2) Some users will have created UTF-8 filenames and UTF-8 text files, and
(3) Some of those text files may have been concatenated, leading to
mixed-encoding text files
then we can surely automate everything. (Requirement (1) can be met simply by
asking all users who have changed their locale to UTF-8 to change it back
again, temporarily). Assuming these requirements, all you have to do is:

# for (all users)
# {
# for (all filename below ~/)
# {
# if (filename not valid UTF-8)
# {
# rename it by re-encoding it (assuming it to be currently
encoded in the user's locale) to UTF-8
# }
# }
# for (all files below ~/)
# {
# if (the file can be positively identified as a text file)
# {
# re-encode all non-UTF-8 substrings (assuming them to be in the
user's locale) to UTF-8
# }
# }
# change the user's locale to UTF-8
# }

Kernel files and other files under / but not under /user should all have ASCII
filenames and contain ASCII text, so they won't be a problem anyway. (And even
that's not true, the superuser can do the same thing, taking care to avoid
traversing /user). References to filenames in scripts will have been modified
along with the filenames, because scripts are text files. All that would fall
through would be references to non-ASCII filenames in binary files, and you can
mitigate even that, at least partially - for instance by spitting out all
databases into .sql files before conversion and reloading them after;
recompiling as much as possible from source after the conversion; etc.. A small
amount of stuff would still fall through, but that set will be so small that by
now it would be pretty reasonable just to say "hell - let it break". And when
it breaks, fix it. I mean - if you actually /can/ automate things, then the
whole of the rest of this line of discussion becomes unnecessary.

Just my thoughts.
Jill

PS. I'm on holiday from tomorrow, so if I fail to respond to any comments,
it'll be because I'm not here. :-)

Next message: Lars Kristan: "RE: Roundtripping Solved"
Previous message: Arcane Jill: "Re: Roundtripping Solved"
Maybe in reply to: Arcane Jill: "Roundtripping Solved"
Next in thread: Peter Kirk: "Re: Roundtripping Solved"
Reply: Peter Kirk: "Re: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 04:16:44 CST