From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 20:17:00 CDT
>Yes, if you do *everything* in UTF-32, the same arguments
>for string APIs would apply without having to do surrogate
>detection at the point of parsing code point boundaries,
>but there are a number of good reasons why people choose
>to (or have to) process text in UTF-16, as well.
For most purposes I do do everything in UTF-32. I read UTF-8,
convert it to UTF-32, work on the UTF-32, and convert it to
UTF-8 again on output. In a UTF-16 world that may not be the
best approach, but in my overwhelmingly Unix world, the input
I see is ASCII, UTF-8, or some parochial encoding. I don't think
that I have ever encountered UTF-16 in the wild, though I have
created it for testing purposes. Your mileage may vary.
(The weirdest parochial encoding that I have encountered was
one used by an Indian word processor whose native encoding I
reverse-engineered. It was a stateful encoding in which the same
codepoint could represent different characters depending on whether
it was expecting a consonant or a vowel.)
Bill
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:18:56 CDT