Re: Getting A Newb Started

From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 20:17:00 CDT

  • Next message: Doug Ewell: "Re: Getting A Newb Started"

    >Yes, if you do *everything* in UTF-32, the same arguments
    >for string APIs would apply without having to do surrogate
    >detection at the point of parsing code point boundaries,
    >but there are a number of good reasons why people choose
    >to (or have to) process text in UTF-16, as well.

    For most purposes I do do everything in UTF-32. I read UTF-8,
    convert it to UTF-32, work on the UTF-32, and convert it to
    UTF-8 again on output. In a UTF-16 world that may not be the
    best approach, but in my overwhelmingly Unix world, the input
    I see is ASCII, UTF-8, or some parochial encoding. I don't think
    that I have ever encountered UTF-16 in the wild, though I have
    created it for testing purposes. Your mileage may vary.

    (The weirdest parochial encoding that I have encountered was
    one used by an Indian word processor whose native encoding I
    reverse-engineered. It was a stateful encoding in which the same
    codepoint could represent different characters depending on whether
    it was expecting a consonant or a vowel.)

    Bill



    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:18:56 CDT