Re: Getting A Newb Started

From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 20:07:28 CDT

  • Next message: William J Poser: "Re: Getting A Newb Started"

    >Most low-level string processing code shouldn't need to be rewritten
    >for each application. If you've got UCS-2 only code, you have to
    >reƫvaluate it for each project, or introduce a subtle bug by the reuse
    >of code. If you don't reuse code, you're probably rewriting code,
    >which introduces bugs, especially in the parts that aren't
    >well-tested--which for most people will include non-BMP characters.

    True, but it depends what code you're writing. Some things are general
    purpose and should handle anything. Some things are very application
    specific.

    >And just because you can clean and validate user input doesn't mean
    >that you should arbitrarily forbid non-BMP characters. One of the
    >principles of Unicode is that you can pass through arbitrary scripts
    >and not worry about the difference.

    That's a principle of Unicode, not a design requirement for particular
    applications. It is prefectly appropriate to write programs
    using Unicode that are aimed at a particular language or writing
    system and so can make "arbitrary" assumptions.

    >>I don't get the point. Whether you're dealing with one character or
    >>many, life is simpler if they're all the same size.
            
    >If I have to look up a single character in an array, it makes a
    >difference. If I'm looking up multiple characters, it no longer
    >matters the length of any one of them; you're passing and returning
    >strings.

    True, for string lookup it doesn't matter. For determining how much storage
    to allocate assuming you know how many characters are in a string, it does.
    For moving from character to character, it does.

    >>But for some purposes, yes, you can assume that input is BMP-only.
    >>Not all input comes direct from the user.

    >Even for the times that you can assume integer input is positive, you
    >generally need to guard that code carefully with run-time tests. I
    >would regard nothing less as reasonable and necessary for code that
    >assumes the input in in the BMP.

    I agree that you need run-time tests, but you can't put them everywhere.
    You have to have a few points at which you test and elsewhere rely
    on your code not to take the data out of bounds.

    >If simplicity is your goal, why not use UTF-32?

    That is what I normally do. You seem to think that I am advocating
    UTF-16. Far from it. I never use it. I was simply listing pros and cons
    of the various formats.



    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:09:03 CDT