Re: Getting A Newb Started

From: David Starner (
Date: Mon Jul 07 2008 - 19:33:06 CDT

  • Next message: John H. Jenkins: "Re: Getting A Newb Started"

    On Mon, Jul 7, 2008 at 7:17 PM, William J Poser <> wrote:
    > Of course you want to be prepared for any possible input, but
    > in some cases you do know what the range of possible inputs is.
    > The input may not be coming directly from the user. It may be user
    > input that has already been cleaned or validated, or it may be
    > data that you yourself have generated.

    Most low-level string processing code shouldn't need to be rewritten
    for each application. If you've got UCS-2 only code, you have to
    reƫvaluate it for each project, or introduce a subtle bug by the reuse
    of code. If you don't reuse code, you're probably rewriting code,
    which introduces bugs, especially in the parts that aren't
    well-tested--which for most people will include non-BMP characters.

    And just because you can clean and validate user input doesn't mean
    that you should arbitrarily forbid non-BMP characters. One of the
    principles of Unicode is that you can pass through arbitrary scripts
    and not worry about the difference.

    > I don't get the point. Whether you're dealing with one character or
    > many, life is simpler if they're all the same size.

    If I have to look up a single character in an array, it makes a
    difference. If I'm looking up multiple characters, it no longer
    matters the length of any one of them; you're passing and returning

    > But for some purposes, yes, you can assume that input is BMP-only.
    > Not all input comes direct from the user.

    Even for the times that you can assume integer input is positive, you
    generally need to guard that code carefully with run-time tests. I
    would regard nothing less as reasonable and necessary for code that
    assumes the input in in the BMP. If simplicity is your goal, why not
    use UTF-32?

    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 19:36:19 CDT