Re: Getting A Newb Started

From: Mike (
Date: Mon Jul 07 2008 - 22:19:36 CDT

  • Next message: Jeroen Ruigrok van der Werven: "Re: Getting A Newb Started"

    >> Writing your code with the assumption that you're dealing with BMP-
    >> only is nonetheless still a bad idea, since the day will inevitably
    >> come when you want to re-use it in a situation where the assumption is
    >> false. Best to write for Unicode with as much generality as possible
    >> from the get-go rather than having to rewrite later.

    In my own code, I solved this by creating various UTF iterators. For
    example when you ask the UTF-16 iterator for the next character, it
    examines the next two bytes of the String to determine if they form a
    surrogate or not. If they don't, then it returns a uint32 with the
    code point and advances two bytes. If the first two bytes are a
    surrogate, then it checks that the following two bytes create a
    surrogate and it combines them into the effective code point and
    returns that, advancing 4 bytes.

    By creating these iterators, I was able to write all my other higher
    level code independent of the UTF in use. For example, I can normalize
    UTF-8 input directly into UTF-16 output (which requires my Char class
    to turn a uint32 into the proper sequence of bytes for the output
    encoding). One of these days I plan to write a GB18030 iterator (if I
    can ever find a decent reference on how to en/decode it), and all the
    high level functions will "just work" without even knowing the form of
    the original data.

    This approach lets you separate the input byte stream processing from
    the rest of the code, and even allows you to expand your ability to
    handle different encodings as the demand for them surfaces, with
    minimal pain and effort. If you think UTF-16 is right today, perhaps
    tomorrow you will be rewriting everything for UTF-8. If you follow my
    example, though, you will not need to rewrite anything, and will be
    able to support as many different encodings as you eventually need.


    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 22:23:48 CDT