RE: Software support costs (was: Nicest UTF

From: Carl W. Brown (
Date: Sat Dec 11 2004 - 10:32:18 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping in Unicode"


    >>However, within the program itself UTF-8 presents a
    >>problem when looking for specific data in memory buffers.
    >>It is nasty, time consuming and error prone. Mapping
    >>UTF-16 to code points is a snap as long as you
    >>do not have a lot of surrogates. If you do then probably
    >>UTF-32 should be considered.
    > This is not demonstrated by experience. Parsing UTF-8 or
    > UTF-16 is not complex, even in the case of random accesses
    > to the text data, because you always have a bounded and
    > small limit to the number of steps needed to find
    > the beginning offset of a fully encoded code point: for
    > UTF-16, this means at most 1 range test and 1 possible
    > backward step. For UTF-8, this limit for random accesses
    > is at most 3 range tests and 3 possible backward steps.
    > UTF-8 and UTF-16 are very easily supporting backwards and
    > forwards enumerators; so what else do you need to perform
    > any string handling?

    Sorry but I was unclear. I was thinking of raw data displays in hex. For
    example with a sniffer, debuggers or memory dump.

    In this case what is a very simple algorithm is not easy when you are
    manually converting from UTF-8 to code points by disassembling the hex to
    bits and recombining the bits to find the code points. With UTF-16 at best
    you may have to do a little endian flip of the hex digits except for
    surrogates which should be few.

    Because some dumps not only provide hex but also ASCII representations of
    data. UTF-8 is great to find tags like in XML. It allows you to analyze
    the tree because the tags show up in the ASCII side of the trace data
    display making is easy to find your specific data elements as well as
    finding missing tags, tree structure errors or problems with data that is
    not well formed. It is rare that systems use non-ASCII tags. Certainly
    since the tags are only used internally there is no reason that they can not
    be limited to ASCII just for improved support.


    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 10:38:02 CST