From: Carl W. Brown (email@example.com)
Date: Sat Dec 11 2004 - 10:32:18 CST
>>However, within the program itself UTF-8 presents a
>>problem when looking for specific data in memory buffers.
>>It is nasty, time consuming and error prone. Mapping
>>UTF-16 to code points is a snap as long as you
>>do not have a lot of surrogates. If you do then probably
>>UTF-32 should be considered.
> This is not demonstrated by experience. Parsing UTF-8 or
> UTF-16 is not complex, even in the case of random accesses
> to the text data, because you always have a bounded and
> small limit to the number of steps needed to find
> the beginning offset of a fully encoded code point: for
> UTF-16, this means at most 1 range test and 1 possible
> backward step. For UTF-8, this limit for random accesses
> is at most 3 range tests and 3 possible backward steps.
> UTF-8 and UTF-16 are very easily supporting backwards and
> forwards enumerators; so what else do you need to perform
> any string handling?
Sorry but I was unclear. I was thinking of raw data displays in hex. For
example with a sniffer, debuggers or memory dump.
In this case what is a very simple algorithm is not easy when you are
manually converting from UTF-8 to code points by disassembling the hex to
bits and recombining the bits to find the code points. With UTF-16 at best
you may have to do a little endian flip of the hex digits except for
surrogates which should be few.
Because some dumps not only provide hex but also ASCII representations of
data. UTF-8 is great to find tags like in XML. It allows you to analyze
the tree because the tags show up in the ASCII side of the trace data
display making is easy to find your specific data elements as well as
finding missing tags, tree structure errors or problems with data that is
not well formed. It is rare that systems use non-ASCII tags. Certainly
since the tags are only used internally there is no reason that they can not
be limited to ASCII just for improved support.
This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 10:38:02 CST