Re: Getting A Newb Started

From: Doug Ewell (dewell@roadrunner.com)
Date: Mon Jul 07 2008 - 20:57:36 CDT

Next message: Doug Ewell: "Re: how to add all latin (and greek) subscripts"

Previous message: William J Poser: "Re: Getting A Newb Started"
In reply to: John H. Jenkins: "Re: Getting A Newb Started"
Next in thread: Kenneth Whistler: "Re: Getting A Newb Started"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John H. Jenkins <jenkins at apple dot com> wrote:

>> ... If you use UTF-32, every char is four bytes. If you use UTF-8,
>> characters take from one to four bytes depending on where the
>> corresponding codepoint is. If you use UTF-16, every character in the
>> BMP is two bytes, any character outside of the BMP takes four bytes.
>
> This isn't as much of an advantage as it sounds, since in most Unicode
> processes you need to be prepared to deal with multiple characters at
> once anyway.

I hear this argument every so often, from different people, and it just
doesn't ever carry any weight for me. Sure, there are lots of
situations when processing text (Unicode or otherwise) that you need to
deal with more than one character at a time -- especially so with
Unicode, with its combining marks and such. But there are still many
other string-processing situations that require functions like Length
and IndexOf and Remove. The need to do that kind of
character-by-character processing hasn't vanished. Just last week I
wrote a program that operated on a fixed-column-width UTF-8 file, and to
do that, you have to deal with characters by position.

>> The downside of UTF-16 and UTF-8 is that characters are not the same
>> length, which makes processing more complicated. With UTF-16,
>> however, if you know that there are no characters outside the BMP,
>> every character is a constant two bytes wide.
>
> That's the problem. You really can't make the assumption that you're
> dealing with BMP-only text.

Agreed. UTN #12 notwithstanding, I'm with Bill Poser in preferring to
store Unicode text in memory as UTF-32 -- when I have to do it manually
at all, which is less and less often, as I complete the transition from
C++ and MFC to C# and .NET.

--
Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

Next message: Doug Ewell: "Re: how to add all latin (and greek) subscripts"
Previous message: William J Poser: "Re: Getting A Newb Started"
In reply to: John H. Jenkins: "Re: Getting A Newb Started"
Next in thread: Kenneth Whistler: "Re: Getting A Newb Started"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:59:47 CDT