Re: Getting A Newb Started

From: John H. Jenkins (
Date: Mon Jul 07 2008 - 18:18:17 CDT

  • Next message: William J Poser: "Re: Getting A Newb Started"

    On Jul 7, 2008, at 3:19 PM, William J Poser wrote:

    > There's no way to avoid using more than one byte per character if
    > you're using Unicode since there are more than 256 characters. If
    > you use UTF-32, every char is four bytes. If you use UTF-8, characters
    > take from one to four bytes depending on where the corresponding
    > codepoint
    > is. If you use UTF-16, every character in the BMP is two bytes, any
    > character
    > outside of the BMP takes four bytes.

    This isn't as much of an advantage as it sounds, since in most Unicode
    processes you need to be prepared to deal with multiple characters at
    once anyway.

    > The downside of UTF-16 and UTF-8 is that characters are not the same
    > length, which makes processing more complicated. With UTF-16, however,
    > if you know that there are no characters outside the BMP, every
    > character is a constant two bytes wide.

    That's the problem. You really can't make the assumption that you're
    dealing with BMP-only text.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 18:21:45 CDT