Re: Getting A Newb Started

From: John H. Jenkins (jenkins@apple.com)
Date: Mon Jul 07 2008 - 18:18:17 CDT

  • Next message: William J Poser: "Re: Getting A Newb Started"

    On Jul 7, 2008, at 3:19 PM, William J Poser wrote:

    > There's no way to avoid using more than one byte per character if
    > you're using Unicode since there are more than 256 characters. If
    > you use UTF-32, every char is four bytes. If you use UTF-8, characters
    > take from one to four bytes depending on where the corresponding
    > codepoint
    > is. If you use UTF-16, every character in the BMP is two bytes, any
    > character
    > outside of the BMP takes four bytes.
    >

    This isn't as much of an advantage as it sounds, since in most Unicode
    processes you need to be prepared to deal with multiple characters at
    once anyway.

    > The downside of UTF-16 and UTF-8 is that characters are not the same
    > length, which makes processing more complicated. With UTF-16, however,
    > if you know that there are no characters outside the BMP, every
    > character is a constant two bytes wide.
    >

    That's the problem. You really can't make the assumption that you're
    dealing with BMP-only text.

    =====
    John H. Jenkins
    jenkins@apple.com



    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 18:21:45 CDT