Re: ANSI and Unicode for x00 - xFF

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Oct 26 2005 - 12:20:32 CST

  • Next message: Chris Jacobs: "Re: ANSI and Unicode for x00 - xFF"

    On Wed, 26 Oct 2005, Velasquez, Carlos wrote:

    > I am new to this list and somewhat new to the Unicode standard.

    Welcome. You might find the FAQ at http://www.unicode.org useful, since it
    addresses some of the questions you asked. Admittedly, the FAQ is partly
    rather hard reading. I hope you can find a suitable tutorial.

    > I am hoping someone can help me understand the difference between ANSI
    > and UTF-8 for characters in the domain of x00 and xFF.

    First, the abbreviation "ANSI", when used to denote a character code,
    is a misnomer. There was once a draft by the American National Standards
    Institute. Microsoft created its own version of "Latin 1" and started
    calling it "ANSI", but the ANSI never approved it. The Microsoft code
    commonly called "ANSI" is properly called "windows-1252" (the official
    MIME encoding name) or "Windows Latin 1" (a common descriptive name).

    Second, UTF-8 is one of the encodings you can use for Unicode (and often
    the best choice). It might be confusing to compare it with windows-1252.
    _Logically_, Unicode assigns a unique number to each character, and this
    number can be physically represented in different ways - in UTF-8, you use
    one to four bytes (octets) per character. For the first code positions,
    these numbers are the same as in ISO-8859-1, also known as ISO Latin 1.
    The difference between ISO Latin 1 and Windows Latin 1 is that characters
    in code positions 128 to 159 decimal (80 to 9F in hexadecimal) are
    reserved for control characters in the former, assigned (in part) to
    some printable characters (mostly punctuation) in the latter.

    > Are the 7 bit ASCII characters a subset of the 8 bit ANSI character?

    The ASCII characters have the same numbers in ASCII, ISO Latin 1, Windows
    Latin 1 ("ANSI"), and Unicode.

    > I understand that the 7 bit ASCII characters are definitely a subset of
    > the UTF-8 set but am not sure if ANSI is a subset of UTF-8.

    UTF-8 is an encoding, thus at a different conceptual level. However,
    ASCII characters are represented "as such" as 8-bit bytes (with first bit
    zero) in UTF-8, whereas "ANSI" characters outside ASCII have a completely
    different representation (each of them occupies at least two bytes).

    > Here is why I ask:
    > Our database contains name information for a Spanish population. As
    > such, we store names such as "Sérgio Murilo" in our database which is
    > set to Unicode UTF-8. However, when we generate files and specify the
    > file encoding to be ANSI, we get the character "é" in double byte (xC3
    > and xA9).

    That's to be expected. The letter "é" (e with acute accent) is outside
    the ASCII range, and it is represented as two octets in UTF-8. If you view
    a UTF-8 encoded document in a program that interprets its input as "ANSI",
    you will see two characters in place of "é".

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Oct 26 2005 - 12:23:48 CST