Microsoft Unicode Article Review

From: John Tisdale (jtisdale@ocean.org)
Date: Thu Aug 05 2004 - 15:52:07 CDT

  • Next message: Magda Danish \(Unicode\): "New translation."

    I'm in the early stages of writing an article for Microsoft for publication
    on Developing Multilingual Web Sites. I want to include a brief overview of
    Unicode. I must say that I read a lot of contradictory information on the
    topic online from various sources. I've done my best to differentiate fact
    from fiction so that I can provide readers with an accurate introduction to
    the topic.

     

    I would really appreciate some review of the following segment of the
    article (in first draft form) for accuracy. Any technical corrections or
    general enhancements that anyone may wish to offer would be much
    appreciated. Please be gentle in dispensing criticism as this is just a
    starting point.

     

    Feel free to respond directly to me at jtisdale@ocean.org (as this topic
    probably doesn't warrant group discussion and bandwidth).

     

    Thanks very much, John

     

     

    Unicode Fundamentals

    For our discussion, there are two fundamental terms with which you must be
    familiar. First, a character set or character repertoire is an organized
    collection of characters (the term character set is more common but
    character repertoire is more technically correct, as is coded character
    set). Second, an encoding scheme is a system for representing those
    characters in a computing environment. Distinguishing between these two
    terms is crucial to understanding how to leverage the benefits of Unicode.

    Before Unicode, the majority of character sets contained only those
    characters needed by a single language or a small group of associated
    languages (such as iso-8859-2 which contains characters used in various
    European languages). The popularization of the Internet elevated the need
    for a more universal character set.

    In 1989, the International Organization for Standardization (ISO) published
    the first draft of a character set standard that supported a broad range of
    languages. It was called the ISO/IEC 10646 standard or the Universal
    Multiple-Octet Coded Character Set (commonly referred to as the Universal
    Character Set or UCS).

    Around the same time, a group of manufacturers in the U.S. formed the
    Unicode Consortium with a similar goal of creating a broad multilingual
    character set standard. The result of their work was the formation of the
    Unicode Standard. Since the early releases of these two standards, both
    groups have worked together closely to ensure compatibility between their
    standards. For details on the development of these standards, see
    http://www.unicode.org/versions/Unicode4.0.0/appC.pdf.

    In most cases when someone refers to Unicode they usually are discussing the
    collective offerings of these two standards bodies (whether they realize it
    or not). Technically, this isn't accurate but it certainly does simplify the
    discussion. In this article, I will sometimes use the term Unicode in a
    generic manner to refer to these collective standards (with apologies to
    those offended by this generalization) and when applicable I will make
    distinctions between them (referring to the Unicode Standard as Unicode and
    the ISO/IEC 10646 standard as UCS).

     

    On Character Sets and Encoding Schemes

    First, you should recognize that both of these standards separate the
    character repertoire from the encoding scheme. Many people confuse this fact
    and suggest that Unicode is a 16-bit character set. Yet, in neither standard
    is this accurate. The number of bits used is not associated with the
    character set but with the encoding scheme. Character sets are based on code
    points (a string of hexadecimal numbers) and not bits and bytes. So, to say
    that Unicode is represented by any number of bits is not correct. If you
    want to talk about bits and bytes, you need to talk about encoding schemes.

    Each character in Unicode is represented by a code point. It is usually
    notated as U + hexadecimal code point. The U stands for Unicode, followed by
    the + sign, and then a hexadecimal number (the code point) representing a
    given character in the Unicode character repertoire. For example, the
    English uppercase letter A is represented as U+0041.

    One way of encoding this character would be with UTF-8. This scheme would
    encode this character as 0x41. Encoding this same character using UCS-2
    produces 0x00, 0x41. You can run the Windows charmap utility (if you are
    running Windows 2000, XP or 2003) to see how characters are mapped in
    Unicode in your system.

    Basically, the Unicode Standard and the UCS character repertoires are the
    same (for practical purposes). Whenever one group publishes a new version of
    their standard, the other eventually releases a corresponding one. For
    example the Unicode Standard, Version 4.0 is the same as ISO/IEC 10646:2003.
    Hence, the code points are synchronized between these two standards.

    So, the differences between these two standards are not with the character
    sets themselves, but with the standards they offer for encoding and
    processing the characters contained therein. Both standards provide multiple
    encoding schemes (each with their unique characteristics). A term frequently
    used in encoding scheme definitions is an octet. This term describes an
    8-bit byte.

    UCS provides two encoding schemes. UCS-2 uses two octets (or 16 bits) and
    UCS-4 uses four octets (or 32 bits) to encode characters. Unicode has three
    primary encoding schemes UTF-8, UTF-16 and UTF-32. UTF stands for Unicode
    (or UCS) Transformation Format. Although you will occasionally see someone
    refer to UTF-7, this is a specialized form (more of a derivative) that
    ensures itself fully compatible with ASCII for specialized applications such
    as email systems that are not designed to handle non-ASCII data. As such, it
    is not part of the current definition of the Unicode Standard.

    One of the differences between Unicode and UCS encoding schemes is that the
    former provides variable-width encoding lengths and the latter does not.
    That is, UCS-2 is encoded with 2 bytes only and UCS-4 with 4 bytes only.
    Based on the naming convention, some people assume that UTF-8 is a single
    byte coding scheme. But, this isn't the case. UTF-8 actually provides
    variable lengths from 1 to 4 octets. Additionally, UTF-16 can encodes
    characters in either 2-octet or 4-octet lengths. UTF-32 can only encode with
    four octets.

    Also, you should be aware that the byte order can differ with UCS-2, UTF-16,
    UCS-4 and UTF-32. The two variations are known as Big-endian (BE) and
    Little-endian (LE). The name Big-endian is derived from the term Big End In
    (meaning Most Significant Byte first) and Little-endian comes from Little
    End In (meaning Most Significant Byte last). See Figure 1 for a synopsis of
    these encoding schemes.

     

    Choosing a Unicode Encoding Scheme

    In developing for the Web, most of your choices for Unicode encoding schemes
    will have already been made for you when you select a protocol or
    technology. Yet, you may find instances in which you will have the freedom
    to select which scheme to use for your application (especially in customized
    desktop applications). In such cases, there are several dynamics that will
    influence your decision.

    First, it should be stated that there isn't necessarily a right or wrong
    choice when it comes to a Unicode encoding scheme. In general, when choosing
    between the UCS and Unicode standards, the former tends to provide more
    generalized parameters about encoding and decoding whereas Unicode tends to
    provide more granularity, precision, and restriction in its standards. So,
    which you choose may depend upon how much precise definition you want versus
    freedom in tailoring the standard to your application.

    The variable length capability UTF-8 may empower you with greater
    flexibility in your application (to vary the number of octets you need for a
    particular application). Yet, if you are designing an application that needs
    to parse Unicode at the byte-level, the variable length of UTF-8 will
    require much more complex algorithms than the fixed length encoding schemes
    of UCS (granted you could use the fixed length of UTF-32 but if you don't
    need to encode more than 65,536 characters, you are wasting twice as much
    space than by using the fixed length UCS-2 scheme).

    Because UTF-8 supports 8-bit encoding and Unicode's character mapping
    precisely matches that of ASCII for the first 128 characters, UTF-8 affords
    you the ability to have Unicode and ASCII compatibility at the same time
    (talk about having your cake and eating it too). So, for cases in which
    maintaining ASCII compatibility is highly valued, UTF-8 makes an obvious
    choice. This is one of the primary reasons that Active Server Pages and
    Internet Explorer use UTF-8 encoding for Unicode.

    Yet, if you are working with an application that must parse and manipulate
    text at the byte-level, the costliness of variable length encoding will
    probably outweigh the benefits of ASCII compatibility. In such a case the
    fixed length of UCS-2 will usually prove the better choice. This is why
    Windows NT and subsequent Microsoft operating systems, SQL Server 7 (and
    subsequent ones), XML, Java, COM, ODBC, OLEDB and the .NET framework are all
    built on UCS-2 Unicode encoding. The uniform length of UCS provides a good
    foundation when it comes to complex data manipulation.

    If, on the other hand, you are creating an application that needs to display
    multiple Asian languages at the same time, UTF-32 or UCS-4 may be your only
    option (because the combination of characters found in these languages
    usually exceeds the 65,536 limitations of a 16-bit encoding scheme).

    There are other technical differences between these standards that you may
    want to consider that are beyond the scope of this article (such as how
    UTF-16 supports surrogate pairs but UCS-2 does not). For a more detailed
    explanation of Unicode, see the Unicode Consortium's article The UnicodeR
    Standard: A Technical Introduction
    (http://www.unicode.org/standard/principles.html) as well as Chapter 2 of
    the Unicode Consortium's The Unicode Standard, Version 4.0
    (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11178).

    The separation that Unicode provides between the character set and the
    encoding scheme allows you to choose the smallest and most appropriate
    encoding scheme for referencing all of the characters you need for a given
    application (thus providing considerable power and flexibility). Unicode is
    an evolving standard that continues to be tweaked and elaborated upon.

     



    This archive was generated by hypermail 2.1.5 : Thu Aug 05 2004 - 16:24:37 CDT