Re: MSDN Article, Second Draft

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri Aug 20 2004 - 22:33:30 CDT

  • Next message: John Cowan: "Re: MSDN Article, Second Draft"

    John Tisdale wrote:

    > Unicode Fundamentals

    > Early character sets were very limited in scope. ASCII required only 7 bits
    > to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits
    > which represented 256 characters while providing backward compatibility with
    > ASCII. Countless other character sets emerged that represented the

    As is often the case, Unicode experts are not necessarily experts on
    'legacy' character sets and encodings. The 'official' name of 'ASCII' is
    ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode,
    I'm afraid you're spreading misinformation about what came before it.
    The sentence that 'ANSI pushed this scope ... represents 256 characters'
    is misleading. ANSI has nothing to do with various single, double,
    triple byte character sets that make up single and multibyte character
    encodings. They're devised and published by national and international
    standard organizations as well as various vendors. Perhaps, you'd better
    just get rid of the sentence 'ANSI pushed ... providing backward
    compatibility with ASCII'.

    > characters needed by various languages and language groups. The growing
    > complexities of managing numerous international character sets escalated the

       numerous national and vendor character sets that are specific to a
    small subset of scripts/characters in use (or that can cover only a
    small subset of ....)

    > Two standards emerged about the same time to address this demand. The
    > Unicode Consortium published the Unicode Standard and the International
    > Organization for Standardization (ISO) offered the ISO/IEF 10646 standard.

    A typo: It's ISO/IEC not ISO/IEF. Perhaps, it's not a typo. You
    consistently used ISO/IEF in place of ISO/IEC ;-)

    > Fortunately, these two standards bodies synchronized their character sets
    > some years ago and continue to do so as new characters are added.
    > Yet, although the character sets are mapped identically, the standards for
    > encoding them vary in many ways (which are beyond the scope of this
    > article).

    I'm afraid that 'yet ...' can give a false impression that Unicode
    consortium and
    ISO/IEC have some differences in encoding standards especially
    considering that the sentence begins with 'although ....identically'.

    > Coded Character Sets

    > A coded character set (sometimes called a character repertoire) is a mapping
    > from a set of abstract characters to a set of nonnegative, noncontiguous
    > integers (between 0 and 1,114,111, called code points).

      A 'character repertoire' is different from a coded character set in
    that it's more like a set of abstract characters **without** numbers
    associated with them. (needless to say, 'a coded character set' is a set
    of character-integer pairs)

    > Character Encoding Forms
    > The second component in Unicode is character encoding forms. Their purpose

    I'm not sure whether 'component' is the best word to use here.

    > The Unicode Standard provides three forms for encoding its repertoire
    > (UTF-8, UTF-16 and UTF-32).

    Note that ISO 10646:2003 also define all three of them exactly the same
    as Unicode does.

    > You will often find references to USC-2 and
    > USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is
    > equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the

    UCS-2 IS different from UTF-16. UCS-2 can only represent a subset of
    characters in Unicode/ISO 10646 (namely, those in BMP). BTW, it's not
    USC but UCS. Also note that UTF in UTF-16/UTF-32/UTF-8 stand for either
    'UCS Transformation Format' (UCS stands for Univeral Character Set, ISO
    10646) or 'Unicode Transformation Format'

    > significant enough to limit its implementation (as at least half of the 32
    > bits will contain zeros in the majority of applications). Except in some
    > UNIX operating systems and specialized applications with specific needs,

       Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when
    __STDC_ISO_10646__ is defined. Recent versions of Python also uses
    UTF-32 internally.

    > UTF-32 is seldom implemented as an end-to-end solution (yet it does have its
    > strengths in certain applications).
    > UTF-16 is the default means of encoding the Unicode character repertoire
    > (which has perhaps played a role in the misnomer that Unicode is a 16-bit
    > character set).

       I would not say UTF-16 is the default means of encoding ..... It's
    probably the most widely used, but that's different from being the
    default ...unless you're talking specifically about Win32 APIs (you're
    not in this paragraph, right?)

    > UTF-8 is a variable-width encoding form based on byte-sized code units
    > (ranging between 1 and 4 bytes per code unit).

       The code unit of UTF-8 is an 8-bit byte just like the code unit of
    UTF-16 and that of UTF-32 are a 16-bit 'half-word' and a 32-bit 'word',
    respectively. A single Unicode character is represented with 1 to 4 code
    units (bytes) depending on what code point it's assigned in the Unicode.
      Please, see p. 73 of the Unicode standard 4.0

    > In UTF-8, the high bits of each byte are reserved to indicate where in the
    > unit code sequence that byte belongs. A range of 8-bit code unit values are

    where in the code unit sequence that byte belongs.

    > reserved to indicate the leading byte and the trailing byte in the sequence.
    > By sequencing four bytes to represent a code unit, UTF-8 is able to
    > represent the entire Unicode character repertoire.

    By using one to four code units (bytes) to represent a character

    > Character Encoding Schemes

    > method. This issue is not relevant with UTF-8 because it utilizes individual
    > bytes that are encapsulated with the sequencing data (with bounded look
    > ahead).

       'because ....' reads too cryptic. Why don't you just say that 'the
    code unit in UTF-8 is a byte so that there's no need for serialization'
    (i.e. sequences of code units in UTF-8 are identical to sequences of
    bytes in UTF-8)

    > Choosing an Encoding Solution

    > high value in the multi-platform world of the Web. As such, HTML and current
    > versions of Internet Explorer running on Windows 2000 or later use the UTF-8
    > encoding form. If you try to force UTF-16 encoding on IE, you will encounter
    > an error or it will default to UTF-8 anyway.

    I'm not sure what you're trying to say here although I can't agree with
    you more that UTF-8 is the most sensible choice to transmit information
    (**serve** documents) over 'mostly' byte-oriented protocols/media such
    as internet mail and html/xml (html/xml can be in UTF-16/UTF-32 as
    well). As a web user agent/**client**, MS IE can (must) render
    documents in UTF-16 just as well as documents in UTF-8 and many other
    character encodings. It even supports UTF-7.

    > valuable asset. The extra code and processor bandwidth required to
    > accommodate variable-width code units can outweigh the cost of using 32-bits
    > to represent each code unit.

       You keep misusing 'code unit'. Code units cannot be of
    variable-width. It's fixed in each encoding form. It's 8-bit in UTF-8,
    16-bit in UTF-16 and 32-bit in UTF-32. The last sentence should end with
    'to represent each character'.

    > In such cases, the internal processing can be done using UTF-32 encoding and
    > the results can be transmitted or stored in UTF-16

       can be transmitted or stored in UTF-16 or UTF-8.

       Hope this helps,

       Jungshik



    This archive was generated by hypermail 2.1.5 : Fri Aug 20 2004 - 22:35:58 CDT