Re: Chapter on character sets

From: Tony Graham (
Date: Mon Jun 19 2000 - 21:17:45 EDT

At 15 Jun 2000 04:02 -0800, Lars Marius Garshol wrote:
> I'm writing a book on XML programming for Prentice-Hall, and included
> a chapter on basic character handling issues as they relate to XML.
> This chapter is 95% finished now, but as I'm not really an expert on
> this and it is so easy to make mistakes when writing about i18n issues
> I would be glad if people here could read through it and tell me if
> they see any mistakes (or other kinds of things that could be
> improved).

A few points:

 - Your combining characters example puts the non-spacing mark before
   the base character. Unicode would put it after.

 - You don't address the fact that in XML although you can use both
   precomposed characters and combining characters, the XML processor
   doesn't equate equivalent sequences, so, for example, an end-tag
   containing a precomposed character will not match a start-tag
   containing what, in Unicode, would be an equivalent base +
   combining mark sequence.

 - Thank you for mentioning my book, but it's title is "Unicode: A

 - In section 1.3.4., you state that "most of the world only need the
   lower 256 characters". Saying "much of the world" is more

 - In the same section, you state that UCS-2 was replaced by UTF-16 to
   handle the "potential 32 bits of ISO 10646". UTF-16 adds *only*
   another million or so characters, and it doesn't address the 31-bit
   address range of ISO/IEC 10646.

 - You refer to "surrogate characters", but code values in the
   Surrogates area aren't considered characters.

 - You say that UCS-4 lets you represent all of Unicode with no extra
   tricks. UCS-4 can represent more code points than Unicode attempts
   and more than XML allows as a character.

 - You say that UCS-4 and UTF-32 are nearly identical but don't go
   into the differences. The biggest and simplest difference is that
   UTF-32 is defined to only cover the range of characters you can
   represent with UTF-16, which happens to be the same characters that
   you can represent in XML.

 - In section 1.3.6., you state that XML refers to the Unicode
   definition of a character, but in section 1.1.1., you say that XML
   defines a character in terms of the ISO/IEC 10646 definition. XML
   really picks and chooses from both standards without fully
   conforming to either.

 - You state that ISO 8859-1 being interpreted as UTF-8 will 'result
   in illegal UTF-8 bit sequences'. Wouldn't it be more accurate to
   say that it would 'be seen as illegal UTF-8 bit sequences'?

 - I don't think that you give enough prominence to the notion that
   numeric character references are always to Unicode code values.

 - In this chapter, at least, you don't define or delineate decimal
   and hexadecimal character references.

 - Given the recent discussions on XML-Dev, you should also mention
   that most C0 control characters aren't allowed in XML documents,
   either directly or as numeric character references.

 - You could also mention that code values from the Surrogates area
   are not allowed in XML documents, and that you can't construct a
   character from Planes 1 to 16 using numeric references to two code
   values from the Surrogates area.

 - In this chapter, at least, you don't describe the formats of the
   XML Declaration and the Text Declaration.

 - You refer to XML "documents" having "other character encodings",
   but every parsed entity making up an XML "document" can have a
   different encoding if you like.

 - There as some discussion of "C99" on this list on May 24 that may
   be relevant to your discussion of the C programming language.


Tony Graham
Tony Graham
Mulberry Technologies, Inc.
17 West Jefferson Street Direct Phone: 301/315-9632
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT