At 15 Jun 2000 04:02 -0800, Lars Marius Garshol wrote:
> I'm writing a book on XML programming for Prentice-Hall, and included
> a chapter on basic character handling issues as they relate to XML.
> This chapter is 95% finished now, but as I'm not really an expert on
> this and it is so easy to make mistakes when writing about i18n issues
> I would be glad if people here could read through it and tell me if
> they see any mistakes (or other kinds of things that could be
A few points:
- Your combining characters example puts the non-spacing mark before
the base character. Unicode would put it after.
- You don't address the fact that in XML although you can use both
precomposed characters and combining characters, the XML processor
doesn't equate equivalent sequences, so, for example, an end-tag
containing a precomposed character will not match a start-tag
containing what, in Unicode, would be an equivalent base +
combining mark sequence.
- Thank you for mentioning my book, but it's title is "Unicode: A
- In section 1.3.4., you state that "most of the world only need the
lower 256 characters". Saying "much of the world" is more
- In the same section, you state that UCS-2 was replaced by UTF-16 to
handle the "potential 32 bits of ISO 10646". UTF-16 adds *only*
another million or so characters, and it doesn't address the 31-bit
address range of ISO/IEC 10646.
- You refer to "surrogate characters", but code values in the
Surrogates area aren't considered characters.
- You say that UCS-4 lets you represent all of Unicode with no extra
tricks. UCS-4 can represent more code points than Unicode attempts
and more than XML allows as a character.
- You say that UCS-4 and UTF-32 are nearly identical but don't go
into the differences. The biggest and simplest difference is that
UTF-32 is defined to only cover the range of characters you can
represent with UTF-16, which happens to be the same characters that
you can represent in XML.
- In section 1.3.6., you state that XML refers to the Unicode
definition of a character, but in section 1.1.1., you say that XML
defines a character in terms of the ISO/IEC 10646 definition. XML
really picks and chooses from both standards without fully
conforming to either.
- You state that ISO 8859-1 being interpreted as UTF-8 will 'result
in illegal UTF-8 bit sequences'. Wouldn't it be more accurate to
say that it would 'be seen as illegal UTF-8 bit sequences'?
- I don't think that you give enough prominence to the notion that
numeric character references are always to Unicode code values.
- In this chapter, at least, you don't define or delineate decimal
and hexadecimal character references.
- Given the recent discussions on XML-Dev, you should also mention
that most C0 control characters aren't allowed in XML documents,
either directly or as numeric character references.
- You could also mention that code values from the Surrogates area
are not allowed in XML documents, and that you can't construct a
character from Planes 1 to 16 using numeric references to two code
values from the Surrogates area.
- In this chapter, at least, you don't describe the formats of the
XML Declaration and the Text Declaration.
- You refer to XML "documents" having "other character encodings",
but every parsed entity making up an XML "document" can have a
different encoding if you like.
- There as some discussion of "C99" on this list on May 24 that may
be relevant to your discussion of the C programming language.
Tony Graham mailto:firstname.lastname@example.org
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9632
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
Mulberry Technologies: A Consultancy Specializing in SGML and XML
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT