Re: Chapter on character sets

From: Tony Graham (tgraham@mulberrytech.com)
Date: Mon Jun 19 2000 - 21:17:45 EDT

Next message: Tony Graham: "Re: UTF-8 and UTF-16 issues"
Previous message: J.Schneider@epixtech.com: "Re: Characters for Programming Languages"
Maybe in reply to: Keld Jørn Simonsen: "Re: Chapter on character sets"
Next in thread: Peter_Constable@sil.org: "Re: Chapter on character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 15 Jun 2000 04:02 -0800, Lars Marius Garshol wrote:
> I'm writing a book on XML programming for Prentice-Hall, and included
> a chapter on basic character handling issues as they relate to XML.
> This chapter is 95% finished now, but as I'm not really an expert on
> this and it is so easy to make mistakes when writing about i18n issues
> I would be glad if people here could read through it and tell me if
> they see any mistakes (or other kinds of things that could be
> improved).

A few points:

- Your combining characters example puts the non-spacing mark before
the base character. Unicode would put it after.

- You don't address the fact that in XML although you can use both
   precomposed characters and combining characters, the XML processor
   doesn't equate equivalent sequences, so, for example, an end-tag
   containing a precomposed character will not match a start-tag
   containing what, in Unicode, would be an equivalent base +
   combining mark sequence.

- Thank you for mentioning my book, but it's title is "Unicode: A
Primer".

- In section 1.3.4., you state that "most of the world only need the
lower 256 characters". Saying "much of the world" is more
defensible.

- In the same section, you state that UCS-2 was replaced by UTF-16 to
   handle the "potential 32 bits of ISO 10646". UTF-16 adds *only*
   another million or so characters, and it doesn't address the 31-bit
   address range of ISO/IEC 10646.

- You refer to "surrogate characters", but code values in the
Surrogates area aren't considered characters.

- You say that UCS-4 lets you represent all of Unicode with no extra
tricks. UCS-4 can represent more code points than Unicode attempts
and more than XML allows as a character.

- You say that UCS-4 and UTF-32 are nearly identical but don't go
   into the differences. The biggest and simplest difference is that
   UTF-32 is defined to only cover the range of characters you can
   represent with UTF-16, which happens to be the same characters that
   you can represent in XML.

- In section 1.3.6., you state that XML refers to the Unicode
   definition of a character, but in section 1.1.1., you say that XML
   defines a character in terms of the ISO/IEC 10646 definition. XML
   really picks and chooses from both standards without fully
   conforming to either.

- You state that ISO 8859-1 being interpreted as UTF-8 will 'result
in illegal UTF-8 bit sequences'. Wouldn't it be more accurate to
say that it would 'be seen as illegal UTF-8 bit sequences'?

- I don't think that you give enough prominence to the notion that
numeric character references are always to Unicode code values.

- In this chapter, at least, you don't define or delineate decimal
and hexadecimal character references.

- Given the recent discussions on XML-Dev, you should also mention
that most C0 control characters aren't allowed in XML documents,
either directly or as numeric character references.

- You could also mention that code values from the Surrogates area
   are not allowed in XML documents, and that you can't construct a
   character from Planes 1 to 16 using numeric references to two code
   values from the Surrogates area.

- In this chapter, at least, you don't describe the formats of the
XML Declaration and the Text Declaration.

- You refer to XML "documents" having "other character encodings",
but every parsed entity making up an XML "document" can have a
different encoding if you like.

- There as some discussion of "C99" on this list on May 24 that may
be relevant to your discussion of the C programming language.

Regards,

Tony Graham
======================================================================
Tony Graham mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9632
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Next message: Tony Graham: "Re: UTF-8 and UTF-16 issues"
Previous message: J.Schneider@epixtech.com: "Re: Characters for Programming Languages"
Maybe in reply to: Keld Jørn Simonsen: "Re: Chapter on character sets"
Next in thread: Peter_Constable@sil.org: "Re: Chapter on character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT