The World Wide Web (WWW) is a collection of interoperating applications that exchange data using various protocols and formats. A large part of the data exchanged is text. In order for this text to be handled correctly independent of character encoding, format, protocol, or application, a clear understanding of character encoding and processing issues, i.e. a Character Model, is necessary.
The paper will discuss the various aspects of the character model. The base part of the character model deals with character encoding, including issues such as the distinction between bytes, characters, and glyphs, and recommendations on escaping techniques to include any Unicode character in any character encoding. This part is based on the model of RFC 2070 (Internationalization of HTML) and includes experience from HTML 4.0, XML 1.0, and CSS.
With the WWW changing more and more from a one-way content-delivery system to a very large integrated application, more areas of character handling seem to need clear specifications. This in particular applies to the handling of cannonical equivalences (precomposed vs. decomposed) and to character indexing.
As the character model is currently under development, and will be evolved as new needs arise, the actual presentation may be somewhat different from this summary.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
24 January 1999, Webmaster