Proposed DRAFT Unicode Technical Report #20

The Use of Unicode with Markup Languages

Revision	1
Authors	Martin Dürst (mduerst@w3.org), Mark Davis (mark@unicode.org), Hideki Hiura (hideki.hiura@eng.sun.com), and Asmus Freytag
Date	1999-07-05
This Version	http://www.unicode.org/unicode/reports/tr20/tr20-1.html
Previous Version	none
Latest Version	http://www.unicode.org/unicode/reports/tr20

Summary

This document contains guidelines on the use of the Unicode Standard in conjunction with markup languages.

Status of this document

This proposed draft is published for review purposes. This draft has been considered by the Unicode Technical Committee and approved as proposed draft for internal review by Unicode Members and members of W3C i18n WG. At its next meeting, the Unicode Technical Committee may approve, reject, or further amend this document. It is intended that this document will become a joint Unicode - W3C document.

The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See http://www.unicode.org/unicode/standard/versions/ for more information.

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.

Introduction
General Considerations
List of Characters
Versioning
Conformance
References
Change History
Copyright

1. Introduction

The Unicode Standard contains a large number of characters in order to cover the scripts of the world. It also contains characters for compatibility with older character encodings, and characters with control-like functions included for various reasons. It also provides specifications for use of these characters.

For document and data interchange, the Internet and the World Wide Web is more and more making use of marked-up text. In many instances, markup provides the same, or essentially similar features to those provided by formatting characters in the Unicode Standard for use in plain text. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language.

[a more extensive overview of Unicode and markup will be added to level out the background of various audiences]

1.1 Notation

This report uses XML as a prominent and general example of markup. The XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix 'html:' indicates that this element is taken from [XHTML]. This means that the examples containing the namespace prefix 'html:' are assumed to include a namespace declaration of xmlns:html"..." (insert the appropriate URI for XHTML later).

Characters are denoted using the notation used in the Unicode Standard, i.e. U+ followed by their hexadecimal number. [Should this be replaced by the XML convention? Probably not, because we don't want to see these in XML :-)]

2. General Considerations

This chapter will contain general considerations regarding control-like characters in markup. In particular, it is planned to address the following points:

Linearity of text vs. hierarchy of markup structure
Coincidence (in most cases) of semantic markup and functions of control characters (e.g. <html:q> for insertions of fragments from another language,...)
Extensibility of markup
Problems with structural alignment between markup and control characters
Ambiguity or interference of control characters in markup source

3. List of Characters

The following table contains the characters currently considered not suitable for use with markup. Each category is further discussed below.

Codepoints	Names/Description	Short Comment
U+202A .. U+202E	BIDI embedding controls (LRE, RLE, LRO, RLO, PDF)	Strongly discouraged in HTML 4.0; RLM and LRM are allowed
U+2028 .. U+2029	Line and paragraph separator (under discussion)	use <html:br />, <html:p></html:p>, or equivalent
U+206A .. U+206B	Activate/Inhibit Symmetric swapping	Deprecated in Unicode 3.0
U+206C .. U+206D	Activate/Inhibit Arabic form shaping	Deprecated in Unicode 3.0
U+206E .. U+206F	Activate/Inhibit National digit shapes	Deprecated in Unicode 3.0
U+FFF9 .. U+FFFB	Interlinear annotation characters	Use ruby markup
U+FFFC	Object replacement character (under discussion)	Use markup, e.g. HTML <object>
U+1xxxx????	Language Tag codepoints (if and when they will be encoded)	Use html:lang or xml:lang

A later version of this document will discuss each of the character categories. For each of the categories/characters, the following points may be mentioned/discussed:

Short description of semantics
Reason for inclusion in Unicode
Specific problems when used with markup
Other areas where problems may occur (e.g. plain text)
What kind of markup to use in place
What to if detected (remove/ignore/replace/complain,...)

The following subsection gives an example:

3.1 Object Replacement Character, U+FFFC

Short description: The object replacement character is used to stand in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was included in Unicode only in order to reserve a codepoint for a very frequent application-internal use. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. The overall implementation makes sure that these two structures are kept in sync. If the text contains objects such as images, it is extremely helpful for implementations to have a sentinel in the text itself; any additional information is kept separately.

Problems when used in markup: Including an object replacement character in markup text does not work because the additional information (what object to include,...) is not available.

Problems with other uses: The object replacement character is also problematic when used in plain text.

Replacement markup: The markup to be used in place of the Object Replacement Character depends on the object in question and the markup context it is used in. Typical cases are <html:img src'...' />, <html:object ...>, or <html:applet ...>. These constructs allow to provide all additional information needed to identify and use the object in question.

What to do if detected: In a proxy context or browser context, ignore. When received in an editing context, remove, maybe with a warning to the user.

3.2 Interlinear Annotation Characters, U+FFF9-U+FFFB

Short description: The interlinear annotation characters are used to delimit interlinear annotations, in certain circumstances.

Reason for inclusion: The he interlinear annotation characters were included in Unicode only in order to reserve codepoints for very frequent application-internal use.The interlinear annotation characters are used to delimit interlinear annotations in contexts where other delimiters are not available, and where non-textual means exist to carry formatting information. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. The overall implementation makes sure that these two structures are kept in sync. If the text contains interlinear annotations, it is extremely helpful for implementations to have delimiters in the text itself; even though delimiters are not otherwise used for style markup.With this method, and unlike the case of the object replacement character, all textual information can remain in the standard text stream, but any additional formatting information is kept separately. In addition, the Interlinear Annotation Anchor serves as a place holder for formatting information for the whole annotation object, the same way a paragraph mark can be a place-holder to attach paragraph formatting information.

Problems when used in markup: Including interlinear annotation characters in markup text does not work because the additional formatting information (how to position the annotation,...) is not available.

Problems with other uses: The interlinear annotation characters are also problematic when used in plain text, and are not intended for that purpose.

Replacement markup: The markup to be used in place of the Object Replacement Character depends on the formatting an nature of the interlinear annotation in question.[Pointer to the RUBY draft]..

What to do if detected: In a proxy context or browser context, remove U+FFF9 and remove all characters between U+FFFA and following U+FFFB. When received in an editing context, either remove in the same manner, maybe with a warning to the user, or convert into appropriate ruby markup for further editing and formatting by the user.

4. Versioning

When this report is finalized, it will treat all relevant characters in the then current version of the Unicode Standard, and it may include some others whose addition is anticipated/planned/feared.

As the Unicode standard is updated and new characters get added, new characters that are not suitable for markup may also be added. However, it is hoped that this report will help to reduce such additions as much as possible. These characters will be flagged as such in the appropriate datafile. This file should always be checked to have the most up-to-date information. This report itself may be updated periodically to give additional background information.

For more information, see:

Versions of the Unicode Standard (http://www.unicode.org/unicode/standard/versions)
Unicode Character Database Format (ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html)
Unicode Character Database (ftp://ftp.unicode.org/Public/Public/UNIDATA/UnicodeData.txt)

5. Conformance

In the context of the Unicode Standard, the material in this technical report is informative. However, other documents, particularly markup language specifications, may specify conformance including normative references to this document.

6. References

(to be completed)

[Charmod]

[Charreq]

[Namespace]

[HTML]

[Unicode]

[XHTML]

[XML]

7. Change History

Changes from the initial draft: Fixed the header. Fixed the numbering. Fixed the title. Put references to final version of data files based on naming conventions. Minor wording changes. Added proposed language on annotation characters to match example on FFFC. Posted for internal review by UTC and W3C (AF)

8. Copyright

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/