Part of L2/99-181

Proposed DRAFT Unicode Technical Report #20

Characters not Suitable for Markup

Revision 1
Authors Martin Dürst (, Mark Davis (, Hideki= Hiura (
Date 1999-06-08
This Version most probably:
Previous Version none
Latest Version


This document contains guidelines in order to avoid the use of= certain characters in markup.

Status of this document

This proposed draft is published for review purposes. This draft has= not yet been considered by the Unicode Technical Committee. At its next= meeting, the Unicode Technical Committee may approve, reject, or further amend= this document.

The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See for more information.

This document does not, at this time, imply any endorsement by= the Consortium's staff or member organizations. Please mail comments to

Table of Contents

  1. Introduction
  2. General Considerations
  3. List of Characters
  4. Versioning
  5. Conformance

1. Introduction

The Unicode Standard contains a large number of characters in order to= cover the scripts of the world. It also contains characters for compatibility= with older character encodings, and characters with control-like functions= included for various reasons.

For document and data interchange, the Internet and the World Wide Web= is more and more making use of marked-up text. The principles of marked-up= text can interfere with some control-like characters in various undesirable= ways.

[a more extensive overview of Unicode and markup will be added to level= out the background of various audiences]


This report uses XML as a prominent and general example of markup. The= XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix= 'html:' indicates that this element is taken from [XHTML]. This means that the= examples containing the namespace prefix 'html:' are assumed to include a= namespace declaration of xmlns:html="..." (insert the appropriate URI for XHTML= later).

Characters are denoted using the notation used in the Unicode Standard,= i.e. U+ followed by their hexadecimal number. [Should this be replaced by the XML convention? Probably not, because we don't want to see these in XML= :-)]

2. General Considerations

This chapter will contain general considerations regarding= control-like characters in markup. In particular, it is planned to address the= following points:

3. List of Characters

The following table contains the characters currently considered not= suitable for use with markup. Each category is further discussed below.



Short Comment

U+202A .. U+202E BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in HTML 4.0; RLM and LRM are allowed
U+2028 .. U+2029 Line and paragraph separator (under discussion) use <html:br />, <html:p></html:p>, or= equivalent
U+206A .. U+206B symmetric swapping Strongly discouraged in Unicode 2.0
U+206C .. U+206D Arabic form shaping Strongly discouraged in Unicode 2.0
U+206E .. U+206F National digit shapes Strongly discouraged in Unicode 2.0
U+FFF9 .. U+FFFB Interlinear annotation controls Use ruby markup
U+FFFC Object replacement character (under discussion) Use markup, e.g. HTML <object>
U+1xxxx???? Language Tag codepoints (if and when they will be encoded) Use html:lang or xml:lang

A later version of this document will discuss each of the character= categories. For each of the categories/characters, the following points may= be mentionned/discussed:

The following subsection gives an example:

Object Replacement Character, U+FFFC

Short description: The object replacement character is used to= stand in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was= included in Unicode only in order to reserve a codepoint for a very= frequent application-internal use. Many text-processing applications store the= text and the associated markup (or in some cases styling information) of a= document in separate structures. The actual text is kept in a single linear= structure; additional information is kept separately with pointers to the= appropriate text positions. The overall implementation makes sure that these two= structures are kept in sync. If the text contains objects such as images, it is= extremely helpful for implementations to have a sentinel in the text itself; any= additional information is kept separately.

Problems when used in markup: Including an object replacement= character in markup text does not work because the additional information (what= object to include,...) is not available.

Problems with other uses: The object replacement character is= also problematic when used in plain text.

Replacement markup: The markup to be used in place of the= Object Replacement Character depends on the object in question and the markup= context it is used in. Typical cases are <html:img src='...' />,= <html:object ...>, or <html:applet ...>. These constructs allow to provide= all additional information needed to identify and use the object in= question.

What to do if detected: In a proxy context or browser context,= ignore. When received in an editing context, remove, maybe with a warning to= the user.

4. Versioning

When this report is finalized, it will treat all relevant characters in= the then current version of the Unicode Standard, and it may include some= others whose addition is anticipated/planned/feared.

As the Unicode standard is updated and new characters get added, new= characters that are not suitable for markup may also be added. However, it is= hoped that this report will help to reduce such additions as much as= possible. These characters will be flagged as such in the appropriate datafile.= This file should always be checked to have the most up-to-date information.= This report itself may be updated periodically to give additional= background information.

For more information, see:

[add a pointer to the latest data file once we have one]


This document does not specify any kind of conformance clause. However,= other documents may specify conformance including references to this= document.


(to be= completed)









Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any= kind, and assumes no liability for errors or omissions. No liability is= assumed for incidental and consequential damages in connection with or arising= out of the use of the information or programs contained or accompanying= this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are= registered in some jurisdictions.

Unicode Home Page:

Unicode Technical Reports: