Inter-Client Exchange of Unicode Text Juliusz Chroboczek 16 Jun 1999. DRAFT! This is not an actual proposal, but an early draft of something that may someday, hopefully, become a proposal. {Author's comments, not part of the text, are in curly brackets.} Introduction and background *************************** Unicode [UNICODE] is a coded character set with the ambition of being suitable for interchange of textual data in all known scripts, current and historical. The Unicode character repertoire is indexed by unsigned integers known as codepoints. By convention, we write U+89AB for the Unicode character with codepoint 89AB (hexadecimal). UTF-8 [UTF-8] is a technique for encoding Unicode text as a stream of eight-bit bytes. The UTF-8 encoding enjoys the following desirable properties: * it is compatible with 7-bit ASCII [US-ASCII], in that mapping arbitrary ASCII strings to UTF-8 doesn't require any conversion. * it is stateless; * conversion between UTF-8 and streams of 16- or 32-bit Unicode values is computationally trivial. Together with the universality of Unicode, these properties make UTF-8 into the ideal interchange format for plain text. In particular, UTF-8 could allow clients to seamlessly exchange selections [ICCCM] containing multilingual text in a locale-independent manner. Normative part ************** This document proposes that the atom name UTF8_STRING should be registered with The Open Group. The main uses of this atom will be property types and selection targets, although use for other purposes, such as input method encodings, is not excluded. If a property has type UTF8_STRING, it should carry eight-bit data (i.e. specify format `8'). It should be interpreted as a string encoded according to UTF-8, as defined by the Unicode standard, version 2.0 or any later version. Nothing more is implied about the data. In particular, it need not be in canonical form, and may contain any Unicode control characters, including legacy control characters, paragraph and line breaks, directional marks, or Plane 14 language tags. If a selection request specifies target UTF8_STRING, the selection holder should make the selection available as a UTF-8 string, with type UTF8_STRING. The selection target TEXT is extended to explicitly allow a property with type UTF8_STRING as the reply. {Rationale: the atom name UTF8_STRING has been chosen for consistency with names such as _SJIS_STRING.} Guidelines for clients ********************** Clients that use internally a text encoding that maps easily into a subset of Unicode are encouraged to use UTF-8 as their preferred interchange format. Composing characters -------------------- It is expected that, in the short term, a number of applications will not be able to properly treat composing characters. It is also expected that any application that can accept composing characters will be able to properly deal with composed forms. For this reason, it is suggested that clients make characters available in their composed form whenever possible. Of course, clients are also encouraged to accept composing characters whenever possible. Legacy control characters ------------------------- Unicode contains two ranges of legacy control characters, C0 and C1, isomorphic to the ranges of control characters in the ISO 8859 series of encodings. The use of these control characters should be avoided, as they do not have well defined semantics. However, in order to ease the transition from legacy character sets, these characters may be used in some circumstances. In particular, the characters LINE FEED U+000A and FORM FEED U+000C are likely to be used by applications running under X11, especially terminal emulators and text editors (as opposed to word processors). Those applications will typically want to include such characters in selections, thus making it necessary for applications to be able to interpret them. We suggest that they should be interpreted as follows. A single LINE FEED (LF) causes a line break but not a paragraph break. A sequence of 2 or more LFs causes a paragraph break. In particular, a single LF should not reset the state of the Unicode BIDI algorithm, while a sequence of two or more does. A FORM FEED (FF) causes a page break. It does not cause a paragraph break, and thus does not cause the state of the BIDI algorithm to be reset. {Question: is TR13 mature enough to be mentioned at this point?} The rest of this document assumes that applications use line separators and paragraph separators internally. Guidelines for the selection owner ---------------------------------- The selection owner, wishing to make a selection available as UTF-8 text should: 1. Respond to a conversion request of type TARGETS with (at least) the atoms TEXT, STRING, UTF8_STRING, and eventually COMPOUND_TEXT. 2. Respond to a conversion request with target UTF8_STRING with a UTF-8 encoded string stored in a property of type UTF8_STRING. In order to maximise interoperability, this string should preferably be in composed form, but clients are free to make arbitrary Unicode strings available when conversion to composed form is not desirable. Producing such a property will typically involve generating Unicode control characters, including U+2029 PARAGRAPH SEPARATOR, U+2028 LINE SEPARATOR, direction marks, and eventually Plane 14 language tags. Legacy control characters (U+0000 to U+001F, and U+0080 to U+009F) should in general not be used, with the possible exception of LF U+000A which may be used to represent a single LINE SEPARATOR (when alone) or one or more PARAGRAPH SEPARATORs (when in a sequence of two or more), as well as FF U+000C, which marks a page break. 3. Respond to a conversion request with target STRING by forcibly mapping the selection to ISO 8859-1, and presenting the result in a property of type STRING. Characters that do not map to ISO 8859-1 should be replaced by 0x23 `#'. Treatment of Unicode control characters is application-dependent, but a reasonable approach would be to map a sequence of an arbitrary number of U+2028 to a single 0x0A, a sequence of n U+2029 to a sequence of n+1 0x0A, pass through all legacy control characters, and discard all other Unicode control characters. {Rationale: `#' is easy to spot, and, unlike `?' and `~', is not used by standard shells and utilities on systems running X11. Defining a standard character for this purpose will allow users to automate treatment of text with lost characters.} 5. Optionally respond to a conversion request with target COMPOUND_TEXT with an application-specific conversion of the data into ISO 2022 [CTEXT]. 6. Respond to a conversion request with the polymorphic target TEXT by checking whether the selected text can be represented exactly as an ISO 8859-1 string. If this is the case, the selection owner should proceed as in point 3; otherwise, proceed as in point 2. Guidelines for the requestor ---------------------------- A client, known as the requestor, wishing to use a selection that may be available as UTF-8 should: 1. Make a conversion request with target TARGETS, and check for the availability of the targets UTF8_STRING, STRING, TEXT. 2. If the target UTF8_STRING was found, the requestor should issue a conversion request with target UTF8_STRING. If this conversion succeeds, the requestor should process the resulting UTF-8 encoded string in an application-specific manner. This will typically involve treating U+2029 as a paragraph separator, U+2028 as a line separator, composing or decomposing characters, and interpreting (or ignoring) directional marks and Plane 14 language tags. Interpretation of the legacy control character NEWLINE U+000A is application dependent. A reasonable strategy could be as follows: a single NL could be interpreted as a line separator, while a sequence of n NLs, where n>1, could be interpreted as a sequence of n-1 paragraph separators. The above conventions imply that a single LF should not reset the BIDI state, while a sequence of two LFs or more should do. 3. If the target UTF8_STRING was not found, or the conversion in step 2 above failed, the client should issue a conversion request of type STRING. The resulting property should have type STRING and contain an ISO 8859-1 encoded string. Eventual control characters in this string will be interpreted in an application-specific manner, but typically 0x0A will be interpreted just like U+000A in step 2 above. A client may also make a conversion selection with target COMPOUND_TEXT if it is prepared to convert the resulting ISO 2022 data into its internal encoding format. A client may eventually make a conversion selection with the polymorphic target TEXT, but it should then be prepared to receive a property with any text encoding, including, but not limited to, STRING, UTF8_STRING, and COMPOUND_TEXT. Sample implementation --------------------- A sample implementation of the guidelines above has been integrated with Thomas Dickey's `XTerm', patch 108. It is available from http://www.clark.net/pub/dickey/ {Okay, fair enough, it is not available yet.} Current practice **************** None of the clients available to the author behave as proposed above. A number of clients make UTF-8 strings available as property type STRING, which violates the ICCCM conventions [ICCCM]. Some clients try to convert Unicode strings into ISO 2022 and make them available as property type COMPOUND_TEXT, which follows the ICCCM, but requires a complex, stateful conversion which is not one-to-one. Some other clients have been found to make UTF-8 encoded strings available in a locale-dependent property type (such as en_US.UTF-8), which makes interoperation between clients running in different locales very difficult or impossible. Other Unicode formats ********************* The main other Unicode formats are UTF-16 and UCS-4. Conversion between UTF-8 and these other Unicode formats is computationally trivial. When restricted to the BMP, UTF-8 carries at most a 50% overhead with respect to UTF-16, and is always more compact than UCS-4. Finally, unlike UTF-16, UTF-8 encodes the whole range of both Unicode and ISO 10646. For these reasons, this document does not suggest defining mechanisms for exchanging data encoded as UTF-16 or UCS-4, as creating such mechanisms would only cause confusion among users and implementors, while bringing no visible benefits to users. It does not, of course, exclude defining such mechanisms in the future. References ********** [ASCII] ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange [CTEXT] Compound Text Encoding, version 1.1. Robert W Scheifler. [ICCCM] Inter-Client Communication Conventions Manual, Version 2.0. David Rosenthal and Stuart W. Marks. [UTF-8] UTF-8, a transformation format of ISO 10646. F. Yergeau. RFC 2279, January 1998. [UNICODE] The Unicode Standard -- Version 2.0. The Unicode Consortium. Addison-Wesley, 1996.