[Working Draft for a Proposed] Unicode Technical Report #XX

Unicode NAMED CHARACTER SEQUENCES

L2/04-339

Version	1.0 - L2/04-339
Authors	Asmus Freytag (asmus@unicode.org), Mark Davis (mark@example.com)
Date	2002-05-08
This Version	http://www.unicode.org/reports/tr25/trXX-1.html
Previous Version	none
Latest Version	http://www.unicode.org/reports/tr25/
Tracking Number	1

Summary

This report defines named sequences of Unicode Characters

[ Many known issues. - Need overall direction, not so much detailed edits, from reviewers.]

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex / Unicode Technical Standard / Unicode Technical Report. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

This document is a proposed draft / draft / proposed update / proposed update of a previously approved Unicode Standard Annex / Unicode Technical Standard / Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Overview
Specification
- 2.2 A Subsection if needed
Conformance

References
Modifications

1 Overview

This annex/technical report specifies sequences of characters that may be treated as a single units, either in particular types of processing, in reference by standards, in listing of repertoires (such as for fonts or keyboards), or in communicating with users.

Some standards, notably those developed by ISO/IEC JTC1/SC2 have a long standing tradition of using the formal name of a character as the means to identify corresponding characters across standards. With Unicode as the universal character set, this practice has largely given way to using the code point in Unicode as the unique identifier. However, some standards contain entities or characters that are mapped not to a single Unicode code point, but to a sequence of characters. In these instances it is convenient to have a name for the sequence.

Here are examples of such characters, and their representation as a sequence of code points.

[TBD formatting of tables, subsetting]

Character	Code Points	Linguistic Usage
	0063 0068	Slovak, traditional Spanish
	0074 02B0	Native American languages
	0078 0323
	019B 0313
	00E1 0328	Lithuanian
	0069 0307 0301	Lithuanian
	30C8 309A	Ainu in kana transcription

Additional Khmer Character Names

Glyph	Code	Name
	17BB 17C6	khmer vowel sign srak om
	17B6 17C6	khmer vowel sign srak am
	17D2 1780	khmer consonant sign coeng ka
	17D2 1781	khmer consonant sign coeng kha
	17D2 1782	khmer consonant sign coeng ko
	17D2 1783	khmer consonant sign coeng kho
	17D2 1784	khmer consonant sign coeng ngo
	17D2 1785	khmer consonant sign coeng ca
	17D2 1786	khmer consonant sign coeng cha
	17D2 1787	khmer consonant sign coeng co
	17D2 1788	khmer consonant sign coeng cho
	17D2 1789	khmer consonant sign coeng nyo
	17D2 178A	khmer consonant sign coeng da
	17D2 178B	khmer consonant sign coeng ttha
	17D2 178C	khmer consonant sign coeng do
	17D2 178D	khmer consonant sign coeng ttho
	17D2 178E	khmer consonant sign coeng na
	17D2 178F	khmer consonant sign coeng ta
	17D2 1790	khmer consonant sign coeng tha
	17D2 1791	khmer consonant sign coeng to
	17D2 1792	khmer consonant sign coeng tho
	17D2 1793	khmer consonant sign coeng no
	17D2 1794	khmer consonant sign coeng ba
	17D2 1795	khmer consonant sign coeng pha
	17D2 1796	khmer consonant sign coeng po
	17D2 1797	khmer consonant sign coeng pho
	17D2 1798	khmer consonant sign coeng mo
	17D2 1799	khmer consonant sign coeng yo
	17D2 179A	khmer consonant sign coeng ro
	17D2 179B	khmer consonant sign coeng lo
	17D2 179C	khmer consonant sign coeng vo
	17D2 179D	khmer consonant sign coeng sha
	17D2 179E	khmer consonant sign coeng ssa
	17D2 179F	khmer consonant sign coeng sa
	17D2 17A0	khmer consonant sign coeng ha
	17D2 17A2	khmer consonant sign coeng qa
	17D2 17A7	khmer vowel sign coeng qu
	17D2 17AB	khmer vowel sign coeng ry
	17D2 17AF	khmer vowel sign coeng qe

While all combinations of accents and base characters are encodable in Unicode, not all combinations are required for particular purpose and only some may be supported. Named sequences would be useful in these contexts to have a shorthand to refer to a specific sequence. However, while rhere are many sequences of characters that get special treatment that varies by language, such as sequences of characters that are collated as a single units, not all such sequences necessarily need to be named.

2 Notation

The standard notation for a sequence of characters defined by the Unicode Standard is

<HHHH, HHHH, ....HHHH>

where HHHH is a sequence of 4-6 upper case hexadecimal digits, optionally preceded by "U+".

3 Conformance

[TBD: edit boilerplate]

	Conformance to the Unicode Standard does not require conformance to the specification in this document.
	Conformance to the Unicode Standard does not require / requires conformance to the specification in this document. The relationship between conformance to the Unicode Standard, and conformance to an individual Unicode Standard Annex (UAX) is described in more detail in the Unicode Standard in Section 3.2 Conformance.

	Unicode-conformant implementation that implement this specification must do so as described in the following clause.
	There are many different ways to [break lines of text], and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement this specification must do so as described in the following clause. Implementations are free to deviate from this, as long as they do not purport to conform to this specification.

C1	An implementation that claims conformance to the default [Unicode Line Break Algorithm] shall produce the same results as the algorithm published in this specification. As specified in Section 3.2 of the Unicode Standard, Unicode specifications are generally described as an algorithm or process, producing a result from a given input. However, these are simply logical specifications; particular implementations can change or optimize the internal processing as long as they provide the same results from the same input.
C2	This specification defines default behavior, which is to be used in the absence of tailoring for particular languages and environments. Where a particular environment requires tailoring, such modifications to this specification can be done without affecting conformance.
C3	If tailoring is used by an implementation that claims conformance to the default [Unicode Line Break Algorithm], the existence of such tailoring must be documented. This does not require that the tailoring be described in a reproducible manner; for example, a statement 'tailored to language X' is sufficient.

At times, this specification recommends best practice. These recommendations are not normative and conformance with this specification does not depend on their realization. These recommendations contain the expression "We recommend ...", "This specification recommends ...", or some similar wording.

4 Names

Names of Unicode sequences are unique. Where possible, they are constructed by appending the names of the constituent elements together while eliding duplicate elements. If a sequence would have a name that already exists, then the name is modified suitably to avoid the clash. Names of sequences are enclosed in <>.

Examples:

<A, B, C> <LATIN LETTER CAPITAL A B C

<AE, F> <LATIN LETTER CAPITAL A E F>

<A, E, F> <LATIN LETTER CAPITAL A WITH E AND F>

5 Data File

A data file is available. See [Composite].

References

[Charts]	The online code charts can be found at http://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html
[Composite]	[Tentative]Named composite entities data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/NamedCompositeEntities-4.1.0d2.txt For other versions, see: http://www.unicode.org/standard/versions/
[Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Normal]	Unicode Technical Report #15: Unicode Normalization Forms http://www.unicode.org/unicode/reports/tr15/
[RegEx]	Unicode Technical Standard #18: Regular Expressions http://www.unicode.org/unicode/reports/tr18/
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Scripts]	Scripts data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/Scripts.txt For other versions, see: http://www.unicode.org/standard/versions/
[UCD]	Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files
[Unicode]	The Unicode Standard For the latest version see: http://www.unicode.org/versions/latest/. For the current version see: http://www.unicode.org/versions/Unicode4.1.0/. For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1).
[Versions]	Versions of the Unicode Standard http://www.unicode.org/standard/versions For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous version of this document.

1	Initial version

Copyright © 2001-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.