[Unicode]  Technical Reports
 

[Working Draft for a Proposed] Unicode Technical Report #XX

Unicode NAMED CHARACTER SEQUENCES

L2/04-339

Version 1.0 - L2/04-339
Authors Asmus Freytag (asmus@unicode.org), Mark Davis (mark@example.com)
Date 2002-05-08
This Version http://www.unicode.org/reports/tr25/trXX-1.html  
Previous Version none
Latest Version http://www.unicode.org/reports/tr25/
Tracking Number 1

Summary

This report defines named sequences of Unicode Characters

[ Many known issues. - Need overall direction, not so much detailed edits, from reviewers.]

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex / Unicode Technical Standard / Unicode Technical Report. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

This document is a proposed draft / draft / proposed update / proposed update of a previously approved Unicode Standard Annex / Unicode Technical Standard / Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

  1. Overview
  2. Specification
  3. Conformance  

1 Overview 

This annex/technical report specifies sequences of characters that may be treated as a single units, either in particular types of processing, in reference by standards, in listing of repertoires (such as for fonts or keyboards), or in communicating with users. 

Some standards, notably those developed by ISO/IEC JTC1/SC2 have a long standing tradition of using the formal name of a character as the means to identify corresponding characters across standards. With Unicode as the universal character set, this practice has largely given way to using the code point in Unicode as the unique identifier. However, some standards contain entities or characters that are mapped not to a single Unicode code point, but to a sequence of characters. In these instances it is convenient to have a name for the sequence.

Here are examples of such characters, and their representation as a sequence of code points.

[TBD formatting of tables, subsetting]

Character Code Points Linguistic Usage
0063 0068 Slovak, traditional Spanish
0074 02B0 Native American languages
0078 0323
019B 0313
00E1 0328 Lithuanian
0069 0307 0301
30C8 309A Ainu in kana transcription
Additional Khmer Character Names
Glyph Code Name
17BB,17C6 17BB 17C6 khmer vowel sign srak om
17B6,17C6 17B6 17C6 khmer vowel sign srak am
17D2,1780 17D2 1780 khmer consonant sign coeng ka
17D2,1781 17D2 1781 khmer consonant sign coeng kha
17D2,1782 17D2 1782 khmer consonant sign coeng ko
17D2,1783 17D2 1783 khmer consonant sign coeng kho
17D2,1784 17D2 1784 khmer consonant sign coeng ngo
17D2,1785 17D2 1785 khmer consonant sign coeng ca
17D2,1786 17D2 1786 khmer consonant sign coeng cha
17D2,1787 17D2 1787 khmer consonant sign coeng co
17D2,1788 17D2 1788 khmer consonant sign coeng cho
17D2,1789 17D2 1789 khmer consonant sign coeng nyo
17D2,178A 17D2 178A khmer consonant sign coeng da
17D2,178B 17D2 178B khmer consonant sign coeng ttha
17D2,178C 17D2 178C khmer consonant sign coeng do
17D2,178D 17D2 178D khmer consonant sign coeng ttho
17D2,178E 17D2 178E khmer consonant sign coeng na
17D2,178F 17D2 178F khmer consonant sign coeng ta
17D2,1790 17D2 1790 khmer consonant sign coeng tha
17D2,1791 17D2 1791 khmer consonant sign coeng to
17D2,1792 17D2 1792 khmer consonant sign coeng tho
17D2,1793 17D2 1793 khmer consonant sign coeng no
17D2,1794 17D2 1794 khmer consonant sign coeng ba
17D2,1795 17D2 1795 khmer consonant sign coeng pha
17D2,1796 17D2 1796 khmer consonant sign coeng po
17D2,1797 17D2 1797 khmer consonant sign coeng pho
17D2,1798 17D2 1798 khmer consonant sign coeng mo
17D2,1799 17D2 1799 khmer consonant sign coeng yo
17D2,179A 17D2 179A khmer consonant sign coeng ro
17D2,179B 17D2 179B khmer consonant sign coeng lo
17D2,179C 17D2 179C khmer consonant sign coeng vo
17D2,179D 17D2 179D khmer consonant sign coeng sha
17D2,179E 17D2 179E khmer consonant sign coeng ssa
17D2,179F 17D2 179F khmer consonant sign coeng sa
17D2,17A0 17D2 17A0 khmer consonant sign coeng ha
17D2,17A2 17D2 17A2 khmer consonant sign coeng qa
17D2,17A7 17D2 17A7 khmer vowel sign coeng qu
17D2,17AB 17D2 17AB khmer vowel sign coeng ry
17D2,17AF 17D2 17AF khmer vowel sign coeng qe

 

While all combinations of accents and base characters are encodable in Unicode, not all combinations are required for particular purpose and only some may be supported. Named sequences would be useful in these contexts to have a shorthand to refer to a specific sequence. However, while rhere are many sequences of characters that get special treatment that varies by language, such as sequences of characters that are collated as a single units, not all such sequences necessarily need to be named.

2 Notation

The standard notation for a sequence of characters defined by the Unicode Standard is

<HHHH, HHHH, ....HHHH>

where HHHH is a sequence of 4-6 upper case hexadecimal digits, optionally preceded by "U+".

3 Conformance

[TBD: edit boilerplate]

  Conformance to the Unicode Standard does not require conformance to the specification in this document.
  Conformance to the Unicode Standard does not require / requires conformance to the specification in this document. The relationship between conformance to the Unicode Standard, and conformance to an individual Unicode Standard Annex (UAX) is described in more detail in the Unicode Standard in Section 3.2 Conformance.

 

  Unicode-conformant implementation that implement this specification must do so as described in the following clause.
  There are many different ways to [break lines of text], and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement this specification must do so as described in the following clause. Implementations are free to deviate from this, as long as they do not purport to conform to this specification.

 

C1 An implementation that claims conformance to the default [Unicode Line Break Algorithm] shall produce the same results as the algorithm published in this specification.
  • As specified in Section 3.2 of the Unicode Standard, Unicode specifications are generally described as an algorithm or process, producing a result from a given input. However, these are simply logical specifications; particular implementations can change or optimize the internal processing as long as they provide the same results from the same input.
C2 This specification defines default behavior, which is to be used in the absence of tailoring for particular languages and environments.
  • Where a particular environment requires tailoring, such modifications to this specification can be done without affecting conformance.
C3 If tailoring is used by an implementation that claims conformance to the default [Unicode Line Break Algorithm], the existence of such tailoring must be documented.
  • This does not require that the tailoring be described in a reproducible manner; for example, a statement 'tailored to language X' is sufficient.

At times, this specification recommends best practice. These recommendations are not normative and conformance with this specification does not depend on their realization. These recommendations contain the expression "We recommend ...", "This specification recommends ...", or some similar wording.

4 Names

Names of Unicode sequences are unique. Where possible, they are constructed by appending the names of the constituent elements together while eliding duplicate elements. If a sequence would have a name that already exists, then the name is modified suitably to avoid the clash. Names of sequences are enclosed in <>.

Examples:

<A, B, C>    <LATIN LETTER CAPITAL A B  C

<AE, F>    <LATIN LETTER CAPITAL A E F>

<A, E, F>    <LATIN LETTER CAPITAL A WITH E AND F>

<A COMBINING DIACRITIC ABOVE> <A WITH DIACRITIC ABOVE>

5 Data File

A data file is available. See [Composite].

References

[Charts] The online code charts can be found at http://www.unicode.org/charts/
An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html
[Composite] [Tentative]Named composite entities data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/NamedCompositeEntities-4.1.0d2.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[Feedback] Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[FAQ] Unicode Frequently Asked Questions http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary] Unicode Glossary http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Normal] Unicode Technical Report #15: Unicode Normalization Forms http://www.unicode.org/unicode/reports/tr15/
[RegEx] Unicode Technical Standard #18: Regular Expressions
http://www.unicode.org/unicode/reports/tr18/
[Reports] Unicode Technical Reports http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Scripts] Scripts data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/Scripts.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[UCD] Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files
[Unicode] The Unicode Standard
For the latest version see:
http://www.unicode.org/versions/latest/.
For the current version see:
http://www.unicode.org/versions/Unicode4.1.0/.
For the last major version see:

The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1).
[Versions] Versions of the Unicode Standard http://www.unicode.org/standard/versions
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous version of this document.

1 Initial version

Copyright © 2001-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.