SC22/WG20 N618

Draft Text for Amendment #1 to TR 10176

Source: SC22/WG20 (Ken Whistler)

Date: 98-October 21

Replace the text of Annex A with the following text.

Annex A

Recommended extended repertoire for user-defined identifiers

The recommended extended repertoire consists of those characters which collectively can be used to generate word-like identifiers for most natural languages of the world. This list comprises the letters (combining or not), syllables, and ideographs from ISO/IEC 10646-1, together with the modifier letters and marks conventionally used as parts of words. The list excludes punctuation and symbols not generally included in words or considered appropriate for use in identifiers. Also excluded are most presentation forms of letters and a number of compatibility characters. The inclusion of combining characters corresponds to those allowed under a level 2 implementation of ISO/IEC 10646-1. These are the minimum required to do a reasonable job of representing word-like identifiers in Hebrew, Arabic, and scripts of South and Southeast Asia, which make general use of combining marks. However, combining marks for level 3 implementations of ISO/IEC 10646-1 are not included in the list, so as to avoid the problem of alternative representations of identifiers.

Attention is drawn to the fact that using the extended repertoire for identifiers may impact source code portability, since the presence of these characters in program text may not be supported on systems that implement less than the full repertoire of ISO/IEC 10646-1.

The character repertoire listed in this annex is based on the ISO/IEC 10646-1:1993 with its COR 1 and AMD 1 through 9. It is subject to expansion in the future, to track future amendments to the standard. However, characters currently listed in this Annex will not be removed from the recommended extended repertoire in future revisions.

The character repertoire listed in this annex should be conceived of as a recommendation for the minimum extended repertoire for use in user-defined identifiers. Each programming language standard or implementation of the standard can extend the repertoire at the adaptation, in accordance with established practice of identifier usage for the language and any additional user requirements that may be present. For example, the C language should allow U003F LOW LINE in addition to the character repertoire listed below; COBOL should allow U002D HYPHEN-MINUS as well; Java allows a rather large extension to support a level 3 implementation of 10646-1. Some programming language standards may allow half- or full-width compatibility characters from ISO/IEC 10646-1, and some of the standards, e.g. COBOL, may recognize these characters in a width-insensitive manner.

Programming language standards generally have restrictions on what characters may be allowed as the first character of an identifier. For example, digits are often constrained from appearing as the first character of an identifier. To assist in their identification, the decimal digits in ISO/IEC 10646-1 are separately noted in the list below. In addition, combining characters should not appear as the first character of an identifier. To maximize the chances of interoperability between programming languages (as for example, when linking compiled objects between languages), programming language standards and their implementations should follow these restrictions when making use of the extended repertoire for user-defined identifiers.

The recommended characters consist of the following characters of ISO/IEC 10646-1, using their code values in hexadecimal form.Combining characters for scripts are separated out and marked with a "C" following the respective script entries.

 

Latin: 0041-005A, 0061-007A, 00AA, 00BA, 00C0-00D6, 00D8-00F6,
00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F

Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6,
03DA, 03DC, 03DE, 03E0, 03E2-03F3,
1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57,
1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC,
1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC,
1FF2-1FF4, 1FF6-1FFC

Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4,
04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9

Armenian: 0531-0556, 0561-0587

Hebrew: 05D0-05EA, 05F0-05F2

Hebrew (C): 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2

Arabic: 0621-063A, 0640-064A, 0671-06B7, 06BA-06BE, 06C0-06CE,
06D0-06D3, 06D5, 06E5-06E6

Arabic (C): 064B-0652, 0670, 06D6-06DC, 06E7-06E8, 06EA-06ED

Devanagari: 0905-0939, 0950, 0958-0961

Devanagari (C): 0901-0903, 093E-094D, 0951-0952, 0962-0963

Bengali: 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0,
09B2, 09B6-09B9, 09DC-09DD, 09DF-09E1, 09F0-09F1

Bengali (C): 0981-0983, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09E2-09E3

Gurmukhi: 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33,
0A35-0A36, 0A38-0A39, 0A59-0A5C, 0A5E, 0A74

Gurmukhi (C): 0A02, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D

Gujarati: 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0,
0AB2-0AB3, 0AB5-0AB9, 0ABD, 0AD0, 0AE0

Gujarati (C): 0A81-0A83, 0ABE-0AC5, 0AC7-0AC9, 0ACB-0ACD

Oriya: 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30,
0B32-0B33, 0B36-0B39, 0B5C-0B5D, 0B5F-0B61

Oriya (C): 0B01-0B03, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D

Tamil: 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A,
0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9

Tamil (C): 0B82-0B83, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD

Telugu: 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C60-0C61

Telugu (C): 0C01-0C03, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D

Kannada: 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3,
0CB5-0CB9, 0CDE, 0CE0-0CE1

Kannada (C): 0C82-0C83, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD

Malayalam: 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D60-0D61

Malayalam (C): 0D02-0D03, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D,

Thai: 0E01-0E30, 0E32-0E33, 0E40-0E46, 0E50-0E59

Thai (C): 0E31, 0E34-0E3A, 0E47-0E4E

Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97,
0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE,
0EB0, 0EB2-0EB3, 0EBD, 0EC0-0EC4, 0EC6, 0EDC-0EDD

Lao (C): 0EB1, 0EB4-0EB9, 0EBB-0EBC, 0EC8-0ECD,

Tibetan: 0F00, 0F40-0F47, 0F49-0F69, 0F88-0F8B,

Tibetan (C): 0F18-0F19, 0F35, 0F37, 0F39, 0F71-0F84, 0F86-0F87,
0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9

Georgian: 10A0-10C5, 10D0-10F6

Hiragana: 3041-3093

Katakana: 30A1-30F6, 30FB-30FC

Bopomofo: 3105-312C

Hangul: AC00-D7A3

CJK Unified
Ideographs: 4E00-9FA5

Digits: 0030-0039, 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF,
0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F,
0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F29

Special
characters: 00B5, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1,
02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102,
2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128,
212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029