L2/01-378

ISO/IEC JTC1/SC2/WG2 N2370

2001-10-10

Universal Multiple Octet Coded Character Set

International Organization for Standardization

Organisation internationale de normalisation

Международная организация по стандартизации

Doc Type: Working Group Document

Title: Unicode Consortium Liaison Report

Source: The Unicode Consortium

Status: Liaison report

Action: For consideration by JTC1/SC2/WG2

Related: N2339, N2361, N2362, N2366, N2369, N2383

References

In the interest of continuing synchronization between the Unicode Standard and ISO/IEC 10646 there are a few areas where it would be useful to give implementers wishing to create interoperable implementations access to additional information.

Therefore it would be helpful if references were available in ISO/IEC10646 that point to the Unicode Standard and Unicode Technical Reports as a source of a more detailed description of, and more information on how to render and process the scripts in 10646.

Given the emerging prominence of normalization in the context of W3C protocols and related specifications, the Unicode Consortium suggests that WG2 add a reference to Unicode Standard Annex 15 Unicode Normalization Forms as a source for information for implementers that wish to created normalized data streams for implementation levels 2 and 3.

Policies

The impact of normalization form C is that sequences of characters that are considered canonically equivalent in the Unicode Standard are normalized to the same sequence. There are some practical effects of this for coding new characters and in the interest of synchronization WG2 should make sure these are reflected in the principle and procedures. In addition, singleton canonical equivalences have the practical effect of removing the distinction between pairs of characters and such pairs of characters should therefore be imaged with identical glyphs.

The Unicode Consortium would like to be abler to advise implementers of the following ranges for characters that can be ignored in most processing, mostly because they are formatting characters.

2060..2069

FFF0..FFF8

Plane E

Since formatting characters are very different from other graphic characters, implementations that intend to be robust in terms of additions to the Unicode Standard and 10646 need to be able to anticipate the ranges where such characters are to be added in the future. Therefore the Consortium asks WG2 to adopt a matching policy and reflect it in the roadmap.

For reasons given in a separate technical paper, variation selectors need to be combining characters. It has turned out to be an intractable problem to allow variation selectors to act on combining characters. Therefore Unicode has established a policy that variation selectors cannot be used with combining marks and ask WG2 to adopt the same policy.

Other information

There have been some questions regarding the proposed grapheme joiner. This character is not intended to affect ligation. The existing document provides enough information on everything else.

Roadmaps

The Unicode Consortium is proud to announce that the WG2 roadmaps are now hosted on its web site and will be maintained there by the roadmap ad-hoc committee. A separate document has been submitted with further details.

Publications

Unicode 3.1.1 has been released.

Unicode experienced great difficulties in publishing code charts for the Han Extensions for 3.1. Such problems can be avoided in the future by requiring that all scripts are processed with the same production process.

The Unicode Standard Versions 3.2 is tentatively scheduled for Spring 2002 and will incorporate the new characters for amendment 1. Version 4.0 is anticipated for Spring 2003. It will be a complete reissue of the Unicode Standard in book form.

Summary of UTC actions

The following are brief summaries of recent Unicode Technical Committee (UTC) actions that are deemed of interest to WG2. In almost all instances these actions are reflected in separate, specific documents submitted to WG2, in those cases such documents override the summary information given below.

1. The UTC reviewed and accepted the changes to the repertoire for FPDAM1 with minor changes suggested for several character names.

2. The UTC supports one new script, Limbu as candidate for amendment 2 of Part 1.

3. The UTC supports new scripts (Aegean scripts, Ugaritic 103C00..103DF. Osmanya 10480..104A9, Shavian 10450..1047F) in Plane 1 as candidates for amendment 1 of Part 2.

4. The UTC would like to request a minor revision in a note to the text describing UTF-8 in 10646-1 to allow FFFE and FFFF. This would provide synchronization with the formal definition of UTF-8 as used in the Unicode Standard. This is part of an effort to align the definition of UTF-8 as used by Unicode with the definition of UTF-8 as used by IETF and others.

5. The UTC approved several additional characters, which will be put forth in specific proposal documents or ballot comments. They include

· two monogram four digram characters at 2672..2677, and

· sixty four hexagram characters at 4DC0 to 4DFF

· fifteen variation selector characters to be encoded at FE01..FE0F with the names VARIATION SELECTOR 2..VARIATION SELECTOR 16

· 240 variation selector characters to be encoded at E0110..E01FF with the names VARIATION SELECTOR 17..VARIATION SELECTOR 256.

· five phonetic characters:
U+0221 LATIN SMALL LETTER T WITH CURL
U+0234 LATIN SMALL LETTER D WITH CURL
U+0235 LATIN SMALL LETTER N WITH CURL
U+02AE LATIN SMALL LETTER TURNED H WITH FISHHOOK
U+02AF LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL

· eight bracket characters:
U+27E6 MATHEMATICAL LEFT WHITE SQUARE BRACKET
U+27E7 MATHEMATICAL RIGHT WHITE SQUARE BRACKET
U+27E8 MATHEMATICAL LEFT ANGLE BRACKET
U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET
U+27EA MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
U+27EB MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET
U+FF5F FULLWIDTH LEFT WHITE PARENTHESIS
U+FF60 FULLWIDTH RIGHT WHITE PARENTHESIS

· ARABIC CURRENCY SIGN RIAL at U+FDFC

6. Unicode has considered a number of characters to be ‘deprecated’, with the consequence that their use is strongly discouraged, even though they formally remain in the standard. The list of these characters (0340, 0341, 206A..206F) will now be made available in machine readable form. Other, less strongly discouraged characters are often annotated in the Unicode names list.

7. Certain characters need to be ignored by almost all general text processes, except for certain specific processes for which they were designed (example: JOINER and NON-JOINER). Many display engines will force a zero-width glyph for these characters, as too many fonts simply display the missing glyph symbol. To ensure that software that is created today can handle future addition of such characters, in the Unicode Character Database, the UTC designated the following ranges with appropriate code point properties:

· 2060..206F

· FE00..FE0F

· FFF0..FFFC

· E0000..E0FFF

8. The UTC accepted a Proposed Draft Unicode Technical Report#26 Compatibility Encoding Scheme for UTF-16 - 8-bit.

9. UTC has formally rejected the proposal to encode Klingon.

10. The UTC supports the development of a proposal for encoding Egyptian hieroglyphs consisting of the Gardiner subset while leaving hieroglyphic extensions and markup issues for later study.

11. The UTC has become aware that there is no direct support for UTF-16 in the C and C++ programming languages and forwarded to the C and C++ committees a request for an unambiguous UTF-16 datatype and string literal support.

12. The UTC recommends adding a separate Help document to the Proposal Submission Form to assist the submitter in filling out the form.

13. The UTC welcomes dialogue with representatives of Cambodia to ensure that the Unicode Standard and 10646 meet the needs for text representation of Cambodian.