PRI #497: Unicode 16.0 Alpha Review

Background Document

Date: February 6, 2024

This document provides background material for the alpha review period for Unicode 16.0.

Alpha review is for early review and comment on the repertoire proposed for eventual publication in Unicode 16.0. During alpha review the repertoire is reasonably mature and stable, but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be entertained.

This early review is provided so that reviewers may consider the character repertoire issues prior to the start of beta review (currently scheduled to start in May, 2024). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.

How to Provide Feedback

Feedback for the alpha review period should be reported under this PRI #497, using the Unicode contact form.

Code Charts

For ease of review, a set of alpha review charts have been prepared. These are accessible on a block-by-block basis for new characters proposed to be added for Unicode 16.0. See 16.0 delta charts.

Alpha review charts also show glyph changes specifically planned for Unicode 16.0. Proposed glyph changes are highlighted in blue, while new characters proposed for encoding in Unicode 16.0 are highlighted in yellow. Note that a significant number of the glyph changes are cosmetic only.

Note that new “Moji Jōhō Kiban” (文字情報基盤) Japanese source references have been added for over 36,000 CJK unified ideographs. These are reflected in the alpha code charts for virtually all CJK unified ideograph blocks by additional representative glyphs added to the “J” column. Because this horizontal extension for the J-source impacts so many characters, the individual characters which have a new J-source glyph added are not highlighted one-by-one in the charts.

Emoji Repertoire

For more information about new emoji for inclusion in Unicode 16.0, see PRI #498 Unicode Emoji 16.0 Alpha Repertoire.

Landing Page

To provide some context for Unicode 16.0, an early draft of the Unicode 16.0 landing page is also available. That draft page is incomplete, and the UTC is not seeking feedback on the content of that page yet. A more complete draft will be available for beta review and comment in May.

Data Files

During the alpha review period, most of the data files associated with the Unicode Character Database are available for review at 16.0 UCD alpha data files. Data files associated with UTS #46 and IDNA are also available for review at 16.0 IDNA alpha data files and Idna2008-16.0.0.txt.

These files are not yet a complete set of data files for Unicode 16.0; a complete set will be provided later for the beta review. Caution: Please do not report missing data files. However, comments on data inconsistencies or the assignment of inappropriate values for new characters are welcome during alpha review.

Normalization: Important Novel Behavior

Unicode 16.0 adds several new characters (in the Kirat Rai, Tulu-Tigalari, and Gurung Khema scripts) with normalization behavior not seen in characters that have been encoded in earlier releases. The normalization algorithm and definition of normalization-related properties have not changed. However, this is the first version of the Unicode Standard which includes some composite characters that can occur in NFC/NFKC strings, but when they follow certain other characters, then performing NFC or NFKC normalization subsumes those composite characters. (A composite character has a Decomposition_Mapping (dm) value consisting of a sequence of more than one character. In this case, the first characters in their decompositions combine with some preceding characters.) This situation is illustrated schematically in the following table, using an arbitrary convention of square brackets to indicate a composite character.

Character dm Full Decomposition NFC
A A A A
B B B B
[BB] B + B B + B [BB]
[AB] A + B A + B [AB]
[ABB] [AB] + B A + B + B [ABB]
Sequences Full Decomposition NFC
A + [BB] A + B + B [ABB]
B + [BB] B + B + B [BB] + B
A + B + [BB] A + B + B + B [ABB] + B
[AB] + [BB] A + B + B + B [ABB] + B

In this schematic example, the composite character [BB] is in NFC form, and the composite character [AB] also is in NFC form. The problem happens when an implementation encounters a sequence such as A + B + B in text and needs to normalize it to NFC form. If it is only looking locally, it might conclude that the B + B should be normalized to [BB] and stop there, but in this context, preceded by an A, the correct normalization is for the entire sequence A + B + B to be normalized to [ABB] in NFC form. More problematical are the sequences shown in the last four rows of the table. Faced with mixed input data, an optimized normalization implementation that has incorrect assumptions about the status of [BB] can go astray and miss the implications of characters that precede it.

Certain optimized implementations of normalization may normalize strings incorrectly if those strings contain these particular characters. Please review your implementation for this behavior and test it thoroughly using the NormalizationTest.txt data. This early notice about normalization issues is provided during alpha review because of its potential impact on such implementations of normalization.

For the quickCheck() algorithm to work properly, the relevant characters with canonical decomposition mappings have NFC_Quick_Check=Maybe and NFKC_Quick_Check=Maybe values. If your implementation derives these quick check properties, then please compare your data with that provided in the UCD. Also, please test your quickCheck() implementation against the results specified in NormalizationTest.txt.

For details on the affected new characters and their properties and behavior see L2/24-009R item 5.1.

Unicode 16.0 Core Specification

For the alpha review, placeholder pages of the core spec have been deployed to https://www.unicode.org/versions/Unicode16.0.0/core-spec/, without actual content. This URL may change before final release, and for alpha review is simply a placeholder intended to test deployment and navigation.