BETA Unicode 12.0.0

BETA Unicode® 12.0.0

Note: The beta review period for Unicode 12.0.0 has closed, as of January 7, 2019. Feedback received during the public review can be referred to from PRI #389. This beta review page is left active, however, for convenience of access to the prepublication versions of the Unicode 12.0.0 data files and annexes, until the formal release planned for March 5, 2019.

The next version of the Unicode Standard will be Version 12.0.0, planned for release on March 5, 2019. This version updates several annexes to deal with segmentation issues and adds significant new repertoire. A total of 554 new characters are encoded, including 61 new emoji characters, 4 new scripts, and multiple additions to existing blocks.

A beta version of the 12.0.0 Unicode Character Database files is available for public review. We strongly encourage implementers to review the summary description, download the beta 12.0.0 Unicode Character Database files, and test their programs with the new data, well before the end of the beta period. It is especially important to review the Notable Issues for Beta Reviewers.

We encourage users to check the code charts carefully to verify correctness of the new characters added to Unicode 12.0.0 and to ensure that there are no regressions in glyph shapes for previously encoded characters.

Summary description

Unicode character database (UCD)

Summary of beta charts

Single-block delta charts with yellow highlighting for new characters

Single-block charts for all of Unicode 12.0.0

Code charts - single download (108 MB)

Auxiliary HTML charts for beta review

Related Unicode Technical Standards

In addition to the Unicode Standard proper, four other Unicode Technical Standards have significant text and data file updates that are correlated with the new additions for Unicode 12.0.0. Review of that text and data is also encouraged during the beta review period.

UTS #10, Unicode Collation Algorithm Data files

UTS #39, Unicode Security Mechanisms Data files

UTS #46, Unicode IDNA Compatibility Processing Data files

UTS #51, Unicode Emoji Data files

Review and Feedback

For guidance on how to focus your review, see the section Notable Issues for Beta Reviewers.

Any feedback should be reported using the contact form. Comments on the Unicode Standard Version 12.0.0 or the Unicode Character Database data files should refer to the beta review Public Review Issue #389. Comments on specific Version 12.0.0 UAXes and UTSes should refer to the respective Public Review Issue Numbers for each document, where available.

The comment period ends January 4, 2019. All substantive technical comments must have been received by that date for consideration at the January UTC meeting. Editorial comments (typos, etc.) may be still submitted after that date for consideration in the final editorial work.

Note: All beta files may be updated, replaced, or superseded by other files at any time. The beta files will be discarded once Unicode 12.0.0 is final. It is inappropriate to cite these files as other than a work in progress. No products or implementations should be released based on the beta UCD data files—use only the final, approved Version 12.0.0 data files, expected on March 5, 2019.

The Unicode Consortium provides early access to updated versions of the data files and text to give reviewers and developers as much time as possible to ensure a problem-free adoption of Version 12.0.0.

The assignment of characters for Unicode 12.0.0 is now stable. There will be no further additions or modifications of code points and no further changes to character names. Please do not submit feedback requesting changes to code points or character names for Unicode 12.0.0, as such feedback is not actionable.

One of the main purposes of the beta review period is to verify and correct the preliminary character property assignments in the Unicode Character Database. Reviewers should check for property changes to existing Unicode 11.0.0 characters, as well as the property values for the new Unicode 12.0.0 character additions. The Auxiliary HTML charts include the new characters highlighted in yellow, with names appearing when hovering over a cell. These charts may be useful for reviewing information such as the default collation order, Script property assignments, and so forth during beta review.

To facilitate verification of the property changes and additions, diffable XML versions of the Unicode Character Database are available. These XML files are dated, so that people can check the details of changes that occurred during the beta review period. For more information, see the diffs.readme.txt file.

The beta review period is a good opportunity to add support for the new Unicode 12.0.0 characters in internal versions of software, so that software can be tested to verify that the new characters and property assignments do not cause problems when upgraded to Version 12.0.0 of Unicode.

Notable Issues for Beta Reviewers

Changes to Unicode Standard Annexes

Some of the Unicode Standard Annexes have modifications for Unicode 12.0.0, often in coordination with changes to character properties. Most notably for Unicode 12.0.0:

UAX #29, Unicode Text Segmentation adjusted the derivation of Sentence_Break to account for the titlecasing behavior of Georgian.

UAX #38, Unicode Han Database (Unihan) includes a substantial update of the regular expressions for the kIRG_GSource and kIRG_JSource properties, to make them more comprehensible. Implementers should check the text of UAX #38 and their implementations of the CJK properties validation.

UAX #45, U-Source Ideographs, has added a new comment field to the data file, USourceData.txt, as well as new entries. Parsers may need to be checked to ensure they handle the comment field correctly. In addition, there is a new radical/stroke index file, USourceRSChart.pdf, added to the UCD, to assist in lookup of particular U-Source Ideographs.

See the Modifications section of each Annex for details of the relevant changes.

Core Specification Update

The core specification is undergoing extensive review, with numerous additions for Version 12.0.0. Although the draft text for Version 12.0.0 is not yet available, specific reports of any technical or editorial issues in the currently published core specification are also welcome during the beta review period. Such reports will be taken into consideration for corrections to the Version 12.0.0 draft. (Note: The Unicode Consortium has ongoing opportunities for subject-matter volunteers: experts interested in contributing to or editing relevant parts of the core specification or other Unicode specifications.)

Script-specific Issues

4 new scripts have been added in Unicode 12.0.0. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

Nandinagari is a complex script of the Indic type.

Ottoman Siyaq numerals have complex formatting requirements, when combined to represent large numbers.

A set of Egyptian format controls has been added in a new block in the range U+13430..U+13438. While these are intended for use with the existing Egyptian Hieroglyphs script, their use involves a complicated extension to the rendering model for hieroglyphs to enable quadrat formation. Implementers who wish to support these format controls will need to study the specification in the supporting proposal documents. See, in particular, L2/17-112.

U+1E94B ADLAM NASALIZATION MARK has been added for the Adlam script. Although the Adlam script was encoded earlier, implementations have run into trouble attempting to implement the Adlam nasalization mark with characters such as U+0027 APOSTROPHE. The new character is intended to eliminate those problems, but Adlam implementations will need to be updated to add the character and its correct rendering to Adlam fonts.

Casing Issues

A few new uppercase Latin letters have been added, which form case pairs with existing lowercase Latin letters. Casing tables should be checked carefully.

General Character Property Issues

There are a number of issues related to particular character properties:

Numerous updates have been made to the Alphabetic and Diacritic property values, to help keep the DUCET table for collation stable when initial weights are assigned based on character property values. Most of the affected characters are tone marks for lesser-known scripts.

A Script_Extensions property value of {Latn Mong} has been added for U+202F NARROW NO-BREAK SPACE. Implementations that support Script_Extensions should check that they are handling this character appropriately, and that its identification in both Latin and Mongolian script runs is correct.

Numeric Property Issues

Unicode 12.0 adds a large number of Tamil characters used for fractional values in traditional accounting practices. Some of these fraction characters introduce fractional values distinct from those noted for fraction characters in prior versions of the UCD. Implementations which handle numeric values of Unicode characters and which have special assumptions about how to deal with fractional values should take note of the following new fractional values among the Tamil fractions:

1/320, 1/80, 1/64, 1/32, 3/64

Note that these Tamil fractions share structural similarities (and many values) with Malayalam fractions. See DerivedNumericValues.txt for details.

Unihan-related Issues

All Unihan properties should be reviewed carefully. Additionally, the following deserve special attention:

The regular expressions for the kIRG_GSource and kIRG_JSource properties were completely rewritten to be comprehensible, and should be checked. See UAX #38 for details.

Standardized Variation Sequences

Many additional new standardized variation sequences have been added, to represent distinctions between variants of some common East Asian punctuation characters.

Code Charts

As always, careful review of the updated code charts for Version 12.0.0 is advised, especially for all newly added scripts. Particular issues to take note of include:

The old Phags-pa font has been replaced with a better design.

The old Bopomofo font has also been replaced with a better design. This impacts the Bopomofo and Bopomofo Extended blocks, as well as two Bopomofo tone mark characters (U+02EA and U+02EB) in the Spacing Modifier Letters block.

Collation-related Issues

The Default Unicode Collation Element Table (DUCET) was updated to the Unicode 12.0 repertoire for UCA 12.0. For the most part, the additions for new scripts and other characters are unremarkable, but implementations should be checked to ensure the new additions do not cause problems.

Other Issues

Please also check the following specific items carefully:

61 new emoji characters have been added. However, in addition to those individual characters, many new emoji sequences have been recognized, as well. If your implementation involves emoji support, be sure to carefully review UTS #51, Unicode Emoji (PRI #380) and the related beta review for Emoji 12.0 (PRI #387).

The end range for the Tangut block has been extended by 6 code points from U+187F1 to U+187F7. Implementations often have hard-coded ranges for large ideographic blocks, so check to verify you have no dependencies on this particular end range.

The following blocks are new in Unicode 12.0.0. Check implementations carefully for any range or property value assumptions regarding these new blocks. See also the single-block delta charts.

Range Block Name

10FE0..10FFF Elymaic

119A0..119FF Nandinagari

11FC0..11FFF Tamil Supplement

13430..1343F Egyptian Hieroglyph Format Controls

1B130..1B16F Small Kana Extension

1E100..1E14F Nyiakeng Puachue Hmong

1E2C0..1E2FF Wancho

1ED00..1ED4F Ottoman Siyaq Numbers

1FA70..1FAFF Symbols and Pictographs Extended-A

Some blocks have also had font updates; see the single-block delta charts for details. In such cases, careful review of the blocks in question is advised, to ensure that there have not been any regressions in representative glyph display.

General Issues

For current proposed updates to the particular UAXes, see Proposed Updates for Standard Annexes or use the links in the navigation bar on this page. Particular issues in the UAXes may also be the focus of specific Public Review Issues. Each proposed textual change in a UAX is highlighted, so that you can focus your review on those sections if you have limited time. The changes are also listed in detail in the Modifications sections (linked from the table of contents of each document), and are summarized in UAX changes, so you can check on those areas that might be of most interest.

Some links between beta documents and the proposed updates for UAXes will not work correctly during the beta review period. This is a known problem which does not need to be reported, as such links point to the eventual final names or revision numbers for the released versions.

Stability

Certain character properties for newly assigned characters cannot be changed after the formal release of each version of the standard, because of the Character Encoding Stability Policy. Such character property values need special attention during the beta review process, as they cannot be corrected after publication. These include:

Any property affecting Unicode Normalization, including Decomposition_Mapping, Canonical_Combining_Class, and Composition_Exclusion.

The determination of whether a character is included in identifiers (XID_Start, XID_Continue).

Case mappings and case foldings.