[Unicode] Unicode 15.1.0 Tech Site | Site Map | Search
 

Unicode® 15.1.0

2023 September 12 (Announcement)

STATUS: This is a preliminary draft page for an upcoming release. Some details may be missing or incorrect, and some links may be wrong or broken. During the alpha review period, errors are expected and feedback is not necessary. During the beta review period, feedback on errors will be helpful and appreciated.

This page summarizes the important changes for the Unicode Standard, Version 15.1.0. This version supersedes all previous versions of the Unicode Standard.

A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration

A. Summary

Unicode 15.1 adds 627 characters, for a total of 149,813 characters.

There are several significant themes for this release of the Unicode Standard.

  • The repertoire addition consists almost entirely of urgently needed CJK ideographs, synchronized with planned additions to the Chinese national standard, GB 18030. The remaining additions to the repertoire extend the set of ideographic description characters, to better enable description of unusual CJK ideographs.
  • Major updates were made to UAX #9, Unicode Bidirectional Algorithm, UAX #31, Unicode Identifiers and Syntax, and UTS #39, Unicode Security Mechanisms, to coordinate with the publication of an important new Unicode Technical Standard: UTS #55, Unicode Source Code Handling.
  • Segmentation rule changes, most notably:
    • Support was added to line breaking (UAX #14, Unicode Line Breaking Algorithm) for orthographic syllables in a number of South and Southeast Asian writing systems.
    • Grapheme cluster breaking (UAX #29, Unicode Text Segmentation) has adopted the aksara cluster behavior for six scripts. That cluster breaking behavior had previously been widely available via CLDR and ICU.
    • These changes involved significant character property updates.

Synchronization

Several other important Unicode specifications have been updated for Version 15.1. The following four Unicode Technical Standards are versioned in synchrony with the Unicode Standard, because their data files cover the same repertoire. All have been updated to Version 15.1:

Specification Scope Data Files
UTS #10, Unicode Collation Algorithm Sorting Unicode text UCA data
UTS #39, Unicode Security Mechanisms Reducing Unicode spoofing Security data
UTS #46, Unicode IDNA Compatibility Processing Compatible processing of non-ASCII URLs IDNA data
UTS #51, Unicode Emoji Emoji and their behavior Emoji data

Some of the changes in Version 15.1 and associated Unicode Technical Standards may require modifications to implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTS #51.

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

B. Technical Overview

Version 15.1 of the Unicode Standard consists of:

  • The core specification (unchanged from Version 15.0)
  • The code charts (delta and archival) for this version
  • The Unicode Standard Annexes
  • The Unicode Character Database (UCD)

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Core Specification

The core specification is available as a single pdf for viewing. (14 MB) Links are also available in the navigation bar on the left of this page to access individual chapters and appendices of the core specification.

Code Charts

Several sets of code charts are available. They serve different purposes:

  • The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.

For Unicode 15.1.0 in particular two additional sets of code chart pages are provided:

  • A set of delta code charts showing the new blocks and any blocks in which characters were added for Unicode 15.1.0. The new characters are visually highlighted in the charts.
  • A set of archival code charts that represents the entire set of characters, names and representative glyphs at the time of publication of Unicode 15.1.0.

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Unicode Standard Annexes

Links to the individual Unicode Standard Annexes are available in the navigation bar on the left of this page. The list of significant changes in the content of the Unicode Standard Annexes for Version 15.1 can be found in Section G below.

Unicode Character Database

Data files for Version 15.1 of the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap to the functions of the various subdirectories. Zipped versions of the UCD for bulk download are available, as well.

Version References

Version 15.1.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 15.1.0, (South San Francisco, CA: The Unicode Consortium, 2023. ISBN 978-1-936213-33-7)
https://www.unicode.org/versions/Unicode15.1.0/

The terms “Version 15.1” or “Unicode 15.1” are abbreviations for the full version reference, Version 15.1.0.

The citation and permalink for the latest published version of the Unicode Standard is:

The Unicode Consortium. The Unicode Standard.
https://www.unicode.org/versions/latest/

A complete specification of the contributory files for Unicode 15.1 is found on the page Components for 15.1.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.

Errata

Errata incorporated into Unicode 15.1 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 15.1, see the list of current Updates and Errata.

C. Stability Policy Update

The Case Folding Stability policy has been extended with an explicit statement of the stability of case folding as applicable to toNFKC_Casefold(S) between versions of the Unicode Standard. A clarification has been added regarding the subtle distinction between toNFKC_Casefold(S) and toCasefold(toNFKC(S)).

D. Textual Changes and Character Additions

Changes in the Unicode Standard Annexes are listed in Section G.

Character Assignment Overview

627 characters have been added. For details, see the delta code charts.

New Blocks

There is one newly-defined block in Version 15.1:

Range Block Name
2EBF0..2EE5F CJK Unified Ideographs Extension I

The block for CJK Unified Ideographs Extension I was placed near the end of Plane 2, immediately after Extension F, instead of on Plane 3 after Extension H, in order to make best use of the allocation space available on Plane 2.

E. Conformance Changes

There are no new conformance requirements for the core specification in Unicode 15.1. However, the conformance clauses in several Unicode Standard Annexes and Unicode Technical Standards have been reorganized and split in some cases to make it easier to exactly specify conformance to tailored versions of some Unicode algorithms. UAX #29 has added new conformance clauses.

F. Changes in the Unicode Character Database

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 15.1 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.

G. Changes in the Unicode Standard Annexes

In Version 15.1, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes
UAX #9
Unicode Bidirectional Algorithm
There was significant clarification added for the text regarding BD16 and the interaction of control flow between W4, W5, and W6. The use of sos and the treatment of AN/EN with brackets in N0 was also clarified. The text regarding retaining BNs and explicit formatting characters was updated. A major example of the use of HL4 for URLs was added in Section 4.3.3, and a reference to the new UTS #55 was added in Section 4.3.2.
UAX #11
East Asian Width
No significant changes in this version.
UAX #14
Unicode Line Breaking Algorithm
Support was added for line breaking at orthographic syllable boundaries, including the introduction of five new line breaking classes for characters. Rule LB15 was split into LB15a and LB15b, to improve the handling of French style quotation marks. A clearer characterization of allowed tailorings was added to Section 8.1. Various other clarifications and small updates to the text and examples were also made.
UAX #15
Unicode Normalization Forms
No significant changes in this version.
UAX #24
Unicode Script Property
No significant changes in this version.
UAX #29
Unicode Text Segmentation
Explicit conformance rules for each type of segmentation were added to the Conformance section. Support for orthographic syllable breaking was adding in a new rule GB9c. The definition of "crlf" was updated in the table of Regex Definitions. Multiple changes were made to the table of Word_Break Property Values. A note was added in Section 3.1.1 clarifying that each emoji sequence constitutes a single grapheme cluster.
UAX #31
Unicode Identifiers and Syntax
This UAX was retitled to better reflect its scope. Multiple changes were made to the section of Default Identifiers, including the removal of UAX31-R1a, Restricted Format Characters. A significant example was added to UAX31-R1b, Stable Identifiers. Section 4 was completely rewritten, separating the discussion of whitespace and of syntax. The section on limited contexts for joining controls was moved out of this annex and into UTS #39, instead. Section 7 was added, with three new standard profiles: mathematical compatibility notation, emoji, and default ignorable exclusion.
UAX #34
Unicode Named Character Sequences
No significant changes in this version.
UAX #38
Unicode Han Database (Unihan)
Documentation was added for CJK Unified Ideographs Extension I and for 6 new provisional properties. 7 existing provisional properties were removed. The syntax, list of sources, and/or descriptions were updated for the kIRG_GSource, kIRG_KSource, and kIRG_KPSource properties. Syntax and descriptions were also updated for several other properties, including kRSUnicode.
UAX #41
Common References for Unicode Standard Annexes
All references were updated for Unicode 15.1.
UAX #42
Unicode Character Database in XML
New code point attributes, values, and patterns were added for Unicode 15.1.
UAX #44
Unicode Character Database
The documentation was updated to describe the changes to the UCD for Version 15.1.
UAX #45
U-Source Ideographs
A new Section 3 was added, documenting the ranges of U-source ideographs that were added in each version of the Unicode Standard. The N, V, W, and X status values were updated to the more descriptive FutureWS, Variant, Rejected, and NoAction, respectively. The now-obsolete UK-2015 and WS-2017 status values were removed.
UAX #50
Unicode Vertical Text Layout
No significant changes in this version.

H. Changes in Synchronized Unicode Technical Standards

There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

Unicode Technical Standard Changes
UTS #10
Unicode Collation Algorithm
No significant changes in this version.
UTS #39
Unicode Security Mechanisms
The definition and discussion of the contexts for joining controls was moved from UAX #31 into this UTS. The definition of confusability was updated to take default ignorable code points into account. A new confusability relation suitable for identifiers containing bidirectional text was added.
UTS #46
Unicode IDNA Compatibility Processing
Transitional processing of Deviation characters has been deprecated. All major implementations now use nontransitional processing. Step 7 in Section 6 was changed to no longer check for NFD validity; this changed three characters from disallowed_STD3_valid to valid. In nontransitional processing, U+1E9E capital sharp s (ẞ) now maps to U+00DF small sharp s (ß).
UTS #51
Unicode Emoji
A short discussion of the interactions of emoji with computer language syntaxes was added. Minor updates were also made to account for new emoji sequences added in Version 15.1.

M. Implications for Migration

There are a significant number of changes in Unicode 15.1 which may impact implementations upgrading to Version 15.1 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.

Script-related Changes

Because of the limited scope of new repertoire for Version 15.1, there are no migration issues of note specifically tied to various scripts, other than the Han script (see below).

General Character Property Issues

  • There are 5 new ideographic description characters. These extend the syntax of ideographic description sequences.
  • Two of the new ideographic description characters function as unary operators, which necessitated introduction of a new binary property: IDS_Unary_Operator.
  • There are two new properties, ID_Compat_Math_Start and ID_Compat_Math_Continue, for the new Mathematical Compatibility Notation Profile in UAX #31.
  • There is a new property NFKC_Simple_Casefold which establishes another normalization form like NFKC_Casefold does. The new one uses Simple_Case_Folding mappings rather than full Case_Folding mappings. This is intended for use in systems that support case-insensitive identifiers based on simple (1:1) case folding mappings.
  • Five new values have been added to the Line_Break property, in support of new orthographic line breaking rules for a significant number of South and Southeast Asian scripts.

Segmentation

There is a new grapheme cluster segmentation rule GB9c in UAX #29 which refers to a new enumerated property Indic_Conjunct_Break. The list of scripts affected by this rule is expected to expand in subsequent versions of the Unicode Standard. (Note that this outcome differs from the preliminary solution discussed during the beta review for Version 15.1.0, which used macros instead of a new property in the statement of rule GB9c.)

There is a new line breaking rule LB28a in UAX #14, to prevent breaks inside orthographic syllables of Brahmic scripts. That new rule uses the new Line_Break property values. It also includes the use of a dotted circle in its regex expressions. The dotted circle is a literal character—that is, it matches U+25CC ◌ DOTTED CIRCLE.

Numeric Property Issues

  • There is one large new value in extracted/DerivedNumericValues.txt: 10000000000000000 (for U+4EAC)
  • U+5146 has two kPrimaryNumeric values: 1000000, 1000000000000
  • U+79ED has two kPrimaryNumeric values: 1000000000, 1000000000000

CJK/Unihan Changes

  • A new CJK unified ideograph block, Extension I, has been added, with 622 characters in the range U+2EBF0..U+2EE5D. Implementers should check carefully for any hard-coded assumptions about CJK ranges. To keep the CJK block ranges as compact as possible, Extension I has been added to Plane 2, instead of directly after Extension H on Plane 3. Implementers should also check that their code does not assume that CJK extensions all occur in alphabetic order by the extension letter.
  • Some kRSUnicode values now include double-apostrophe radicals, sometimes as the only values for a code point.
  • Seven old provisional properties have been removed.
  • Six new provisional properties have been added.

See UAX #38, Unicode Han Database (Unihan) for further details on these changes, especially Section 4.2, Listing by Date of Addition to the Unicode Standard, and Section 4.3, Listing by Location within Unihan.zip. UAX #38 also has updated regex values for numerous Unihan properties. For the double-apostrophe radicals, see:

UTS #46 (IDNA) Changes

  • Transitional processing (see conformance clause C1) has now been deprecated in UTS #46, Unicode IDNA Compatibility Processing.
  • In nontransitional processing, U+1E9E capital sharp s (ẞ) now maps to U+00DF small sharp s (ß), so that domain names with either input character always match. Until Unicode 15.0, capital sharp s mapped to "ss", which is the same as the mapping for small sharp s in transitional processing.
  • U+2260 (≠), U+226E (≮), and U+226F (≯) are now unconditionally valid, rather than disallowed_STD3_valid.
  • There are a couple of additional, minor changes to the validity criteria. See the UTS #46 Modifications section for details.

Changes to Code Charts

  • The code charts for the main CJK Unified Ideographs block (U+4E00) has an updated format that uses 7 columns for source glyphs, instead of 6. The KP source glyphs have been explicitly added to the code charts.
  • The font used for the representative glyphs of the Alchemical Symbols block has been updated.

Collation-related Changes

There has been an update to DUCET regarding the weighting of quotation marks. Various single quotation marks are now weighted as secondary variants of U+0027 (') APOSTROPHE, and various double quotation marks are now weighted as secondary variants of U+0022 (") QUOTATION MARK. U+05F3 (׳) HEBREW PUNCTUATION GERESH is also weighted as a secondary variant of U+0027, and U+05F4 (״) HEBREW PUNCTUATION GERSHAYIM is weighted as a secondary variant of U+0022. This change enables better behavior of geresh and gershayim for searching and sorting, and brings UCA more in line with the CLDR tailorings for quotation marks, geresh, and gershayim.

Emoji Changes

There are no new emoji characters in Unicode 15.1, but 118 new RGI emoji ZWJ sequences and 17 presentation sequences have been added to the overall emoji repertoire. For details, see the Unicode 15.1 emoji charts and Emoji Recently Added, v15.1.