Unicode 5.1 Released

From: Rick McGowan (rick@unicode.org)
Date: Fri Apr 04 2008 - 16:54:58 CST

  • Next message: Dominikus Scherkl: "Errata on the Unicode page"

    The Unicode Consortium is pleased to announce the release of Unicode 5.1.
    This release contains over 100,000 characters, and provides significant
    additions and improvements that extend text processing for software
    worldwide. Some of the key features are: increased security in data
    exchange, significant character additions for Indic and South East Asian
    scripts, expanded identifier specifications for Indic and Arabic scripts,
    improvements in the processing of Tamil and other Indic scripts,
    linebreaking conformance relaxation for HTML and other protocols,
    strengthened normalization stability, new case pair stability,
    plus others given below.

    The Version 5.1.0 data files and documentation are final and posted on the
    Unicode site. In addition to updated existing files, implementers will
    find new test data files (for example, for linebreaking) and new XML data
    files that encapsulate all of the Unicode character properties. For
    details, see the page for Unicode 5.1.0 at

    A major feature of Unicode 5.1.0 is the enabling of ideographic variation
    sequences. These sequences allow standardized representation of glyphic
    variants needed for Japanese, Chinese, and Korean text. The first
    registered collection, from Adobe Systems, is now available at

    Unicode 5.1 contains significant changes to properties and behaviorial
    specifications. Several important property definitions were extended,
    improving linebreaking for Polish and Portuguese hyphenation. The Unicode
    Text Segmentation Algorithms, covering sentences, words, and characters,
    were greatly enhanced to improve the processing of Tamil and other Indic
    languages. The Unicode Normalization Algorithm now defines stabilized
    strings and provides guidelines for buffering. Standardized named sequences
    are added for Lithuanian, and provisional named sequences for Tamil.

    Unicode 5.1.0 adds 1,624 newly encoded characters. These additions include
    characters required for Malayalam and Myanmar and important individual
    characters such as Latin capital sharp s for German. Version 5.1 extends
    support for languages in Africa, India, Indonesia, Myanmar, and Vietnam,
    with the addition of the Cham, Lepcha, Ol Chiki, Rejang, Saurashtra,
    Sundanese, and Vai scripts. Scholarly support includes important editorial
    punctuation marks, as well as the Carian, Lycian, and Lydian scripts, and
    the Phaistos disc symbols. Other new symbol sets include dominoes, Mahjong,
    dictionary punctuation marks, and math additions. This latest version of
    the Unicode Standard has exactly the same character assignments as ISO/IEC
    10646:2003 plus Amendments 1 through 4.

    The Unicode Collation Algorithm (UCA), the core standard for sorting all
    text, is also being updated at the same time (see
    http://www.unicode.org/reports/tr10/). The major changes in UCA include
    coverage of all Unicode 5.1 characters, tightened conformance for canonical
    equivalence, clearer definitions of internationalized search and matching,
    specifications of parameters for customizing collation, and definitions of
    collation folding. There are also important clarifications on the use of
    contractions (such as "ch" in Slovak) in collation.

    The next version of the Unicode locale project (CLDR) is also being
    prepared on the basis of Unicode 5.1, and is now open for public data
    submission (see http://www.unicode.org/cldr/).

    This archive was generated by hypermail 2.1.5 : Fri Apr 04 2008 - 16:59:08 CST