Unicode Collation

From: Mark Davis (markdavis@ispchannel.com)
Date: Sun Apr 09 2000 - 13:06:30 EDT


On http://www.unicode.org/unicode/reports/tr10/charts/ I put up a set of mechanically-generated HTML charts for the Unicode collation data.

They show the full default Unicode collation order, so you can see what sorts before what (at least those characters you have a font for -- Ariel Unicode MS is perfect, if you have it). You can choose to see the characters with or without their sortkeys.

The chart also contains some edge cases, so you can see how they sort:

surrogate pair characters:
            "\uD800\uDC00" (10000), "\uDBFF\uDFFD" (10FFFD)

other unassigned characters:
            "\u0220", "\uFFF0"

illegal UTF-16 code units:
            "\uD800", "\uDFFF", "\uFFFE", "\uFFFF", "\uDBFF\uDFFE" (10FFFE), "\uDBFF\uDFFF" (10FFFF)

sample Han, Hangul, and Yi characters:
            "\u3400", "\u3401", "\u3DB4", "\u4DB5", "\u4E00", "\u4E01", "\u9FA4", "\u9FA5",
            "\uAC00", "\uAC01", "\uD7A2", "\uD7A3", "\uA000", "\uA001", "\uA4C6"

[Note: Unicode 3.0-only characters are currently sorted as if they were unassigned.]

Feedback is welcome.

Mark



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT