Blocks and Ranges
Q: Much of the Unicode Standard is organized into "code blocks" that look like other 7-bit or 8-bit code pages.
Does this mean that Unicode is really an assemblage of 8-bit code pages for different languages.
A: No, not at all. The character blocks in the Unicode Standard are merely a convenience for exposition of the standard. The Unicode Standard
is not an assemblage of code pages, but a single, universal character encoding. All characters are equally accessible, and the blocks have no
implementation expression in most Unicode software. The fact that some character blocks, and in particular, the Indic script character blocks, bear a
superficial resemblance, in ordering and size, to other standards such as ISCII or the ISO/IEC 8859 series, is primarily to assist people in interpreting
the repertoire visually in comparison to legacy encodings, and to make it simpler to develop conversion tables for older character encodings.
Q: I understand that all Unicode characters are 16 bits, and that the high byte is used to switch between code blocks. Is that correct?
A: Absolutely not! Unicode characters may be encoded at any code point from U+0000 to U+10FFFF. The size of the code unit used for expressing those
code points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits (for UTF-32) [See UTF & BOM]. Even when Unicode characters are expressed with 16-bit code units, there is no concept of a high byte switching values between "code pages" expressed in
the low byte. The entire 16-bit value expresses the entire character, period.
Q: If Unicode blocks aren't code pages, what are they?
A: Blocks in the Unicode Standard are named ranges of code points. They are used
to help organize the standard into groupings of related kinds of characters, for convenience
in reference. And they are used by a charting program to define the ranges of characters printed
out together for the code charts seen in the book or posted online.
Q: Can blocks overlap?
A: No. Every Unicode block is discrete, and cannot overlap with any other block.
Also, every assigned character in the Unicode Standard has to be in a block (and only one block,
of course). This ensures that when code charts are printed, no characters are omitted simply
because they aren't in a block.
Q: Where can I find the definitive list of Unicode blocks?
A: The Unicode blocks and their names are a normative part of
the Unicode Standard. The exact list is always maintained in one of
the files of
the Unicode Character Database,
Q: Is casing significant for Unicode block names?
A: No. Block names are commonly represented in Titlecase, but can also appear in all UPPERCASE.
Other casing combinations can occur, and case should be ignored when comparing block names.
Q: You said that blocks are named ranges of code points. Are Unicode ranges the same as Unicode blocks?
A: No. A range simply refers to any sequence of Unicode code points with a starting point and an ending point. It doesn't
have to be the same as the specific ranges for the Unicode blocks. A range can overlap block boundaries, and a range in general
doesn't have any name.
Q: How are Unicode ranges expressed?
A: By using the U+ form for the starting and ending code points, connected with dots. So, for example: U+0100..U+03FF.
Sometimes a dash or a long dash is substituted for the two dots, and the "U+" can be omitted if it is clear you are talking about
Unicode code points specifically.
Q: Are there any restrictions on the ranges used for Unicode blocks?
A: Yes. Every Unicode block starts with a code point of the form nnn0 and ends with a code point of the form nnnF.
That is another way of saying that every block consists of some number of complete columns of characters, when seen printed out
in charts. And the number of code points in every block is divisible by 16. Also, the ranges for the Unicode blocks do not
extend over planes in the standard. The reasons for these restrictions have mostly to do with convenience for printing out
the charts, but they also provide some minor benefits for implementations when constructing tables.
Q: Are there any restrictions on what characters can be encoded in a Unicode block?
A: There are no absolute rules involved, but in general the encoding committees are careful to try to encode related
characters together when they can, given the constraints on what has already been encoded. So any additional Devanagari letters
would be encoded in the existing Devanagari block, if possible, and additional punctuation in one of the existing punctuation
blocks, and so on.
Q: Do Unicode blocks have defined character properties?
A: No. The character properties are associated with encoded characters themselves, rather than the blocks they are
Q: Does that even apply to the script for characters?
A: Yes. For example, the Thai block contains Thai characters that have the Thai script property, but it also
contains the character for the baht currency sign, which is used in Thai text, of course, but which is defined to have the
Common script property. To find the script property value for any character you need to rely on the
Unicode Character Database
rather than the block value alone.
Q: So block value is not the same as script value?
A: Correct. In some cases, such as Latin, the encoded characters are spread across as many as a dozen different Unicode blocks.
That is unfortunate, but is simply the result of the history of the standard. In other instances, a single block may contain characters
of more than one script. For example, the Greek and Coptic block contains mostly characters of the Greek script, but also a few historic
characters of the Coptic script.
Q: Do Unicode ranges ever have defined character properties?
A: Yes, there are a few special cases where specific ranges of code points are defined to have default property values.
The most important of these cases is for the Bidi_Class property, where certain ranges of code points, including unassigned code points,
are specified to be right-to-left. This is done to enable stability for implementations of the Bidirectional Algorithm, as characters
are added over time to the standard. There are other instances of special ranges with predefined character properties. For details,
see the documentation for the
Unicode Character Database.
Q: Are Unicode blocks predefined, even before characters are encoded for them?
A: Formally, no. However, the Unicode Consortium and SC2/WG2 jointly maintain a Roadmap that contains both existing blocks
and tentative allocations of blocks for future encoding. The tentative allocations help in the planning for encoding and provide a
convenient place for linking to proposal documents. However, they are not part of the standard itself, and such tentative block
allocations can be and frequently are moved around during the process of proposal review and approval. For details, see
Q: Are Unicode blocks important for implementations of Unicode?
A: It may be surprising, but usually they are not. What matters for implementations of Unicode are the properties for characters.
Those are obtained from other data files in the
Unicode Character Database, and don't depend on blocks, per se. In particular, since block
identity is not exactly correlated with script identity, it is much better to rely on
implementing an operation that depends on script identity for a character.
Blocks are sometimes convenient for display of characters, as for a character picker application. But even when expressing such
thing as the supported repertoire for an application, it is generally better to express that in terms of explicit ranges of assigned characters,
rather than just in terms of blocks.
Q: Can Unicode blocks be used in defining sets for regular expressions?
A: Yes, but only with some care, as they may lead to surprises—particularly in not matching
characters that users may expect them to. For further discussion, see
cautions about use of blocks in regular expressions.
Q: I've noticed some discrepancies in the listing of block ranges in various places. Are these errors I should report?
A: There are several reasons for such discrepancies, and in most instances they are intentional distinctions. So, no, they are not errors
Q: O.k., so what are those reasons?
A: Well, first of all, the names and ranges of blocks are occasionally modified editorially in the text of the Unicode Standard.
Block names are sometimes shortened a little in book headers, so they fit on a line and don't cause problems in the table of contents or index.
Sometimes when discussing characters in a single script where two adjacent blocks contain those characters, a header may be listed coalescing
the range under discussion, or a header may list one name and two discrete ranges. Such changes are simply to help in the presentation of
material about the standard, and in no way are intended to modify the normative block definitions. In all cases the normative block ranges and
names are those specified in
Q: What about block discrepancies in the Unicode names list?
Unicode names list file, which can be found in the
Unicode Character Database, is actually a data file which is used to drive the charting
program for the Unicode code charts. It uses some special markup conventions explained in
documentation of the names list file.
In particular, the header entries in the names list file occasionally depart from normative block ranges because of constraints on how the charting program works
and also to prevent the printing of unnecessary blank columns or pages in the charts. The label used in a header entry may also differ from a block name, adding
annotations that are helpful for reading the charts. For example, here is the normative block definition for the Latin-1 Supplement:
0080..00FF; Latin-1 Supplement
But the names list file uses a header entry:
@@ 0080 C1 Controls and Latin-1 Supplement (Latin-1 Supplement) 00FF
The range used is the same, but the header entry adds "C1 Controls and" for clarity when printing the Unicode code charts. The parenthetical
string is used instead by the charting program when printing code charts for ISO/IEC 10646.
In another example, the normative block definition for CJK Unified Ideographs is:
4E00..9FFF; CJK Unified Ideographs
But the names list file uses a header entry:
@@ 4E00 CJK Unified Ideographs 9FD5
The charting program uses the "9FD5" value to know where the last assigned character to print is, since CJK Unified Ideographs are not
explicitly listed in UnicodeData.txt. And the charting program uses this information to optimize page breaks and prevent printing of empty columns.
Finally, the charting program departs from both Blocks.txt and NamesList.txt in some instances. For example, there are two normative high
D800..DB7F; High Surrogates and
DB80..DBFF; High Private Use Surrogates
But the code charts don't actually print out anything for those ranges, which aren't actually assignable for characters anyway; instead,
there is a consolidated single page explaining the High Surrogate Area, Range: D800-DBFF.
Q: Do Unicode blocks exactly match the blocks defined in ISO/IEC 10646?
A: For the most part they do, but there are several principled exceptions.
First, the Unicode blocks for Basic Latin and the Latin-1 Supplement are extended to incorporate the control characters, since the Unicode
Standard prints out all the code points for the control characters, as well as the graphic characters.
Unicode: 0000..007F; Basic Latin
10646: 0020-007E BASIC LATIN
Unicode: 0080..00FF; Latin-1 Supplement
10646: 00A0-00FF LATIN-1 SUPPLEMENT
There is a similar distinction for the special cases of the Byte Order Mark at U+FEFF and
the two noncharacters at the very end of the BMP.
Unicode: FE70..FEFF Arabic Presentation Forms-B
10646: FE70-FEFE ARABIC PRESENTATION FORMS-B
Unicode: FFF0..FFFF Specials
10646: FFF0-FFFD SPECIALS
Second, for Hangul syllables, 10646 defines a block that ends at the last encoded Hangul syllable, but the Unicode rules for block definitions
require ending a block at an even 16-character boundary:
Unicode: AC00..D7AF; Hangul Syllables
10646: AC00-D7A3 HANGUL SYLLABLES
Third, the Unicode Standard defines blocks for sub-ranges of surrogate code points; those have no blocks defined in 10646. Also, the Unicode
Standard defines blocks for the supplementary private use areas on planes 15 and 16, while no blocks are defined for those in 10646.