[Unicode]  Frequently Asked Questions Home | Site Map | Search

Private-Use Characters, Noncharacters & Sentinels FAQ

Private-Use Characters

Noncharacters

Sentinels

Private-Use Characters

Q: What are private-use characters?

A: Private-use characters are code points whose interpretation is not specified by a character encoding standard and whose use and interpretation may be determined by private agreement among cooperating users. Private-use characters are sometimes also referred to as user-defined characters (UDC) or vendor-defined characters (VDC).

Q: Does Unicode have private-use characters?

A: Yes. There are three ranges of private-use characters in the standard. The main range in the BMP is U+E000..U+F8FF, containing 6,400 private-use characters. That range is often referred to as the Private Use Area (PUA). But there are also two large ranges of supplementary private-use characters, consisting of most of the code points on Planes 15 and 16: U+F0000..U+FFFFD and U+100000..U+10FFFD. Together those ranges allocate another 131,068 private-use characters. Altogether, then, there are 137,468 private-use characters in Unicode.

Q: Why are there so many private-use characters in Unicode?

A: Unicode is a very large and inclusive character set, containing many more standardized characters than any of the legacy character encodings. Most users have little need for private-use characters, because the characters they need are already present in the standard.

However, some implementations, particularly those interoperating with East Asian legacy data, originally anticipated needing large numbers of private-use characters to enable round-trip conversion to private-use definitions in that data. In most cases, 6,400 private-use characters is more than enough, but there can be occasions when 6,400 does not suffice. Allocating a large number of private-use characters has the additional benefit of allowing implementations to choose ranges for their private-use characters that are less likely to conflict with ranges used by others.

The allocation of two entire additional planes full of private-use characters ensures that even the most extravagant implementation of private-use character definitions can be fully accomodated by Unicode.

Q: Will the number of private-use characters in Unicode ever change?

A: No. The set of private-use characters is formally immutable. This is guaranteed by a Unicode Stability Policy.

Q: So legacy character encodings also have private-use characters?

A: Yes. Private-use characters are commonly used in East Asia, particularly in Japan, China, and Korea, to extend the available characters in various standards and vendor character sets. Typically, such characters have been used to add Han characters not included in the standard repertoire of the character set. Such non-standard Han character extensions are often referred to as "gaiji" in Japanese contexts.

Q: So other than interoperating with legacy CJK, why would I use private-use characters?

A: Some characters may never get standard encodings for one reason or another. For example, they might be part of a constructed artificial script (ConScript) which has no general community of use. Or a particular implementation may need to use private-use characters for specific internal purposes. Private-use characters are also useful for testing implementations of scripts or other sets of characters which may be proposed for encoding in a future version of Unicode.

Q: How can private-use characters be input?

A: Some input method editors (IME) allow customizations whereby an input sequence and resulting private-use character can be added to their internal dictionaries.

Q: How are private-use characters displayed?

A: With common font technologies such as OpenType and AAT, private-use characters can be added to fonts for display.

Q: What happens if definitions of private-use characters conflict?

A: The same code points in the PUA may be given different meanings in different contexts, since they are, after all, defined by users and are not standardized. For example, if text comes from a legacy NEC encoding in Japan, the same code point in the PUA may mean something entirely different if interpreted on a legacy Fujitsu machine, even though both systems would share the same private-use code points. For each given interpretation of a private-use character one would have to pick the appropriate IME, user dictionary and fonts to work with it.

Q: What about properties for private-use characters?

A: One should not expect the rest of an operating system to override the character properties for private-use characters, since private use characters can have different meanings, depending on how they originated. In terms of line breaking, case conversions, and other textual processes, private-use characters will typically be treated by the operating system as otherwise undistinguished letters (or ideographs) with no uppercase/lowercase distinctions.

Q: What does "private agreement among cooperating parties" mean?

A: A "private agreement" simply refers to the fact that agreement about the interpretation of some set of private-use characters is done privately, outside the context of the standard. The Unicode Standard does not specify any particular interpretation for any private-use character. There is no implication that a private agreement necessarily has any contractual or other legal status—it is simply an agreement between two or more parties about how a particular set of private-use characters should be interpreted.

Q: How would I define a private agreement?

A: One can share, or even publish, documentation containing particular assignments for private-use characters, their glyphs, and other relevant information about their interpretation. One can then ask others to use those private-use characters as documented. One can create appropriate fonts and IMEs, or request that others do so.


Noncharacters

Q: What are noncharacters?

A: A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal use.

Q: How did noncharacters get that weird name?

A: Noncharacters are in a sense a kind of private-use character, because they are reserved for internal (private) use. However, that internal use is intended as a "super" private use, not normally interchanged with other users. Their allocation status in Unicode differs from that of ordinary private-use characters. They are considered unassigned to any abstract character, and they share the General_Category value Cn (Unassigned) with unassigned reserved code points in the standard. In this sense they are "less a character" than most characters in Unicode, and the moniker "noncharacter" seemed appropriate to the UTC to express that unique aspect of their identity.

In Unicode 1.0 the code points U+FFFE and U+FFFF were annotated in the code charts as "Not character codes" and instead of having actual names were labeled "NOT A CHARACTER". The term "noncharacter" in later versions of the standard evolved from those early annotations and labels.

Q: How many noncharacters does Unicode have?

A: Exactly 66.

Q: Which code points are noncharacters?

A: The 66 noncharacters are allocated as follows:

  • a contiguous range of 32 noncharacters: U+FDD0..U+FDEF in the BMP
  • the last two code points of the BMP, U+FFFE and U+FFFF
  • the last two code points of each of the 16 supplementary planes: U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... U+10FFFE, U+10FFFF

For convenient reference, the following table summarizes all of the noncharacters, showing their representations in UTF-32, UTF-16, and UTF-8. (In this table, "#" stands for either the hex digit "E" or "F".)

UTF-32 UTF-16 UTF-8
0000FDD0 FDD0 EF B7 90

...

0000FDEF FDEF EF B7 AF
0000FFF# FFF# EF BF B#
0001FFF# D83F DFF# F0 9F BF B#
0002FFF# D87F DFF# F0 AF BF B#
0003FFF# D8BF DFF# F0 BF BF B#
0004FFF# D8FF DFF# F1 8F BF B#

...

000FFFF# DBBF DFF# F3 BF BF B#
0010FFF# DBFF DFF# F4 8F BF B#

Q: Why are 32 of the noncharacters located in a block of Arabic characters?

A. The allocation of the range of noncharacters U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A block was mostly a matter of efficiency in the use of reserved code points in the rather fully-allocated BMP. The Arabic Presentation Forms-A block had a contiguous range of 32 unassigned code points, but as of 2001, when the need for more BMP noncharacters became apparent, it was already clear to the UTC that the encoding of many more Arabic presentation forms similar to those already in the Arabic Presentation Forms-A block would not be useful to anyone. Rather than designate an entirely new block for noncharacters, the unassigned range U+FDD0..U+FDEF was designated for them, instead.

Note that the range U+FDD0..U+FDEF for noncharacters is another example of why it is never safe to simply assume from the name of a block in the Unicode Standard that you know exactly what kinds of characters it contains. The identity of any character is determined by its actual properties in the Unicode Character Database. The noncharacter code points in the range U+FDD0..U+FDEF share none of their properties with other characters in the Arabic Presentation Forms-A block; they certainly are not Arabic script characters, for example.

Q: Will the set of noncharacters in Unicode ever change?

A: No. The set of noncharacters is formally immutable. This is guaranteed by a Unicode Stability Policy.

Q: Are noncharacters intended for interchange?

A: No. They are intended explicity for internal use. For example, they might be used internally as a particular kind of object placeholder in a string. Or they might be used in a collation tailoring as a target for a weighting that comes between weights for "real" characters of different scripts, thus simplifying the support of "alphabetic index" implementations.

Q: Are noncharacters prohibited in interchange?

A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.

Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.

Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

Q: If my application makes specific, internal use of a noncharacter, what should I do with input text?

A: In cases where the input text cannot be guaranteed to use the same interpretation for the noncharacter as your program does, and the presence of that noncharacter would cause internal problems, it is best practice to replace that particular noncharacter on input by U+FFFD. Of course, such behavior should be clearly documented, so that external clients know what to expect.

Q: What should I do if downstream clients depend on noncharacters being passed through by my module?

A: In such a case, your module may need to use a more complicated mechanism to preserve noncharacters for pass through, while not interfering with their specific internal use. This behavior will prevent your downstream clients from breaking, at the cost of making your processing marginally more complex. However, because of this additional complexity, if you anticipate that a future version of your module may not pass through one or more noncharacters, it is best practice to document the reservation of those values from the start. In that way, any downstream client using your module can have clearly specified expectations regarding which noncharacter values your module may replace.

Q: Can failing to replace noncharacters with U+FFFD lead to problems?

A: If your implementation has no conflicting internal definition and use for the particular noncharacter in question, it is usually harmless to just leave noncharacters in the text stream. They definitely will not be displayable and might break up text units or create other "funny" effects in text, but these results are typically the same as could be expected for an uninterpreted private-use character or even a normal assigned character for which no display glyph is available.

Q: Can noncharacters simply be deleted from input text?

A: No. Doing so can lead to security problems. For more information, see Unicode Technical Report #36, Unicode Security Guidelines.

Q: Can you summarize the basic differences between private-use characters and noncharacters?

A: Private-use characters do not have any meanings assigned by the Unicode Standard, but are intended to be interchanged among cooperating parties who share conventions about what the private-use characters mean. Typically, sharing those conventions means that there will also be some kind of public documentation about such use: for example, a website listing a table of interpretations for certain ranges of private-use characters. As an example, see the ConScript Unicode Registry—a private group unaffiliated with the Unicode Consortium—which has extensive tables listing private-use character definitions for various unencoded scripts. Or such public documentation might consist of the specification of all the glyphs in a font distributed for the purpose of displaying certain ranges of private-use characters. Of course, a group of cooperating users which have a private agreement about the interpretation of some private-use characters is under no obligation to publish the details of their agreement.

Noncharacters also do not have any meanings assigned by the Unicode Standard, but unlike private-use characters, they are intended only for internal use, and are not intended for interchange. Occasionally, there will be no public documentation available about their use in particular instances, and fonts typically do not have glyphs for them.

Noncharacters and private-use characters also differ significantly in their default Unicode character property values.

Code Point Type Use Type Properties
noncharacter private, internal gc=Cn, bc=BN, eaw=N
private use private, interchange gc=Co, bc=L, eaw=A

 


Sentinels

Q: What is a sentinel?

A: A sentinel is a special numeric value typically used to signal an edge condition of some sort. For text, in particular, sentinels are values stored with text but which are not interpreted as part of the text, and which indicate some special status. For example, a null byte is used as a sentinel in C strings to mark the end of the string.

Q: Is it safe to use a noncharacter as an end-of-string sentinel?

A: It is not recommended. The use of any Unicode code point U+0000..U+10FFFF as a sentinel value (such as "end of text" in APIs) can cause problems when that code point actually occurs in the text. It is preferable to use a true out-of-range value, for example -1. This is parallel to the use of -1 as the sentinel end-of-file (EOF) value in the standard C library, and is easy and fast to test for in code with a (result < 0) check. Alternatively, a clearly out-of-range positive value such as 0x7FFFFFFF could also be used as a sentinel value.

Q: How about using NULL as an end-of-string sentinel?

A: When using UTF-8 in C strings, implementations follow the same conventions they would for any legacy 8-bit character encoding in C strings. The byte 0x00 marks the end of the string, consistent with the C standard. Because the byte 0x00 in UTF-8 also represents U+0000 NULL, a UTF-8 C string cannot have a NULL in its contents. This is precisely the same issue as for using C strings with ASCII. In fact, an ASCII C string is formally indistinguishable from a UTF-8 C string with the same character content.

It is also quite common for implementations which handle both UTF-8 and UTF-16 data to implement 16-bit string handling analogously to C strings, using 0x0000 as a 16-bit sentinel to indicate end of string for a 16-bit Unicode string. The rationale for this approach and the associated problems completely parallel those for UTF-8 C strings.

Q: The Unicode Standard talks about U+FEFF BYTE ORDER MARK (BOM) being a signature. Is that the same as a sentinel?

A: No. A signature is a defined sequence of bytes used to identify an object. In the case of Unicode text, certain encoding schemes use specific initial byte sequences to identify the byte order of a Unicode text stream. See the BOM FAQ entries for more details.

Q: But the byte-swapped BOM, U+FFFE, is a noncharacter. Why?

A: U+FFFE was designated as a noncharacter to make it unlikely that normal, interchanged text would begin with U+FFFE. The occurrence of U+FFFE as the initial character as part of text has the potential to confuse applications testing for the two initial signature bytes <FE FF ...> or <FF FE ...> of a byte stream labeled as using the UTF-16 encoding scheme. That can interfere with checking for the presence of a BOM which would indicate big-endian or little-endian order.

Q: I read somewhere that U+FFFE and U+FFFF were illegal in Unicode, and could be used as sentinels. Is that true?

A: Well, the short answer is no, that is not true—at least, not entirely true. U+FFFE and U+FFFF are noncharacters just like the other 64 noncharacters in the standard, and are valid in Unicode strings. Because they are noncharacters, nothing would prohibit a privately-defined internal use of either of them as a sentinel, but such use is problematical in the same way that use of any valid character as a sentinel can be problematical.

The claims about U+FFFE and U+FFFF being illegal in Unicode derive from the days of Unicode 1.0 [1991], when the standard was still architected as a pure 16-bit character encoding, before the invention of UTF-16 and supplementary characters. In that version of the standard, U+FFFE and U+FFFF did have an unusual status. The code charts were printed omitting the last two code points altogether, and in the names list, the code points U+FFFE and U+FFFF were labeled "NOT A CHARACTER". They were also annotated with notes like, "the value FFFF is guaranteed not to be a Unicode character at all". Section 2.3, p. 14 of Unicode 1.0 contains the statement, "U+FFFE and U+FFFE are reserved and should not be transmitted or stored," so it is clear that Unicode 1.0 intended that those values would not occur in Unicode strings. The block description for the Specials Block in Unicode 1.0 contained the following information:

U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode character value, and should be taken as a signal that Unicode characters should be byte-swapped before interpretation. U+FFFE should only be intepreted as an incorrectly byte-swapped version of U+FEFF.

U+FFFF. The 16-bit unsigned hexadecimal value U+FFFF is not a Unicode character value, and can be used by an application as a [sic] error code or other non-character value. The specific interpretation of U+FFFF is not defined by the Unicode standard, so it can be viewed as a kind of private-use non-character.

It should be apparent that U+FFFF in Unicode 1.0 was the prototype for what later became noncharacters in the standard—both in terms of how it was labeled and how its function was described.

Unicode 2.0 [1996] formally changed the architecture of Unicode, as a result of the merger with ISO/IEC 10646-1:1993 and the introduction of UTF-16 and UTF-8 (both dating from Unicode 1.1 times [1993]). However, both Unicode 2.0 and Unicode 3.0 effectively were still 16-bit standards, because no characters had been encoded beyond the BMP, and because implementations were still mostly treating the standard as a de facto fixed-width 16-bit encoding.

The conformance wording about U+FFFE and U+FFFF changed somewhat in Unicode 2.0, but these were still the only two code points with this unique status, and there were no other "noncharacters" in the standard. The code charts switched to the current convention of showing what we now know as "noncharacters" with black cells in the code charts, rather than omitting the code points altogether. The names list annotations were unchanged from Unicode 1.0, and the Specials Block description text was essentially unchanged as well. Unicode 3.0 introduced the term "noncharacter" to describe U+FFFE and U+FFFF, not as a formal definition, but simply as a subhead in the text.

The Chapter 2 language in Unicode 2.0 dropped the explicit prohibition against transmission or storage of U+FFFE and U+FFFF, but instead added the language, "U+FFFF is reserved for private program use as a sentinel or other signal." That statement effectively blessed existing practice for Unicode 2.0 (and 3.0), where 16-bit implementations were taking advantage of the fact that the very last code point in the BMP was reserved and conveniently could also be interpreted as a (signed) 16-bit value of -1, to use it as a sentinel value in some string processing.

Unicode 3.0 [1999] formalized the definition of "transformations", now more widely referred to as UTFs. And there was one very important addition to the text which makes it clear that U+FFFE and U+FFFF still had a special status and were not considered "valid" Unicode characters. Chapter 3, p. 46 included the language:

To ensure that round-trip transcoding is possible, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE16, FFFF16, and unpaired surrogates.

That initial formulation of UTF mapping was erroneous. A lot of work was done to correct and clarify the concepts of encoding forms and UTF mapping in the versions immediately following Unicode 3.0, to correct various defects in the specification.

Unicode 3.1 [2001] was the watershed for the development of noncharacters in the standard. Unicode 3.1 was the first version to add supplementary characters to the standard. As a result, it also had to come to grips with the fact the ISO/IEC 10646-2:2001 had reserved the last two code points for every plane as "not a character", despite the fact that their code point values shared nothing with the rationale for reserving U+FFFE and U+FFFF when the entire codespace was just 16 bits.

The Unicode 3.1 text formally defined noncharacters, and also designated the code point range U+FDD0..U+FDEF as noncharacters, resulting in the 66 noncharacters defined in the standard.

Unicode 4.0 [2003] finally corrected the statement about mapping noncharacters and surrogate code points:

To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.

That correction results in the current situation for Unicode, where noncharacters are valid Unicode scalar values, are valid in Unicode strings, and must be mapped through UTFs, whereas surrogate code points are not valid Unicode scalar values, are not valid in UTFs, and cannot be mapped through UTFs.

Unicode 4.0 also added an entire new informative section about noncharacters, which recommended the use of U+FFFF and U+10FFFF "for internal purposes as sentinels." That new text also stated that "[noncharacters] are forbidden for use in open interchange of Unicode text data," a claim which was stronger than the formal definition. And it made a contrast between noncharacters and "valid character value[s]", implying that noncharacters were not valid. Of course, noncharacters could not be interpreted in open interchange, but the text in this section had not really caught up with the implications of the change of wording in the conformance requirements for UTFs. The text still echoed the sense of "invalid" associated with noncharacters in Unicode 3.0.

Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

For up-to-date Unicode implementations, however, one should use caution when choosing sentinel values. U+FFFF and U+10FFFF still have interesting numerical properties which render them likely choices for internal use as sentinels, but implementers should be aware of the fact that those values, as for all noncharacters in the standard, are also valid in Unicode strings, must be converted between UTFs, and may be encountered in Unicode data—not necessarily used with the same interpretation as for one's own sentinel use. Just be careful out there!