Mark Davis, 2009-05-29. Last updated 07-07
There are some security and interoperability issues with
PEP 383 (http://python.org/dev/peps/pep-0383/
as outlined below.
While it appears that the
security issues may not be problematic in Python, if
used as intended, we should present these issues
publicly in some way so that people using similar
systems are made aware of the problems.
There is a known problem with file systems that use a
legacy charset. When you use a Unicode API to find the
files in a directory, you typically get back a list of
Unicode file names. You use those names to access the
files through some other API. There are two possible
- One of the file names is invalid according to
the legacy charset converter. For example, it is an
SJIS string consisting of bytes <E0 30>.
- Two of the file names are mapped to the same
Unicode string by the legacy charset converter.
These problems come up in other situations besides file
systems as well. A common source is when a byte string
that is valid in one charset is converted by a different
charset's converter. For example, the byte string <E0
30> that is invalid in SJIS is perfectly meaningful in
Latin-1, representing "à0".
A possible solution
to this is to enable all charset converters to
losslessly (reversibly) convert to Unicode. That is, any
sequence of bytes can be converted by each charset
converter to a Unicode string, and that Unicode string
will be converted back to exactly that original sequence
of bytes by that converter. This precludes, for example,
the charset converter's mapping two different
byte sequences to
( � ) REPLACEMENT CHARACTER, since the
original bytes could not be recovered. It also precludes
having "fallbacks" (see
): cases where two
different byte sequences map to the same Unicode
PEP 383 Approach
The basic idea of PEP 383 is to be able to do this by
converting all "unmappable" sequences to a sequence of
one or more isolated high surrogate code points: that
is, each code point's value is 0xD800 plus the
corresponding unmappable byte value. With this
mechanism, every maximal subsequence of bytes that can
be reversibly mapped to Unicode by the charset converter
is so mapped; any intervening subsequences are converted
to a sequence of high surrogates. The result is a
Unicode String, but is not a well-formed UTF sequence.
For example, suppose that the byte 81 is illegal in
. When converted to Unicode, PEP 383
represents this as U+D881. When mapped back to bytes
(for charset n
), then that turns back into the
byte 81. This allows the source byte sequence to be
reversibly represented in a Unicode String, no matter
what the contents. If this mechanism is applied to a
charset converter that has no fallbacks from bytes to
Unicode, then the charset converter becomes reversible
(from bytes to Unicode to bytes).
Note that this
only works when the Unicode string is converted back
with the very same charset converter that was used to
convert from bytes. For more information on PEP 383, see
- B2Un is the bytes-to-Unicode converter for
- U2Bn is the Unicode-to-bytes converter for
- An invalid byte is one that would be
mapped by a PEP to a high surrogate, because it is
(or is part of a sequence that is) not reversibly
mappable. Note that the context of the byte is
important: 81 alone might be unmappable, while 81
followed by an 40 is valid (see
Unicode implementations have been subject to a number of
security exploits (such as
centered around ill-formed encoding. Systems making
incorrect use of a PEP 383 style mechanism are subject
to such an attack.
Suppose the source byte stream
is <A B X D>, and that according to the charset
converter being used (n), X is an invalid byte. B2Un
transforms the byte stream into Unicode as <G Y H>,
where Y is an isolated surrogate. U2Bn maps back to the
correct original <A B X D>. That is the intended usage
of PEP 383.
The problem comes when that Unicode
sequence is converted back to bytes by a different
charset converter m. Suppose that U2Bm maps Y into a
valid byte representing "/", or any one of a number of
other security-sensitive characters. That means that
converting <G Y H> via U2Bm to bytes, and back to
Unicode results in the string "G/Y", where the "/" did
not exist in the original.
This violates one of
the cardinal security rules for transformations of
Unicode strings: creating a character where no valid
character previously existed. This was, for example, at
the heart of the "non-shortest form" security exploits.
A gatekeeper is watching for suspicious characters. It
doesn't see Y as one of them, but past the gatekeeper, a
conversion of U2Bm followed by B2Um results in a
suspicious character where none previously existed.
The suggested solution for this is that a converter
can only map an isolated surrogate Y onto a byte stream
when the resulting byte would be an illegal
If not, then an exception would be thrown, or a
replacement byte or byte sequence must be used instead
(such as the SUB character). For details, see
Safely Converting to Bytes
, below. This replacement
would be similar to what is used when trying to convert
a Unicode character that cannot be represented in the
target encoding. That preserves the ability to
round-trip when the same encoding is used, but prevents
security attacks. Note that simply not represented
(deleting) Y in the output is not an option, since that
is also open to security exploits.
appears that PEP 383 when used as intended
Python is unlikely to present security problems.
According to information from the author:
- PEP 383 is only intended for use with
- Only bytes >= 128 will be transformed to D8xx or
- The combination of these factors means that no
ASCII-repertoire characters (which represent the
most serious problems for security) would ever be
- The primary use of PEP 383 is in file systems,
and where the Unicode resulting from PEP 383 is only
ever converted back to bytes on the same system,
using the same charset converter.
However, if PEP 383 is used more generally by
applications, or similar systems are used more
generally, security exploits are possible. Developers of
charset converters should also be aware of other
possible security issues, so that they avoid
shortest-form exploits and others, and should consult
UTR #36: Unicode Security Considerations
The choice of isolated surrogates (D8xx) as the way to
represent the unconvertible bytes appears clever at
first glance. However, it presents certain
interoperability and security issue. Such isolated
surrogates are not well formed. Although they can be
represented in Unicode Strings, they are not supported
by conformant UTF-8, UTF-16, or UTF-32 converters or
implementations. This may cause interoperability
problems, since many systems replace incoming ill-formed
Unicode sequences by replacement characters.
It may also cause security problems. Although
strongly discouraged for security reasons, some
implementations may delete the isolated surrogates,
which can cause a security problem when two substrings
become adjacent that were previously separated.
There are different alternatives:
- Use 256 private use code points, somewhere in
the ranges F0000..FFFFD or 100000..10FFFD. This
would probably cause the fewest security and
interoperability problems. There is some possibility
of collision with other uses of private use
- Use pairs of non-character code points in the
range FDD0..FDEF. These are "super" private use
characters, and are discouraged for general
interchange. The transformation would take each
nibble of a byte Y, and add to FDD0 and FDE0,
respectively. However, noncharacter code points may
be replaced by
U+FFFD ( � ) REPLACEMENT CHARACTER by
some implementations, especially when they use them
internally. (Again, incoming characters must
never be deleted, since that can cause security
- One could also ask the Unicode Consortium to
encode characters expressly for that purpose. That
would be essentially 256 different versions of
U+FFFD ( � ) REPLACEMENT CHARACTER. The
downside of this approach is that even if it were
accepted, it would take a couple of years to get
into the standard.
Converting to Bytes
The following describes how to safely convert a Unicode
buffer U1 to a byte buffer B1 when the D8xx convention
is used. It assumes that an exception is thrown if a
D8xx is problematic. It can be enhanced to use
substitution characters instead, if needed.
The simplest mechanism is just by brute force:
- Convert from Unicode buffer U1 to byte buffer
- If there were any D8XX's in U1
- Convert back to Unicode buffer U2 (according
to the same Charset C1)
- If U1 != U2, throw an exception.
Because the frequency of D8xx's will be very low, this
approach is probably enough for many implementations.
There are a number of different ways to optimize this.
Such approaches may vary depending on whether the
converter is stateless or stateful.
- At any point at which the converter sees a D8xx,
traverse the following D8xx's (up to a non-D8xx or
EOS), putting each low byte into a buffer B2.
- Convert enough extra input bytes to B2 to
provide for context (basically the maximum bytes per
character minus 1).
- So <D841 D842 XX> => <41 42 YY>
- Convert B2 back to Unicode buffer U3 with the
same converter, using an incremental conversion so
that extra bytes at the end don't cause an error.
- If U3 consists of all D8xx's with low bytes
matching B2, then append B2 (as far as you got) to
B1 and continue.
- This will be the normal -- and unproblematic
- If not, throw an exception.
This is basically the same as Stateless, except for the
conversion of B2 to U3, which has the following changes
- Scan backwards in the output byte buffer, and
pick up the last shift sequence of each kind.
- For simple stateful encodings like 2022-JP,
this will be just the last shift sequence (G0)
- For more complex stateful encodings like
2022-CN, this will be until we get a shift
sequence for each of G0, G1, G2
- Create a new buffer B2a, which is B2 prepended
with those shift sequences.
- Convert B2a to U3
In building the B2Un conversion table generate the
following data and store:
Safe: xx is never part of any valid character
in charset n.
Unsafe: xx is always part of
some valid character in charset n.
anything else; that is, safe or unsafe depending on
All single-byte charsets would only have safe or
unsafe bytes, so they are easy. The only ones that
require more work are the mixed bytes, which only occur
in multi-byte charsets. Let's take SJIS. In the sequence
<D881 0030> it is safe to map the D881 back to 81,
because <81 30> would map to <D881 0030>. But in the
sequence <D881 0040> it is not, because <81 40> would
map back to <3000>.
Once this is done, the
Stateless algorithm is modified to just map the Safe
D8xx's, throw an exception on the Unsafe D8xx's, and use
the plain Stateful process on any sequence of Mixed
- Instead of traversing backwards, record the
state of the U2B converter, and use it for the B2U
conversion of B2.
Further optimizations are also possible, for both
Stateless and Stateful charset processes.