Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 8 May 2015 02:16:25 +0200

It would be more exact to say that JSON strings, just like strings in
Javascript and Java or many programming languages are just binary streams
of 16-bit code units. The transport syntax of JSON does not even require
that the textual syntax itself must be encoded in UTF-16, and in most cases
it will be transported as UTF-8.
So before processing a "text/json" content type, you have first to
determine an appropriate character encoding to decode this syntax (in HTTP
you would use a MIME header to specify the charset effectively used, but
the "text/json" MIME type by default uses UTF-8.
Then the JSON processor will decode this text and will remap it to an
internal UTF-16 encoding (for characters that are not escaped) and the
"\uXXXX" will be decoded as plain 16-bit code units. The result will be a
stream of 16-bit code units, which can then externally be outpout and
encoded or stored in any convenient encoding that preserves this stream,
EVEN if this is not valid UTF-16.
If you need a validation of UTF-16 this is not the job of JSON itself (or
Java or Javascript or similar) but dependant on the application using the
JSON data: some of them will reject the stream as invalid because they
expect their input to be a valid UTF (not necessarily UTF-16 or UTF-8), or
they may even restrict more the allowed characer set they support (e.g.
restrict to just ASCII, or support some other encodings such as GSM
encoding for SMS, or just use the lowest 8 bits of each code unit).

JSON by itself is neutral, it just assumes in its syntax that any binary
stream of 16-bit code unit is encodable and transportable: it could be even
used to transport executable binary code or bitmap images data (such as
JPEG or PNG), provided that there's a way to represent the effective binary
length (when it is not an exact multiple of 16 bits) with additional data
transmited in the JSON encoded data (however the most common way for such
binary data is to store them in JSON using Base64, for example with the
"data:" URL-encoding scheme: this scheme is commonly used in CSS which can
be safely embedded in JSON strings)...

I don't think this is a bad thing of JSON: JSON strings are NOT equivalent
to text (and not all text is also valid Unicode text when it uses specific
encodings whose character entities don't have a one-to-one mapping in the
UCS, for example with private-use characters that require an external
agreement if we want to map them to PUA in the UCS, or if the encoding maps
them to non-characters of the UCS), even if there's a "assumed" encoding
only for characters that are not reserved by the JSON syntax and not
represented as escaped sequences (this assumption is also based an an
external greement for the encoding used in the transport).

2015-05-07 22:29 GMT+02:00 Daniel Bünzli <daniel.buenzli_at_erratique.ch>:

> Le jeudi, 7 mai 2015 à 21:59, Markus Scherer a écrit :
> > I assume that the JSON spec deliberately allows anything that Java and
> JavaScript allow. In particular, there is no requirement for a Java String
> or JavaScript string to contain "text", or well-formed UTF-16, or only
> assigned characters.
>
> > Some code stores binary data (sequence of arbitrary 16-bit unsigned
> integers) in a "string", just because it is easy and fairly efficient to
> transport.
> >
> > You should "validate" *text* only when you are certain that it is indeed
> text.
> Section 8.2 [1] of the spec specifically says that only strings that
> represent sequences of Unicode scalar values (they say "characters") are
> interoperable and that strings that do not represent such sequences like
> "\uDEAD" can lead to unpredictable behaviour.
>
> If you want to transmit binary data reliably in json you must apply some
> form of binary to Unicode scalar value encoding (like in most text based
> interchange formats).
>
> Best,
>
> Daniel
>
> [1] https://tools.ietf.org/html/rfc7159#section-8.2
>
>
Received on Thu May 07 2015 - 19:17:58 CDT

This archive was generated by hypermail 2.2.0 : Thu May 07 2015 - 19:17:58 CDT