L2/08-280

Date/Time:    Sun Jul 20 20:58:30 CDT 2008
Contact:      alex.purins@businesslink.com
Name:         Alex Purins
Report Type:  Public Review Issue
Opt Subject:  Issue #121 Recommended Practice for Replacement Characters


Clearly #3 not #2, and definitely not #1


Reasons
```````
Replacement character use should be decided by how well the number of replacement
characters matches the number of actual characters in the input (which in most cases
will be well-defined), and by the flexibility made available to applications that do
not want the default behaviour.

Option #3, one replacement character per invalid code unit, is much better than #2,
one per maximal subpart, because

(a) #3 comes closer to reflecting the size of the invalid data.

(b) #3 is much easier to describe (complicated standards being sensible only for
worthwhile benefits not deliverable by simpler means).

(c) #3 is not grossly larger than #2 in resultant size, so there is no significant
penalty in choosing it (nor significant size gain by using #2).


Option #1, one replacement character per run of invalid code units, is not worth
considering because it will often collapse a large amount of data into a single
character, and therefore

(a) #1 completely fails to reflect the size of the invalid data.

(b) #1 wrongly reduces application flexibility, because its result cannot be further
processed to give #2 or #3 (while the output of both of those can be trivially
collapsed to give the same result as #1).


Detail
``````
Even without good samples of international text and solid real-world testing, a
theoretical assessment supports #3 over #2.

For many inputs, #3 and #2 are the same:

  UTF-32 has no code unit sequences, only individual values some of which are invalid
  (all above 10FFFF, plus some scattered lower ones).

  UTF-16 has no maximal subparts longer than one code unit, only eg orphan low
  surrogates and a few individual invalid values.

  Simple non-Unicode code pages, both single or double byte, have no code unit sequences.

  EUC and other non-Unicode escape-driven forms would seem to be a combination of
  the above (though I am not familiar with the details).


For UTF-8 input, #3 and #2 vary noticeably, with #3 more accurate.

Broadly, #3 will count all invalid bytes, while #2 will count a complicated subset
(00 - 7F and lead 80 - FF bytes, dropping some number of 80 - BF bytes if they follow
a C0 - F7), suggesting #3 is preferable, but the detail is worth investigating.

In detail, the effect of #2 and #3 on supposedly UTF-8 input data can be summarised
by language/script/encoding/transformation as follows:

Key to Codes Used
  xN Treating this data as UTF-8 inherently multiplies the number of characters by
  N, even if the data is valid UTF-8 and therefore no replacement is done.
 n/r No replacement needed
  =  This replacement option results in exactly the inherent N multiplier effect,
  ie reflects the size of invalid data as accurately as is possible
  <  This replacement option results in less than the inherent N multiplier effect,
  ie the size of invalid data is wrongly reduced

Behaviour of Replacement Options
   #2  #3  Actual data which is being treated as UTF-8
x1 n/r n/r Any UTF-8
x1 n/r n/r English ASCII - byte values 00 to 7F.
x1  <   =  Non-English Latin extended ASCII, eg Windows Western European French - byte values largely 00 to 7F with scattered A0 to FF
x1  <   =  Non-Latin extended ASCII, eg ISO 8859-5 Russian - byte values A0 to FF plus scattered 00 to 1F controls
x1  <   =  Single byte EBCDIC, any language - byte values 40 to FF plus scattered 00 to 3F controls
x2  <   =  Double byte EBCDIC, any language - byte values 40 to FF plus scattered 00 to 3F controls
x2 n/r n/r English UTF-16 - byte values 00 to 7F, every alternate byte 00
x2  =   =  Other Latin UTF-16 - byte values largely 00 to 7F with scattered A0 to FF, every alternate byte 00
x2  =   =  Non-Latin alphabetic UTF-16, eg Cyrillic - byte values 00 to FF, every alternate byte the same and less than 20 (eg 04 for Cyrillic)
x2  <   =  CJK UTF-16 - byte values 00 to FF, alternate bytes only 4E to 9E and possibly D8 to DF
x4 n/r n/r English UTF-32 - varying byte values 00 to 7F, three lead 00 bytes ahead of (or after) each varying byte
x4  =   =  Non-English Latin UTF-32 - varying byte values 00 to FF, three lead 00 bytes ahead of (or after) each varying byte
x4  =   =  Non-Latin alphabetic UTF-32 - varying byte values 00 to FF, two lead 00 bytes plus a fixed third byte (eg 04 for Cyrillic) ahead of each varying byte
x3~ <   =  Other UTF-32 - varying byte values 00 to FF, often with one or two lead 00 bytes ahead of two or three varying byte values
x?  <   =  Random binary, eg images or encrypted data - byte values 00 to FF, no input character concept but #2 will collapse some incomplete UTF-8 sequences

Heuristics to discover the true encoding are correctly not part of the Unicode standard.

-- Alex Purins


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
(End of Report)