L2/08-280 Date/Time: Sun Jul 20 20:58:30 CDT 2008 Contact: alex.purins@businesslink.com Name: Alex Purins Report Type: Public Review Issue Opt Subject: Issue #121 Recommended Practice for Replacement Characters Clearly #3 not #2, and definitely not #1 Reasons ``````` Replacement character use should be decided by how well the number of replacement characters matches the number of actual characters in the input (which in most cases will be well-defined), and by the flexibility made available to applications that do not want the default behaviour. Option #3, one replacement character per invalid code unit, is much better than #2, one per maximal subpart, because (a) #3 comes closer to reflecting the size of the invalid data. (b) #3 is much easier to describe (complicated standards being sensible only for worthwhile benefits not deliverable by simpler means). (c) #3 is not grossly larger than #2 in resultant size, so there is no significant penalty in choosing it (nor significant size gain by using #2). Option #1, one replacement character per run of invalid code units, is not worth considering because it will often collapse a large amount of data into a single character, and therefore (a) #1 completely fails to reflect the size of the invalid data. (b) #1 wrongly reduces application flexibility, because its result cannot be further processed to give #2 or #3 (while the output of both of those can be trivially collapsed to give the same result as #1). Detail `````` Even without good samples of international text and solid real-world testing, a theoretical assessment supports #3 over #2. For many inputs, #3 and #2 are the same: UTF-32 has no code unit sequences, only individual values some of which are invalid (all above 10FFFF, plus some scattered lower ones). UTF-16 has no maximal subparts longer than one code unit, only eg orphan low surrogates and a few individual invalid values. Simple non-Unicode code pages, both single or double byte, have no code unit sequences. EUC and other non-Unicode escape-driven forms would seem to be a combination of the above (though I am not familiar with the details). For UTF-8 input, #3 and #2 vary noticeably, with #3 more accurate. Broadly, #3 will count all invalid bytes, while #2 will count a complicated subset (00 - 7F and lead 80 - FF bytes, dropping some number of 80 - BF bytes if they follow a C0 - F7), suggesting #3 is preferable, but the detail is worth investigating. In detail, the effect of #2 and #3 on supposedly UTF-8 input data can be summarised by language/script/encoding/transformation as follows: Key to Codes Used xN Treating this data as UTF-8 inherently multiplies the number of characters by N, even if the data is valid UTF-8 and therefore no replacement is done. n/r No replacement needed = This replacement option results in exactly the inherent N multiplier effect, ie reflects the size of invalid data as accurately as is possible < This replacement option results in less than the inherent N multiplier effect, ie the size of invalid data is wrongly reduced Behaviour of Replacement Options #2 #3 Actual data which is being treated as UTF-8 x1 n/r n/r Any UTF-8 x1 n/r n/r English ASCII - byte values 00 to 7F. x1 < = Non-English Latin extended ASCII, eg Windows Western European French - byte values largely 00 to 7F with scattered A0 to FF x1 < = Non-Latin extended ASCII, eg ISO 8859-5 Russian - byte values A0 to FF plus scattered 00 to 1F controls x1 < = Single byte EBCDIC, any language - byte values 40 to FF plus scattered 00 to 3F controls x2 < = Double byte EBCDIC, any language - byte values 40 to FF plus scattered 00 to 3F controls x2 n/r n/r English UTF-16 - byte values 00 to 7F, every alternate byte 00 x2 = = Other Latin UTF-16 - byte values largely 00 to 7F with scattered A0 to FF, every alternate byte 00 x2 = = Non-Latin alphabetic UTF-16, eg Cyrillic - byte values 00 to FF, every alternate byte the same and less than 20 (eg 04 for Cyrillic) x2 < = CJK UTF-16 - byte values 00 to FF, alternate bytes only 4E to 9E and possibly D8 to DF x4 n/r n/r English UTF-32 - varying byte values 00 to 7F, three lead 00 bytes ahead of (or after) each varying byte x4 = = Non-English Latin UTF-32 - varying byte values 00 to FF, three lead 00 bytes ahead of (or after) each varying byte x4 = = Non-Latin alphabetic UTF-32 - varying byte values 00 to FF, two lead 00 bytes plus a fixed third byte (eg 04 for Cyrillic) ahead of each varying byte x3~ < = Other UTF-32 - varying byte values 00 to FF, often with one or two lead 00 bytes ahead of two or three varying byte values x? < = Random binary, eg images or encrypted data - byte values 00 to FF, no input character concept but #2 will collapse some incomplete UTF-8 sequences Heuristics to discover the true encoding are correctly not part of the Unicode standard. -- Alex Purins -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)