UTC/2000-009 Proposal for a signature byte sequence for SCSU - updated post UTC 81 To the Unicode Technical Committee, I propose to define and document a signature byte sequence for the Standard Compression Scheme for Unicode (TR 6). Proposed text to be added to Unicode Technical Report 6, for example in the "Notes" section: --------------------------------begin proposed TR 6 text Unicode Signature Byte Sequence for SCSU Depending on the implementation of an SCSU encoder, and depending on the following text, a leading U+feff character could be encoded as one of these initial byte sequences (hexadecimal, not showing following text): 1) 0e fe ff (SQU fe ff, Single-byte mode Quote Unicode) 2) 0f fe ff (SCU fe ff, Single-byte mode Change to Unicode) 3) 18 a5 ff (SD0 a5 ff, Single-byte mode Define dynamic window 0 to 0xfe80) 4) 19 a5 ff (SD1 a5 ff, Single-byte mode Define dynamic window 1 to 0xfe80) 5) 1a a5 ff (SD2 a5 ff, Single-byte mode Define dynamic window 2 to 0xfe80) 6) 1b a5 ff (SD3 a5 ff, Single-byte mode Define dynamic window 3 to 0xfe80) 7) 1c a5 ff (SD4 a5 ff, Single-byte mode Define dynamic window 4 to 0xfe80) 8) 1d a5 ff (SD5 a5 ff, Single-byte mode Define dynamic window 5 to 0xfe80) 9) 1e a5 ff (SD6 a5 ff, Single-byte mode Define dynamic window 6 to 0xfe80) 10) 1f a5 ff (SD7 a5 ff, Single-byte mode Define dynamic window 7 to 0xfe80) It is recommended to use only the byte sequence <0e fe ff> for an initial U+feff character (0e is the "SQU" tag). This convention will assist receiving processes that use initial byte sequences to identify a data file or stream as being encoded in SCSU. This defines a signature byte sequence similar to the Unicode Signatures for UCS Transformation Formats (Unicode Standard, section 2.7, "Byte Order Mark"). It quotes U+feff, ZWNBSP, which is the same character that is used for signatures of the UTFs. For UTF-16 and UTF-32, the signature also serves as a Byte Order Mark ("BOM") to help distinguish between little-endian and big-endian encodings. For SCSU, this is not necessary because SCSU has a defined byte sequence. Every SCSU encoder should write this particular initial byte sequence if a U+feff is encountered as the first character in the stream. Any further occurence of this character may be encoded in any way possible with SCSU and will always be interpreted as a ZWNBSP. Note: If the input text starts with a U+feff that is to be interpreted as a ZWNBSP, then an encoder or sending process may prepend the text with another U+feff which may be safely recognized as an SCSU signature and stripped by a receiving process. Otherwise, the initial ZWNBSP could itself be misinterpreted as a signature and stripped by a receiving process. This is equivalent to sending and receiving text in UTF-16 or UTF-32. A process reading text from a file or stream could interpret the initial bytes <0e fe ff> as a signature for SCSU and assume the file or stream to be encoded with SCSU. The process or SCSU decoder may or may not strip the initial U+feff character from the resulting text. Any other encoding of an initial U+feff character, and any encoding of a U+feff after the initial character must be interpreted as a ZWNBSP. A signature should not be used where a protocol specification, database design, or out-of-band information or similar specifies the encoding. --------------------------------end proposed TR 6 text Discussion: This would be useful in Unicode plain text files and could be used where already such signature bytes are used for other Unicode encodings. For example, newer versions of Windows Notepad (Windows 2000) detect and write such signature bytes for UTF-8, UTF-16LE, and UTF-16BE. Signature bytes are generally used only where there is no out-of-band indication of the encoding and byte order, like in files. In protocols and databases, the encoding can be and is typically specified by out-of-band data (protocol field, database design). I propose the bytes <0e fe ff> as the signature byte sequence for SCSU. It is the equivalent of quoting the ZWNBSP in the initial single-byte mode (SQU fe ff). The other signatures that were defined so far for UTFs use this character, too, which provides some consistency. Writing and later stripping such a sequence could be done exactly as it is done now with the set of signatures that are in use today. Behavior of the encoder: It must be considered that any Unicode character can be encoded in a number of ways with SCSU, and that SCSU allows non-minimum-length results for any input. However, it can be expected that a ZWNBSP followed by typical non-ideographic text (and not followed by Arabic Presentation Forms characters) will almost always be encoded with a single-quote-unicode command. In this case, "reasonable" results from an encoder that produces compact byte streams but is not modified to guarantee to always produce the proposed sequence for U+feff could be any of the ones mentioned in the proposed text above. Note that the initial state of an SCSU encoder and decoder includes the single-byte mode, and that no predefined static or dynamic window includes U+feff. The window from 0xfe80 to 0xfeff includes most of the Arabic Presentation Forms. With the documentation of the proposed signature, a few lines of code will suffice to make sure that any encoder will produce the intended result under all circumstances (always use SQU for U+feff as the first character in the stream, or more simply any time a U+feff occurs while in single byte mode). If an encoder produces a different encoding, like SCU fe ff ... (see above), then the plain text just starts with a ZWNBSP and auto-detection does not work as intended or is simply not used. There are other possible signatures for SCSU. From feedback on the unicore list, shorter sequences like a single <10> (SC0, switch to the already active default window) are viewed as being too short for detection purposes. (Such a sequence would be harder to write [needs special writeSignature() API on the encoder] but easier to read - no stripping of a character). Sincerely, markus PS: The signature byte sequences for the other Unicode encodings are: UTF-8 ef bb bf UTF-16BE fe ff UTF-16LE ff fe UTF-32BE 00 00 fe ff UTF-32LE ff fe 00 00 PPS: <0e> in ASCII is the "SO" or "shift out" control; <10> in ASCII is the "DLE" or "data link escape" control. PPPS: TR 6 is at http://www.unicode.org/unicode/reports/tr6/ Markus Scherer IBM Cupertino, CA schererm@us.ibm.com markus.scherer@jtcsv.com Page 1 C:\WINNT\Profiles\winkleaf\Application Data\Microsoft\Templates\Normal.dot