UTC/1999-025 From: Markus Scherer IBM Cupertino, CA schererm@us.ibm.com Subject: Proposal for a signature byte sequence for SCSU To the Unicode Technical Committee, I propose to define and document a signature byte sequence for the Standard Compression Scheme for Unicode (TR 6). This would be useful in Unicode plain text files and could be used where already such signature bytes are used for other Unicode encodings. For example, newer versions of Windows Notepad (Windows 2000) detect and write such signature bytes for UTF-8, UTF-16LE, and UTF-16BE. Signature bytes are generally used only where there is no out-of-band indication of the encoding and byte order, like in files. In protocols and databases, the encoding can be and is typically specified by out-of-band data (protocol field, database design). I propose the bytes <0e fe ff> as the signature byte sequence for SCSU. It is the equivalent of quoting the ZWNBSP in the initial single-byte mode (SQU fe ff). The other signatures that were defined so far use this character, too, which provides some consistency. Writing and later stripping such a sequence could be done exactly as it is done now with the set of signatures that are in use today. Behavior of the encoder: It must be considered that any Unicode character can be encoded in a number of ways with SCSU. However, it can be expected that a ZWNBSP followed by typical non-ideographic text will almost always be encoded with a single-quote-unicode command. With the documentation of this signature, a few lines of code will suffice to make sure that any encoder will produce the intended result under all circumstances (always use SQU for U+feff as the first character in the stream). If an encoder produces a different encoding, like SCU fe ff ..., then the plain text just starts with a ZWNBSP and auto-detection does not work as intended. There are other possible signatures for SCSU. From feedback on the unicore list, shorter sequences like a single <10> (SC0, switch to the already active default window) are viewed as being too short for detection purposes. (Such a sequence would be harder to write [needs special writeSignature() API on the encoder] but easier to read - no stripping of a character). Sincerely, markus PS: The signature byte sequences for the other Unicode encodings are: UTF-8 ef bb bf UTF-16BE fe ff UTF-16LE ff fe UTF-32BE 00 00 fe ff UTF-32LE ff fe 00 00 PPS: <0e> in ASCII is the "SO" or "shift out" control; <10> in ASCII is the "DLE" or "data link escape" control. PPPS: TR 6 is at http://www.unicode.org/unicode/reports/tr6/