UTC/2000-009


Proposal for a signature byte sequence for SCSU - updated post UTC 81


To the Unicode Technical Committee,

I propose to define and document a signature byte sequence for the Standard
Compression Scheme for Unicode (TR 6).

Proposed text to be added to Unicode Technical Report 6,
for example in the "Notes" section:

--------------------------------begin proposed TR 6 text

Unicode Signature Byte Sequence for SCSU


Depending on the implementation of an SCSU encoder, and depending on the
following text, a leading U+feff character could be encoded as one of these
initial byte sequences (hexadecimal, not showing following text):

1)  0e fe ff
     (SQU fe ff, Single-byte mode Quote Unicode)
2)  0f fe ff
     (SCU fe ff, Single-byte mode Change to Unicode)
3)  18 a5 ff
     (SD0 a5 ff, Single-byte mode Define dynamic window 0 to 0xfe80)
4)  19 a5 ff
     (SD1 a5 ff, Single-byte mode Define dynamic window 1 to 0xfe80)
5)  1a a5 ff
     (SD2 a5 ff, Single-byte mode Define dynamic window 2 to 0xfe80)
6)  1b a5 ff
     (SD3 a5 ff, Single-byte mode Define dynamic window 3 to 0xfe80)
7)  1c a5 ff
     (SD4 a5 ff, Single-byte mode Define dynamic window 4 to 0xfe80)
8)  1d a5 ff
     (SD5 a5 ff, Single-byte mode Define dynamic window 5 to 0xfe80)
9)  1e a5 ff
     (SD6 a5 ff, Single-byte mode Define dynamic window 6 to 0xfe80)
10) 1f a5 ff
     (SD7 a5 ff, Single-byte mode Define dynamic window 7 to 0xfe80)

It is recommended to use only the byte sequence <0e fe ff> for an initial
U+feff character (0e is the "SQU" tag). This convention will assist
receiving processes that use initial byte sequences to identify a data file
or stream as being encoded in SCSU.

This defines a signature byte sequence similar to the Unicode Signatures for
UCS Transformation Formats (Unicode Standard, section 2.7, "Byte Order
Mark"). It quotes U+feff, ZWNBSP, which is the same character that is used
for signatures of the UTFs.

For UTF-16 and UTF-32, the signature also serves as a Byte Order Mark
("BOM") to help distinguish between little-endian and big-endian encodings.
For SCSU, this is not necessary because SCSU has a defined byte sequence.

Every SCSU encoder should write this particular initial byte sequence if a
U+feff is encountered as the first character in the stream. Any further
occurence of this character may be encoded in any way possible with SCSU and
will always be interpreted as a ZWNBSP.

Note: If the input text starts with a U+feff that is to be interpreted as a
ZWNBSP, then an encoder or sending process may prepend the text with another
U+feff which may be safely recognized as an SCSU signature and stripped by a
receiving process. Otherwise, the initial ZWNBSP could itself be
misinterpreted as a signature and stripped by a receiving process. This is
equivalent to sending and receiving text in UTF-16 or UTF-32.

A process reading text from a file or stream could interpret the initial
bytes <0e fe ff> as a signature for SCSU and assume the file or stream to be
encoded with SCSU. The process or SCSU decoder may or may not strip the
initial U+feff character from the resulting text.

Any other encoding of an initial U+feff character, and any encoding of a
U+feff after the initial character must be interpreted as a ZWNBSP.

A signature should not be used where a protocol specification, database
design, or out-of-band information or similar specifies the encoding.

--------------------------------end proposed TR 6 text


Discussion:

This would be useful in Unicode plain text files and could be used where
already such signature bytes are used for other Unicode encodings. For
example, newer versions of Windows Notepad (Windows 2000) detect and write
such signature bytes for UTF-8, UTF-16LE, and UTF-16BE.

Signature bytes are generally used only where there is no out-of-band
indication of the encoding and byte order, like in files. In protocols and
databases, the encoding can be and is typically specified by out-of-band
data (protocol field, database design).

I propose the bytes <0e fe ff> as the signature byte sequence for SCSU. It
is the equivalent of quoting the ZWNBSP in the initial single-byte mode (SQU
fe ff). The other signatures that were defined so far for UTFs use this
character, too, which provides some consistency. Writing and later stripping
such a sequence could be done exactly as it is done now with the set of
signatures that are in use today.

Behavior of the encoder: It must be considered that any Unicode character
can be encoded in a number of ways with SCSU, and that SCSU allows
non-minimum-length results for any input. However, it can be expected that a
ZWNBSP followed by typical non-ideographic text (and not followed by Arabic
Presentation Forms characters) will almost always be encoded with a
single-quote-unicode command.

In this case, "reasonable" results from an encoder that produces compact
byte streams but is not modified to guarantee to always produce the proposed
sequence for U+feff could be any of the ones mentioned in the proposed text
above.

Note that the initial state of an SCSU encoder and decoder includes the
single-byte mode, and that no predefined static or dynamic window includes
U+feff. The window from 0xfe80 to 0xfeff includes most of the Arabic
Presentation Forms.

With the documentation of the proposed signature, a few lines of code will
suffice to make sure that any encoder will produce the intended result under
all circumstances (always use SQU for U+feff as the first character in the
stream, or more simply any time a U+feff occurs while in single byte mode).

If an encoder produces a different encoding, like SCU fe ff ... (see above),
then the plain text just starts with a ZWNBSP and auto-detection does not
work as intended or is simply not used.

There are other possible signatures for SCSU. From feedback on the unicore
list, shorter sequences like a single <10> (SC0, switch to the already
active default window) are viewed as being too short for detection purposes.
(Such a sequence would be harder to write [needs special writeSignature()
API on the encoder] but easier to read - no stripping of a character).


Sincerely,

markus


PS: The signature byte sequences for the other Unicode encodings are:
UTF-8       ef bb bf
UTF-16BE    fe ff
UTF-16LE    ff fe
UTF-32BE    00 00 fe ff
UTF-32LE    ff fe 00 00

PPS: <0e> in ASCII is the "SO" or "shift out" control;
     <10> in ASCII is the "DLE" or "data link escape" control.

PPPS: TR 6 is at http://www.unicode.org/unicode/reports/tr6/


Markus Scherer
IBM Cupertino, CA
schererm@us.ibm.com
markus.scherer@jtcsv.com


Page 1		C:\WINNT\Profiles\winkleaf\Application Data\Microsoft\Templates\Normal.dot