L2/00-136 Proposal to Facilitate Random Access of SCSU Strings by Peter Bishop April 15, 2000 This paper proposes an enhancement of the SCSU encoding described in version 3.1 of Unicode Technical Report #6. Overview Both the UTF-8 and UTF-16 encodings of Unicode have the very important property that they can be parsed in reverse as well as forward. In addition, a random access into the middle of a string using either of these encodings can immediately correctly interpret the characters found there. The SCSU encoding as of version 3.1 of the technical report does not have this property. It is almost impossible to parse this encoding in reverse, but, much worse, a random access into the middle of an SCSU string cannot correctly interpret the bytes that are found there without parsing all of the bytes from the beginning of the string. One of the stated purposes of the SCSU encoding is to be appropriate for short character strings. If a character string is short, then it is easy for any software that needs to access into the middle of the string to first convert it to UTF-16 or even UTF-8. On the other hand, if SCSU is successful at its goal of being as compact as a native character set, or an encoding that switches between native character sets, while remaining an encoding of the Unicode character set, then it is possible that SCSU may become popular as a file encoding. Initial analysis suggests that performing file compression on SCSU results in smaller compressed files than when performing file compression on UTF-16 or UTF-8 files (if the file contains numerous non-ASCII characters). Thus it is possible that SCSU may be a very interesting encoding for large files. If we begin to see large text files encoded in SCSU, then it will become more important for SCSU to support random access into the file rather than requiring the entire file to be parsed from the beginning. This paper proposes a small enhancement to SCSU, defining some reserved tag codes, which, if used, will enable large SCSU strings to be randomly accessed without parsing from the beginning. This may be an important property to support to enhance the usefulness of the SCSU encoding and increase the applications that will use it. Obstacles to Random Access of SCSU Strings The main problem with accessing randomly into the middle of an SCSU string is that SCSU supports numerous different encoding modes. Once an encoding mode is in effect, this mode mandates the interpretation of the following bytes. The mode change commands are only designed to be interpreted parsing forward. One of the largest challenges arises because there are no fully reserved tag bytes. There are two modes: single-byte mode and Unicode mode. Each of these modes uses different tag bytes. Thus what are tag bytes in one mode are legitimate characters in the other mode. Thus, designing a proposal for tag bytes that are not legitimate character sequences in any mode was a little challenging. Even the proposal that I am making is more complex than one would wish if this feature were considered more important than it has been, but I think that the small complexities of finding the encoding in the middle of an SCSU string are worth the cost of being able to randomly access into the string. The main problem is that when we randomly access an SCSU string, we do not know what mode we are in or what windows have been defined. If we had a tag value that would reset all windows and modes to a known value, then this tag could be placed periodically in the SCSU string. The challenge is to define a single tag value that can be easily searched for, forwards or backwards, but when found, can be determined to actually be a tag value rather than being a legitimate character sequence in the other mode. Basic Proposal My proposal is to define one reserved tag byte value in Unicode mode. The Unicode mode tag value F2 is reserved for future use. I am proposing that this tag value be defined to be called UX for Unicode mode extension. The following byte will specify a particular Unicode mode extension. I propose that we then define one Unicode mode extension code F2 to be defined to be followed by an additional F2 code. This combined code: F2 F2 F2 would be defined to mean, when parsed normally in SCSU, to reset the SCSU mode to initial conditions, that is, reset all window definitions to their default values and switch out of Unicode mode to window 0. Unfortunately, the code F2 is a legitimate character encoding in single-byte mode, thus the sequence F2 F2 F2 is not sufficient to be sure it is a mode reset tag character. In addition, we need a single-byte tag character as well. Fortunately, we can use an existing one: SCU (0F): Switch to Unicode Mode. The sequence: 0F F2 F2 F2, if thought to be in Unicode mode, would be interpreted as a legitimate Unicode character followed by the Unicode mode tag F2, specifying Unicode mode extension F2, which is the reset mode tag. This sequence is almost adequate. The sequence: 0F F2 F2 F2 could be insured to be the reset mode tag as long as it is not preceeded by various tag codes that allow arbitrary one or two-byte values following them. Thus, I must add the requirement that the byte immediately preceeding the 0F must not be a tag byte from either single-byte mode or Unicode mode that is followed by one or two arbitrary bytes, and the second byte before the 0F may not be any tag byte in either single-byte mode or Unicode mode that is followed by two arbitrary bytes. This requirement is based on the bytes immediately preceeding the 0F byte (in single-byte mode) of the full reset mode tag sequence in the SCSU encoding regardless of whether these bytes actually are single-byte tags or Unicode mode tags in the SCSU string. Thus, if, when we wish to insert the reset mode tag sequence to enable a random access to start parsing forward from this point, the above requirement is not met, then simply insert enough UC0 or SCU codes before the 0F F2 F2 F2 sequence to satisfy the above requirement. Thus in the worst case, a reset mode tag sequence could require as many as 6 bytes: 0F E0 0F F2 F2 F2 if originally in single-byte mode or E0 10 0F F2 F2 F2 if originally in Unicode mode. In practice, it might be possible to simply delay inserting the mode reset sequence to the next point in the SCSU stream where only 4 bytes would be needed. Using the Reset Mode Tag Sequence When someone wishes to enable random access into a large SCSU file, the reset mode tag sequence can be inserted periodically into the file. Software that wishes to perform a random access into such a file should first parse the file from the beginning to the first reset mode tag sequence. This should be considered to be validation that the reset mode tag sequence is being used as well as specifying the approximate distance between reset mode tag sequences. Then the file can be randomly accessed, and searched either forward or backward for the reset mode tag sequence, being sure to check that the preceeding two bytes could not be mistaken for tags in either mode that take arbitrary byte sequences following them. If the reset mode tag sequence is not found within at least 10% more characters than the initial positioning of the first reset mode tag sequence, then the software should assume that reset mode tag sequences have not been used periodically throughout the file and should parse the file from the beginning instead. If the reset mode tag sequence is inserted into a file every 1,000 characters, then it should be very easily accessed in a random access fashion. Even inserting the sequence every 2,000 or 4,000 characters still supports high-performance random access into the file. Large scrolling jumps could easily be performed using random access with reset mode tag sequences every 4,000 characters or even more. 3