RE: Double Byte enabled

From: Murray Sargent (murrays@microsoft.com)
Date: Wed Apr 05 2000 - 21:59:11 EDT


MBCS is a generic term that includes SBCS, DBCS, and character sets with
more than two bytes. In an operational sense UTF-8 is a kind of MBCS, since
in order to deal with it directly (rather than translating it to UTF-16 or
UTF-32) you have to navigate over 1 to 4 bytes. (5 and 6 are ruled out by
recent standards activities). A cool thing about UTF-8 is that you can
easily find the start of a character if you land on a trail byte. But you
still have to deal with other problems of MBCS, such as ensuring that the
text cursor (or caret) always points to the start of a character, and saving
for the next read any partial character sequence that ends an input buffer
(if you need to translate to UTF-16 or UTF-32).

UTF-16 surrogate pairs have similar considerations, but they are relatively
easy to deal with, especially if your code can already handle multicharacter
sequences such as CR LF and combining-mark sequences.

Again, the thing I'd recommend is Unicode enabling rather than MBCS or DBCS
enabling.

Murray

-----Original Message-----
From: Suzanne Topping [mailto:stopping@rochester.rr.com]
Sent: Wednesday, April 05, 2000 5:59 PM
To: Unicode List
Subject: Re: Double Byte enabled

Thanks for this response Murray. It pointed out why I was uncomfortable with
the term, but was not directly thinking about; the differences in processing
DBCS versus Unicode characters.

A lot of people think that Unicode -IS- a DBCS, so I didn't really want to
leave it the way it was.

I wonder if "multibyte enabled" has the same connotations of processing
differences? A few people have suggested this as an acceptable option, but
now I'm trying to think back about whether there are processing methods for
MBCS that are similar to the old DBCS methods?

"Unicode enabled" is probably the clearest term, but I would appreciate
comments on historic use of "multibyte enabled".

----- Original Message -----
From: Murray Sargent <murrays@microsoft.com>
To: 'Suzanne Topping' <stopping@rochester.rr.com>
Cc: Unicode List <unicode@unicode.org>
Sent: Wednesday, April 05, 2000 6:14 PM
Subject: RE: Double Byte enabled

> "Double-byte enabled" is very different from "Unicode enabled". The
former
> refers to apps that can navigate, maybe edit, and display DBCS text such
as
> Shift-JIS. Typically the ASCII characters in a DBCS character repertoire
> are represented by single bytes and most, if not all, other characters are
> represented by a lead byte followed by a trail byte. So you have a mix of
> single and double byte characters. It gets particularly tricky, since
some
> lead bytes can also be trail bytes. Accordingly if you land in the middle
> of the text it can be tricky to figure out where a character boundary is
> (finding it used to be a favorite interview question at Microsoft).
>
> In your document, it would be better to say something about making sure
> applications are Unicode enabled. Double-byte enabling used to be
> desirable, but now it should only be needed for import/export code.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT