Re: Unicode, SMS, PDA/cellphones

From: Doug Ewell (
Date: Mon May 29 2006 - 12:57:02 CDT

  • Next message: Doug Ewell: "Re: Unicode, SMS, PDA/cellphones"

    Theodore H. Smith <delete at elfdata dot com> replied to Cristian

    >> Every time I try to send a SMS message that includes accented
    >> characters for my language (Romanian), I can't stop to blame those
    >> who have established the SMS technical standard, because the fixed
    >> 2-bytes character for Latin is pure waste of space (and money :).
    > BOCU would have been more sensible. It can usually encode codepoints
    > above 256 in one byte per character, and it can represent every code
    > point.

    Actually that's not the full story with BOCU-1, because it requires 2
    bytes not only to encode a Latin character outside of ASCII but also 2
    bytes to encode the next ASCII character (except space or controls).
    BOCU-1 works better on text that fits within a 128-byte block.

    The Romanian translation of the Universal Declaration of Human Rights --
    which is probably not representative of text that would be sent via
    SMS -- yield the following sizes:

    12,841 bytes in UTF-8
    12,454 bytes in SCSU (3% decrease)
    13,498 bytes in BOCU-1 (5% increase)

    Cristian can probably supply a more appropriate sample text for

    Additionally, BOCU-1 wasn't available when SMS was developed. And, like
    SCSU or UTF-8, it requires an 8-bit byte, which represents a 14%
    increase over the existing 7-bit scheme for messages that fit wholly
    within the 7-bit GSM scheme.

    In any case, however, either SCSU or BOCU-1 would have been a dramatic
    improvement for Romanian over simply falling back to 16 bits.

    Doug Ewell
    Fullerton, California, USA

    This archive was generated by hypermail 2.1.5 : Mon May 29 2006 - 13:11:50 CDT