USV to UTF-8 mapping

From: Peter_Constable@sil.org
Date: Wed Nov 14 2001 - 11:15:06 EST


A week or so ago, I asked for comments on a C++ algorithm for converting
UTF-32 to UTF-8. There were a couple of things pointed out to me that had
to do with the pseudo-code algorithm I provided to the developer. Here's a
revised pseudo-code algorithm:

U is a Unicode scalar value; C1, C2, etc. are byte code units in a UTF-8
sequence; and \ is integer divide.

If U <= U+007F, then
        C1 = U
Else if U+0080 <= U <= U+07FF, then
        C1 = U \ x40 + xC0
        C2 = U mod x40 + x80
Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then
        C1 = U \ x1000 + xE0
        C2 = (U mod x1000) \ x40 + x80
        C3 = U mod x40 + x80
Else if U >= U+FFFF, then
        C1 = U \ x40000 + xF0
        C2 = (U mod x40000) \ x1000 + x80
        C3 = (U mod 100016) \ x40 + x80
        C4 = U mod x40 + x80
Else
        Error
End if

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Wed Nov 14 2001 - 12:58:08 EST