Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Wed Jun 14 2006 - 16:26:36 CDT

Next message: Mike Ayers: "Re: What is a Jamo, and why is it staring at me?"

Previous message: Theodore H. Smith: "What is a Jamo, and why is it staring at me?"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Asmus Freytag: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 14 Jun 2006, at 18:39, Richard Wordingham wrote:

>
>> The re-ordering pass, is not using my "multiple-replaceall"
>> algorithm. It does use the canonical combining classes. A multi-
>> pass approach, while possible... I wouldn't do, it would take too
>> long.
>
> And this was the basis of the claim that you couldn't just treat
> characters as 'bags of bytes'!

But I do treat them as a "bag of bytes" :)

Even my combining character reorder does that.

I'll copy/paste part of my code (below). (theres two other necessary
functions which I won't bother you with.)

this line here:

StartComb = dict.SearchObj( Data, nil, PrevCombEnd, ElfData.kEnd,
FoundLen, FoundObj )

It finds within some UTF-8 data, the first "key" within the
dictionary, which can be found in the data. The length of the found
item (a character in this case), is stored in FoundLen.

The "value" for the key in the dictionary, is returned into FoundObj.
This "object" is actually just a 1 byte long string, and the byte is
actually the combining code!! Hence this line "CurrByte = ElfData
( FoundObj ).ByteVal".

Anyhow, all this code is byte-oriented, yet it cannot corrupt any
characters or miss them or cause false hits.

Well, in theory, it *could*, if you had a UTF-8 character that could
exist within another UTF-8 character. No ASCII values exist within
UTF-8 characters, and no character even exists within any other, so
in practice this false matching never occurs.

The whole thing is safe, despite treating everything as UTF-8 and not
doing any code point detection or character boundary checking!

The character boundaries are implicit by the keys in the dictionary
themself. If a key, which is a UTF-8 character, is found, and that
key is 3 bytes long, then the character boundary is after the 3rd
byte... It's just all treated as strings of bytes.

Private Function ReorderSub(Dict as ElfDataDictionary, Data as
ElfData, Start as integer, fs as FastString) As integer
    dim CurrByte, LastByte, StartComb, PrevCombEnd, FoundLen as Integer
    dim Chars() as ElfData
    dim Scores() as Integer
    dim UnOrdered as Boolean
    dim FoundObj as object

PrevCombEnd = Start

    do
       StartComb = dict.SearchObj( Data, nil, PrevCombEnd,
ElfData.kEnd, FoundLen, FoundObj )
       if StartComb = 0 or StartComb <> PrevCombEnd then
          exit
       end if

       CurrByte = ElfData( FoundObj ).ByteVal
       if CurrByte < LastByte then
          UnOrdered = true
       end if
       LastByte = CurrByte

chars.Append data.mid( StartComb, FoundLen )
Scores.Append CurrByte

PrevCombEnd = StartComb + FoundLen
loop

    if UnOrdered = false then
       Return PrevCombEnd
    end if

ReorderArrays Chars, Scores

    fs.AppendSectElfData Data, fs.Length + 1, Start - ( fs.length + 1 )
    for CurrByte = 0 to UBound( Chars )
       fs.AppendElfData Chars( CurrByte )
    next

Return fs.Length + 1

End Function

--
http://elfdata.com/plugin/

Next message: Mike Ayers: "Re: What is a Jamo, and why is it staring at me?"
Previous message: Theodore H. Smith: "What is a Jamo, and why is it staring at me?"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Asmus Freytag: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 16:33:32 CDT