Re: What does one do if the encoding is unknown and all you have is a sequence of bytes? from Karl Williamson on 2013-07-19 (Unicode Mail List Archive)

From: Karl Williamson <public_at_khwilliamson.com>
Date: Fri, 19 Jul 2013 12:33:31 -0600

On 07/19/2013 11:51 AM, Costello, Roger L. wrote:
> Hi Folks,
>
> Suppose that these hex bytes:
>
> C3 83 C2 B1
>
> show up in a message and the message contains no hint what its encoding is.
>
> Perhaps it is 8859-1, in which case the message consists of four 1-byte characters:
>
> C3 = Ã
> 83 = the “no break here” character
> C2 = Â
> B1 = ±
>
> Perhaps it is UTF-8, in which case the message consists of two 2-byte characters:
>
> C383 = 쎃
> C2B1 = 슱
>

That's not how UTF-8 works. Instead in UTF-8 it would be:

C3 83 = LATIN CAPITAL LETTER A WITH TILDE
C2 B1 = PLUS-MINUS SIGN

It's unlikely that any other encoding will pass a UTF-8 validity test
for inputs longer than just a few bytes. So you can rule-in or rule-out
UTF-8 fairly easily. You can also look for BOMs to get UTF-16 and UTF-32.

After that, there are various heuristics that can be applied, and people
have written things that attempt to guess encodings. An example from
Perl is
http://search.cpan.org/~dankogai/Encode-2.51/lib/Encode/Guess.pm
but it requires a list of possible encodings that it experiments with.

> Or, perhaps it is some other encoding.
>
> What does one do in such a situation?
>
> /Roger
>
>
Received on Fri Jul 19 2013 - 13:35:58 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 19 2013 - 13:35:58 CDT