RE: What does one do if the encoding is unknown and all you have is a sequence of bytes? from Whistler, Ken on 2013-07-19 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Fri, 19 Jul 2013 18:40:04 +0000

> Suppose that these hex bytes:
>
> C3 83 C2 B1
>
> show up in a message and the message contains no hint what its encoding is.
>
> Perhaps it is 8859-1, in which case the message consists of four 1-byte
> characters:
>
> C3 = Ã
> 83 = the “no break here” character
> C2 = Â
> B1 = ±
>
> Perhaps it is UTF-8, in which case the message consists of two 2-byte
> characters:
>
> C383 = 쎃
> C2B1 = 슱

Actually, that would be interpreting it as UTF-16, not as UTF-8. That
can probably be quickly ruled out if the rest of the text is not obviously
in UTF-16.

Interpreted as UTF-8, it would be:

C3 83 --> U+00C3 = Ã
C2 B1 --> U+00B1 = ±

More likely than the other two alternatives you cite.

Of course, you also have to consider serial corruptions as a possibility.

It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ.

Then the <C3 B1> got misinterpreted as Latin-1, and then re-misinterpreted
as UTF-8 again.

--Ken
Received on Fri Jul 19 2013 - 13:42:11 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 19 2013 - 13:42:11 CDT