Re: A basic question on encoding Latin characters

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Fri Oct 01 1999 - 12:44:49 EDT


It is still not a problem. XML requires every instance of '<' where it could be
interpreted as the start of a tag to be quoted as "&lt;", so if you wanted to
use the combining character sequence it would have to be as "&lt;&#x0338;".
(actually, the second character doesn't need to be quoted if the character set
can express it).

In detail, here is the relevant data from the file

226E;NOT LESS-THAN;Sm;0;ON;003C 0338;...
0338;COMBINING LONG SOLIDUS OVERLAY;Mn;1;...
003C;LESS-THAN SIGN;Sm;0;...

LESS-THAN SIGN is a base form (type 0).
COMBINING LONG SOLIDUS OVERLAY is a combining mark (type 1).
So in Normalization Form C, they merge to form 226E.

If people have any questions as to the types of characters, they can always just

pull up the data files on their browser to look at them directly:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/CompositionExclusions.txt

Mark

Kevin Bracey wrote:

> In message <199910011421.HAA10109@unicode.org>
> "Mark E. Davis" <markdavis@ispchannel.com> wrote:
>
> > This is not a problem, since 226E remains as is in Normalization Form C,
> > which will be the recommended form for XML.
> >
>
> But it might be a problem for an unwitting implementor who runs everything
> through a decomposition engine on input, and is looking only for '<', rather
> than '<'+{non-combining character}.
>
> --
> Kevin Bracey, Senior Software Engineer
> Pace Micro Technology plc Tel: +44 (0) 1223 518566
> 645 Newmarket Road Fax: +44 (0) 1223 518526
> Cambridge, CB5 8PB, United Kingdom WWW: http://www.acorn.co.uk/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT