RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 07 2003 - 20:05:57 EST

  • Next message: jcowan@reutershealth.com: "Re: Transcoding Tamil in the presence of markup"

    Peter Kirk wrote:
    > On 07/12/2003 15:40, Philippe Verdy wrote:
    > > Peter Kirk wrote:
    > > > Of course there is an even simpler way to provide the glue I
    > > > was talking about. W3C simply needs to relax the rule forbidding
    > > > combining marks at the start of a string (and interpret the one
    > > > precomposed character with ">" as base as if it were decomposed,
    > > > as I suggested before), and, remembering that use of NFC is a
    > > > strong recommendation rather than a requirement, not insist on
    > > > NFC in such cases. Then nothing needs to be added to Unicode.
    > >
    > > There's little chance that this will be relaxed by the W3C, because
    > > now HTML is XML (since XHTML is the current recommanded standard,
    > > and HTML 4.01 is just kept as is, and all other extensions are being
    > > developped since XHTML 1.1 as modules with DTDs or XML schemas), and
    > > because XML text elements are independant. What you propose would
    > > break the XML containment model (could it be implemented however in
    > > XSLT transforms from XHTML? I doubt because the output of XSLT is
    > > also XML, even if it does not always produce a XML syntax, but only
    > > a DOM-parsable tree or InfoSet...)
    >
    > Well, this is W3C's problem. They seem to have backed themselves into a
    > corner which they need to get out of but have no easy way of doing so.
    > Unicode is of course very familiar with this kind of situation e.g. with
    > character name errors, combining class errors, 11000+ redundant Korean
    > characters without decompositions, etc etc. So no doubt it can extend
    > its sympathy; and possibly even offer to help by encoding the kind of
    > character I was suggesting early (perhaps in exchange for some W3C
    > readiness to accept correction of errors in the normalisation data?).
    > But really this is not a Unicode issue.

    I don't agree with you there: going to XML was a good decision for the
    evolution,stabilisation and interoperability of HTML (now extensions are
    in modules, described by DTDs or schemas, and this offers a good framework
    for interoperability of documents, even if they don't implement the same
    set of optional modules.

    If you want something better, it is not by modifying XML (so HTML will
    stick on XML now). But in the way the DOM-tree or InfoSet generated from
    a parsed XHTML document will be rendered. With CSS and XSLT, you have
    the tools to define precisely with a compilable language, how this data
    tree can be transformed to prepare the rendering of documents.

    Nothing will forbid the standard XHTML modules to define standard
    transformations in relation with style, as a XSLT application. So this
    applies to the transformation of plain text contained in the XHTML
    document into another XML document containing all the associated glyphic,
    layout and style information. Some of these information may be used to
    monitor the behavior of font renderers to enable or disable features
    with the augmented data which contains now< more than just plain text.

    So this stylesheet processor will be able to position clealy diacritics
    above letters, or to create Korean syllabic clusters, or even Han
    ideographic clusters, or to alter the relative positions of the diacritic
    and its base letter to take into account differences of styles (for
    example, if the stylesheet instructs the HTML processor to render dots
    above "i" with a custom start bitmap or SVG graphic, or in bold style
    from another font...)

    The initial problem for Tamil transcoding with markup is not a problem
    for Unicode or even for HTML: the author has created in its document
    separate runs of texts without specifying clearly how these separate runs
    may be rendered in a coherent layout. For Unicode or for HTML, there's
    a default layout which is the HTML "box model", and attempts to break it
    requires relative positioning (specified in CSS), and possible
    transformation of the initial text into other text or markup (this is a
    work for XSLT, and could be specified in a further revision of CSS, to
    specify such complex rendering out of the default "box model").

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Sun Dec 07 2003 - 20:52:10 EST