Re: Subject: Re: 32'nd bit & UTF-8

From: Arcane Jill (
Date: Mon Jan 24 2005 - 02:41:50 CST

  • Next message: Michael \(michka\) Kaplan: "Re: Subject: Re: 32'nd bit & UTF-8"

    > -----Original Message-----
    > From: []On
    > Behalf Of Marcin 'Qrczak' Kowalczyk
    > Sent: 21 January 2005 22:49
    > To:
    > Subject: Re: Subject: Re: 32'nd bit & UTF-8
    > Let's assume that I design a programming language, specify that its
    > source files should be encoded in UTF-8, don't mention anything about
    > BOM, implement a compiler ...

    It just so happens that I'm designing a programming language. (A toy, at
    present - purely an intellectual exercise. But you never know. One day...?).
    But I am, at least implementing a compiler for it.

    But why on Earth would I specify its encoding? Why on Earth would /anyone/
    specify an encoding in a modern (indeed, future) programming language? My
    language is specified in CHARACTERS.

    My in-development parser goes through a decoding phase. The decoding phase
    translates the bytes from the source code file into /characters/. These
    characters are then fed into the lexer generating phase.

    Currently the decoding phase can auto-detect all UTF encodings correctly,
    either with or without a BOM. The lexing phase doesn't care. It's all
    characters by then. Auto-detection is guaranteed to be possible (and correct)
    in my case, because the grammar dictates that the very first thing in the
    source file MUST be a comment. This guarantees that the two characters of the
    file WILL be non-NUL ASCII (possibly preceeded by a BOM), and given this
    restriction, auto-detection of UTFs is accurate given only the first four bytes
    of the source file. The grammar for the rest of the file permits (for example)
    non-ASCII operators, identifiers, etc.

    In the future, it should be possible that other (non-auto-detectable) encodings
    be permitted, having been specified either in some OOB way, because, as I said,
    the lexing phase doesn't care.

    Of course, this is all for my own amusement. I've been writing this for a month
    or so now. I didn't bother with lex, flex, yacc, bison, etc., because they
    weren't sufficiently Unicode.

    In other words, my response to your comment above: "Let's assume that I design
    a programming language, specify that its source files should be encoded in
    UTF-8..." is this. Let's not. It's silly.


    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:38:03 CST