a parser generator for PHP – *finally*

[Update: oops. I forgot that the server parses file named “whatever.php” when uploading things :). The directory at pear.chiaraquartet.net/lemon (clickable link below) now contains both .phps and .php.txt files for downloading. sorry about that]

So, it’s been a while since I wrote an entry, but I have not been idle (are you really all that surprised?) I took a hard look at the projects I am involved in, and realized rather quickly that the only thing really holding back several of them is a good parser generator. Here’s the short list of projects that really need a parser generator:

  1. phpDocumentor. We use parsers for *everything*
  2. PHP_Parser. The name says it all
  3. Games_Chess/File_ChessPGN. For this team, we need a good parser for the PGN file format.

Recently, I completed a rather nice lexer for docblocks (documentation comments) and this is now available at http://pecl.php.net/docblock, for those who are interested. About three weeks ago, I looked at the state of the parser generator world out there for PHP, and it is pretty dismal. Antlr3 will theoretically support PHP 5 generation, but it’s impossible to find any source in spite of several fruitless hours of googling.

I finally decided that if this is ever going to happen, I’ll have to get off my butt and do it. So, two weeks ago, I grabbed the source of the Lemon parser generator from its website (conveniently compressed into two files: the generator and its template). Although the 4000+ lines may have scared me off, the code is very clearly written, making minimal use of pre-processor macros, and actually lends itself very easily to translation into PHP code. I spent about a week doing the actual transcription from C to PHP, removing all the malloc-related crap, converting the C implementation of associative arrays into PHP associative arrays, and finally I had something that works. In the past week, I’ve been scraping on the template, which was pretty complete, but didn’t quite do everything I needed.

For one example, upon a syntax error, there was no easy way to retrieve a list of expected tokens. This turned out to be a very hard problem, until I broke down and yesterday added a generated array of expected tokens. Coupling this with a little reduce simulator, it is quite easy to grab the complete list of expected tokens based on the current token and the parser stack.

Another tricky problem was that if an unexpected token occurred right at a potential end-of-input moment, the parser simply reduced to an accept, and silently restarted parsing. This is a bad thing (TM). So, I implemented a simple “is this token possible in the current state of things?” function that catches these pesky errors.

In the process, I have a fully working PGN file parser that will make its way into a PEAR proposal as soon as I get around to integrating it with Games_Chess to do full validation of the contents of the PGN file. However, the parser works 100% even with some of the weirdest PGN things I could throw at it.

The parser generator works just like Lemon with a few small differences. I added a few line directives:

  1. %include_class – this works like %include, but puts stuff inside the generated parser class.
  2. %declare_class – this can be used to make the parser implement an interface, extend another class, etc. and is used like “%declare_class {extends blah} <– note the lack of semi-colon, this is inserted between “class ParseyyParser” and “{“

The parser should be called with a loop similar to this:

[geshi lang=php]$lexer = new File_ChessPGN_Lexer($contents);$lexer = new File_ChessPGN_Lexer($lexer);while ($lexer->advance($parser)) { $parser->doParse($lexer->token, $lexer->value);}$parser->doParse(0, 0); // for end of input[/geshi]

To run it, save Main.php and Lempar.php in the same directory, and take a look at the bottom of Main.php for an example run. You can also just run it from the command line:

php Main.php /path/to/Parser.y

Parser.y is the fully-functioning PGN file parser that will be integrated into File_ChessPGN. I plan to test out the parser generator and split it into multiple files, then potentially propose it as a PEAR package. It should be noted that the original Lemon authors disclaim copyright to the code. This port to PHP, however, will be licensed under either PHP 3.01, new BSD or something along those lines.

You can grab the stuff from http://pear.chiaraquartet.net/lemon.

Tagged with:
Posted in Computers, PEAR
13 comments on “a parser generator for PHP – *finally*
  1. Ren says:

    I ported Lemon a while back for an SQL parser, never go it complete enough to release (documentation bleh.. ).http://ren.dotgeek.org/expr/http://ren.dotgeek.org/expr/expr.y

  2. Sara Golemon says:

    The files at http://pear.chiaraquartet.net/lemon (with the exception of Parser.y) are downloading as empty documents (probably because they’re being parser at the server…)

  3. Ed says:

    Hmm… it would be nice if the links actually worked.That being said, this parser generator sounds very interesting. I’d like to see it.BTW, you ought to link to the lemon site: http://www.hwaci.com/sw/lemon/And… (does it have unit tests?)

  4. boots says:

    Beautiful, thanks Greg! Any comments on its performance speed wise?

  5. Greg Beaver says:

    So far, I’ve been concentrating on its correctness over the performance.The generator itself can be somewhat slow, but I can generate the PHP parser (almost finished) for phpDocumentor in about 2 seconds on my computer. I haven’t yet done any performance tests on the generated parser.The best part about all of this is that it can be optimized, but it will perform at the level of Lemon.

  6. Rune Zedeler says:

    Thanks for doing this work!What is the status for your work on lemon.It appears that the linked files have not been completely looked through.For instance, on Line 2787 of Main.php you have $this->{self::$options[$argv[$i][1]][‘arg’]}(substr($v, 2));which is clearly an error because $v is a boolean.I would like to look more into lemon, but first I will ensure that you have abandoned your work. Would be really stupid for me to looke at the code if you have made lots of updates and just forgot to upload them.

  7. Rune Zedeler says:

    Your code in yy_is_expected_token does not work. The following small grammar should only accept the string "|1-1|". But if you pass it the string "|1-1-1|" then it gives a argument-type-warning, and accepts the string.(sorry for br-tokens at end of lines. Seems like a bug in geshi)[geshi lang=php ln=n]%name Bug_%include { $parser = new Bug_yyParser; $tokens[‘-‘] = Bug_yyParser::DASH; $tokens[‘1′] = Bug_yyParser::NUMBER; $tokens[‘|’] = Bug_yyParser::EOP; $str = "|1-1-1|"; for($i=0; $i< strlen($str); $i++) { $token = $tokens[$str{$i}]; $parser->doParse($token,$val); }}%syntax_error { die("Syntax error!n");}start ::= EOP interval EOP.interval ::= NUMBER DASH NUMBER.[/geshi]

  8. Greg Beaver says:

    Hi Rune,The parser has been released as an official PEAR package at http://pear.php.net/PHP_ParserGeneratorI’ve been using the code extensively for a number of parsers, including a parser for the PGN (chess) file format, a parser for PHP itself (working on that right now), a parser for docblocks in PHP, and there are others who have been using it as well for parsing lots of things.Try that package out and see if it works for you.

  9. Rune Zedeler says:

    Thanks, this looks a lot better! :+)I’ll email you some more questions.

  10. Search Lyrics says:

    Is’nt so easy. I need an PHP parser script too and I try do understand….Thx, admin! :)

  11. Erigami says:

    I’m glad to see a parser generator written in PHP. Have you considered writing a tutorial, with a full lexer/parser so that lemon n00bs could start using this package? A link to a working example would be almost as helpful.

  12. Mirko Adari says:

    What lexer and parser would you recommend for a template engine? Only thing that you could count on is default php installation. I know that PRADO and Smarty 3 are implementing house-made, but for performance and specialization reasons I would look further.

  13. Nicholas Maietta says:

    I would seriously recommend simple_dom_parser found at Sourceforge.net. We are now using this in our php website framework/cms solution.This solution has allowed us to modify templates on the fly. Our templates are XHTML only with no PHP, unlike wordpress, joomla and drupal templates.Add the Commnetivity php framework to your webserver "stack" and it provides a layer that handles everything from templating (using simple dom parser) to security and can drop dynamic scripts into divs. (We love that feature).The simple dom parser has a very simple Object Oriented approach that we like a lot. It made integration into our system a breeze and we suspect that it can do the same for your projects too.

Leave a Reply

Your email address will not be published. Required fields are marked *


HTML tags are not allowed.