...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
The whole purpose of integrating Spirit.Lex as part of the Spirit library was to add a library allowing the merger of lexical analysis with the parsing process as defined by a Spirit grammar. Spirit parsers read their input from an input sequence accessed by iterators. So naturally, we chose iterators to be used as the interface between the lexer and the parser. A second goal of the lexer/parser integration was to enable the usage of different lexical analyzer libraries. The utilization of iterators seemed to be the right choice from this standpoint as well, mainly because these can be used as an abstraction layer hiding implementation specifics of the used lexer library. The picture below shows the common flow control implemented while parsing combined with lexical analysis.
Another problem related to the integration of the lexical analyzer with
the parser was to find a way how the defined tokens syntactically could
be blended with the grammar definition syntax of Spirit.
For tokens defined as instances of the token_def<>
class the most natural way of integration
was to allow to directly use these as parser components. Semantically these
parser components succeed matching their input whenever the corresponding
token type has been matched by the lexer. This quick start example will
demonstrate this (and more) by counting words again, simply by adding up
the numbers inside of semantic actions of a parser (for the full example
code see here: word_count.cpp).
This example uses two of the Spirit
library components: Spirit.Lex and Spirit.Qi,
consequently we have to #include
the corresponding header files. Again, we need to include a couple of header
files from the Boost.Phoenix
library. This example shows how to attach functors to parser components,
which could be done using any type of C++ technique resulting in a callable
object. Using Boost.Phoenix
for this task simplifies things and avoids adding dependencies to other
libraries (Boost.Phoenix
is already in use for Spirit
anyway).
#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/lex_lexertl.hpp> #include <boost/spirit/include/phoenix_operator.hpp> #include <boost/spirit/include/phoenix_statement.hpp> #include <boost/spirit/include/phoenix_container.hpp>
To make all the code below more readable we introduce the following namespaces.
using namespace boost::spirit; using namespace boost::spirit::ascii;
If compared to the two previous quick start examples (Lex Quickstart 1 - A word counter using Spirit.Lex and Lex Quickstart 2 - A better word counter using Spirit.Lex) the token definition class for this example does not reveal any surprises. However, it uses lexer token definition macros to simplify the composition of the regular expressions, which will be described in more detail in the section FIXME. Generally, any token definition is usable without modification from either a stand alone lexical analyzer or in conjunction with a parser.
template <typename Lexer> struct word_count_tokens : lex::lexer<Lexer> { word_count_tokens() { // define patterns (lexer macros) to be used during token definition // below this->self.add_pattern ("WORD", "[^ \t\n]+") ; // define tokens and associate them with the lexer word = "{WORD}"; // reference the pattern 'WORD' as defined above // this lexer will recognize 3 token types: words, newlines, and // everything else this->self.add (word) // no token id is needed here ('\n') // characters are usable as tokens as well (".", IDANY) // string literals will not be escaped by the library ; } // the token 'word' exposes the matched string as its parser attribute lex::token_def<std::string> word; };
While the integration of lexer and parser in the control flow is achieved by using special iterators wrapping the lexical analyzer, we still need a means of expressing in the grammar what tokens to match and where. The token definition class above uses three different ways of defining a token:
token_def<>
, which is handy whenever you
need to specify a token attribute (for more information about lexer
related attributes please look here: Lexer Attributes).
All three token definition methods require a different method of grammar integration. But as you can see from the following code snippet, each of these methods are straightforward and blend the corresponding token instances naturally with the surrounding Spirit.Qi grammar syntax.
Token definition |
Parser integration |
---|---|
|
The |
single character |
The single character is directly usable in the grammar. However,
under certain circumstances it needs to be wrapped by a |
explicit token id |
To use an explicit token id in a Spirit.Qi
grammar you are required to wrap it with the special |
The grammar definition below uses each of the three types demonstrating their usage.
template <typename Iterator> struct word_count_grammar : qi::grammar<Iterator> { template <typename TokenDef> word_count_grammar(TokenDef const& tok) : word_count_grammar::base_type(start) , c(0), w(0), l(0) { using boost::phoenix::ref; using boost::phoenix::size; start = *( tok.word [++ref(w), ref(c) += size(_1)] | lit('\n') [++ref(c), ++ref(l)] | qi::token(IDANY) [++ref(c)] ) ; } std::size_t c, w, l; qi::rule<Iterator> start; };
As already described (see: Attributes),
the Spirit.Qi parser library builds upon a set of
fully attributed parser components. Consequently, all token definitions
support this attribute model as well. The most natural way of implementing
this was to use the token values as the attributes exposed by the parser
component corresponding to the token definition (you can read more about
this topic here: About
Tokens and Token Values). The example above takes advantage of the
full integration of the token values as the token_def<>
's parser attributes: the word
token definition is declared as
a token_def<std::string>
,
making every instance of a word
token carry the string representation of the matched input sequence as
its value. The semantic action attached to tok.word
receives this string (represented by the _1
placeholder) and uses it to calculate the number of matched characters:
ref(c) +=
size(_1)
.
The main function needs to implement a bit more logic now as we have to
initialize and start not only the lexical analysis but the parsing process
as well. The three type definitions (typedef
statements) simplify the creation of the lexical analyzer and the grammar.
After reading the contents of the given file into memory it calls the function
tokenize_and_parse()
to initialize the lexical analysis and parsing processes.
int main(int argc, char* argv[]) { typedef lex::lexertl::token< char const*, boost::mpl::vector<std::string> > token_type; typedef lex::lexertl::lexer<token_type> lexer_type; typedef word_count_tokens<lexer_type>::iterator_type iterator_type; // now we use the types defined above to create the lexer and grammar // object instances needed to invoke the parsing process word_count_tokens<lexer_type> word_count; // Our lexer word_count_grammar<iterator_type> g (word_count); // Our parser // read in the file int memory std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1])); char const* first = str.c_str(); char const* last = &first[str.size()]; bool r = lex::tokenize_and_parse(first, last, word_count, g); if (r) { std::cout << "lines: " << g.l << ", words: " << g.w << ", characters: " << g.c << "\n"; } else { std::string rest(first, last); std::cerr << "Parsing failed\n" << "stopped at: \"" << rest << "\"\n"; } return 0; }
Define the token type to be used: |
|
Define the lexer type to use implementing the state machine |
|
Define the iterator type exposed by the lexer type |
|
Parsing is done based on the token stream, not the character stream
read from the input. The function |