...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Spirit.Lex is very modular, which follows the general
building principle of the Spirit
libraries. You never pay for features you don't use. It is nicely integrated
with the other parts of Spirit
but nevertheless can be used separately to build stand alone lexical analyzers.
The first quick start example describes a stand alone application: counting
characters, words, and lines in a file, very similar to what the well known
Unix command wc
is doing
(for the full example code see here: word_count_functor.cpp).
The only required #include
specific to Spirit.Lex follows. It is a wrapper for
all necessary definitions to use Spirit.Lex in a stand
alone fashion, and on top of the Lexertl
library. Additionally we #include
two of the Boost headers to define boost::bind()
and boost::ref()
.
#include <boost/spirit/include/lex_lexertl.hpp> #include <boost/bind/bind.hpp> #include <boost/ref.hpp>
To make all the code below more readable we introduce the following namespaces.
namespace lex = boost::spirit::lex;
The most important step while creating a lexer using Spirit.Lex is to define the tokens to be recognized in the input sequence. This is normally done by defining the regular expressions describing the matching character sequences, and optionally their corresponding token ids. Additionally the defined tokens need to be associated with an instance of a lexer object as provided by the library. The following code snippet shows how this can be done using Spirit.Lex.
The template word_count_tokens
defines three different tokens: ID_WORD
,
ID_EOL
, and ID_CHAR
, representing a word (anything
except a whitespace or a newline), a newline character, and any other character
(ID_WORD
, ID_EOL
, and ID_CHAR
are enum values representing the token ids, but could be anything else
convertible to an integer as well). The direct base class of any token
definition class needs to be the template lex::lexer<>
, where the corresponding template
parameter (here: lex::lexertl::lexer<BaseIterator>
)
defines which underlying lexer engine has to be used to provide the required
state machine functionality. In this example we use the Lexertl based lexer
engine as the underlying lexer type.
template <typename Lexer> struct word_count_tokens : lex::lexer<Lexer> { word_count_tokens() { // define tokens (the regular expression to match and the corresponding // token id) and add them to the lexer this->self.add ("[^ \t\n]+", ID_WORD) // words (anything except ' ', '\t' or '\n') ("\n", ID_EOL) // newline characters (".", ID_CHAR) // anything else is a plain character ; } };
We will use a setup, where we want the Spirit.Lex
library to invoke a given function after any of the generated tokens is
recognized. For this reason we need to implement a functor taking at least
the generated token as an argument and returning a boolean value allowing
to stop the tokenization process. The default token type used in this example
carries a token value of the type boost::iterator_range
<BaseIterator>
pointing to the matched range in the underlying input sequence.
In this example the struct 'counter' is used as a functor counting the characters, words and lines in the analyzed input sequence by identifying the matched tokens as passed from the Spirit.Lex library.
struct counter { // the function operator gets called for each of the matched tokens // c, l, w are references to the counters used to keep track of the numbers template <typename Token> bool operator()(Token const& t, std::size_t& c, std::size_t& w, std::size_t& l) const { switch (t.id()) { case ID_WORD: // matched a word // since we're using a default token type in this example, every // token instance contains a `iterator_range<BaseIterator>` as its token // attribute pointing to the matched character sequence in the input ++w; c += t.value().size(); break; case ID_EOL: // matched a newline character ++l; ++c; break; case ID_CHAR: // matched something else ++c; break; } return true; // always continue to tokenize } };
All what is left is to write some boilerplate code helping to tie together
the pieces described so far. To simplify this example we call the lex::tokenize()
function implemented in Spirit.Lex (for a more detailed
description of this function see here: FIXME),
even if we could have written a loop to iterate over the lexer iterators
[first
, last
)
as well.
The main function simply loads the given file into memory (as a std::string
), instantiates an instance of
the token definition template using the correct iterator type (word_count_tokens<char const*>
), and finally calls lex::tokenize
, passing an instance of the
counter function object. The return value of lex::tokenize()
will be true
if the whole input sequence has been successfully tokenized, and false
otherwise.
int main(int argc, char* argv[]) { // these variables are used to count characters, words and lines std::size_t c = 0, w = 0, l = 0; // read input from the given file std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1])); // create the token definition instance needed to invoke the lexical analyzer word_count_tokens<lex::lexertl::lexer<> > word_count_functor; // tokenize the given string, the bound functor gets invoked for each of // the matched tokens using boost::placeholders::_1; char const* first = str.c_str(); char const* last = &first[str.size()]; bool r = lex::tokenize(first, last, word_count_functor, boost::bind(counter(), _1, boost::ref(c), boost::ref(w), boost::ref(l))); // print results if (r) { std::cout << "lines: " << l << ", words: " << w << ", characters: " << c << "\n"; } else { std::string rest(first, last); std::cout << "Lexical analysis failed\n" << "stopped at: \"" << rest << "\"\n"; } return 0; }
This example was deliberately chosen to be as much as possible similar to the equivalent Flex program (see below), which isn't too different from what has to be written when using Spirit.Lex.
Note | |
---|---|
Interestingly enough, performance comparisons of lexical analyzers written using Spirit.Lex with equivalent programs generated by Flex show that both have comparable execution speeds! Generally, thanks to the highly optimized Lexertl library and due its carefully designed integration with Spirit the abstraction penalty to be paid for using Spirit.Lex is negligible. |
The remaining examples in this tutorial will use more sophisticated features of Spirit.Lex, mainly to allow further simplification of the code to be written, while maintaining the similarity with corresponding features of Flex. Spirit.Lex has been designed to be as similar to Flex as possible. That is why this documentation will provide the corresponding Flex code for the shown Spirit.Lex examples almost everywhere. So consequently, here is the Flex code corresponding to the example as shown above.
%{ #define ID_WORD 1000 #define ID_EOL 1001 #define ID_CHAR 1002 int c = 0, w = 0, l = 0; %} %% [^ \t\n]+ { return ID_WORD; } \n { return ID_EOL; } . { return ID_CHAR; } %% bool count(int tok) { switch (tok) { case ID_WORD: ++w; c += yyleng; break; case ID_EOL: ++l; ++c; break; case ID_CHAR: ++c; break; default: return false; } return true; } void main() { int tok = EOF; do { tok = yylex(); if (!count(tok)) break; } while (EOF != tok); printf("%d %d %d\n", l, w, c); }