...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
The tokenize()
function is a helper function simplifying the usage of a lexer in a stand
alone fashion. For instance, you may have a stand alone lexer where all
that functional requirements are implemented inside lexer semantic actions.
A good example for this is the word_count_lexer
described in more detail in the section Lex
Quickstart 2 - A better word counter using Spirit.Lex.
template <typename Lexer> struct word_count_tokens : lex::lexer<Lexer> { word_count_tokens() : c(0), w(0), l(0) , word("[^ \t\n]+") // define tokens , eol("\n") , any(".") { using boost::spirit::lex::_start; using boost::spirit::lex::_end; using boost::phoenix::ref; // associate tokens with the lexer this->self = word [++ref(w), ref(c) += distance(_start, _end)] | eol [++ref(c), ++ref(l)] | any [++ref(c)] ; } std::size_t c, w, l; lex::token_def<> word, eol, any; };
The construct used to tokenize the given input, while discarding all generated
tokens is a common application of the lexer. For this reason Spirit.Lex
exposes an API function tokenize()
minimizing the code required:
// Read input from the given file std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1])); word_count_tokens<lexer_type> word_count_lexer; std::string::iterator first = str.begin(); // Tokenize all the input, while discarding all generated tokens bool r = tokenize(first, str.end(), word_count_lexer);
This code is completely equivalent to the more verbose version as shown
in the section Lex
Quickstart 2 - A better word counter using Spirit.Lex.
The function tokenize()
will return either if the end of the input has been reached (in this case
the return value will be true
),
or if the lexer couldn't match any of the token definitions in the input
(in this case the return value will be false
and the iterator first
will point to the first not matched character in the input sequence).
The prototype of this function is:
template <typename Iterator, typename Lexer> bool tokenize(Iterator& first, Iterator last, Lexer const& lex , typename Lexer::char_type const* initial_state = 0);
where:
The beginning of the input sequence to tokenize. The value of this iterator will be updated by the lexer, pointing to the first not matched character of the input after the function returns.
The end of the input sequence to tokenize.
The lexer instance to use for tokenization.
This optional parameter can be used to specify the initial lexer state for tokenization.
A second overload of the tokenize()
function allows specifying of any arbitrary
function or function object to be called for each of the generated tokens.
For some applications this is very useful, as it might avoid having lexer
semantic actions. For an example of how to use this function, please have
a look at word_count_functor.cpp:
The main function simply loads the given file into memory (as a std::string
), instantiates an instance of
the token definition template using the correct iterator type (word_count_tokens<char const*>
), and finally calls lex::tokenize
, passing an instance of the
counter function object. The return value of lex::tokenize()
will be true
if the whole input sequence has been successfully tokenized, and false
otherwise.
int main(int argc, char* argv[]) { // these variables are used to count characters, words and lines std::size_t c = 0, w = 0, l = 0; // read input from the given file std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1])); // create the token definition instance needed to invoke the lexical analyzer word_count_tokens<lex::lexertl::lexer<> > word_count_functor; // tokenize the given string, the bound functor gets invoked for each of // the matched tokens char const* first = str.c_str(); char const* last = &first[str.size()]; bool r = lex::tokenize(first, last, word_count_functor, boost::bind(counter(), _1, boost::ref(c), boost::ref(w), boost::ref(l))); // print results if (r) { std::cout << "lines: " << l << ", words: " << w << ", characters: " << c << "\n"; } else { std::string rest(first, last); std::cout << "Lexical analysis failed\n" << "stopped at: \"" << rest << "\"\n"; } return 0; }
Here is the prototype of this tokenize()
function overload:
template <typename Iterator, typename Lexer, typename F> bool tokenize(Iterator& first, Iterator last, Lexer const& lex, F f , typename Lexer::char_type const* initial_state = 0);
where:
The beginning of the input sequence to tokenize. The value of this iterator will be updated by the lexer, pointing to the first not matched character of the input after the function returns.
The end of the input sequence to tokenize.
The lexer instance to use for tokenization.
A function or function object to be called for each matched token.
This function is expected to have the prototype: bool
f(Lexer::token_type);
.
The tokenize()
function will return immediately if F
returns `false.
This optional parameter can be used to specify the initial lexer state for tokenization.