...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
New Features:
skip()
primitive for static regexes, which allows you to specify parts of the
input string to ignore during regex matching.
regex_replace()
algorithm interface.
regex_replace()
accepts formatter objects and formatter lambda expressions in addition
to format strings.
Bugs Fixed:
Bugs Fixed:
sub_match<>
constructor copies singular iterator causing debug assert.
New Features:
(?R)
construct
match_flag_type::format_perl
, match_flag_type::format_sed
,
and match_flag_type::format_all
operator+(std::string, sub_match<>)
and variants
tolower()
and toupper()
Bugs Fixed:
~(set='a')
works.
Bugs Fixed:
This is the version that shipped as part of Boost 1.34.
Bugs Fixed:
match_results::position()
works for nested results.
Version 1.0!
The version reviewed for acceptance into Boost. The review began September 8, 2005. Xpressive was accepted into Boost on September 28, 2005.
New Features:
syntax_option_type::ignore_white_space
New Features:
Announcement of xpressive: http://lists.boost.org/Archives/boost/2003/11/56312.php
The following features are planned for xpressive 2.X:
syntax_option_type::collate
[.a.]
[=a=]
syntax_option_type::nosubs
,
and a nosubs()
modifier for static xpressive.
Here are some wish-list features. You or your company should consider hiring me to implement them!
std::locale
.
Since many of xpressive's users are likely to be familiar with the Boost.Regex library, I would be remiss if
I failed to point out some important differences between xpressive and Boost.Regex. In particular:
xpressive::basic_regex<>
is a template on the iterator type, not the character type.
xpressive::basic_regex<>
cannot be constructed directly from a string; rather, you must use basic_regex::compile()
or regex_compiler<>
to build a regex object from a string.
xpressive::basic_regex<>
does not have an imbue()
member function; rather, the imbue()
member function is in the xpressive::regex_compiler<>
factory.
boost::basic_regex<>
has a subset of std::basic_string<>
's
members. xpressive::basic_regex<>
does not. The members lacking are: assign()
, operator[]()
, max_size()
, begin()
, end()
, size()
, compare()
, and operator=(std::basic_string<>)
.
boost::basic_regex<>
but do not exist in xpressive::basic_regex<>
are: set_expression()
,
get_allocator()
,
imbue()
,
getloc()
,
getflags()
,
and str()
.
xpressive::basic_regex<>
does not have a RegexTraits template parameter. Customization of regex
syntax and localization behavior will be controlled by regex_compiler<>
and a custom regex facet for std::locale
.
xpressive::basic_regex<>
and xpressive::match_results<>
do not have an Allocator template parameter. This is by design.
match_not_dot_null
and
match_not_dot_newline
have
moved from the match_flag_type
enum to the syntax_option_type
enum, and they have changed names to not_dot_null
and not_dot_newline
.
syntax_option_type
enumeration values are not supported: escape_in_lists
,
char_classes
, intervals
, limited_ops
,
newline_alt
, bk_plus_qm
, bk_braces
,
bk_parens
, bk_refs
, bk_vbar
,
use_except
, failbit
, literal
,
perlex
, basic
,
extended
, emacs
, awk
,
grep
,egrep
,
sed
, JavaScript
,
JScript
.
match_flag_type
enumeration values are not supported: match_not_bob
,
match_not_eob
, match_perl
, match_posix
,
and match_extra
.
Also, in the current implementation, the regex algorithms in xpressive will not detect pathological behavior and abort by throwing an exception. It is up to you to write efficient patterns that do not behave pathologically.
The performance of xpressive is competitive with Boost.Regex. I have run performance benchmarks comparing static xpressive, dynamic xpressive and Boost.Regex on two platforms: gcc (Cygwin) and Visual C++. The tests include short matches and long searches. For both platforms, xpressive comes off well on short matches and roughly on par with Boost.Regex on long searches.
<disclaimer> As with all benchmarks, the true test is how xpressive performs with your patterns, your input, and your platform, so if performance matters in your application, it's best to run your own tests. </disclaimer>
Below are the results of a performance comparison between:
Test Specifications
hyper-threaded 3GHz Xeon with 1Gb RAM
Windows XP Pro + Cygwin
GNU C++ version 3.4.4 (Cygwin special)
GNU libstdc++ version 3.4.4
1.33+, BOOST_REGEX_USE_CPP_LOCALE, BOOST_REGEX_RECURSIVE
0.9.6a
The following tests evaluate the time taken to match the expression to the input string. For each result, the top number has been normalized relative to the fastest time, so 1.0 is as good as it gets. The bottom number (in parentheses) is the actual time in seconds. The best time has been marked in green.
static xpressive | dynamic xpressive | Boost | Text | Expression |
---|---|---|---|---|
1(8.79e‑07s) | 1.08(9.54e‑07s) | 2.51(2.2e‑06s) | 100- this is a line of ftp response which contains a message string | ^([0-9]+)(\-| |$)(.*)$ |
1.06(1.07e‑06s) | 1(1.01e‑06s) | 4.01(4.06e‑06s) | 1234-5678-1234-456 | ([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4} |
1(1.4e‑06s) | 1.13(1.58e‑06s) | 2.89(4.05e‑06s) | john_maddock@compuserve.com | ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ |
1(1.28e‑06s) | 1.16(1.49e‑06s) | 3.07(3.94e‑06s) | foo12@foo.edu | ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ |
1(1.22e‑06s) | 1.2(1.46e‑06s) | 3.22(3.93e‑06s) | bob.smith@foo.tv | ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ |
1.04(8.64e‑07s) | 1(8.34e‑07s) | 2.5(2.09e‑06s) | EH10 2QQ | ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$ |
1.11(9.09e‑07s) | 1(8.19e‑07s) | 2.47(2.03e‑06s) | G1 1AA | ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$ |
1.12(9.38e‑07s) | 1(8.34e‑07s) | 2.5(2.08e‑06s) | SW1 1ZZ | ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$ |
1(7.9e‑07s) | 1.06(8.34e‑07s) | 2.49(1.96e‑06s) | 4/1/2001 | ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$ |
1(8.19e‑07s) | 1.04(8.49e‑07s) | 2.4(1.97e‑06s) | 12/12/2001 | ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$ |
1.09(8.95e‑07s) | 1(8.19e‑07s) | 2.4(1.96e‑06s) | 123 | ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ |
1.11(8.79e‑07s) | 1(7.9e‑07s) | 2.57(2.03e‑06s) | +3.14159 | ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ |
1.09(8.94e‑07s) | 1(8.19e‑07s) | 2.47(2.03e‑06s) | -3.14159 | ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ |
The next test measures the time to find all matches in a long English text. The text is the complete works of Mark Twain, from Project Gutenberg. The text is 19Mb long. As above, the top number is the normalized time and the bottom number is the actual time. The best time is in green.
static xpressive | dynamic xpressive | Boost | Expression |
---|---|---|---|
1(0.0263s) | 1(0.0263s) | 1.78(0.0469s) | Twain |
1(0.0234s) | 1(0.0234s) | 1.79(0.042s) | Huck[[:alpha:]]+ |
1.84(1.26s) | 2.21(1.51s) | 1(0.687s) | [[:alpha:]]+ing |
1.09(0.192s) | 2(0.351s) | 1(0.176s) | ^[^
]*?Twain |
1.41(0.08s) | 1.21(0.0684s) | 1(0.0566s) | Tom|Sawyer|Huckleberry|Finn |
1.56(0.195s) | 1.12(0.141s) | 1(0.125s) | (Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn) |
Below are the results of a performance comparison between:
Test Specifications
hyper-threaded 3GHz Xeon with 1Gb RAM
Windows XP Pro
Visual C++ .NET 2003 (7.1)
Dinkumware, version 313
1.33+, BOOST_REGEX_USE_CPP_LOCALE, BOOST_REGEX_RECURSIVE
0.9.6a
The following tests evaluate the time taken to match the expression to the input string. For each result, the top number has been normalized relative to the fastest time, so 1.0 is as good as it gets. The bottom number (in parentheses) is the actual time in seconds. The best time has been marked in green.
static xpressive | dynamic xpressive | Boost | Text | Expression |
---|---|---|---|---|
1(3.2e‑007s) | 1.37(4.4e‑007s) | 2.38(7.6e‑007s) | 100- this is a line of ftp response which contains a message string | ^([0-9]+)(\-| |$)(.*)$ |
1(6.4e‑007s) | 1.12(7.15e‑007s) | 1.72(1.1e‑006s) | 1234-5678-1234-456 | ([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4} |
1(9.82e‑007s) | 1.3(1.28e‑006s) | 1.61(1.58e‑006s) | john_maddock@compuserve.com | ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ |
1(8.94e‑007s) | 1.3(1.16e‑006s) | 1.7(1.52e‑006s) | foo12@foo.edu | ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ |
1(9.09e‑007s) | 1.28(1.16e‑006s) | 1.67(1.52e‑006s) | bob.smith@foo.tv | ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$ |
1(3.06e‑007s) | 1.07(3.28e‑007s) | 1.95(5.96e‑007s) | EH10 2QQ | ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$ |
1(3.13e‑007s) | 1.09(3.42e‑007s) | 1.86(5.81e‑007s) | G1 1AA | ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$ |
1(3.2e‑007s) | 1.09(3.5e‑007s) | 1.86(5.96e‑007s) | SW1 1ZZ | ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$ |
1(2.68e‑007s) | 1.22(3.28e‑007s) | 2(5.36e‑007s) | 4/1/2001 | ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$ |
1(2.76e‑007s) | 1.16(3.2e‑007s) | 1.94(5.36e‑007s) | 12/12/2001 | ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$ |
1(2.98e‑007s) | 1.03(3.06e‑007s) | 1.85(5.51e‑007s) | 123 | ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ |
1(3.2e‑007s) | 1.12(3.58e‑007s) | 1.81(5.81e‑007s) | +3.14159 | ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ |
1(3.28e‑007s) | 1.11(3.65e‑007s) | 1.77(5.81e‑007s) | -3.14159 | ^[-+]?[[:digit:]]*\.?[[:digit:]]*$ |
The next test measures the time to find all matches in a long English text. The text is the complete works of Mark Twain, from Project Gutenberg. The text is 19Mb long. As above, the top number is the normalized time and the bottom number is the actual time. The best time is in green.
static xpressive | dynamic xpressive | Boost | Expression |
---|---|---|---|
1(0.019s) | 1(0.019s) | 2.98(0.0566s) | Twain |
1(0.0176s) | 1(0.0176s) | 3.17(0.0556s) | Huck[[:alpha:]]+ |
3.62(1.78s) | 3.97(1.95s) | 1(0.492s) | [[:alpha:]]+ing |
2.32(0.344s) | 3.06(0.453s) | 1(0.148s) | ^[^
]*?Twain |
1(0.0576s) | 1.05(0.0606s) | 1.15(0.0664s) | Tom|Sawyer|Huckleberry|Finn |
1.24(0.164s) | 1.44(0.191s) | 1(0.133s) | (Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn) |
In xpressive, regex objects can refer to each other and themselves by value
or by reference. In addition, they ref-count their referenced regexes to
keep them alive. This creates the possibility for cyclic reference counts,
and raises the possibility of memory leaks. xpressive avoids leaks by using
a type called tracking_ptr<>
. This doc describes at a high level
how tracking_ptr<>
works.
Our solution must meet the following design constraints:
To use tracking_ptr<>
,
you must separate your type into a handle and a body. In the case of xpressive,
the handle type is called basic_regex<>
and the body is called regex_impl<>
.
The handle will store a tracking_ptr<>
to the body.
The body type must inherit from enable_reference_tracking<>
. This gives the body the bookkeeping
data structures that tracking_ptr<>
will use. In particular, it gives
the body:
std::set<shared_ptr<body>
> refs_
: collection of bodies to which this body refers, and
std::set<weak_ptr<body>
> deps_
: collection of bodies which refer to this body.
We refer to (1) above as the "references" and (2) as the "dependencies".
It is crucial to the understanding of tracking_ptr<>
to recognize that the set of references
includes both those objects that are referred to directly as well as those
that are referred to indirectly (that is, through another reference). The
same is true for the set of dependencies. In other words, each body holds
a ref-count directly to every other body that it needs.
Why is this important? Because it means that when a body no longer has a handle referring to it, all its references can be released immediately without fear of creating dangling references.
References and dependencies cross-pollinate. Here's how it works:
Consider the following code:
sregex expr; { sregex group = '(' >> by_ref(expr) >> ')'; // (1) sregex fact = +_d | group; // (2) sregex term = fact >> *(('*' >> fact) | ('/' >> fact)); // (3) expr = term >> *(('+' >> term) | ('-' >> term)); // (4) } // (5)
Here is how the references and dependencies propagate, line by line:
Expression |
Effects |
---|---|
1) |
|
2) |
|
3) |
|
4) |
|
5) |
|
This shows how references and dependencies propagate when creating cycles of objects. After line (4), which closes the cycle, every object has a ref-count on every other object, even to itself. So how does this not leak? Read on.
Now that the bodies have their sets of references and dependencies, the
hard part is done. All that remains is to decide when and where to break
the cycle. That is the job of tracking_ptr<>
, which is part of the handle. The
tracking_ptr<>
holds 2 shared_ptr
s. The
first, obviously, is the shared_ptr<body>
-- the reference to the body to which
this handle refers. The other shared_ptr
is used to break the cycle. It ensures that when all the handles to a body
go out of scope, the body's set of references is cleared.
This suggests that more than one handle can refer to a body. In fact,
tracking_ptr<>
gives you copy-on-write semantics -- when you copy a handle, the body is
shared. That makes copies very efficient. Eventually, all the handles to
a particular body go out of scope. When that happens, the ref count to
the body might still be greater than 0 because some other body (or this
body itself!) might be holding a reference to it. However, we are certain
that the cycle-breaker's ref-count goes to 0 because the cycle-breaker
only lives in handles. No more handles, no more cycle-breakers.
What does the cycle-breaker do? Recall that the body has a set of references
of type std::set<shared_ptr<body> >
. Let's call this type "references_type".
The cycle-breaker is a shared_ptr<references_type>
. It uses a custom deleter, which is
defined as follows:
template<typename DerivedT> struct reference_deleter { void operator ()(std::set<shared_ptr<DerivedT> > *refs) const { refs->clear(); } };
The job of to the cycle breaker is to ensure that when the last handle to a body goes away, the body's set of references is cleared. That's it.
We can clearly see how this guarantees that all bodies are cleaned up eventually. Once every handle has gone out of scope, all the bodies' sets of references will be cleared, leaving none with a non-zero ref-count. No leaks, guaranteed.
It's a bit harder to see how this guarantees no dangling references. Imagine that there are 3 bodies: A, B and C. A refers to B which refers to C. Now all the handles to B go out of scope, so B's set of references is cleared. Doesn't this mean that C gets deleted, even though it is being used (indirectly) by A? It doesn't. This situation can never occur because we propagated the references and dependencies above such that A will be holding a reference directly to C in addition to B. When B's set of references is cleared, no bodies get deleted, because they are all still in use by A.
All these std::set
s and shared_ptr
s
and weak_ptr
s! Very inefficient.
I used them because they were handy. I could probably do better.
Also, some objects stick around longer than they need to. Consider:
sregex b; { sregex a = _; b = by_ref(a); b = _; } // a is still alive here!
Due to the way references and dependencies are propagated, the std::set
of references can only grow. It never
shrinks, even when some references are no longer needed. For xpressive
this isn't an issue. The graphs of referential objects generally stay small
and isolated. If someone were to try to use tracking_ptr<>
as a general ref-count-cycle-collection
mechanism, this problem would have to be addressed.