...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
iterator_to
Boost.MultiIndex provides six different index types, which can be classified as shown in the table below. Ordered and sequenced indices, which are the most commonly used, have been explained in the basics section; the rest of index types can be regarded as variations of the former providing some added benefits, functionally or in the area of performance.
type | specifier | |
---|---|---|
key-based | ordered | ordered_unique |
ordered_non_unique |
||
hashed | hashed_unique |
|
hashed_non_unique |
||
non key-based | sequenced |
|
random_access |
Key-based indices, of which ordered indices are the usual example, provide
efficient lookup of elements based on some piece of information called the
element key: there is an extensive suite of
key extraction
utility classes allowing for the specification of such keys. Fast lookup
imposes an internally managed order on these indices that the user is not
allowed to modify; non key-based indices, on the other hand, can be freely
rearranged at the expense of lacking lookup facilities. Sequenced indices,
modeled after the interface of std::list
, are the customary
example of a non key-based index.
Hashed indices constitute a trade-off with respect to ordered indices: if correctly used,
they provide much faster lookup of elements, at the expense of losing sorting
information.
Let us revisit our employee_set
example: suppose a field for storing
the Social Security number is added, with the requisite that lookup by this
number should be as fast as possible. Instead of the usual ordered index, a
hashed index can be resorted to:
#include <boost/multi_index_container.hpp> #include <boost/multi_index/hashed_index.hpp> #include <boost/multi_index/ordered_index.hpp> #include <boost/multi_index/identity.hpp> #include <boost/multi_index/member.hpp> struct employee { int id; std::string name; int ssnumber; employee(int id,const std::string& name,int ssnumber): id(id),name(name),ssnumber(ssnumber){} bool operator<(const employee& e)const{return id<e.id;} }; typedef multi_index_container< employee, indexed_by< // sort by employee::operator< ordered_unique<identity<employee> >, // sort by less<string> on name ordered_non_unique<member<employee,std::string,&employee::name> >, // hashed on ssnumber hashed_unique<member<employee,int,&employee::ssnumber> > > > employee_set
Note that the hashed index does not guarantee any particular ordering of the elements: so, for instance, we cannot efficiently query the employees whose SSN is greater than a given number. Usually, you must consider these restrictions when determining whether a hashed index is preferred over an ordered one.
If you are familiar with non-standard hash_set
s provided
by some compiler vendors, then learning to use hashed indices should be straightforward.
However, the interface of hashed indices is modeled after the specification
for unordered associative containers by the
C++ Standard
Library Technical Report (TR1),
which differs in some significant aspects from existing pre-standard
implementations:
lower_bound
or upper_bound
member functions (unlike Dinkumware's solution.)
Just like ordered indices, hashed indices have unique and non-unique variants, selected
with the specifiers hashed_unique
and hashed_non_unique
,
respectively. In the latter case, elements with equivalent keys are kept together and can
be jointly retrieved by means of the equal_range
member function.
Hashed indices specifiers have two alternative syntaxes, depending on whether tags are provided or not:
(hashed_unique | hashed_non_unique) <[(tag)[,(key extractor)[,(hash function)[,(equality predicate)]]]]> (hashed_unique | hashed_non_unique) <[(key extractor)[,(hash function)[,(equality predicate)]]]>
The key extractor parameter works in exactly the same way as for ordered indices; lookup, insertion, etc., are based on the key returned by the extractor rather than the whole element.
The hash function is the very core of the fast lookup capabilities of this type of
indices: a hasher is just a unary function object
returning an std::size_t
value for any given
key. In general, it is impossible that every key map to a different hash value, for
the space of keys can be greater than the number of permissible hash codes: what
makes for a good hasher is that the probability of a collision (two different
keys with the same hash value) is as close to zero as possible. This is a statistical
property depending on the typical distribution of keys in a given application, so
it is not feasible to have a general-purpose hash function with excellent results
in every possible scenario; the default value for this parameter uses
Boost.Hash, which often provides good
enough results.
The equality predicate is used to determine whether two keys are to be treated
as the same. The default
value std::equal_to<KeyFromValue::result_type>
is in most
cases exactly what is needed, so very rarely will you have to provide
your own predicate. Note that hashed indices require that two
equivalent keys have the same hash value, which
in practice greatly reduces the freedom in choosing an equality predicate.
The lookup interface of hashed indices consists in member functions
find
, count
and equal_range
.
Note that lower_bound
and upper_bound
are not
provided, as there is no intrinsic ordering of keys in this type of indices.
Just as with ordered indices, these member functions take keys
as their search arguments, rather than entire objects. Remember that
ordered indices lookup operations are further augmented to accept
compatible keys, which can roughly be regarded as "subkeys".
For hashed indices, a concept of
compatible key is also
supported, though its usefulness is much more limited: basically,
a compatible key is an object which is entirely equivalent to
a native object of key_type
value, though maybe with
a different internal representation:
// US SSN numbering scheme struct ssn { ssn(int area_no,int group_no,int serial_no): area_no(area_no),group_no(group_no),serial_no(serial_no) {} int to_int()const { return serial_no+10000*group_no+1000000*area_no; } private: int area_no; int group_no; int serial_no; }; // interoperability with SSNs in raw int form struct ssn_equal { bool operator()(const ssn& x,int y)const { return x.to_int()==y; } bool operator()(int x,const ssn& y)const { return x==y.to_int(); } }; struct ssn_hash { std::size_t operator()(const ssn& x)const { return boost::hash<int>()(x.to_int()); } std::size_t operator()(int x)const { return boost::hash<int>()(x); } }; typedef employee_set::nth_index<2>::type employee_set_by_ssn; employee_set es; employee_set_by_ssn& ssn_index=es.get<2>(); ... // find an employee by ssn employee e=*(ssn_index.find(ssn(12,1005,20678),ssn_hash(),ssn_equal()));
In the example, we provided a hash functor ssn_hash
and an
equality predicate ssn_equal
allowing for interoperability
between ssn
objects and the raw int
s stored as
SSN
s in employee_set
.
By far, the most useful application of compatible keys in the context of hashed indices lies in the fact that they allow for seamless usage of composite keys.
Hashed indices have
replace
,
modify
and
modify_key
member functions, with the same functionality as in ordered indices.
Due to the internal constraints imposed by the Boost.MultiIndex framework, hashed indices provide guarantees on iterator validity and exception safety that are actually stronger than required by the C++ Standard Library Technical Report (TR1) with respect to unordered associative containers:
rehash
provides the strong exception safety guarantee
unconditionally. TR1 only warrants it if the internal hash function and
equality predicate objects do not throw. The somewhat surprising consequence
is that a TR1-compliant unordered associative container might erase
elements if an exception is thrown during rehashing!
Random access indices offer the same kind of functionality as
sequenced indices, with the extra advantages
that their iterators are random access, and operator[]
and at()
are provided for accessing
elements based on their position in the index. Let us rewrite a
container used in a previous example,
using random access instead of sequenced indices:
#include <boost/multi_index_container.hpp> #include <boost/multi_index/random_access_index.hpp> #include <boost/multi_index/ordered_index.hpp> #include <boost/multi_index/identity.hpp> // text container with fast lookup based on a random access index typedef multi_index_container< std::string, indexed_by< random_access<>, ordered_non_unique<identity<std::string> > > > text_container; // global text container object text_container tc;
Random access capabilities allow us to efficiently write code like the following:
void print_page(std::size_t page_num) { static const std::size_t words_per_page=50; std::size_t pos0=std::min(tc.size(),page_num*words_per_page); std::size_t pos1=std::min(tc.size(),pos0+words_per_page); // note random access iterators can be added offsets std::copy( tc.begin()+pos0,tc.begin()+pos1, std::ostream_iterator<std::string>(std::cout)); } void print_random_word() { std::cout<<tc[rand()%tc.size()]; }
This added flexibility comes at a price: insertions and deletions at positions
other than the end of the index have linear complexity, whereas these operations
are constant time for sequenced indices. This situation is reminiscent of the
differences in complexity behavior between std::list
and
std::vector
: in the case of random access indices, however,
insertions and deletions never incur any element copying, so the actual
performance of these operations can be acceptable, despite the theoretical
disadvantage with respect to sequenced indices.
Example 10 and example 11 in the examples section put random access indices in practice.
Random access indices are specified with the random_access
construct,
where the tag parameter is, as usual, optional:
random_access<[(tag)]>
All public functions offered by sequenced indices are also provided
by random access indices, so that the latter can act as a drop-in replacement
of the former (save with respect to their complexity bounds, as explained above).
Besides, random access
indices have operator[]
and at()
for positional
access to the elements, and member functions
capacity
and
reserve
that control internal reallocation in a similar manner as the homonym
facilities in std::vector
. Check the
reference for details.
std::vector
It is tempting to see random access indices as an analogue of std::vector
for use in Boost.MultiIndex, but this metaphor can be misleading, as both constructs,
though similar in many respects, show important semantic differences. An
advantage of random access indices is that their iterators, as well as references
to their elements, are stable, that is, they remain valid after any insertions
or deletions. On the other hand, random access indices have several disadvantages with
respect to std::vector
s:
std::vector
s by which elements are stored adjacent to one
another in a single block of memory.
replace
and
modify
.
This precludes the usage of many mutating
algorithms that are nonetheless applicable to std::vector
s.
In general, it is more instructive to regard random access indices as
a variation of sequenced indices providing random access semantics, instead
of insisting on the std::vector
analogy.
By design, index elements are immutable, i.e. iterators only grant
const
access to them, and only through the provided
updating interface (replace
, modify
and
modify_key
) can the elements be modified. This restriction
is set up so that the internal invariants of key-based indices are
not broken (for instance, ascending order traversal in ordered
indices), but induces important limitations in non key-based indices:
typedef multi_index_container< int, indexed_by< random_access<>, ordered_unique<identity<int> > > > container; container c; ... // compiler error: assignment to read-only objects std::random_shuffle(c.begin(),c.end());
What is unfortunate about the previous example is that the operation
performed by std::random_shuffle
is potentially compatible
with multi_index_container
invariants, as its result can be
described by a permutation of the elements in the random access index
with no actual modifications to the elements themselves. There are many
more examples of such compatible algorithms in the C++ standard library,
like for instance all sorting and partition functions.
Sequenced and random access indices provide a means to take advantage
of such external algorithms. In order to introduce this facility we need
a preliminary concept: a view of an index is defined as
some iterator range [first
,last
) over the
elements of the index such that all its elements are contained in the
range exactly once. Continuing with our example, we can apply
std::random_suffle
on an ad hoc view obtained from the
container:
// note that the elements of the view are not copies of the elements // in c, but references to them std::vector<boost::reference_wrapper<const int> > v; BOOST_FOREACH(const int& i,c)v.push_back(boost::cref(i)); // this compiles OK, as reference_wrappers are swappable std::random_shuffle(v.begin(),v.end());
Elements of v
are reference_wrapper
s (from
Boost.Ref) to the actual elements
in the multi-index container. These objects still do not allow modification
of the referenced entities, but they are swappable,
which is the only requirement std::random_shuffle
imposes. Once
we have our desired rearrange stored in the view, we can transfer it to
the container with
c.rearrange(v.begin());
rearrange
accepts an input iterator signaling the beginning
of the external view (and end iterator is not needed since the length of
the view is the same as that of the index) and internally relocates the
elements of the index so that their traversal order matches the view.
Albeit with some circumventions, rearrange
allows for the
application of a varied range of algorithms to non key-based indices.
Please note that the view concept is very general, and in no way tied
to the particular implementation example shown above. For instance, indices
of a multi_index_container
are indeed views with respect to
its non key-based indices:
// rearrange as index #1 (ascending order) c.rearrange(c.get<1>().begin()); // rearrange in descending order c.rearrange(c.get<1>().rbegin());
The only important requirement imposed on views is that they must be
free, i.e. they are not affected by relocations on the base index:
thus, rearrange
does not accept the following:
// undefined behavior: [rbegin(),rend()) is not free with respect // to the base index c.rearrange(c.rbegin());
The view concept is defined in detail in the
reference.
See example 11 in the examples section
for a demonstration of use of rearrange
.
iterator_to
All indices of Boost.MultiIndex provide a member function called iterator_to
which returns an iterator to a given element of the container:
multi_index_container< int, indexed_by<sequenced<> > > c; ... // convoluted way to do c.pop_back() c.erase(c.iterator_to(c.back())); // The following, though similar to the previous code, // does not work: iterator_to accepts a reference to // the element in the container, not a copy. int x=c.back(); c.erase(c.iterator_to(x)); // run-time failure ensues
iterator_to
provides a way to retrieve an iterator to an element
from a pointer to the element, thus making iterators and pointers interchangeable
for the purposes of element pointing (not so for traversal) in many situations.
This notwithstanding, it is not the aim of iterator_to
to
promote the usage of pointers as substitutes for real iterators: the latter are
specifically designed for handling the elements of a container,
and not only benefit from the iterator orientation of container interfaces,
but are also capable of exposing many more programming bugs than raw pointers, both
at compile and run time. iterator_to
is thus meant to be used
in scenarios where access via iterators is not suitable or desireable:
Ordered indices are implemented by means of a data structure known as a red-black tree. Nodes of a red-back tree contain pointers to the parent and the two children nodes, plus a 1-bit field referred to as the node color (hence the name of the structure). Due to alignment issues, on most architectures the color field occupies one entire word, that is, 4 bytes in 32-bit systems and 8 bytes in 64-bit environments. This waste of space can be avoided by embedding the color bit inside one of the node pointers, provided not all the bits of the pointer representation contain useful information: this is precisely the case in many architectures where such nodes are aligned to even addresses, which implies that the least significant bit of the address must always be zero.
Boost.MultiIndex ordered indices implement this type of node compression
whenever applicable. As compared with common implementations of the STL
container std::set
, node compression can
result in a reduction of header overload by 25% (from 16 to 12 bytes on
typical 32-bit architectures, and from 32 to 24 bytes on 64-bit systems).
The impact on performance of this optimization has been checked to be negligible
for moderately sized containers, whereas containers with many elements (hundreds
of thousands or more) perform faster with this optimization, most likely due to
L1 and L2 cache effects.
Node compression can be disabled by globally setting the macro
BOOST_MULTI_INDEX_DISABLE_COMPRESSED_ORDERED_INDEX_NODES
.
Revised July 7th 2013
© Copyright 2003-2013 Joaquín M López Muñoz. Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)