Trees, roots and leaves #1 @ Software @ Cunobaros

Comments? Send me an e-mail, read what others have written, or say what you think below.

Trees, roots and leaves #1

Trees are very useful; they can help with sorting and organisation of data, parsing and problem solving. They can also provide the over-worked programmer with shade and a sturdy support for the hammock. With this in mind, I was quite surprised when I couldn't see any when I went looking recently. True to the proverb, I saw a forest. Tries, binary trees, cartesian trees, search trees... fine, but I needed a tree, and I couldn't find any.

Ah, well, I thought, I'll just have to write my own. How hard can it be? As it turns out, this was not an entirely trivial exercise, which is why I decided to write an article about it. I got a bit carried away, so it actually became three articles, roughly covering design, implementation and refactoring.

Not big and not clever

Most examples of trees I have seen have been, in one way or another, quite explicit about the difference between trees, branches and leaves. This approach has two significant drawbacks: it prohibits the use of trees and branches interchangeably in generic functions, like sorting and searching, and it prohibits the promotion from leaf to branch.

When would you want to do the latter? Well, in HTML it is quite common to see things like important . If we parsed this in a way that had each HTML element, or tag, as a node in our tree, this would be an leaf with < important > as content. But if we wanted even more emphasis, and put important in our tree, then the element would be a branch with a element as child.

I also found that many of the trees out there have either very specialised or very big interfaces. They tend to assume a lot about how and for what they are going to be used. While I do the same, to some extent, I have made an effort to make it as simple and generic as possible.

The following requirements summarise my needs:

Data indifference, i.e. generic and able to hold different data types in different instances.
Small, simple interface, with as high exception safety as possible.
Whole-part indifference, i.e. offering the ability to treat roots, branches and leaves alike.
Composable and decomposable, i.e. it should be easy, fast and cheap to merge and break up.
Unsorted but sortable, i.e. a client should be able to sort a tree, but the tree should not make any assumptions.
If possible, conform to STL container requirements.
Faithful, i.e. with a strict control of relationships.

These requirements affect the design in a couple of surprising ways. The first is a no-brainer: make a template class. I already knew that, but it's always a good idea to put down the requirements you take for granted - that makes it easier to question them later, should you need to. This also goes for the second, which is a requirement that should be in place for almost any class you'd care to design - the good old KISS principle, here in the alternative meaning <Keep It Simple and Safe>.

The third and fourth tells me I am not actually looking for a tree, but a tree node. Whether it is a root, branch or leaf is only determined by its relationships: a root has no parent, a leaf has no children, but a branch has both. If they are all of the same class, and can be treated uniformly, it helps a lot with the fourth requirement while also making the interface smaller.

The fifth asks for iterator support, to allow the use of the standard sorting algorithms. How to sort the tree is not my worry; as long as you can go through a list of branches with a common parent and swap them around, the client can implement sorting. The simplest way of achieving this is to make it appear like one of the standard containers, as wished for in the sixth requirement.

While full conformity with the STL containers is nice, as it makes the class easier to use, it's only an absolute requirement for library writers. For the rest of us, it's good enough if it does the job in a reliable and efficient manner. For this reason, I will do this backwards, i.e. start making a usable class, while keeping the standard in mind, and refactor it once I have it working. Normally, the arguments for this would be a lot weaker - one should always strive for standardised and familiar interfaces where applicable - but in this instance, I already know that it is impossible to make this fully conformant. But I'll tell that story later.

Finally, the seventh prohibits assignment and copy construction. This was a bit of a surprise, but it makes sense if you think about it. When I copy a leaf or branch - do I give the copy the same parent as the original? If I do, the copying of one node will alter another, since the parent must be notified it has a new child. If I don't, the copy will not be true, and I cannot do the same thing with the copy as I can with the original - in this case accessing the parent - which is hardly what you'd expect of a copy. It would definitely not be a faithful copy.

Without knowing how it's going to be used in every instance, I can't in good conscience decide which path to follow. Either way, the user might be surprised by the result of the assignment, which is rarely a good thing. Rather than baffle, I decided to provide a copy function that copied everything except the parent.

To be? No, it's not to be

The first draft looked something like this:

template <typename T>
class tree_node
{
public:
  ...
  tree_node<T> & front();
  ...

Right, I'll stop there because I have already made hamburgers of two holy cows, and I can see the friendly neighbourhood mob approaching with their quaint pitchforks and torches. I hope they're bringing something to drink, too.

The first is seen in the return type of front(), which is a node, rather than the template type which is the norm in the standard template library. When a client calls front(), the expected return is a reference to the first element held - that's how deque, list and vector works. Well, that's what you get here as well.

Unlike those containers, however, the tree_node is not one-dimensional. In a list, for instance, you get hold of the first value and use it, and what you use it for is wholly dependent on the type of element you have in your list. This would only make sense in a tree_node if it is a leaf; if it is a branch or root, you might want it for its value or you might want it for its children. Rather than try to provide two different idioms, I stick with one.

Remember the third requirement? Only by admitting that a tree_node contains tree_nodes can I make it whole-part indifferent. Everybody knows that a list contains structs holding not only data, but also pointers forwards and backwards, but this is an implementation detail and can be ignored. In this case, however, it's a design feature, not an implementation detail, and as such it should be visible, and should generate a compilation error if used wrong:

tree_node<string> tree("Hello");
string s = tree.front(); // Error: no conversion from tree_node<string> to string

The second holy cow on the barbecue is the composite pattern, in which, as we all know, the composite class is derived from the stand-alone class.[1] So why don't I follow that pattern?

First of all, patterns are only useful idioms - you are not required to follow them when they don't make sense. Secondly, the composite pattern is not this generic; like all good patterns it is quite specific when it comes to circumstances it is applicable. In this case, you write both the base and composite classes, so you know what you are doing. That luxury is not available when inheriting from an unknown type. Thirdly, in the words of Herb Sutter: "Only use public inheritance to model true IS-A, as per the Liskov Substitution Principle (LSP)".[2]

You cannot argue that a tree to store Widgets is-a Widget, as it has no knowledge of what a Widget is. Instead, you end up with a class with two distinct and unrelated purposes. When it comes to class purposes, there can be only one. Okay, so nobody will come and cut your head off with a sword if you write a class with two or more different purposes, but I am sure I’m not the only one who have wished for that when coming across a piece of old code to maintain.

Short of actual beheading, I know there is no way to stop some people from ignoring issues of style and try to resurrect both cows by re-writing the class declaration like this:

template 
class tree_node : public T
{
    ...
    tree_node & front();
    ...

Neat, or what? This follows the pattern, and it brings tree_node::front() back on track, so that it returns a reference to the template class. Well, it returns a reference to a class derived from it, which should be close enough. Shouldn't it? Well, as it turns out, it isn’t.

list<string> l;
// Add elements
// ...
// Change the first
l.front() = "bubble";      // fine

tree_node<string> t;
// Add elements
// ...
// Change the first
t.front() = "bobble";// Error: no operator= defined for tree_node

Oops.

And that's just the start. Even if I did not hide the copy constructor and assignment operator, I would hide the function front and any other functions in the template class that happens to share name with the functions in tree_node. Hopefully, if issues of style does not prevail, the practical considerations should convince everyone that in this particular forest, those two cows are not holy.

You want fries with those burgers?

To have or to hold

With those controversies out of the way, I can go on with the design and put the member data in. That's fairly straightforward - simple containment should do it. This means, of course, that a root or branch which is only used to keep track of children - like you would use a list or vector, for instance - always has space set aside for data, even if it's not used. If the contained class is expensive to construct, this is a bad thing.

To allow data-less nodes, I might as well keep the data as a pointer, with the appropriate care in constructor, destructor and assignment. This is the familiar pimpl idiom [3], a common an implementation of the bridge pattern[1].

template <typename T>
class tree_node
{
public:
    typedef T& value_reference;

    tree_node():
        value_(NULL);

    tree_node( const value_reference value ):
        value_( new T( value ) );

    ~tree_node()
    {
        delete value_;
    }

    bool has_value() const
    {
        return (NULL != value_ );
    }

    value_reference value()
    {
        if ( !has_value() )
            throw std::exception("No value");
        return *value_;
    }

    void assign_value(const value_reference value)
    {
        if ( has_value()_)
            delete value_;
        value_= new T(value);
    }
    ...
private:
    T* value_;
};

Since there is no guarantee there’ll be a value to return, I’ll have to check for that and throw an exception if the client has asked for something I can’t give. Since I’ll have to check it anyway, I might as well give the client the ability to do the check, to avoid having to put a try-catch block around every call to value(). I’ll also provide a const version, but there’s not much point in listing it here.

That's the first bit of data sorted, what's next? The parent of course - as most people know, you have to have a parent (or two, but bear with me) before you can get children. And because most people don't want the parent to move in with them, I just keep track of its address so I can send Christmas cards. Just remember to use the right type - the parent is not the template type T, but the tree_node that has us as a child:

    ...
    typedef tree_node<T>* pointer;

    pointer parent()
    {
        return parent_;
    }

    pointer root()
    {
        pointer tmp = this;
        while ( NULL != tmp->parent() )
            tmp = tmp->parent();
        return tmp;
    }
    ...
private:
    pointer parent_;

What, no assignment of parent? No, you don't get to choose your parents, but more on that later. How do you like the function to find the root? There's another way of doing this that doesn’t use a temporary variable, but recursion. If you want to, you can write something like this instead:

    pointer root()
    {
        if ( NULL != parent_ )
            return parent_->root();
        return this;
    }

This is slightly less efficient, but illustrates that this container is designed for recursion, something that makes it a different breed than the usual two types that are distributed in the standard template library, namely: sequence containers (deque, list and vector) and sorted associative containers (map, set and their multi counterparts. We will have reason to come back to this in the next part, when we begin to play with children.

[1] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides: "Design Patterns", Addison-Wesley 1994

[2] Herb Sutter: "Uses and Abuses of Inheritance, Part 2", C++ Report, January 1999

[3] Herb Sutter: Pimpls - Beauty Marks You Can Depend On, C++ Report, 10(5), May 1998.
Herb Sutter: The Joy of Pimpls (or, More About the Compiler-Firewall Idiom), C++ Report, 10(7), July/August 1998

Comments