Friday 21 August 2009

Heapsort, the Binary Mafia, and Product Backlog Priorities

Dear Junior

I recently picked up my university book on algorithms and by accident came across the section on heap-sort and its data structures. When reading it, I was struck by a strong association to product backlogs and how product owners keep them sorted, often spending a substantial time on that task.

A common advice to product owners is to give each user story a priority by attributing it an “importance number”. High importance number means important story, so if you want to send a story to the top, you can always change its number to being higher then the currently most important. This is a neat trick eliminating the common problem when things are given priority numbers (where priority one” is most important) – because what do you do when something even more important arrives? Give it “prio zero”?

In the same breath, product owners are usually advised that all stories should have unique importance numbers, thus distinguishing them. For those stories that to be developed in near future (what I call the “Backlog Shortlist”), it is important to sort out their relative priorities, and I agree that assigning unique numbers to those stories is a feasible way to do it.

However, for the rest of the backlog, setting unique numbers is unnecessary work. Following the advice forces us to determine the exact order of even the bottom ten, most unimportant, stories – and to do this at an early stage. Guess what those priorities might, and will, change many times before those stories are ripe for development. Chances are high that those stories will not even be developed, at least not in their current shape. So basically we have wasted precious product-owner time on establishing details that was not needed yet what I call “premature exactness”.

Going back to the algorithms and data structures, this reminds me on how a list is sorted using insert-sort, where a new item “scans” through the list to find its place in the sorted list. Doing this with one element at a time with all the elements you want to sort gives you a sorted list with the most important element at the beginning. If you extract the most important element one at a time from that list you obviously get them in priority order. Also, if you let new elements enter the list during the process, you have a what computer scientists call a “priority queue”. The connection to the backlog should be obvious.

However, it turns out that insert-sort is a pretty bad way of implementing a priority queue. This is because inserting a new element requires you to scan half the list on average – in other words, inserting a new element is linear to number of elements in the list, or O(n) using computer science terminology. So, the complete work for processing n original items through the priority processing will be O(n*n).

Heapsort on the other hand uses another data structure called the “heap”. This heap is a binary tree with some extra restrictions on how nodes can sit relative to each other – not to be confused with garbage collected memory area used in runtime environments such as Java or LISP.

What makes the binary tree into a heap is an extra property each node has a value that is larger than its immediate children’s. Thus, the largest value will be at the top. However, in the heap the nodes do not care about whether the right or the left child is larger; nor does it care about the values in the nodes two steps down. And it is exactly this ignorance of irrelevant details that happens to make the heap an excellent structure for implementing a priority queue.

I think of a heap as a Mafia organisation with branches. In each cell (tree node), the toughest guy or gal is the cell boss with two subordinates - each being the boss of one sub-branch each. The other way around, the tough boss is part of another cell (the tree node above) where the boss of that cell is yet tougher. And of cause, at the top is the head boss (root node) that is the toughest of them all.

So, as there are two sub-bosses that are directly under the head boss are those two the second and third toughest in the org? Well, one of them is the second toughest, no doubt – but the other one might not be the third toughest. He or she might just be lucky enough that the third toughest is not in his or her part of the org. The third toughest might be one of the underdogs to the other sub-boss. This is the same way as it is not necessary the second best team that plays the final in a cup. The winner of the final should be the best, but the second best team could have lost against the best in some early round.

Back to the Ma; it is easy to know who’s the toughest; it’s the boss right at the root of the heap. But what happens when the head boss get shot ("extracted from the heap")? Well, the two sub-bosses will challenge each other, and the toughest will step up to take the empty place. Now, there is a vacancy for the promoted sub-boss's position. So, the toughest of the two in that cell will step up to fill that vacancy, leaving a vacancy another step down in the hierarchy – rattling down through one “leg” in the tree. At each level there will be a challenge between the peers on who is to be the new boss, and the number of challenges in total will be the height of the tree, which is roughly the logarithm of the size of the org. So, appointing a new head boss when the old gets shot is a O(lg n) operation.

So, what happens when a newbie enters the org? Well, the newbie will enter as a leaf node in one of the cells, reporting to some lowest level local cell boss. Now it well be that the newbie is tougher than the local cell boss, it which case the latter will be challenged, and they will switch places, the newbie being the new local cell boss. Of course, the newbie might be tougher then the next level boss as well, whereupon there will be a new challenge and another switch. At worst, the newbie will be tougher then everybody in the org, and will challenge his or her way all the way to the top becoming the new head boss. This will take as many challenges as the number of levels in the org – i e again, the tree height. So, accepting a new member is a O(lg n) operation.

If inserting a new element is O(lg n) and extracting highest priority element is O(lg n), then processing all elements of a list takes O(n lg n). As product backlogs often can be some thirty to hundred elements (if not more), the difference between n*n and n*lg n is substantial – especially if we talk about using precious product owner time. So, heap-sort is a much better role-model than insert-sort when grooming our product backlog.

How come the heap performed so well? Its efficiency stems from not keeping the entire queue perfectly sorted. Instead it focus on the most important thing – to keep the “toughest” element at the top, and keeping the rest of the queue roughly sorted where “tough” elements tend to hang around at the top, and ”lesser tough” somewhere lower down. Thus it avoids premature exactness, e g decisions that are not needed yet. I also think about is as some kind of "fuzzy late evaluation" - if you get what I mean.

Applying this back to our product backlog, we might go so far as to structure our backlog as a heap. That would be cool but I have never done it, but I am definitely looking for the opportunity. Taking some of the wisdom back, we have learned that it suffices to have the lower part of the backlog “roughly sorted”. One way of applying that would be to relax the requirement on unique importance of each story. We should still require unique numbers for the shortlist, but for the rest it suffices that there are strata of stories that are roughly as important and which might share the same importance number.

Yours

Dan