Friday, January 6, 2012

Quicksort: Is picking a pivot at random secure?

Eric offered this suggestion:
Another way to harden quicksort is to select the elements used in the pivot calculation randomly. This makes it nearly impossible for an attacker to cause worst case performance, and doesn't require you to also implement multiple sort algorithms.
This is well-worth considering. A single sort algorithm is simpler than combining multiple algorithms so there has to be a compelling benefit gained to justify the additional complexity.

I've done some testing and benchmarks but before we jump to the conclusion take a few minutes to ponder this with me and let's see just how far down this rabbit hole goes.

Hardening quicksort is a thorny problem and for an interesting reason. Quicksort's soft spot is in the method used to estimate the median. The risk that a sequence of partitionings will be unbalanced and lead to degenerate performance depends directly on the quality of the estimate. So why bother with a questionable estimate when it's easy to pick the actual median and have a perfectly secure quicksort? Ah, because there's a catch. Finding the actual median requires almost as much effort as it would to perform the sort itself. So, the only way that quicksort is viable is by making estimates and, as such, most quicksort algorithms are defined by their estimating technique: first item, random item, median-of-three, median-of-N, median-of-random-three, median-of-medians-of-three... you name it and it's probably been tried.

As long as we are estimating the median we have to accept that we are, in effect, rolling the dice and hoping a rare series of bad choices does not happen. And if an attacker is present they are only too happy to hand us loaded dice to ensure the worst possible outcome.

But wait... For an attack to work the attacker must be able to anticipate our method of estimating a median. What happens if we choose it at random? Does this defeat the attacker?

First, what exactly do we mean by 'random'? The pseudo-random number generator provided by most frameworks (the Random class in .Net is a good example) is highly predictable. It won't slow down an attacker. Alternatively, you could use a cryptographically secure random number generator but that comes at a performance penalty that almost certainly makes it unworkable. That cost will come down soon, though. Intel has new CPUs coming this year that offer fast, high-quality random number generation in hardware.

Let's assume we have that CPU today and it delivers fast and cryptogaphically secure random values. We are probably(1) not at risk from an attacker.

Not so fast. There are still two problems we can't erase:

1. Although highly unlikely, it doesn't eliminate the risk of a series of low-probability but catastrophic choices. I have seen quicksorts go quadratic on natural data, without the presence of an attacker, even when using the median-of-three method. Choosing at random should be more resilient. Whether this risk is acceptable or not depends on whether the sort is part of tax preparation software or whether it's running the auto-pilot of a jumbo jet.

2. It makes low-quality estimates of the median. Partitions will be less balanced and performance will suffer for it.

There is only one way I know of to reliably harden quicksort against both an attacker and an unlucky series of choices: Set an upper limit on how much effort quicksort is allowed to spend before abandoning it entirely.

Introsort is secure because it limits how many bad partitions it can accept before dropping out of quicksort and switching to heapsort to finish the job.

Of course, we could throw away all the complexity and just use heapsort. It has no degenerate cases and is secure against an attacker. But there's a catch. The reason we don't embrace heapsort so quickly is it tends to be significantly slower than quicksort in the average case.(2)

Let's look at the benchmarks: Sorting up to 100 million items (with no adversary)

The final algorithm, Zimbry_PureQuicksortFatPivotR1, is a pure quicksort (does not fall back to an insertion sort on tiny partitions) that uses a pseudo-random number generator to choose the pivot. It's what Eric suggests: A single quicksort algorithm that chooses a pivot at random. Introsort is 19% faster on average and is not vulnerable to either an attacker or at risk of degenerating on a bad series of random pivot choices.

So, here we are, at the bottom of the rabbit hole, and we find ourselves surrounded by trade-offs:

1. We can have a pure quicksort if we're willing to accept slower performance and the highly unlikely chance of bad behavior.

2. We can be invulnerable to both attackers and bad luck by choosing heapsort but that armor comes at a severe average performance cost compared to quicksort.

3. We can "have our performance cake and eat it too" by building a quicksort that falls back to heapsort in an emergency. We get good performance on average and no bad behavior to worry about but it comes at the cost of additional engineering complexity.

Engineering is all about trade-offs.

                                                                                                                                                            

(1) A motivated attacker won't give up so easily. If there is a flaw in Intel's random number generator it will be found and exploited. It's best to give the community (both the good guys and bad guys) time to stress-test any new security software or hardware before adopting it for your own use.

(2) Ball-park comparison: A well-implemented heapsort is typically three to five times slower than a well-implemented quicksort.

No comments:

Post a Comment