Doing the Fisher-Yates Shuffle

Keywords: probability, uncertainty, sampling

Recently, I have developed some systems which sample data from large datasets of households and individuals. These samples are required to be drawn according to designs which sometimes are complex, going beyond simple randomized sampling to those involving stratification and clustering. The samples that are produced may be then used for surveys of public opinion or for product testing.

Probability selection methods ensure that each individual (or household, or other population unit) has a known, non-zero chance of inclusion in the sample. Methods that ensure that the probability of selection of each individual in a sample frame is the same are known as "epsem", short for "equal probability selection methods". While epsem is common, appropriate weighting can compensate for unequal probability selection methods using standard techniques derived from statistical theory.

There is no statistical theory to guide the use of non-probability samples. They can be assessed only through subjective evaluation. Inevitably, such samples will be biased in generally unknown directions and magnitudes. Despite this, non-probability sampling methods are often used. Cost, convenience, and cargo-cult beliefs[1] usually are used to justify these inferior techniques.

Ensuring that probability selection methods are implemented without introducing spurious bias is not difficult if done correctly. Unfortunately, creating methods that are unreliable is even easier and this is what often happens.

A programming problem that frequently arises in the implementation of sampling systems is the fair (epsem) selection of elements from a set. The set is represented as an array of known length and type but with an unknown and very possibly far from random ordering. The set may be composed of a large number of elements, perhaps in the millions but usually may be memory resident.

One clever solution is to sort the set using a random comparator. Then the desired number of samples could be drawn as the first $k$ elements. In JavaScript this would look like:

function (array) {
  array.sort(function() {
    return Math.random() - 0.5;
  });
}

This has the virtue of being simple, short, memorable and wrong! A little thought about how sort algorithms work leads to the basic problem - the calling order of comparison operations isn't guaranteed to be free of bias for any given length of the input array. By the way - sort algorithms given such random splits tend to execute in worst case time, namely as $O(N^2)$.

An improvement is to associate a random number with each element of the set and then sort based on those random numbers. This works well at the cost of increased required memory and will generally execute in $O(N \log(N))$ time.

This brings us to a lovely algorithm which dates to 1938 by Ronald Fisher and Frank Yates. Later adaptation was introduced by Richard Durstenfeld in 1964 and later as one of the lettered algorithms in Donald Knuth's The Art of Computer Programming.

The core idea is to shuffle elements in place, scanning through the entire set in a single pass, making the complexity $O(N)$. The code and a variant that is useful for sampling $k$ elements from an array of size $N$ in JavaScript are provided here:

/*******************************************************************************

  FISHER-YATES Shuffle

  (sometimes attributed to Knuth or to Durstenfeld)

  Because sometimes you need to randomly shuffle an array, eh?
  Shuffle is in-place picking a random element from the front and swapping
  it with the backmost element not yet selected.

  Suppose we only need N random picks or we need to restart. That's what
  shufflePart does.  Again, in place.

*******************************************************************************/

var shuffle = function (array) {
  var currentIndex = array.length;
  var temporaryValue, randomIndex;

  while (0 !== currentIndex) {// while more
    // Pick an element in remaining part of the array
    randomIndex = Math.floor(Math.random() * currentIndex);
    currentIndex -= 1;

    // Swap selected element with the current element.
    temporaryValue = array[currentIndex];
    array[currentIndex] = array[randomIndex];
    array[randomIndex] = temporaryValue;
  }
  return array;
};

var shufflePart = function (array, count, current) {
  var currentIndex = current ? current : array.length;
  var stopIndex = count ? Math.max(0, currentIndex - count) : 0;
  var oldCurrentIndex = currentIndex;
  var temporaryValue, randomIndex;

//console.log({currentIndex:currentIndex, stopIndex:stopIndex});

  while (stopIndex !== currentIndex) {// while more
    // Pick an element in remaining part of the array
    randomIndex = Math.floor(Math.random() \* currentIndex);
    currentIndex -= 1;

    // Swap selected element with the current element.
    temporaryValue = array[currentIndex];
    array[currentIndex] = array[randomIndex];
    array[randomIndex] = temporaryValue;
  }

  return {array: array, currentIndex: currentIndex, oldCurrentIndex: oldCurrentIndex};

};

Just for fun, the code for the Julia version of the Fisher-Yates Shuffle is given here including a test function that demonstrates its correctness.

Note the use of the @inbounds macro which is a way of promising that runtime bounds checking is unnecessary.

julia> function shuffle!(array)
           currentIndex = length(array)
           @inbounds while 1 != currentIndex
           # Pick an element in remaining part of the array
               randomIndex = Int64(floor(rand() * float(currentIndex))) + 1
       
           # Swap selected element with the current element.
             # println((currentIndex, randomIndex, array))
               temporaryValue = array[currentIndex]
               array[currentIndex] = array[randomIndex]
               array[randomIndex] = temporaryValue
       
           # Next
               currentIndex = currentIndex - 1
           end
           return array
       end
shuffle! (generic function with 1 method)

julia> function testShuffle(len, iters)
           acc = zeros(Int64, len)
           acc1 = zeros(Int64, len)
           for i in 1:iters
               perm = shuffle!(collect(1:len))
               acc = acc .+ shuffle!(collect(1:len))
               sort!(perm)
               acc1 = acc1 .+ perm
           end
           return (acc, acc1)
       end
testShuffle (generic function with 1 method)

julia> (shuffledSum, resortedSum) = testShuffle(10, 1000);

julia> (expectedShuffledSum, totShuffledSum) = (sum(shuffledSum)/length(shuffledSum), sum(shuffledSum))
(5500.0, 55000)

julia> deviation = ((shuffledSum .- expectedShuffledSum) / expectedShuffledSum) * 100
10-element Array{Float64,1}:
 -0.14545454545454545
  0.8
  0.8727272727272728
 -1.0181818181818183
 -2.090909090909091
 -1.6545454545454543
  0.27272727272727276
  1.4181818181818182
 -0.2
  1.7454545454545456

julia> [collect(1:length(shuffledSum)) shuffledSum deviation]
10×3 Array{Float64,2}:
  1.0  5492.0  -0.145455
  2.0  5544.0   0.8
  3.0  5548.0   0.872727
  4.0  5444.0  -1.01818
  5.0  5385.0  -2.09091
  6.0  5409.0  -1.65455
  7.0  5515.0   0.272727
  8.0  5578.0   1.41818
  9.0  5489.0  -0.2
 10.0  5596.0   1.74545

I finish with just a couple of additional comments:

  • Mike Bostock (the creator of D3) has a wonderful blog entry that beautifully illustrates the degree of bias for various shuffle algorithms. https://bost.ocks.org/mike/shuffle/compare.html

  • Only very recently the various browsers and Node standardized on the same random number generation algorithm. Let the buyer beware. Random number generation is subtle and what may be acceptable for one application may not be for another, particularly anything that is to be cryptographically secure.

  • Whenever two statisticians are in the same room an argument will inevitably break out over methodology.