Skip to content

Instantly share code, notes, and snippets.

@CAFxX
Last active March 6, 2025 09:02
Show Gist options
  • Save CAFxX/aff01b2c2ce53e556980a6f54cc23204 to your computer and use it in GitHub Desktop.
Save CAFxX/aff01b2c2ce53e556980a6f54cc23204 to your computer and use it in GitHub Desktop.
Minimal latency N-way sorter

Minimal Latency N-way Sorter (MLNS)

A MLNS is a $n$-sorter circuit that returns $N$ input values, sorted. It is functionally equivalent to a sorting network with $N$ inputs.

For small values of $N$ (3, 4, and possibly also 5) it can make sense, in order to minimize latency at the cost of slightly increased die area and gate count, to not use a traditional sorting network and instead perform all comparisons in parallel in a single stage, and then select the correct ordering, combinatorially, using the outputs of all comparisons.

Alternatively, an MLNS can also be used as the foundational building block of a $M$-inputs sorting network (with $M>N$) to reduce the number of stages (and therefore latency) of the sorting network.

Stage 1 - Comparisons

Pairwise comparison of each distinct pair of input elements. This should result in $C=\frac{N^2-N}{2}$ comparators performing the comparison of each pair simultaneously.

$i_0$ $i_1$ $i_2$ $i_3$ $i_{N-1}$ $i_N$
$i_0$ $i_0 < i_1$ $i_0 < i_2$ $i_0 < i_3$ $i_0 < i_{N-1}$ $i_0 < i_N$
$i_1$ $i_1 < i_2$ $i_1 < i_3$ $i_1 < i_{N-1}$ $i_1 < i_N$
$i_2$ $i_2 < i_3$ $i_2 < i_{N-1}$ $i_2 < i_N$
$i_3$ $i_3 < i_{N-1}$ $i_3 < i_N$
$i_{N-2}$ $i_{N-2} < i_{N-1}$ $i_{N-2} < i_N$
$i_{N-1}$ $i_{N-1} < i_N$
$i_N$

For example:

  • for $N=3$ and inputs i0, i1, and i2, this will result in 3 comparisons of (i0,i1), (i0,i2), (i1,i2).
  • for $N=4$ and inputs i0, i1, i2, and i3, this will result in 6 comparisons of (i0,i1), (i0,i2), (i0,i3), (i1,i2), (i1,i3), (i2,i3).

The $C$ resulting comparison results are used directly by the stage 2 logic.

In Verilog, the $N=3$ case may thus be expressed as:

/* C comparisons in parallel */
wire c0 = i0 < i1;
wire c1 = i0 < i2;
wire c2 = i1 < i2;

Stage 2 - Selection

In this stage, we map the results of the comparisons performed in the previous stage to an input ordering so that the outputs are in the desired, sorted, order.

This means mapping $2^C$ possible cases given by the $C$ comparison results into the correct one of the $N!$ possible ordering of the input that yields the outputs in the desired order, taking care that:

  1. distinct cases can map to the same permuation
  2. cases may be impossible (no set of input values can yield that set of comparison results)
  3. all $N!$ ordering must be mapped

For the case $N=3$ this may look something like:

case ({c0, c1, c2})
3'b111: out = '{i0,i1,i2}; // i0<i1, i0<i2, i1<i2     ->  i0 <  i1 <  i2
3'b110: out = '{i0,i2,i1}; // i0<i1, i0<i2, i1>=i2    ->  i0 <  i2 <= i1
3'b101: out = '{X, X, X }; // i0<i1, i0>=i2, i1<i2    ->  i2 <= i0 <  i1 <  i2  ->  NC
3'b100: out = '{i2,i0,i1}; // i0<i1, i0>=i2, i1>=i2   ->  i2 <= i0 <  i1
3'b011: out = '{i1,i0,i2}; // i0>=i1, i0<i2, i1<i2    ->  i1 <= i0 <  i2
3'b010: out = '{X, X, X }; // i0>=i1, i0<i2, i1>=i2   ->  i1 <= i0 <  i2 <= i1  ->  NC
3'b001: out = '{i1,i2,i0}; // i0>=i1, i0>=i2, i1<i2   ->  i1 <  i2 <= i0
3'b000: out = '{i2,i1,i0}; // i0>=i1, i0>=i2, i1>=i2  ->  i2 <= i1 <= i0
endcase

Obviously, for larger values of $N$ it is going to be beneficial to:

  1. omit all NC cases, and simply have a default case that returns out = '{X,X,X}
  2. merge (using comma-separated lists and don't care bits as needed) all cases that yield the same ordering

So for example, $N=4$ would look like:

case ({c0, c1, c2, c3, c4, c5})
6'b111111: out = '{i0,i1,i2,i3,i4,i5};
...
endcase

Comparison

$N$ MLNS size1 $C$ MLNS depth MLNS cases $2^C$ MLNS illegal cases MLNS valid cases Possible orderings $N!$ SN2 depth SN2 size1
23 1 1 2 0 2 2 1 1
3 3 1 8 2 6 6 3 3
4 6 1 64 ? ? 24 3 5
5 10 1 1K ? ? 120 5 9
6 15 1 32K ? ? 720 5 12
7 21 1 2M ? ? 5040 6 16
8 28 1 256M ? ? 40320 6 19

While the number of MLNS valid cases, especially for higher values of $N$, may seem to be too high to be practical it is worth pointing out that each output can only at most take one of $N$ values, and therefore the muxing latency (that is normally $\log_2(n)$, with $n$ the number of inputs to the mux) should remain practical.

Extensions

  1. In the example above we only handle unsigned integers, but it extends naturally to signed integers and floating point inputs by replacing the $C$ comparisons with comparators that can handle multiple formats.
  2. In the examples above we only handle ascending order, but it would be trivial to extend them to handle both ascending and descending orders.

Implementation

https://github.com/CAFxX/mlns

Footnotes

  1. Comparisons 2

  2. Sorting Network 2

  3. No difference between a SN and a MLNS of size 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment