@p typesetter = tex
% Set your own page size here . . . .
\hsize 160mm
\vsize 245mm
\def\fwseca#1#2{\fwlibc{#1}{#2}}
\def\fwsecb#1#2{\fwlibd{#1}{#2}}
\def\fwsecc#1#2{\fwlibe{#1}{#2}}
\def\fwsecd#1#2{\fwlibe{#1}{#2}}
\def\fwsece#1#2{\fwlibe{#1}{#2}}
@t vskip 40 mm
@t title titlefont centre "A Programming Pearl"
@t title titlefont centre "Generating sorted random numbers"
@t vskip 20 mm
@t title smalltitlefont centre "Don Knuth"
@t vskip 10 mm
@t title normalfont centre "FunnelWeb by Richard Walker"
@t new_page
@t table_of_contents
@t new_page
@!------------------------------------------------------------
@a@
Jon Bentley recently discussed the following interesting problem
as one of his `Programming Pearls' [{\sl Communications of the
ACM 27\/} (December, 1984), 1179--1182]:
The input consists of two integers @{M@} and @{N@}, with @{M@}$<$@{N@}.
The output is a sorted list of @{M@} random numbers in the range
1..@{N@} in which no integer occurs more than once. For probability
buffs, we desire a sorted selection without replacement in which each
selection occurs equiprobably.
The present program illustrates what I think is the best solution to
the problem, when @{M@} is reasonably large yet small compared to @{N@}.
It's the method described tersely in the answer to exercise 3.4.2--15 of
my book {\sl Seminumerical Algorithms}, pp.~141 and 555.
@!------------------------------------------------------------
@a@
For simplicity, all input and output in this program is assumed to be
handled at the terminal. The @{WEB@} macros @{read_terminal@},
@{print@}, and @{print_ln@} defined here can easily be changed to
accommodate other conventions.
Input a value from the terminal.
@$@@(@1@)@M==@{read(@1)@}
Output to the terminal.
@$@@(@1@)@M==@{write(@1)@}
Output to the terminal and end the line.
@$@@(@1@)@M==@{writeln(@1)@}
@!------------------------------------------------------------
@a@
Here's an outline of the entire Pascal program:
@o@==@{@-
program sample(input,output);
var @
@
begin @;
end.
@}
@!------------------------------------------------------------
@a@
The global variables @{M@} and @{N@} have already been mentioned; we had
better declare them. Other global variables will be declared later.
@{M_max@} is the maximum value of @{M@} allowed in this program.
@$@@M==@{5000@}
@{2M-1_max@} is $2M-1$ for use later on.
@$@<2M-1_max@>@M==@{9999@}
@$@+=@{@-
M: integer; { size of the sample }
N: integer; { size of the population }
@}
@!------------------------------------------------------------
@a@
We assume the existence of a system routine called @{rand_int(i,j)@}
that returns a random integer chosen uniformly in the range $i..j$.
@$@==@{@-
function rand_int(i,j: integer): integer; external;
@}
@!------------------------------------------------------------
@a@
After the user has specified @{M@} and @{N@}, we compute the sample by
following a general procedure recommended by Bentley:
@$@==@{@-
@;
size := 0; @;
while size < M do
begin
T := rand_int(1,N);
@;
end;
@@}
@!------------------------------------------------------------
@a@
The main program just sketched has introduced several more
globals. There's a set @{S@} of integers, whose representation
will be deferred until later; but we can declare two auxiliary
integer variables now.
@$@+=@{@-
size: integer; { the number of elements in set S }
T: integer; { new candidate for membership in S }
@}
@!------------------------------------------------------------
@a
The first order of business is to have a short dialogue with the
user.
@$@==@{@-
repeat @@('population size: N = '@);
@@(N@);
if N <= 0 then
@@('N should be positive!'@);
until N > 0;
repeat @@('sample size: M = '@);
@@(M@);
if M < 0 then
@@('M shouldn''t be negative!'@)
else if M > N then
@@('M shouldn''t exceed N!'@)
else if M > @ then
@@('(Sorry, M must be at most ',@:1,'.)'@);
until (M >= 0) and (M <= N) and (M <= @)@}
@!------------------------------------------------------------
@a@
The key idea to an efficient solution of this sampling problem is
to maintain a set whose entries are easily sorted. The method of
`ordered hash tables' [Amble and Knuth, {\sl The Computer Journal
17\/} (May 1974), 135--142] is ideally suited to this task, as we
shall see.
Ordered hashing is similar to ordinary linear probing, except that
the relative order of keys is taken into account. The cited
paper derives theoretical results that will not be rederived
here, but we shall use the following fundamental property: {\sl
The entries of an ordered hash table are independent of the order
in which its keys were inserted}. Thus, an ordered hash table is
a `canonical' representation of its set of entries.
We shall represent @{S@} by an array of $2M$ integers. Since
Pascal doesn't permit arrays of variable size, we must leave room
for the largest possible table.
@$@+=@{@-
hash: array[0..@<2M-1_max@>] of integer;
{ the ordered hash table }
H: 0..@<2M-1_max@>; { an index into hash }
H_max: 0..@<2M-1_max@>; { the current hash size }
alpha: real; { the ratio of table size to N }
@}
@!------------------------------------------------------------
@a
@$@==@{@-
H_max := 2 * M - 1; alpha := 2 * M / N;
for H := 0 to H_max do hash[H] := 0@}
@!------------------------------------------------------------
@a
Now we come to the interesting part, where the algorithm tries to
insert @{T@} into an ordered hash table. The hash address
$H=\lfloor2M(T-1)/N\rfloor$ is used as a starting point, since
this quantity is monotonic in @{T@} and almost uniformly
distributed in the range $0\le H<2M$.
@$@==@{@-
H := trunc(alpha * (T-1));
while hash[H] > T do
if H = 0 then H := H_max else H := H-1;
if hash[H] < T then { T is not present }
begin size := size + 1;
@;
end@}
@!------------------------------------------------------------
The heart of ordered hashing is the insertion process. In general,
the new key @{T@} will be inserted in place of a previous key $T_1==@{@-
while hash[H] > 0 do
begin TT := hash[H]; { we have 0 < TT < T }
hash[H] := T; T := TT;
repeat if H = 0 then H := H_max
else H := H - 1;
until hash[H] < T;
end;
hash[H] := T@}
@!------------------------------------------------------------
@a
@$@+=@{@-
TT: integer; { a key that's being moved }
@}
@!------------------------------------------------------------
@a@
The climax of this program is the fact that the entries in our ordered
hash table can easily be read out in increasing order.
Why is this true? Well, we know that the final state of the table is
independent of the order in which the elements entered. Furthermore
it's easy to understand what the table looks like when the entries are
inserted in decreasing order, because we have used a monotonic hash
function. Therefore we know that the table must have an especially
simple form.
Suppose the nonzero entries are $T_1<\cdots@M==@{@@(hash[H] : 10@)@}
@$@==@{@-
if hash[0] = 0 then { there was no wrap-around }
begin for H := 1 to H_max do
if hash[H] > 0 then @;
end
else begin for H := 1 to H_max do
{ print the wrapped-around entries }
if hash[H] > 0 then
if hash[H] < hash[0] then @;
for H := 0 to H_max do
if hash[H] >= hash[0] then @;
end@}