This is a rather long thread, and what follows are only *my*
messages on the subject. There are others by Roger Hui and Eugene
McDonnell that I haven't included here. (Get a copy of the APL
Newsreader and comp.lang.apl archive to read the other postings!) The
messages include:

- Problem description
- My first solutions
- Anagram timings
- More solutions, with results in lexical order
- Jeff Chilton's clever anagram function
- My analysis of Roger Hui's J solution
- The recode-matrix subproblem
- Ram Biyani's function
- Followup with "ultimate" function
- Mani Iyer's algorithm and i-th permutation functions

Here's the programming puzzle from the 1995 Quarter 2 issue of the Zark APL Tutor Newsletter. Perhaps someone at APL95 would be kind enough to print this out (please untransliterate first!), post it, and encourage attendees to submit solutions.

Permutations and Anagrams

In the 1992 Quarter 2 issue of the Zark APL Tutor News, a highly efficient (though obscure) function for constructing combinations was presented:

{del} R{<-}M COMBIN N;D;E;F;G;P [1] @ Returns a matrix of every possible [2] @ combination of M elements from the [3] @ vector {iota}N. That is, returns a [4] @ matrix with M!N rows and N columns. [5] @ [6] E{<-}({iota}P{<-}N-R{<-}M-1)-#IO [7] D{<-}R+{iota}P [8] R{<-}(P,1){rho}D [9] P{<-}P{rho}1 [10] L1:{->}(#IO>1{take}D{<-}D-1){rho}0 [11] P{<-}+\P [12] G{<-}+\{neg}1{drop}0,F{<-}{reverse}P [13] E{<-}F/E-G [14] R{<-}(F/D),R[E+{iota}{rho}E;] [15] E{<-}G [16] {->}L1 {del} 3 COMBIN 5 1 2 3 1 2 4 1 2 5 1 3 4 1 3 5 1 4 5 2 3 4 2 3 5 2 4 5 3 4 5

Your task is to write a monadic function PERMUTE, whose argument is a positive integer N and whose result is an N-column, (!N)-row matrix, in which each row is a different permutation of {iota}N.

PERMUTE 3 1 2 3 1 3 2 2 1 3 2 3 1 3 1 2 3 2 1

Try to make the function as efficient as possible. In case you don't find this exercise adequately limbering, your second task is to write a monadic function ANAGRAM, whose argument is a character vector, and whose result is a character matrix in which each row is a different arrangement of the argument's letters. (ANAGRAM calls PERMUTE.)

ANAGRAM 'SMSI' IMSS ISMS ISSM MISS MSIS MSSI SIMS SISM SMIS SMSI SSIM SSMI

Mail your solution on paper or APL*PLUS PC or APL*PLUS II disk to:

Zark Incorporated

Re: Limbering Up Task

23 Ketchbrook Lane

Ellington, CT 06029

U.S.A.

[And please post it here in comp.lang.apl, too! -JimW] The notable functions and their authors' names will be printed in the next issue of the Zark APL Tutor News. Entries must be received by July 15, 1995. Good luck and happy limbering.

Well, Gary asked us to try to make the function "as efficient as possible". Assuming this means "as fast as possible", here's my solution:

{del}Z{<-}PERM N [1] @Really fast version [2] Z{<-}{disclose}PERMS[1;#IO+N] {del}

All possible results are precomputed and stored in a 2-by-X matrix PERMS. The columns of PERMS hold the result for arguments of 0, 1, 2, etc., with the first row holding the 1-origin results and the second row the 0-origin results. This is practical because PERM can be called with only a few different arguments. For arguments of 8, 9, and 10, the result is 1.3, 13, and 145 million bytes in size, one of which seems likely to be a practical upper bound for today's computers. Out-of-range arguments will give an INDEX ERROR instead of WS FULL, but you could correct that with some additional checking.

Of course you also need a program to generate the PERMS global variable. Here's what seems to be a reasonable solution. #IO is 1.

{del} Z{<-}PERM1 N;A;D;S [1] @Returns a (!{omega})-by-{omega} matrix of all permutations of {+ +}{iota}{omega} [2] Z{<-}(N{min}1 1){rho}1 @ start with 1-by-1 or 0-by-0 matrix [3] L1:S{<-}{rho}Z @ Loop [4] D{<-}S[2]+1 @ index of column we will add [5] {->}(D>N)/0 @ quit if we have enough columns [6] Z{<-}(S{times}D,1){rho}Z @ make D duplicate copies of Z atop {+ +}each other [7] A{<-}S[1]/{iota}D @ shift each copy by a different amount [8] Z{<-}(-A){rotate}D,A{rotate}Z @ insert D in a different column {+ +}for each copy [9] {->}L1 @ Endloop {del}

(Exposure time and f-stop not recorded, and not very relevant because I swiped the algorithm from Knuth's TAO CP vol 1.) The basic idea is that, given a single order-3 permutation such as 3 1 2, you can generate order-4 permutations by inserting 4 in all possible locations:

4 3 1 2 3 4 1 2 3 1 4 2 3 1 2 4

This is approximatly what line [8] is doing. If Z is 4 3{rho}3 1 2 and A is 0 1 2 3, then A{rotate}Z shifts the insertion points to the first column, "D," does the "insertion", and (-A){rotate} restores the original alignment. When you repeat this process for each order-3 permutation, you get all possible order-4 permutations.

Using PERM1, you can define PERMS as:

PERMS{<-}0 {neg}1{jot}.+PERM1{each}0,{iota}N

With APL*PLUS II/386 v5.2 on my 486/66 and a #WA of about 7.8E6, the statement above runs in 2.11 secs for N=8. Then, PERM takes about 0.0006 secs to return any result.

The original problem didn't say anything about row ordering, and PERM1 doesn't return its result in any particular order. If lexical order is desired, the result can be sorted with Z{<-}Z[{gradeup}Z;].

In writing ANAGRAM, you have to be careful about duplicate letters in the argument--they shouldn't result in duplicate entries in the result. Here's a straightforward version that calls PERM:

{del} Z{<-}ANAGRAM C [1] @Returns all anagrams of {omega} [2] C{<-},C @ arg might be scalar [3] Z{<-}C[PERM {rho}C] @ all arrangements of C [4] Z{<-}Z[#AV{gradeup}Z;] @ alphabetize [5] {->}((C{iota}C)^.={iota}{rho}C)/0 @ If any duplicate letters, [6] Z{<-}(1,1{drop}{or}/Z{/=}{neg}1{rotatebar}Z){slashbar}Z {+ +} @ remove dups from the result {del}

It may be possible to write a more efficient version that deals with duplicate letters directly instead of calling PERM.

Jim

On June 10, Roger Hui wrote:

It may be informative and amusing to compare the time to compute "anagram" in APL and in J, where APL computes permutations by table look-up (as above) and J computes them from scratch.

Actually, using table look-up in ANAGRAM doesn't save as much time as you might think. Alphabetizing the result takes longer than building the permutation matrix from scratch. Using the #MF monitoring facility of APL*PLUS II/386, here are the line-by-line execution times for ANAGRAM (modified to eliminate table look-up):

1 #MF'ANAGRAM2' @ activate monitoring 1 Z{<-}ANAGRAM2'ANAGRAM' @ run the function ]LTIMES ANAGRAM2 @ display a report All Here Iter {del} Z{<-}ANAGRAM2 C 0 0 0 [1] @Returns all anagrams of {omega} 0 0 1 [2] C{<-},C 174 0 1 [3] Z{<-}PERM1 {rho}C @ build permutation matrix 25 25 1 [4] Z{<-}C[Z] 336 336 1 [5] Z{<-}Z[#AV{gradeup}Z;] @ alphabetize 1 1 1 [6] {->}((C{iota}C)^.={iota}{rho}C)/0 91 91 1 [7] Z{<-}(1,1{drop}{or}/Z{/=}{neg}1{rotatebar}Z){slashbar}Z {del} 627 452 1

The first two columns of the report are execution time in milliseconds on my 486/66. The "All" column gives the total time spent on each line, including time spent in subroutines. The "Here" column excludes time spent in subroutines. "Iter" is the number of times each line was executed. The last row is totals.

A much faster way of removing duplicate rows (without alphabetizing) is by using the MATIOTA (aka ROWFIND) fastfn. MATIOTA is similar to dyadic iota for character matrices--it looks up each row of the right argument in the left argument, and it can take the place of {iota} in the classic "nodups" idiom. Here's the execution profile for a version of ANAGRAM that uses this technique:

1 #MF'ANAGRAM3' 1 Y{<-}ANAGRAM3 'ANAGRAM' ]LTIMES ANAGRAM3 All Here Iter {del} Z{<-}ANAGRAM3 C 0 0 0 [1] @Returns all anagrams of {omega} 0 0 1 [2] C{<-},C 176 0 1 [3] Z{<-}PERM1 {rho}C 26 26 1 [4] Z{<-}C[Z] 58 12 1 [5] Z{<-}((Z MATIOTA Z)={iota}1{take}{rho}Z){slashbar}Z {del} 261 38 1

MATIOTA is written in assembler and uses hashing to reduce the number of rows that are compared. Although using a non-APL subroutine is not as satisfying as using standard built-in operations, the subroutine is, at least, supplied with the APL system, so ANAGRAM3 qualifies as an out-of-the-box "stock" solution (for APL*PLUS, at least).

In spite of the faster execution, the APL version is still slower than Roger's version running on J 2.06:

Argument System 'ANAGRAM' 'ABCDEFG' Function -------- --------- --------- -------- +II 5.2 0.636 0.486 ANAGRAM2 +III 1.1 0.450 0.373 ANAGRAM2 +II 5.2 0.253 0.257 ANAGRAM3 +III 1.1 0.236 0.242 ANAGRAM3 J 2.03 0.379 0.379 anagram J 2.06 0.117 0.121 anagram (timed by Roger)

Times are seconds on a 486/66. (I multiplied Roger's 486/50 times by 50{divide}66 to make them comparable with my times.)

Some observations on the times: APL*PLUS III is significantly faster than APL*PLUS II. The speedups illustrated above for ANAGRAM come mostly from improvements to gradeup. On +II, gradeup's execution time depends on the data in the argument; on III, execution is faster and the time appears to depend only on the size of the argument. Once again, J 2.06 demonstrates that it is much faster than J 2.03.

Jim

For what it's worth, here are two more ways of computing a permutation matrix.

{del}Z{<-}PERM2 N;C;I [1] @Returns a (!{omega})-by-{omega} matrix of all permutations of{+ +} {iota}{omega} [2] Z{<-}1 0{rho}I{<-}0 [3] L1:{->}(N<I{<-}I+1)/0 @ Loop for each column [4] C{<-}(1{take}{rho}Z)/{iota}I @ new column to put in front [5] Z{<-}(({rho}C),I-1){rho}Z @ stack I copies of Z atop each other [6] Z{<-}C,Z+Z{>=}{transpose}({reverse}{rho}Z){rho}C {+ +} @ recode Z and put on the col [7] {->}L1 @ Endloop {del}

This version returns the permutation matrix in lexical order, adapts to the current index origin, returns the correct result for N=0, and runs a bit faster than PERM1.

I derived this algorithm by looking at section 4 of an order-four permutation matrix (the rows that have 4 in the first column):

4 1 2 3 4 1 3 2 4 2 1 3 4 2 3 1 4 3 1 2 4 3 2 1

The stuff to the right of the first column is just an order-three permutation matrix, so this section can be built by induction. Section 3 of the order-four matrix is similar, but the 3s in the order-three matrix are replaced by 4s. And in section 2, the 2s are replaced by 3s and the 3s by 4s. So the first row of each section looks like this:

1 2 3 4 2 1 3 4 3 1 2 4 4 1 2 3

At first I thought of using indexing to recode P, the order-three permutation matrix (e.g., 2,(1 3 4)[P] generates section 2), but then I realized that the recoding simply increments the elements of P that are {>=} the number in the first column. This can be done by adding a Boolean matrix to P, as in 2,P+P{>=}2 which generates section 2. Line [4] of PERM2 generates the new first column, and on line [6] the reshape-transpose duplicates the first column I times for comparison with Z.

After writing PERM2, I finally managed to figure out what Roger's J program is doing. Here is an APL translation:

{del}Z{<-}PERM3 N;I;T [1] @Roger Hui's perm algorithm, in APL [2] Z{<-}1 0{rho}I{<-}0 [3] L1:{->}(N<I{<-}I+1)/0 [4] T{<-}{gradedown}{each}{enclose}[1]({iota}I){jot}.={iota}I [5] Z{<-}{disclose}{commabar}/T NDX{each}{enclose}#IO,1+Z [6] {->}L1 {del}

NDX is just {alpha}[{omega}], needed for applying each to indexing. The {enclose}[1] on line [4] should be {split} in APL*PLUS Evlevel 1 or Dyalog APL. For I=4, the items of T on line [4] are the rows of the first-row-of-each-section matrix shown above. Indexing each item with #IO,1+Z gives the sections of the next higher-order permutation matrix as a vector of matrices, and {disclose}{commabar}/ stacks the sections atop each other, forming a flat matrix.

Here are some timings:

Argument 6 7 8 +II 5.2 PERM1 .025 .178 1.59 PERM2 .021 .137 1.36 PERM3 .023 .098 1.37 +III 1.1 PERM1 .021 .134 1.63 PERM2 .019 .132 1.38 PERM3 .022 .093 1.00 perm J 2.03 .077 .244 1.69 J 2.06 .017 .037 .28 (timed by Roger Hui)

Times are seconds on my 486/66. The J 2.06 timings are from Roger's June 10 posting, scaled by 50{divide}66. The performance of J 2.06 is nothing short of astonishing--perm 8 apparently runs fully six times faster on J 2.06 than on 2.03!

Jim

I wrote PERM2 because I thought that if the permutation matrix P was in
lexical order and I sorted the letter list C, then the anagram matrix
C[P] would be in alphabetical order, and I could just shift and compare
to remove duplicates. However, on trying this, I found out I was wrong!
C[P] is *not* in alphabetical order. To see why, consider the letters
'AABC'. Both 1 and 2 in the numeric matrix P become 'A' in C[P]. When
sorting P, if one row has 1 in the first column and another has 2, you
can stop right there and put the 1-row before the 2-row. But in the
character matrix, both 1 and 2 are As, and the 2-row might actually
belong before the 1-row, depending on what's in later columns.

Jeff Chilton suggested the following:

'ABCC' indexed by the in-order permutation matrix is not an alphabetical matrix but the rows that are out of order are the ones you want to delete--algorithmically, examine the second through last rows of the anagram matrix and remove each one that is less than or equal to its last non-deleted predecessor.

Here's one way of doing this in APL:

{del}Z{<-}ANAGRAM4 C;B;D;J;K [1] @Jeff Chilton's algorithm [2] Z{<-}PERM2 {rho}C{<-},C @ numeric permutation matrix [3] C{<-}C[#AV{gradeup}C] @ alphabetize the letter list [4] {->}(({iota}{rho}C)^.=D{<-}C{iota}C)/L1 {+ +} @ If duplicate letters in C, [5] J{<-}({rho}C){basevalue}{transpose}D[Z] {+ +} @ recode Z, then pack each row into a number [6] K{<-}(B{<-}J={max}\J)/J @ remove out-of-order entries [7] Z{<-}(B\1,1{drop}K{/=}{neg}1{rotate}K){slashbar}Z {+ +} @ remove duplicate entries as well [8] L1:Z{<-}C[Z] @ return letters {del}

Line [5] recodes the full permutation matrix so it has the same duplicates as C, then packs each row into a single number. Duplicate rows will have identical numbers. Lines [6] and [7] do what Jeff suggested. This program isn't quite as fast as the MATIOTA version, but it's a lot faster than the sorting version (and it's pure APL). Incidentally, using PERM2 in the MATIOTA version and sorting the letter list before indexing with the permutation matrix will cause the anagram matrix to be in alphabetical order.

Jim

This is an analysis and comparison of the J and APL implementations of Roger Hui's p4 perm function (listed in his June 10 comp.lang.apl posting).

The main statement in Roger's function is:

z=. ,/ (0,.1+z) {"2 1 \:"1 =i.j

This is executed repeatedly with j=1, 2, 3, etc. First, a description of the steps and some commentary from an APLer who is just learning to creep in the world of J.

` i. j - {iota}j.` The . and : suffixes are one of the easier aspects
of J to get used to. They serve a role similar to the {quad}
prefix of APL, giving the J system two entire "namespaces" for built-in
system functions. One can only wish the foreign conjunctions had been
given names like "timer." instead of expecting users to define their own
covers for 6!:2.

` 1=i.j - `No, the = is not dyadic. The 1 is grabbed by the " to its
left, leaving the = as a monadic function. This is an example
of an aspect of J that is fundamentally more difficult than APL--
figuring out what's an argument to what. Monadic = is "self-classify";
for a vector X, =X is

Roger Hui pointed out in a later message that the 1 being grabbed by an operator to its left is a property that J shares with APL2. (I was not aware of this becase APL*PLUS has neither user-defined operators nor any operators that take array arguments.) However, I stand by my point that figuring out what's an argument to what is more difficult in J than in APL.

` \:"1 - `grade down (

Rank is a much richer way of describing repetition than each. In
APL2 and its kin, you have one very simple repetition operator (each),
and to do anything complicated you have to transform the data structure
to specify the axes you want repetition over. For example, to do `\:"1 M`
in APL2, you would use `{enclose}[2]M` to hide the column axis, apply
grade-each, then use `{disclose}` to restore the column axis. The trouble
is, the enclose and disclose actually result in data movement, which is
not so good for efficiency, and they require some thought on the part of
the reader to figure out what is going on. With rank, the data sits
still and the numeric right argument describes the repetition. On the
other hand, rank works only if the overall result can fit within the
confines of a rectangular multidimensional array.* But, as the success
of non-nested APL demonstrates, you can do quite a lot within those
limits.

* Roger pointed out that this is not correct. He wrote: "The individual results from the argument cells need not have the same rank or shape. In constructing the overall result from the individual cellular results, a process which I call 'permissive assembly' uses appropriate padding to produce a rectangular overall result."

` (0,.1+z) - `this would be

` {"2 1 - `"from", J's indexing function, with the rank operator. For
a vector V,

In APL2, a vector can be indexed with a matrix using `V[M]` or
`({enclose}M){squad}V`. To index a set of vectors, you can use
squad-each. If the vectors reside in the rows of a matrix, you must first
use `{enclose}[2]` to hide the column dimension. To specify that the left
argument should be applied in toto to each item of the right argument,
you have to apply an extra enclose to the left argument. After
indexing, disclose must be used to reveal the nested dimension. So the
J statement `L{"2 1 R` for matrices L and R can be done in APL2 using:

{disclose}({enclose}{enclose}L){squad}{each}{enclose}[2]R

As mentioned before, with rank the data stands still and the argument describes the repetition. With APL2 the data has to be twisted to get it ready for each, then straightened out afterwards.

` ,/ - `yes, it's catenate-reduction, but quite different from APL.
It has the effect of coalescing the first two dimensions of
the 3-D matrix. Remember that , stacks matrices vertically in J. In J,

In fairness to APL2, mimicing the p4 J program isn't as difficult as described above because the data doesn't always have to be de-nested after each step--it can be left nested so it is ready for the next step.

Now for the APL version. Why does it run so slowly? An execution profile for PERM 8 (executed 10 times using APL*PLUS II/386 v5.2) is quite revealing:

All Here Iter {del} Z{<-}PERM3B N;I;T 0 0 0 [1] @Roger Hui's algorithm, in APL 2 2 10 [2] Z{<-}1 0{rho}I{<-}0 21 21 90 [3] L1:{->}(N<I{<-}I+1)/0 72 72 80 [4] T{<-}{gradedown}{each}({enclose}{iota}I)= {+ +} {each}{iota}I 532 532 80 [5] Z{<-}{enclose}#IO,1+Z 2,655 59 80 [6] Z{<-}T NDX{each}Z 10,454 10,454 80 [7] Z{<-}{commabar}/Z 3 3 80 [8] Z{<-}{disclose}Z 3 3 80 [9] {->}L1 {del} 13,747 11,146 10

Note that `{commabar}/` is responsible for most of the execution time. The
slowness is because the reduction is implemented literally, as a series
of catenations, and it involves moving a lot of data. On the last
iteration (the one that takes nearly all the time), it's catenating
together eight 158 Kbyte matrices, one at a time, resulting in a total
data movement of about 5.5 Mbytes, even though the result is only 1.3
Mbytes in size. Furthermore, as it builds larger and larger arrays, it
chews through memory, forcing more frequent garbage collects.

A better way of doing this is by using `{disclose}` (or `{mix}` in
APL*PLUS Evlevel 1 or Dyalog APL). Disclose has the sense to figure out
the result size and copy each block of data just once. After the
disclose, reshape must be used to coalesce the first two dimensions.
Here's a version that does that:

All Here Iter {del} Z{<-}PERM3C N;I;T 0 0 0 [1] @Version that avoids {commabar}/ 1 1 10 [2] Z{<-}1 0{rho}I{<-}0 20 20 90 [3] L1:{->}(N<I{<-}I+1)/0 72 72 80 [4] T{<-}{gradedown}{each}({enclose}{iota}I)={+ +}{each}{iota}I 523 523 80 [5] Z{<-}{enclose}#IO,1+Z 2,691 60 80 [6] Z{<-}T NDX{each}Z 1,127 1,127 80 [7] Z{<-}{disclose}Z 1,131 1,131 80 [8] Z{<-}(({times}/2{take}{rho}Z),2{drop}{rho}Z){rho}Z 1 1 80 [9] {->}L1 {del} 5,577 2,935 10

In fact, the dimension-coalescing reshape can be postponed until the very end of the program. Z will build up a lot of dimensions as the loop proceeds, but they don't affect any of the operations in the loop.

All Here Iter {del} Z{<-}PERM3D N;I;T 0 0 0 [1] @Version with the reshape postponed 2 2 10 [2] Z{<-}1 0{rho}I{<-}0 19 19 90 [3] L1:{->}(N<I{<-}I+1)/L2 70 70 80 [4] T{<-}{gradedown}{each}({enclose}{iota}I)={+ +}{each}{iota}I 548 548 80 [5] Z{<-}{enclose}#IO,1+Z 2,662 57 80 [6] Z{<-}T NDX{each}Z 1,120 1,120 80 [7] Z{<-}{disclose}Z 3 3 80 [8] {->}L1 985 985 10 [9] L2:Z{<-}(({times}/{neg}1{drop}{rho}Z),{neg}1{+ +}{take}{rho}Z){rho}Z {del} 5,416 2,803 10

The savings is minor because the final reshape is the one that takes the most time, and moving the reshape out of the loop just avoids the less time-consuming cases.

(Note that I'm using `NDX{each}` in this program because APL*PLUS
doesn't have `{squad}`. I don't expect that there's much penalty for
calling a user-defined function here because arrays are large.)

By the way, it's possible to do this using only standard APL operations. The index-each technique is replaced with indexing the enlist (or ravel) of T with copies of Z that are increased by 0, I, I{times}2, I{times}3, etc., to select the different "rows" of T. Here's the program and execution profile:

All Here Iter {del} Z{<-}PERM3E N;I;M;T;V 0 0 0 [1] @Roger Hui's algorithm, in flat APL 2 2 10 [2] Z{<-}1 0{rho}I{<-}0 20 20 90 [3] L1:{->}(N<I{<-}I+1)/L2 12 12 80 [4] M{<-}({iota}I){jot}.={iota}I 31 31 80 [5] M{<-}M+(2{times}I-{iota}I){jot}.+I{rho}0 17 17 80 [6] T{<-}{gradedown},M 31 31 80 [7] T{<-}T-,(I{times}{neg}1+{iota}I){jot}.+I{rho}0 276 276 80 [8] Z{<-}0,Z 16 16 80 [9] V{<-}1+I{times}{neg}1+{iota}I 3,230 3,230 80 [10] Z{<-}V{jot}.+Z 2,828 2,828 80 [11] Z{<-}T[Z] 1 1 80 [12] {->}L1 983 983 10 [13] L2:Z{<-}(({times}/{neg}1{drop}{rho}Z),{neg}1{+ +}{take}{rho}Z){rho}Z {del} 7,454 7,447 10

This runs slower than the nested version because the addition on line [10] generates a matrix the size of an I-th order permutation matrix, instead of an (I-1)th order matrix as in PERM3D. On the final iteration, this results in line [10] doing 8 times as much work as 1+Z on line [5] of PERM3D, and the times reflect this.

That the indexing on line [11] is *slower* than the NDX{each} on line
[6] of PERM3D is surprising, since both lines are doing the same amount
of work. Perhaps the difference is caused by my computer's 256K of RAM
cache. In PERM3D, a 158K Z is used repeatedly, while in PERM3E, Z is
1.3M and is used only once.

The flat technique used on lines [4-7] to generate T may run fast,
but it sure is obtuse. Remember that these four lines are just `\:"1=i.I`
in J or `{gradedown}{each}({enclose}{iota}I)={each}{iota}I` in APL2.

PERM3D is by far the fastest of the APL versions. Here are some execution times (seconds on my 486/66):

Argument 6 7 8 +II 5.2 .019 .067 .53 +III 1.1 .020 .071 .54 J 2.06 .017 .037 .28 (scaled from Roger's timings)

So, why is the J version still twice as fast? One possibility is that J2.06 might be able to alter the shape of an array without having to make a copy of the data, so the ,/ (which corresponds to lines [7] and [9] of PERM3D) is done without copying the large array. Another possibility is that J's block-move subroutine exploits the 80386's REP MOVSD instruction (something APL*PLUS does not appear to do) and that J's indexing function has a tighter inner loop than APL*PLUS's.

Jim

Roger Hui wrote on June 16:

The component

\:"@=@i.in the enumeration of permutations is a neat little problem per se. [...]Can you think of other solutions?

For nested-array APL:

{disclose}{gradedown}{each}({enclose}{iota}N)={each}{iota}N

For flat APL:

+{backslashbar}({iota}N){commabar}1,(N-1 1){rho}N{take}{neg}1 N{<-}5 (N-1 1){rho}N{take}{neg}1 -1 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 -1 ({iota}N){,-}1,(N-1 1){rho}N{take}{neg}1 1 2 3 4 5 1 -1 0 0 0 1 0 -1 0 0 1 0 0 -1 0 1 0 0 0 -1 +{\-}({iota}N){,-}1,(N-1 1){rho}N{take}{neg}1 1 2 3 4 5 2 1 3 4 5 3 1 2 4 5 4 1 2 3 5 5 1 2 3 4 #IO{<-}0 +{\-}({iota}N){,-}1,(N-1 1){rho}N{take}{neg}1 0 1 2 3 4 1 0 2 3 4 2 0 1 3 4 3 0 1 2 4 4 0 1 2 3

Unfortunately, neither of these solutions works for N=0 (at least on APL*PLUS II).

Jim

Ram Biyani (rambi@vnet.ibm.com) sent me the following version of PERM:

{del} Z{<-}PERM3R N;I;S;T [1] @Ram Biyani's version [2] Z{<-}1 0{rho}I{<-}0 [3] L1:{->}(N<I{<-}I+1)/L2 [4] S{<-}I*2 [5] T{<-}({iota}I),(I-0 1){rho}(S{rho}0,I{rho}1)/S{rho}{iota}I [6] Z{<-}T[;#IO,1+Z] [7] {->}L1 [8] L2:Z{<-}(({times}/{neg}1{drop}{rho}Z),{neg}1{take}{rho}Z){rho}Z {del}

His solution to building the recode matrix (lines [4] and [5]) is novel and very nice. Displaying intermediate results with (I,I) in place of S makes the technique apparent:

I{<-}4 +J{<-}(I,I){rho}{iota}4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 +B{<-}(I,I){rho}0,I{rho}1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 +T{<-}(I-0 1){rho}(,B)/,J 2 3 4 1 3 4 1 2 4 1 2 3 ({iota}I),T 1 2 3 4 2 1 3 4 3 1 2 4 4 1 2 3

And line [6]! Oh, line [6]! Why didn't *I* think of it this
way? With `J{<-}#IO,1+Z`, what we want is
`(T[1;])[J]`, or `T[1;J]`, and `T[2;J]` and
`T[3;J]`, etc. Once stated this way, it's obvious that
`T[;J]` does it all at once. Beautiful and simple.

As you might expect, this is the fastest APL solution so far. Here are the execution times on APL*PLUS II and III on my 486/66. The last row gives the estimated run time for Roger's perm 8 on the same computer.

Argument 6 7 8 +II 5.2 .012 .048 .40 +III 1.1 .012 .048 .40 (no kidding; exactly = +II) J 2.06 .017 .037 .28 (scaled from Roger's timings)

By the way, fully 98% of the execution time is spent on lines [6] and [8], so improvements to the other lines won't result in significant speedups.

Jim

The 1995 Quarter 3 issue of the Zark APL Tutor Newsletter is out, with the results of the Permutations and Anagrams programming contest.

Published solutions include two interesting-but-not-practical nonlooping versions from Georges Brigham and J. Philip Benkard (originally printed in the Big APL Newsletter published by NY/SIGAPL; contact jhlaboyd@ix.netcom.com for info), two recursive versions from John Sullivan and Ram Biyani (who contributed solutions to c.l.a), and the "c.l.a group effort" that I sent in. The recursive versions gained speed by having the results for PERM 1, 2, and 3 imbedded in the functions, which made Ram Biyani's function the fastest for arguments <6. The group effort was the fastest for arguments >6. Adding precomputation to the group effort yields the following "ultimate" APL version:

{del} Z{<-}PERM N;A;I [1] @Returns a (!{omega})-by-{omega} matrix of all permutations of {iota}{+ +}{omega} [2] {->}(N{>=}3 2 1)/L3,L2,L1 [3] Z{<-}1 0{rho}0 [4] {->}0 [5] L1:Z{<-}1 1{rho}#IO [6] {->}0 [7] L2:Z{<-}2 2{rho}#IO+0 1 1 0 [8] {->}0 [9] L3:Z{<-}6 3{rho}#IO+0 1 2 0 2 1 1 0 2 1 2 0 2 0 1 2 1 0 [10] I{<-}3 [11] L4:{->}(N<I{<-}I+1)/L5 [12] A{<-}+{backslashbar}({iota}I),[#IO]1,(I-1 1){rho}I{take}{neg}1 [13] Z{<-}A[;#IO,1+Z] [14] {->}L4 [15] L5:Z{<-}((!N),N){rho}Z {del}

Gary printed only the simplest version of ANAGRAM (the version that sorts the matrix if there are any duplicates in the letter list), even though sorting the result takes more time than generating the permutation matrix. Maybe he didn't trust Jeff Chilton's clever and fast version (see my 14 Jun 1995 posting here) or maybe there wasn't enough space in the newsletter.

Jim

The November 1995 issue of Dr. Dobbs has an article by Mani Iyer entitled "Permutation Generation Using Matrices" (p. 133). Unfortunately, the algorithm described in the article doesn't actually work. I mentioned this to Mani and he sent me a corrected version. The algorithm is not as fast (in APL, at least) as the algorithms described here in June, and the result isn't in sorted order, and it doesn't work for N{<=}2. But the technique is intriguing because it builds a set of square matrices and extracts permutations by reading off the rows and columns of the matrices. Half the permutations are found in the rows and half in the columns! An APL implementation is appended below.

The article has a pointer to an interesting paper: "Permutation
Generation Methods", by Robert Sedgewick, in *Computing Surveys*, Vol.
9, No. 2, June 1977. (Incidentally, Roger Hui's indexing algorithm is
not among those listed.) The paper describes an algorithm, attributed
to D. N. Lehmer in 1906, for generating the i-th lexical pemutation of {iota}N
(function LEXPERM, below). I had wondered how J's "atomic permutation"
function was able to generate the i-th lexical permutation so quickly.
It's simple enough to invert the algorithm to produce a function similar
to J's "atomic index" function (PERMNDX below).

Note: These programs assume `#IO` is 1.
For you IBMers, `{commabar}` is `,[#IO]`.

{del} Z{<-}PERMDD3 N;I;S [1] @Returns a (!{omega})-by-{omega} matrix of all permutations of {+ +}{iota}{omega} [2] @ Algorithm by Mani Iyer, a corrected version of the technique [3] @ described in the Nov 95 issue of Dr. Dobbs on page 133. [4] @ [5] Z{<-}1 2{rho}{iota}2 [6] I{<-}2 [7] L1:{->}(N<I{<-}I+1)/L2 [8] Z{<-}I{slashbar}Z,I [9] Z{<-}((1{take}{rho}Z){rho}-{iota}I){rotate}Z [10] {->}L1 [11] L2:S{<-}((1{take}{rho}Z){divide}N),N,N [12] Z{<-}Z{commabar}({rho}Z){rho}1 3 2{transpose}S{rho}Z [13] @ Z{<-}Z[{gradeup}Z;] @ sort result if desired {del}

LEXPERM1 and PERMNDX1 are simple versions that process only a single permutation. LEXPERM and PERMNDX are more complicated parallel versions that accept or return an array of permutation indices. Some examples:

1 LEXPERM 7 @ first permutation (6 and 7 are reversed) 1 2 3 4 5 7 6 P{<-}0 1000 5039 LEXPERM 7 P 1 2 3 4 5 6 7 @ 0th permutation (no change) 2 4 3 6 7 1 5 @ 1000th 7 6 5 4 3 2 1 @ 5039th (last one, number (!7)-1) PERMNDX P 0 1000 5039 {del} Z{<-}J LEXPERM1 N;C;I [1] @Returns the {alpha}-th lexical permutation of {iota}{omega} [2] @ From an algorithm by D. N. Lehmer [1906] [3] Z{<-},1 [4] C{<-}1+({reverse}1+{iota}N-1){represent}J [5] I{<-}N [6] L1:{->}(0=I{<-}I-1)/0 [7] Z{<-}C[I],Z+Z{>=}C[I] [8] {->}L1 {del} {del} Z{<-}PERMNDX1 P;N [1] @Returns the lexical index of permutation {omega} [2] N{<-}{rho}P{<-},P [3] Z{<-}{iota}0 [4] L1:{->}(1{>=}{rho}P)/L2 [5] Z{<-}Z,1{take}P [6] P{<-}1{drop}P-P>1{take}P [7] {->}L1 [8] L2:Z{<-}({reverse}1+{iota}N-1){basevalue}Z-1 {del} {del} Z{<-}J LEXPERM N;C;I [1] @Returns the {alpha}-th lexical permutations of {iota}{omega} [2] @ N must be a scalar, but J may have any shape. {rho}Z {<-}{->} ({+ +}{rho}J),N [3] @ From an algorithm by D. N. Lehmer [1906] [4] Z{<-}(1,{times}/{rho}J){rho}1 [5] C{<-}1+({reverse}1+{iota}N-1){represent},{transpose}J [6] I{<-}N [7] L1:{->}(0=I{<-}I-1)/L2 [8] Z{<-}C[I;]{commabar}Z+Z{>=}({rho}Z){rho}C[I;] [9] {->}L1 [10] L2:Z{<-}{transpose}((1{take}{rho}Z),{reverse}{rho}J){rho}Z {del} {del} Z{<-}PERMNDX P;S [1] @Returns the lexical indices of permutations {omega} [2] @ P may be a vector containing a single permutation, or an array [3] @ having a permutation in each row. {rho}Z {<-}{->} {neg}1{drop}{rho}P [4] S{<-}{rho}P [5] P{<-}{transpose}(({times}/{neg}1{drop}S),{neg}1{take}1,S){rho}P [6] Z{<-}(0 1{times}{rho}P){rho}0 [7] L1:{->}(1{>=}1{take}{rho}P)/L2 [8] Z{<-}Z{commabar}P[1;] [9] P{<-}1 0{drop}P-P>({rho}P){rho}P[1;] [10] {->}L1 [11] L2:Z{<-}({neg}1{drop}S){rho}({reverse}1+{iota}{neg}1+{neg}1{take}1,S){+ +}{basevalue}Z-1 {del}

Home Page