From: jimw@math.umass.edu (Jim Weigang)
Newsgroups: comp.lang.apl
Date: 5 Oct 1994 01:19:32 GMT
Subject: Re: Puzzler: Comparing Functions

After years of using a very crude comparison program (which just showed lines in version 1 not found in version 2 and vice versa), a friend sent me a paper by Webb Miller and Eugene Myers entitled "A File Comparison Program", from Software--Practice and Experience, Vol. 15(11), pp 1025-1040 (Nov 1985). Here's the abstract:

This paper presents a simple method for computing a shortest sequence of insertion and deletion commands that converts one given file to another. The method is particularly efficient when the difference between the two files is small compared to the files' lengths. In experiments performed on typical files, the program often ran four times faster than the UNIX diff command.

Notice that the output from this program is the (or one of the) shortest possible sequence of insert/delete commands. No hokey edge conditions here; this is the real stuff! I translated the published C program to APL and used it directly for a while. Later, I recoded the iterative core of the program in 80386 assembler, and it now runs nice and fast.

The output format is different from what Zark is asking for in their puzzler. Here's a sample of the format I settled on:

-[15]  old line deleted from version 1
-[16]  another deleted line
+[15]  replacement line in version 2
----------
-[30]  deleted
----------
+[40]  added

A "-" in the first column indicates a line found only in the first version; a "+" marks a line from the second version. Horizontal lines separate groups of changes, indicating unchanged lines common to the two versions.

I've used this program a lot for a couple of years, and I'm generally pleased with the results. There are times, however, when it is hard to tell what has changed when looking at only the original version and the difference. (As happens when there are more than two versions of a program and you're trying to merge multiple differences into a single version.) In these situations, I would have found it helpful for the program to display not just the changed lines, but also a few lines above and below the changed lines to give some idea of the context. This could be done pretty easily, as the output formatting code is just simple APL.

The pre-assembler version of the program is listed below. As usual, it's written in plain vanilla APL [except for a few diamonds (&) and a dyadic grade to sort a character matrix in MATIOTA], so everyone should be able to run it. By the way, use an assembler version of MATIOTA (aka ROWFIND) if you have one. #IO is assumed to be 1. APL*PLUS and APL2 users can translate the listing below back to APL symbols using the APLASCII workspace available via ftp from watserv1.waterloo.edu.

I checked with Gene Myers, and he says that the C program is uncopyrighted and that a more recent version of the algorithm is used in the GNU diff program.

Jim

      {del}X MMCOMP Y;A;B;C;D;DL;I;K;KS;L;LD;M;N;O;P;Q;R;T;U;CHG;CHGi;CHGn;OK
[1]    @Displays the differences between matrices {alpha} and {omega}
[2]    @ Uses the algorithm described in the article "A File Comparison
[3]    @    Program" by Webb Miller and Eugene W. Myers in "Software--Practice
[4]    @    and Experience", Vol. 15(11), pp. 1025-1040 (Nov. 1985).
[5]    @
[6]    DL{<-}X OVER Y
[7]    DL{<-}((DL MATIOTA DL)={iota}1{take}{shape}DL){slashbar}DL@distinct {+
      +}lines
[8]    A{<-}DL MATIOTA X@code the arguments as integers
[9]    B{<-}DL MATIOTA Y
[10]   M{<-}''{reshape}{shape}A & N{<-}''{reshape}{shape}B
[11]   C{<-}+/^\{reverse}((-M{min}N){take}A)=(-M{min}N){take}B@number of {+
      +}identical lines at end
[12]   M{<-}M-C & N{<-}N-C@drop 'em off
[13]   KS{<-}LD{<-}(2+M+N){reshape}0
[14]   O{<-}1+M@origin corrector for indexing KS and LS which go [-M..N+1]
[15]   @ The indexing should go from [-M..N], but if M and N are both 1,
[16]   @   the program for some reason examines diagonal 2.
[17]   CHG{<-}1000 3{reshape}1{take}0 {neg}1 & CHGn{<-}''{reshape}{shape}{+
      +}CHG & CHGi{<-}0
[18]   @ CHG is a linked list of lines to insert/delete
[19]   @        CHG[;1]  is the row number in A
[20]   @               - means delete this row of A
[21]   @               + means insert a row of B after this row
[22]   @        CHG[;2]  is the row number of B for insertion
[23]   @        CHG[;3]  is the pointer to the previous entry in the list
[24]   @               (a row index into CHG)
[25]   @ CHGn is the number of rows in CHG
[26]   @ CHGi is the pointer to the last row of CHG used so far
[27]   @ CHG, CHGn, and CHGi are accessed globally by MMAPPEND
[28]   @ KS is pointers into CHG (linked list heads)
[29]   @
[30]   R{<-}{neg}1+(((M{min}N){take}A){/=}(M{min}N){take}B){iota}1@point to {+
      +}last identical line
[31]   LD[O]{<-}R
[32]   L{<-}({neg}1 1)[1+R=M]
[33]   U{<-}(1 {neg}1)[1+R=N]
[34]   D{<-}0
[35]   {->}(~L>U)/L1
[36]   #{<-}'No differences.'
[37]   {->}0
[38]
[39]   @--------------------------------------------------------------------{+
      +}---
[40]   @ The following code is written in assembler for the production {+
      +}version:
[41]   @
[42]  L1:D{<-}D+1@Loop for each D level
[43]   K{<-}L-2 & OK{<-}O+K
[44]  L2:{->}(U<K{<-}K+2)/L9@   Loop for each diagonal L..U
[45]   OK{<-}O+K@      shift to use as index for LD and KS
[46]
[47]   @ Locate the last d on diagonal k
[48]   Q{<-}LD[OK+1]
[49]   {->}(K=-D)/L3
[50]   P{<-}LD[OK-1]
[51]   {->}((K=D){or}Q<P)/L4
[52]  L3:R{<-}Q+1
[53]   MMAPPEND(-R),0,KS[OK+1]
[54]   {->}L5
[55]  L4:R{<-}P
[56]   MMAPPEND R,(R+K),KS[OK-1]
[57]  L5:KS[OK]{<-}CHGi
[58]   C{<-}R+K
[59]  L6:{->}((R{>=}M){or}C{>=}N)/L7
[60]   {->}(A[P{<-}R+1]{/=}B[Q{<-}C+1])/L7
[61]   R{<-}P & C{<-}Q
[62]   {->}L6
[63]  L7:LD[OK]{<-}R
[64]
[65]   {->}((R=M)^C=N)/L10@jump if done
[66]   {->}(R{/=}M)/L8 & L{<-}K+2
[67]  L8:{->}(C{/=}N)/L2 & U{<-}K-2
[68]   {->}L2
[69]  L9:L{<-}L-1 & U{<-}U+1
[70]   {->}L1
[71]   @
[72]   @ End of assember-coded section
[73]   @--------------------------------------------------------------------{+
      +}---
[74]
[75]   @ Found a solution; reverse the linked list
[76]  L10:I{<-}KS[OK]
[77]   N{<-}0@end marker
[78]  L11:T{<-}CHG[I;3]@previous item
[79]   CHG[I;3]{<-}N@make this one point to the next
[80]   N{<-}I@this one will be the next item on the next pass
[81]   I{<-}T@move to previous item
[82]   {->}(I>0)/L11@quit if at end
[83]   I{<-}N@new head of list
[84]
[85]   @ Display the results
[86]   P{<-}(|CHG[I;1])+{neg}1{times}0>CHG[I;1]
[87]  L12:{->}(P{>=}(|CHG[I;1])+{neg}1{times}0>CHG[I;1])/L13
[88]   #{<-}30{reshape}'-'@put divider line between blocks
[89]  L13:P{<-}|CHG[I;1]
[90]   {->}(CHG[I;1]<0)/L14
[91]   #{<-}DTB'+',(6{take}'[',({format}{neg}1+CHG[I;2]),']'),Y[CHG[I;2];]
[92]   {->}L15
[93]  L14:#{<-}DTB'-',(6{take}'[',({format}{neg}1+-CHG[I;1]),']'),X[-CHG[I;{+
      +}1];]
[94]  L15:I{<-}CHG[I;3]@next in chain
[95]   {->}(I{/=}0)/L12
[96]
[97]   @ Copyright (c) 1993 by Jim Weigang.  Permission is granted to use
[98]   @ and reproduce this program, provided it is not sold for profit.
      {del}

      {del}MMAPPEND A
[1]    @Appends {omega} to the change list
[2]    {->}(CHGn{>=}CHGi{<-}CHGi+1)/L1@If no more room,
[3]    CHG{<-}(1000 0+{shape}CHG){take}CHG@   add a bunch more entries
[4]    CHGn{<-}''{reshape}{shape}CHG
[5]   L1:CHG[CHGi;]{<-}A
      {del}

      {del}Z{<-}A OVER B;S
[1]    @Forms a matrix with {alpha} over {omega}
[2]    A{<-}(S{<-}{neg}2{take}1 1,{shape}A){reshape}A
[3]    S{<-}0 1{times}S{max}{shape}B{<-}({neg}2{take}1 1,{shape}B){reshape}B
[4]    Z{<-}((S{max}{shape}A){take}A),[#IO](S{max}{shape}B){take}B
      {del}

      {del}Z{<-}A MATIOTA B;C;F;N;P;R;T
[1]    @Returns row indices of names {omega} in lookup table {alpha}, or 0 {+
      +}if not found
[2]    C{<-}A OVER B @join arguments
[3]    C{<-}C[P{<-}#AV{gradeup}C;] @alphabetize the rows
[4]    F{<-}{or}/C{/=}{neg}1{rotatebar}C @mark the first occurrence of each {+
      +}different row
[5]    F[{iota}{signum}{shape}F]{<-}1 @insure that the first element is 1
[6]    T{<-}F/P @indices of where the firsts occur
[7]    N{<-}T-{neg}1{drop}0,T @number of occurrences of each distinct row
[8]    Z{<-}+\F\N @make (N[1]{reshape}N[1]),(N[2]{reshape}N[2]),...
[9]    Z[P]{<-}Z @return to original ordering
[10]   R{<-}1{take}{neg}2{take}1 1,{shape}A @number of rows in A
[11]   Z{<-}R{drop}Z @drop off answers for rows of A
[12]   Z{<-}Z{times}Z{<=}R @convert out-of-range numbers to 0
[13]   Z{<-}((1<{shape}{shape}B)/{shape}Z){reshape}Z @return scalar if B {+
      +}was a vector
[14]   @ Copyright (c) 1994 Jim Weigang
      {del}

      {del}Z{<-}DTB A;B
[1]    @Deletes trailing blanks in vector or matrix {omega}
[2]    {->}(2>{shape}{shape}B{<-}A{/=}' ')/L1
[3]    B{<-}{or}{slashbar}B
[4]   L1:Z{<-}({reverse}{or}\{reverse}B)/A
      {del}


Home Page