Wednesday, May 9, 2012

How is diff implemented in Unix / How does it work?

diff essentially solves the classical computer science problem of finding the longest common subsequence (LCS).

The LCS problem has an optimal substructure: the problem can be broken down into smaller, simple "subproblems", which can be broken down into yet simpler subproblems, and so on, until, finally, the solution becomes trivial. The LCS problem also has overlapping subproblems: the solution to a higher subproblem depends on the solutions to several of the lower subproblems. Problems with these two properties—optimal substructure and overlapping subproblems—can be approached by a problem-solving technique called dynamic programming, in which the solution is built up starting with the simplest subproblems. The procedure requires memoization—saving the solutions to one level of subproblem in a table (analogous to writing them to a memo, hence the name) so that the solutions are available to the next level of subproblems. 

diff internally creates an edit distance table. i.e the line by line difference across two input files, and calculates the minimum number of changes required to translate one file into the other. A very nice and simple implementation. I found a very nice and simple implementation of calculating edit distance between two strings online:


#include 

#define MAXLEN 80

int findMin(int d1, int d2, int d3) {
   /*
    * return min of d1, d2 and d3.
    */
   if(d1 < d2 && d1 < d3)
       return d1;
   else if(d1 < d3)
       return d2;
   else if(d2 < d3)
       return d2;
   else
      return d3;
}

int findEditDistance(char *s1, char *s2) {
    /*
     * returns edit distance between s1 and s2.
     */
   int d1, d2, d3;

   if(*s1 == 0)
       return strlen(s2);
   if(*s2 == 0)
       return strlen(s1);
   if(*s1 == *s2)
       d1 = findEditDistance(s1+1, s2+1);
   else
       d1 = 1 + findEditDistance(s1+1, s2+1);    // update.
   d2 = 1+findEditDistance(s1, s2+1);                   // insert.
   d3 = 1+findEditDistance(s1+1, s2);                   // delete.

   return findMin(d1, d2, d3);
}

int main() {
    char s1[MAXLEN], s2[MAXLEN];

    printf("Enter string 1: ");
    gets(s1);

    while(*s1) {
        printf("Enter string 2: ");
        gets(s2);
        printf("Edit distance(%s, %s) = %d.\n", s1, s2, findEditDistance(s1, s2));
        printf("Enter string 1(enter to end): ");
        gets(s1);
    }

    return 0;
}


The basic algorithm for diff is described in "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers, 'Algorithmica' Vol. 1 No. 2, 1986, pp. 251-266; and in "A File Comparison Program", Webb Miller and Eugene W. Myers, 'Software--Practice and Experience' Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in "Algorithms for Approximate String Matching", E. Ukkonen, `Information and Control' Vol. 64, 1985, pp. 100-118

0 comments:

Post a Comment