Wednesday, May 9, 2012

How is diff implemented in Unix / How does it work?

diff essentially solves the classical computer science problem of finding the longest common subsequence (LCS).

The LCS problem has an optimal substructure: the problem can be broken down into smaller, simple "subproblems", which can be broken down into yet simpler subproblems, and so on, until, finally, the solution becomes trivial. The LCS problem also has overlapping subproblems: the solution to a higher subproblem depends on the solutions to several of the lower subproblems. Problems with these two properties—optimal substructure and overlapping subproblems—can be approached by a problem-solving technique called dynamic programming, in which the solution is built up starting with the simplest subproblems. The procedure requires memoization—saving the solutions to one level of subproblem in a table (analogous to writing them to a memo, hence the name) so that the solutions are available to the next level of subproblems. 

diff internally creates an edit distance table. i.e the line by line difference across two input files, and calculates the minimum number of changes required to translate one file into the other. A very nice and simple implementation. I found a very nice and simple implementation of calculating edit distance between two strings online:


#define MAXLEN 80

int findMin(int d1, int d2, int d3) {
    * return min of d1, d2 and d3.
   if(d1 < d2 && d1 < d3)
       return d1;
   else if(d1 < d3)
       return d2;
   else if(d2 < d3)
       return d2;
      return d3;

int findEditDistance(char *s1, char *s2) {
     * returns edit distance between s1 and s2.
   int d1, d2, d3;

   if(*s1 == 0)
       return strlen(s2);
   if(*s2 == 0)
       return strlen(s1);
   if(*s1 == *s2)
       d1 = findEditDistance(s1+1, s2+1);
       d1 = 1 + findEditDistance(s1+1, s2+1);    // update.
   d2 = 1+findEditDistance(s1, s2+1);                   // insert.
   d3 = 1+findEditDistance(s1+1, s2);                   // delete.

   return findMin(d1, d2, d3);

int main() {
    char s1[MAXLEN], s2[MAXLEN];

    printf("Enter string 1: ");

    while(*s1) {
        printf("Enter string 2: ");
        printf("Edit distance(%s, %s) = %d.\n", s1, s2, findEditDistance(s1, s2));
        printf("Enter string 1(enter to end): ");

    return 0;

The basic algorithm for diff is described in "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers, 'Algorithmica' Vol. 1 No. 2, 1986, pp. 251-266; and in "A File Comparison Program", Webb Miller and Eugene W. Myers, 'Software--Practice and Experience' Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in "Algorithms for Approximate String Matching", E. Ukkonen, `Information and Control' Vol. 64, 1985, pp. 100-118

Tuesday, May 8, 2012

What is memoization? Explanation with an example.

Memoization is the process of storing partially computed subproblems in a table to reduce the compute time of your problem.

We will explain the concept of memoization through a nice example from Wikipedia on fibonacci series:

Fibonacci sequence
Here is a na├»ve implementation of a function finding the nth member of the Fibonacci sequence, based directly on the mathematical definition:
   function fib(n)
       if n = 0 return 0
       if n = 1 return 1
       return fib(n − 1) + fib(n − 2)
Notice that if we call, say, fib(5), we produce a call tree that calls the function on the same value many different times:
  1. fib(5)
  2. fib(4) + fib(3)
  3. (fib(3) + fib(2)) + (fib(2) + fib(1))
  4. ((fib(2) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))
  5. (((fib(1) + fib(0)) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))
In particular, fib(2) was calculated three times from scratch. In larger examples, many more values of fib, or subproblems, are recalculated, leading to an exponential time algorithm.
Now, suppose we have a simple map object, m, which maps each value of fib that has already been calculated to its result, and we modify our function to use it and update it. The resulting function requires only O(n) time instead of exponential time:
   var m := map(0 → 0, 1 → 1)
   function fib(n)
       if map m does not contain key n
           m[n] := fib(n − 1) + fib(n − 2)
       return m[n]
This technique of saving values that have already been calculated is called memoization; this is the top-down approach, since we first break the problem into subproblems and then calculate and store values.

Solution: function declaration isn't a prototype in C

This problem is seen when you declare a function separately for example:

int getData();

and then say somewhere within your code this function is defined as:

int getData() {

The reason for this error is that C treats a function defn  fun1() and fun1(void) as different functions. If you change the declaration above as:
int getData(void);

your problem will be solved.