Jump to content

Hirschberg's algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Flavio Ribeiro (talk | contribs) at 00:21, 2 January 2013 (split long sentence in the introduction). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a divide and conquer version of the Needleman–Wunsch algorithm.[1] Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Algorithm information

Hirschberg's algorithm is a generally applicable algorithm for finding an optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If x and y are strings, where length(x) = n and length(y) = m, the Needleman-Wunsch algorithm finds an optimal alignment in O(nm) time, using O(nm) space. Hirschberg's algorithm is a clever modification of the Needleman-Wunsch Algorithm which still takes O(nm) time, but needs only O(min{n,m}) space.[2] One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.

Computation of the Levenshtein edit distance in linear space

The edit distance Edit(x,y) is the least cost of changing x into y by using the operations Insert, Substitute, and Delete, where the cost of each kind of operation is given. We write Ins(a) for the cost of inserting a symbol a into a string, Sub(a,b) for the cost of substituting a symbol b for a in a string, and Del(a) for the cost of deleting a symbol a from a string. The "standard" choice of those costs is Ins(a) = Del(a) = 1 for each symbol a, Sub(a,a) = 0, and Sub(a,b) = 1 if a is not equal to b. The Needleman-Wunsch algorithm computes Edit(x,y) by computing Edit(Prefix[x,i],Prefix[y,j]) for all i and j dynamically, where Prefix[x,i] denotes the prefix of x of length i. That algorithm requires O(nm) time and O(nm) space, where n = length(x) and m = length(y).

Algorithm organization

To understand Hirschberg's algorithm, it is important to first understand that edit distances can be computed using linear space.

What we call the "forward subprogram" computes the values of Edit(Prefix[x,i],Prefix[y,j]) for all i and j, just as the Needleman-Wunsch and returns the array {Edit(x,Prefix[y,j])}0 ≤ j ≤ m. The "backward subprogram" is similar, except that the dynamic program is done in the opposite direction, i.e., starting from the right ends of the strings. It returns the array {Edit(x,Suffix[y,j])}0 ≤ j ≤ m, where Suffix[y,j] is the suffix of y of length j.

The forward and backward subprograms compute values in the matrix by using previously computed values, but save space by erasing values that will no longer be needed during that execution of the subprogram. Unfortunately, the erased values will need to be recomputed later; thus, Hirschberg's algorithm takes roughly twice as much time as Needleman-Wunsch.

Pseudocode

High-level description of the forwards subprogram

Forwards[x,y] is

    1.  n = length(x);  m = length(y)
    2.  Edit[Prefix[x,0],Prefix[y,0]] = ;
    3.  For all j from 1 to m:
           Edit[Prefix[x,0],Prefix[y,j]] = Edit[Prefix[x,0],Prefix[y,j-1]] + Ins(y_j)
    4.  For all i from 1 to n:
           A.  Edit[Prefix[x,i],Prefix[y,0]] = Edit[Prefix[x,i-1],Prefix[y,0]] + Del(x_i)
           B.  For all j from 1 to m, execute the following steps:
                   i.  Edit[Prefix[x,i],Prefix[y,j]] =
                         min{Edit[Prefix[x,i-1],Prefix[y,j]] + Del(x_i),
                             Edit[Prefix[x,i-1],Prefix[y,j-1]] + Sub(x_i,y_j),
                             Edit[Prefix[x,i],Prefix[y,j-1]] + Ins(y_j)}
                   ii.  Erase Edit[Prefix[x,i-1],Prefix[y,j-1]]
           C.  Erase Edit[Prefix[x,i-1],Prefix[y,m]]
    5.  RETURN Edit[x]  %% an array of length m+1

High-level description of the backwards subprogram

Backwards[x,y] is

    1.  n = length(x);  m = length(y)
    2.  Edit[Suffix[x,0],Suffix[y,0]] = 0;
    3.  For all j from 1 to m:
            Edit[Suffix[x,0],Suffix[y,j]] = Edit[Suffix[x,0].Suffix[y,j-1]] + Ins(y_{m-j+1});
    4.  For all i from 1 to n:
          A.  Edit[Suffix[x,i],Suffix[y,0]] = Edit[Suffix[x,i-1],Suffix[y,0]] + Del(x_{n-i-1})
          B.  For all j from 1 to m:
                    i.  Edit[Suffix[x,i],Suffix[y,j]] =
                         min{Edit[Suffix[x,i-1],Suffix[y,j]] + Del(x_{n-i-1}),
                             Edit[Suffix[x,i-1],Suffix[y,j-1]] + Sub(x_{n-i-1},y_{m-j+1}),
                             Edit[Suffix[x,i],Suffix[y,j-1]] + Ins(y_{m-j+1})}
                   ii.  Erase Edit[Suffix[x,i-1],Suffix[y,j-1]]
          C.  Erase Edit[Suffix[x,i-1],Suffix[y,m]]
    5.  RETURN Edit[x] %% an array of length m+1

High level description of Hirschberg's algorithm:

Hirschberg(x,y) is

1.  n = length(x);  m = length(y)
2.  If n <= 1 or m <= 1:
       OUTPUT Alignment(x,y) using Needleman-Wunsch.
    Else:
      A.  mid = floor(n/2)
      B.  x_left = Prefix[x,mid]
      C.  x_right = Suffix[x,n-mid]
      D.  Edit[x_left] = Forwards(x_left,y)  %% an array of length m+1
      E.  Edit[x_right] = Backwards(x_right,y) %% an array of length m+1
      F.  cut = ArgMin{Edit[x_left,Prefix[y,j]] + Edit[x_right,Suffix[y,m-j]]}  %% j ranges from 1 to m-1
      G.  Hirschberg(x_left,Prefix[y,cut])
      H.  Hirschberg(x_right,Suffix[y,m-cut])

See also

References