Ruzzo–Tompa algorithm

The Ruzzo-Tompa algorithm is a linear time algorithm for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers.^[1] This algorithm is an improvement over previously known quadratic time algorithms. The maximum scoring subsequence from the set produced by the algorithm is also a solution to the Maximum subarray problem.

The Ruzzo-Tompa algorithm has applications in Bioinformatics, Web scraping, and Information retrieval.

Applications

Bioinformatics

The Ruzzo-Tompa algorithm has been used in Bioinformatics tools to study biological data. The problem of finding disjoint maximal subsequences is of practical importance in the analysis of DNA. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of sequence homology^[2]

Problem Definition

The problem of finding all maximal subsequences is defined as follows: Given a list of real numbered scores $x_{1},x_{2},...,x_{n}$ , find the list of contiguous subsequences that gives the great total score, where the score of each subsequence $S_{i,j}=\sum _{i\leq k\leq j}^{}x_{k}$ . The subsequences must be disjoint (non-overlapping) and have a positive score.

Algorithm

There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms to find the maximum subsequence (see maximum subarray problem) and then recursively find the maximal subsequences to the left and right of the maximum subsequence. The analysis of this algorithm is similar to that of Quicksort: The maximum subsequence could be small in comparison to the rest of sequence, leading to a running time of $O(n^{2})$ in the worst case.

The standard implementation of the Ruzzo-Tompa algorithm runs in $O(n)$ time and uses $O(n)$ space, where $n$ is the length of the list of scores. The algorithm uses dynamic programming to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows:

Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list

I_{1},I_{2},...,I_{j}

of disjoint subsequences. For each subsequence

I_{j}

, record the cumulative total

L_{j}

of all scores up to but not including the leftmost score of

I_{j}

, and the total

R_{j}

up to and including the rightmost score of

I_{j}

.

The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores are require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence

I_{k}

of length one that is then integrated into the list by the following process.

The list $I$ is searched from right to left for the maximum value of $j$ satisfying $L_{j}<L_{k}$
If there is no such $j$ , then add $I_{k}$ to the end of the list.
If there is such a $j$ , and $R_{j}\geq R_{k}$ , then add $I_{k}$ to the end of the list.
Otherwise (i.e., there is such a j, but $R_{j}<R_{k}$ ), extend the subsequence $I_{k}$ to the left to encompass everything up to and including the leftmost score in $I_{j}$ . Delete subsequences $I_{j},I_{j}+1,...,I_{k}-1$ from the list, and append $I_{k}$ to the end of the list. Reconsider the newly extended subsequence $I_{k}$ (now renumbered $I_{j}$ ) as in step 1.

Once the end of the input is reached, all subsequences remaining on the list

I

are maximal.^[2]

As example of the algorithm running, consider the input score list $L=[4,-5,3,-2,1,2]$ . On input $4$ in step 1, no satisfying $j$ is found so step 2 applies and $[4]$ is appended to $I$ , so $I=[[4]],R=[4],L=[0]$ . The input $-5$ is skipped and the input $3$ is read. In step 1, no satisfying $j$ is found so $[3]$ is appended to $I$ , so $I=[[4],[3]],R=[4,2],L=[0,-1]$ . The input $-2$ is skipped and the input $1$ is read. In step 1, a value of $j=1$ is found to satisfy $L[j]<L[k]$ , so step 3 applies. In step 3, $R[j]\ngeq R[k]$ , so $[1]$ is appended to $I$ , and now $I=[[4],[3],[1]],R=[4,2,1],L=[0,-1,0]$ . On input $2$ in step 1 a value of $j=4$ is found to satisfy $L[j]<L[k]$ , so step 3 applies. In step 3, $R[j]\geq R[k]$ , so step 4 applies. In step 4, the elements of $I[j]$ are inserted into $I[k]$ and $I[j]$ is removed from $I$ , so now $I=[[4],[3]],R=[4,2],L=[0,-1]$ . Now we consider the input $I[k]=[1,2]$ . In step 1, a value of $j=1$ is found to satisfy $L[j]<L[k]$ , so step 3 applies. In step 3, $R[j]\geq R[k]$ , so step 4 applies. In step 4, the elements of $I[j]$ are inserted into $I[k]$ and $I[j]$ is removed from $I$ , so now $I=[[4]],R=[4],L=[0]$ . Now we consider the input $I[k]=[3,-2,1,2]$ . In step 1, so satisfying value of $j$ is found, so $[3,-2,1,2]$ is appended to $I$ . The end of the input has been reached, so the final value of $I$ is $[[4],[3,-2,1,2]]$ .

The following Python code implements the Ruzzo-Tompa algorithm:

def RuzzoTompa(scores):
	k=0
	total = 0;
	# Allocating arrays of size n
	I,L,R,Lidx = [[0]*len(scores) for _ in range(4)]
	for i,s in enumerate(scores):
		total += s
		if s > 0:
			# store I[k] by (start,end) indices of scores
			I[k] = (i,i+1)
			Lidx[k] = i
			L[k] = total-s
			R[k] = total
			while(True):
				maxj = None
				for j in range(k-1,-1,-1):
					if L[j] < L[k]:
						maxj = j
						break;
				if maxj != None and R[maxj] < R[k]:
					I[maxj] = (Lidx[maxj],i+1)
					R[maxj] = total
					k = maxj
				else:
					k+=1
					break;
	# Getting maximal subsequences using stored indices
	return [scores[I[l][0]:I[l][1]] for l in range(k)]

References

^ Ruzzo, Walter L.; Martin, Tompa (1999). "A Linear Time Algorithm for Finding All Maximal Scoring Subsequences". Proceedings. International Conference on Intelligent Systems for Molecular Biology: 234–241. PMID 10786306.
^ ^a ^b Karlin, S; Altschul, SF (Jun 15, 1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. PMID 8390686.

[1] Ruzzo, Walter L.; Martin, Tompa (1999). "A Linear Time Algorithm for Finding All Maximal Scoring Subsequences". Proceedings. International Conference on Intelligent Systems for Molecular Biology: 234–241. PMID 10786306.

[ruzzo_tompa-2] Karlin, S; Altschul, SF (Jun 15, 1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. PMID 8390686.

[1]

[2]