Jump to content

User:Xionbox/Algorithms

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Xionbox (talk | contribs) at 13:37, 11 July 2011 (Rm tag). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Here be summary of algorithms and data structures to be known.

Analysis

Amortized analysis

Amortized analysis is a method of analyzing algorithms that considers the entire sequence of operations of the program. It allows for the establishment of a worst-case bound for the performance of an algorithm irrespective of the inputs by looking at all of the operations. At the heart of the method is the idea that while certain operations may be extremely costly in resources, they cannot occur at a high-enough frequency to weigh down the entire program because the number of less costly operations will far outnumber the costly ones in the long run, "paying back" the program over a number of iterations. It is particularly useful because it guarantees worst-case performance while accounting for the entire set of operations in an algorithm.

Big O

Big-O notation (Landau notation) describes the limiting behavior of the function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. Big O notation characterizes functions according to their growth rates: different functions with the same growth rate may be represented using the same O notation.

Notation Name Example
constant Determining if a number is even or odd; using a constant-size lookup table or hash table
double logarithmic Finding a key value in an array sorted on the keys with Interpolation search.
logarithmic Finding an item in a sorted array with a binary search or a balanced search tree as well as all operations in a Binomial heap.
fractional power Searching in a kd-tree
linear Finding an item in an unsorted list or a malformed tree (worst case) or in an unsorted array; Adding two n-bit integers by ripple carry.
linearithmic, loglinear, or quasilinear Performing a Fast Fourier transform; heapsort, quicksort (best and average case), or merge sort
quadratic Multiplying two n-digit numbers by a simple algorithm; bubble sort (worst case or naive implementation), Shell sort, quicksort (worst case), selection sort or insertion sort
polynomial or algebraic Tree-adjoining grammar parsing; maximum matching for bipartite graphs

L-notation or sub-exponential Factoring a number using the quadratic sieve or number field sieve
exponential Finding the (exact) solution to the traveling salesman problem using dynamic programming; determining if two logical statements are equivalent using brute-force search
factorial Solving the traveling salesman problem via brute-force search; generating all unrestricted permutations of a poset; finding the determinant with expansion by minors.

The statement is sometimes weakened to to derive simpler formulas for asymptotic complexity. For any and , is a subset of for any , so may be considered as a polynomial with some bigger order.

NP Complexity

Diagram of complexity classes provided that PNP. The existence of problems outside both P and NP-complete in this case was established by Ladner

NP is one of the most fundamental complexity classes. The abbreviation NP refers to "nondeterministic polynomial time.

Intuitively, NP is the set of all decision problems for which the instances where the answer is "yes" have efficiently verifiable proofs of the fact that the answer is indeed "yes." More precisely, these proofs have to be verifiable in polynomial time by a deterministic Turing machine. In an equivalent formal definition, NP is the set of decision problems where the "yes"-instances can be recognized in polynomial time by a non-deterministic Turing machine. The equivalence of the two definitions follows from the fact that an algorithm on such a non-deterministic machine consists of two phases, the first of which consists of a guess about the solution which is generated in a non-deterministic way, while the second consists of a deterministic algorithm which verifies or rejects the guess as a valid solution to the problem.

The complexity class P is contained in NP, but NP contains many important problems, the hardest of which are called NP-complete problems, for which no polynomial-time algorithms are known. The most important open question in complexity theory, the P = NP problem, asks whether such algorithms actually exist for NP-complete, and by corollary, all NP problems.

NP complete

Euler diagram for P, NP, NP-complete, and NP-hard set of problems

The complexity class NP-complete (abbreviated NP-C or NPC) is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard problems so that any NP problem can be converted into L by a transformation of the inputs in polynomial time.

Although any given solution to such a problem can be verified quickly, there is no known efficient way to locate a solution in the first place; indeed, the most notable characteristic of NP-complete problems is that no fast solution to them is known. That is, the time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows. As a result, the time required to solve even moderately large versions of many of these problems easily reaches into the billions or trillions of years, using any amount of computing power available today. As a consequence, determining whether or not it is possible to solve these problems quickly, called the P versus NP problem, is one of the principal unsolved problems in computer science today.

While a method for computing the solutions to NP-complete problems using a reasonable amount of time remains undiscovered, computer scientists and programmers still frequently encounter NP-complete problems. NP-complete problems are often addressed by using approximation algorithms.

Turing machine

A Turing machine is a theoretical device that manipulates symbols on a strip of tape according to a table of rules. Despite its simplicity, a Turing machine can be adapted to simulate the logic of any computer algorithm, and is particularly useful in explaining the functions of a CPU inside a computer. A Turing machine that is able to simulate any other Turing machine is called a universal Turing machine (UTM, or simply a universal machine).

A limitation of Turing Machines is that they do not model the strengths of a particular arrangement well. For instance, modern stored-program computers are actually instances of a more specific form of abstract machine known as the random access stored program machine or RASP machine model. Like the Universal Turing machine the RASP stores its "program" in "memory" external to its finite-state machine's "instructions". Unlike the Universal Turing Machine, the RASP has an infinite number of distinguishable, numbered but unbounded "registers"—memory "cells" that can contain any integer. The RASP's finite-state machine is equipped with the capability for indirect addressing (e.g. the contents of one register can be used as an address to specify another register); thus the RASP's "program" can address any register in the register-sequence. The upshot of this distinction is that there are computational optimizations that can be performed based on the memory indices, which are not possible in a general Turing Machine; thus when Turing Machines are used as the basis for bounding running times, a 'false lower bound' can be proven on certain algorithms' running times (due to the false simplifying assumption of a Turing Machine). An example of this is binary search, an algorithm that can be shown to perform more quickly when using the RASP model of computation rather than the Turing machine model.

Data structures

  Linked list Array Dynamic
array
Balanced
tree
Search Θ(n) Θ(1) Θ(1) Θ(log n)
Insert/delete at beginning Θ(1) Θ(n) Θ(log n)
Insert/delete at end Θ(1) Θ(1) amortized Θ(log n)
Insert/delete in middle search time +
Θ(1)
Θ(n) Θ(log n)
Wasted space (average) Θ(n) 0 Θ(n) Θ(n)

Linked lists


A linked list whose nodes contain two fields: an integer value and a link to the next node

A linked list is a data structure used for collecting a sequence of objects, which allows efficient addition, removal and retrieval of elements from any position in the sequence. It is implemented as nodes, each of which contains a reference (i.e., a link) to the next and/or previous node in the sequence.

Hash tables

A small phone book as a hash table.

A hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.

Separate chaining

Hash collision resolved by separate chaining.

In the strategy known as separate chaining, direct chaining, or simply chaining, each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Lookup requires scanning the list for an entry with the given key. Insertion requires adding a new entry record to either end of the list belonging to the hashed slot. Deletion requires searching the list and removing the element. (The technique is also called open hashing or closed addressing, which should not be confused with 'open addressing' or 'closed hashing'.)

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—that is, on the load factor.

For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table.

Separate chaining with list heads

Hash collision by separate chaining with head records in the bucket array.

Some chaining implementations store the first record of each chain in the slot array itself. The purpose is to increase cache efficiency of hash table access. To save memory space, such hash tables often have about as many slots as stored entries, meaning that many slots have two or more entries

Heap

Example of a complete binary max-heap

In computer science, a heap is a specialized tree-based data structure. We distinguish two kinds of heaps:

  • Max-heap, which satisfies the following heap property:
  • Min-heap, which satisfies the following heap property:

There is no restriction as to how many children each node has in a heap, although in practice each node has at most two. The heap is one maximally-efficient implementation of an abstract data type called a priority queue. Heaps are crucial in several efficient graph algorithms such as Dijkstra's algorithm, and in the sorting algorithm heapsort.

Applications

The heap data structure has many applications.

Full and almost full binary heaps may be represented in a very space-efficient way using an array alone. The first (or last) element will contain the root. The next two elements of the array contain its children. The next four contain the four children of the two child nodes, etc. Thus the children of the node at position n would be at positions 2n and 2n+1 in a one-based array, or 2n+1 and 2n+2 in a zero-based array. This allows moving up or down the tree by doing simple index computations. Balancing a heap is done by swapping elements which are out of order. As we can build a heap from an array without requiring extra memory (for the nodes, for example), heapsort can be used to sort an array in-place.

Time computation

Function names assume a min-heap.

Operation Binary Binomial Fibonacci Pairing Brodal
findMin Θ(1) Θ(log n) or Θ(1) Θ(1) Θ(1)[citation needed] Θ(1)[citation needed]
deleteMin Θ(log n) Θ(log n) O(log n)* O(log n)* O(log n)
insert Θ(log n) O(log n) Θ(1)[citation needed] O(1)*[citation needed] Θ(1)[citation needed]
decreaseKey Θ(log n) Θ(log n) Θ(1)* O(log n)* Θ(1)
merge Θ(n) O(log n)** Θ(1) O(1)* Θ(1)

(*)Amortized time
(**)Where n is the size of the larger heap

Queue

Representation of a FIFO Queue

Double-ended queue

A double-ended queue (dequeue, often abbreviated to deque, pronounced deck) is an abstract data structure that implements a queue for which elements can only be added to or removed from the front (head) or back (tail). It is also often called a head-tail linked list.

Priority queue

It is exactly like a regular queue or stack data structure, but additionally, each element is associated with a "priority".

A priority queue must at least support the following operations:

  • insert_with_priority: add an element to the queue with an associated priority
  • pull_highest_priority_element: remove the element from the queue that has the highest priority, and return it (also known as "pop_element(Off)", "get_maximum_element", or "get_front(most)_element"; some conventions consider lower priorities to be higher, so this may also be known as "get_minimum_element", and is often referred to as "get-min" in the literature; the literature also sometimes implement separate "peek_at_highest_priority_element" and "delete_element" functions, which can be combined to produce "pull_highest_priority_element")

Circular buffer (or circular queue)

A ring showing, conceptually, a circular buffer. This visually shows that the buffer has no real end and it can loop around the buffer. However, since memory is never physically created as a ring, a linear representation is generally used as is done below.

A circular buffer, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. The circular buffer is well suited as a FIFO buffer while a standard, non-circular buffer is well suited as a LIFO buffer. This structure lends itself easily to buffering data streams. An example that could possibly use an overwriting circular buffer is with multimedia. Circular buffering makes a good implementation strategy for a Queue that has fixed maximum size.

Start / End Pointers

Generally, a circular buffer requires three pointers:

  • one to the actual buffer in memory
  • one to point to the start of valid data
  • one to point to the end of valid data

Alternatively, a fixed-length buffer with two integers to keep track of indices can be used in languages that do not have pointers.

Taking a couple of examples from above. (While there are numerous ways to label the pointers and exact semantics can vary, this is one way to do it.)

What to note about the second one is that after each element is overwritten then the start pointer is incremented as well.

Stack

Simple representation of a stack

In computer science, a stack is a last in, first out (LIFO) abstract data type and data structure. A stack can have any abstract data type as an element, but is characterized by only three fundamental operations: push, pop and stack top. The push operation adds a new item to the top of the stack, or initializes the stack if it is empty. If the stack is full and does not contain enough space to accept the given item, the stack is then considered to be in an overflow state. The pop operation removes an item from the top of the stack. A pop either reveals previously concealed items, or results in an empty stack, but if the stack is empty then it goes into underflow state (It means no items are present in stack to be removed). The stack top operation gets the data from the top-most position and returns it to the user without deleting it. The same underflow state can also occur in stack top operation if stack is empty.

A stack is a restricted data structure, because only a small number of operations are performed on it. The nature of the pop and push operations also means that stack elements have a natural order. Elements are removed from the stack in the reverse order to the order of their addition: therefore, the lower elements are those that have been on the stack the longest.

Trees

In graph theory, a k-ary tree is a rooted tree in which each node has no more than k children. It is also sometimes known as a k-way tree, an N-ary tree, or an M-ary tree.A binary tree is the special case where k=2.

A simple unordered tree; in this diagram, the node labeled 7 has two children, labeled 2 and 6, and one parent, labeled 2. The root node, at the top, has no parent.

A node is a structure which may contain a value, a condition, or represent a separate data structure (which could be a tree of its own). Each node in a tree has zero or more child nodes, which are below it in the tree (by convention, trees are drawn growing downwards). A node that has a child is called the child's parent node (or ancestor node, or superior). A node has at most one parent.

Nodes that do not have any children are called leaf nodes. They are also referred to as terminal nodes.

A free tree is a tree that is not rooted

The height of a node is the length of the longest downward path to a leaf from that node. The height of the root is the height of the tree. The depth of a node is the length of the path to its root (i.e., its root path). This is commonly needed in the manipulation of the various self balancing trees, AVL Trees in particular. Conventionally, the value −1 corresponds to a subtree with no nodes, whereas zero corresponds to a subtree with one node.

The topmost node in a tree is called the root node. Being the topmost node, the root node will not have parents. It is the node at which operations on the tree commonly begin (although some algorithms begin with the leaf nodes and work up ending at the root). All other nodes can be reached from it by following edges or links. (In the formal definition, each such path is also unique). In diagrams, it is typically drawn at the top. In some trees, such as heaps, the root node has special properties. Every node in a tree can be seen as the root node of the subtree rooted at that node.

An internal node or inner node is any node of a tree that has child nodes and is thus not a leaf node. Similarly, an external node or outer node is any node that does not have child nodes and is thus a leaf.

A subtree of a tree T is a tree consisting of a node in T and all of its descendants in T. (This is different from the formal definition of subtree used in graph theory.) The subtree corresponding to the root node is the entire tree; the subtree corresponding to any other node is called a proper subtree (in analogy to the term proper subset).

B tree

B-tree
Typetree
Time complexity in big O notation
Operation Average Worst case
Search O(log n) O(log n)
Insert O(log n) O(log n)
Delete O(log n) O(log n)
Space complexity
Space O(n) O(n)
A B-tree of order 2 (Bayer & McCreight 1972) or order 5 (Knuth 1997).

A B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic amortized time. The B-tree is a generalization of a binary search tree in that a node can have more than two children. (Comer, p. 123) Unlike self-balancing binary search trees, the B-tree is optimized for systems that read and write large blocks of data. It is commonly used in databases and filesystems.

Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search chooses the child pointer (subtree) whose separation values are on either side of the search value.

A B Tree insertion example with each iteration.

Deletion

There are two popular strategies for deletion from a B-Tree.

  • locate and delete the item, then restructure the tree to regain its invariants

or

  • do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be deleted is encountered, it can be deleted without triggering the need for any further restructuring

The algorithm below uses the former strategy.

There are two special cases to consider when deleting an element:

  1. the element in an internal node may be a separator for its child nodes
  2. deleting an element may put its node under the minimum number of elements and children.

Each of these cases will be dealt with in order.

Deletion from a leaf node
  • Search for the value to delete.
  • If the value is in a leaf node, it can simply be deleted from the node,
  • If underflow happens, check siblings to either transfer a key or fuse the siblings together.
  • if deletion happened from right child retrieve the max value of left child if there is no underflow in left child
  • in vice-versa situation retrieve the min element from right
Deletion from an internal node

Each element in an internal node acts as a separation value for two subtrees, and when such an element is deleted, two cases arise. In the first case, both of the two child nodes to the left and right of the deleted element have the minimum number of elements, namely L−1. They can then be joined into a single node with 2L−2 elements, a number which does not exceed U−1 and so is a legal node. Unless it is known that this particular B-tree does not contain duplicate data, we must then also (recursively) delete the element in question from the new node.

In the second case, one of the two child nodes contains more than the minimum number of elements. Then a new separator for those subtrees must be found. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is the smallest element which is still greater than the separator. Both of those elements are in leaf nodes, and either can be the new separator for the two subtrees.

  • If the value is in an internal node, choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree), remove it from the leaf node it is in, and replace the element to be deleted with the new separator.
  • This has deleted an element from a leaf node, and so is now equivalent to the previous case.
Rebalancing after deletion

If deleting an element from a leaf node has brought it under the minimum size, some elements must be redistributed to bring all nodes up to the minimum. In some cases the rearrangement will move the deficiency to the parent, and the redistribution must be applied iteratively up the tree, perhaps even to the root. Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows:[citation needed]

  • If the right sibling has more than the minimum number of elements
    • Add the separator to the end of the deficient node.
    • Replace the separator in the parent with the first element of the right sibling.
    • Append the first child of the right sibling as the last child of the deficient node
  • Otherwise, if the left sibling has more than the minimum number of elements.
    • Add the separator to the start of the deficient node.
    • Replace the separator in the parent with the last element of the left sibling.
    • Insert the last child of the left sibling as the first child of the deficient node
  • If both immediate siblings have only the minimum number of elements
    • Create a new node with all the elements from the deficient node, all the elements from one of its siblings, and the separator in the parent between the two combined sibling nodes.
    • Remove the separator from the parent, and replace the two children it separated with the combined node.
    • If that brings the number of elements in the parent under the minimum, repeat these steps with that deficient node, unless it is the root, since the root is permitted to be deficient.

The only other case to account for is when the root has no elements and one child. In this case it is sufficient to replace it with its only child.

Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.

DSW algorithm

The DSW algorithm is a method for efficiently balancing binary search trees — that is, decreasing their height to O(log n) nodes, where n is the total number of nodes. Unlike a self-balancing binary search tree, it does not do this incrementally during each operation, but periodically, so that its cost can be amortized over many operations.

The algorithm requires linear (O(n)) time and is in-place. The original algorithm generates as compact a tree as possible: all levels of the tree are completely full except possibly the bottom-most. The modification (by S/W) generates a complete binary tree, namely one in which the bottom-most level is filled strictly from left to right. This is a useful transformation to perform if it is known that no more inserts will be done.

Left child-right sibling binary tree

In computer science, a left child-right sibling binary tree is a binary tree representation of a k-ary tree. The process of converting from a k-ary tree to an LC-RS binary tree is not reversible in general without additional information.

To form a binary tree from an arbitrary k-ary tree by this method, the root of the original tree is made the root of the binary tree. Then, starting with the root, each node's leftmost child in the original tree is made its left child in the binary tree, and its nearest sibling to the right in the original tree is made its right child in the binary tree.

If the original tree was sorted, the new tree will be a binary search tree.

Trie or prefix tree

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn".

A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

In the example shown, keys are listed in the nodes and values below them. Each complete English word has an arbitrary integer value associated with it. A trie can be seen as a deterministic finite automaton, although the symbol on each edge is often implicit in the order of the branches.

It is not necessary for keys to be explicitly stored in nodes. (In the figure, words are shown only to illustrate how the trie works.)

Though it is most common, tries need not be keyed by character strings. The same algorithms can easily be adapted to serve similar functions of ordered lists of any construct, e.g., permutations on a list of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up a short, fixed size of bits such as an integer number or pointer to memory.

Algorithms

Searching

public class BinarySearch{
  static int search(int[] A, int k){
    int b = 0;
    int e = A.length - 1;
    int m;
    while( b <= e ){
      m = 1 + (e-1)/2;
      if(A[m] < K){
        b = m+1;
      }else if( A[m] == k ){
        return m;
      }else{
        e = m - 1;
      }
     }
    return -1;
}