Compressed pattern matching
![]() | This article needs more links to other articles to help integrate it into the encyclopedia. (October 2012) |
In computer science compressed pattern matching or CPM is the process of searching for patterns in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less space.
Approximate CPM
![]() | This section is empty. You can help by adding to it. (March 2009) |
Multi-pattern CPM
![]() | This section needs expansion. You can help by adding to it. (March 2009) |
Aho-Corasick technique
![]() | This section is empty. You can help by adding to it. (January 2011) |
Boyer-Moore technique
![]() | This section is empty. You can help by adding to it. (January 2011) |
Bit parallel technique
![]() | This section is empty. You can help by adding to it. (January 2011) |
Compressed matching problem
If the compressed file uses a variable width encoding it could be present a problem: for example, let “100” be the codeword for a and let “110100” be the codeword for b. If we are looking for an occurrence of a in the text we could obtain as result also an occurrence that is within the codeword of b: we call this event false match. So we have to verify if the occurrence detected is effectively aligned on a codeword boundary. However we could always decode the entire text and then apply a classic string matching algorithm, but this usually requires more space and time and often is not possible, for example if we think to a compressed file hosted online. This problem of verify if the match returned by the compressed pattern matching algorithm is a true or a false match together with the impossibility of decode entire text is called compressed matching problem. Exist many strategies for find the boundaries of codewords and avoid full decompression of the text, for example:
- List of the indices of first bit of each codeword, where we can apply a binary search;
- List of the indices of first bit of each codeword with differential coding, so we can take less space within the file;
- Mask of bit, where bit 1 marks the starting bit of each codeword;
- Subdivision in blocks, for a partial and aimed decompression.
References
- Shmuel T. Klein and Dana Shapira PATTERN MATCHING IN HUFFMAN ENCODED TEXTS (2003)
External links
- Almost optimal fully LZW-compressed pattern matching
- A Dictionary-based Compressed Pattern Matching Algorithm
- A unifying framework for compressed pattern matching
- Speeding Up String Pattern Matching by Text Compression: The Dawn of a New Era
- Shift-and approach to pattern matching in LZW compressed text
- LZW Algorithm