Sequential pattern mining
Sequence mining is a topic of Data Mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. [1] It is usually presumed that the values are discrete, and thus Time series mining is closely related, but usually considered a different activity. Sequence mining is a special case of structured data mining.
There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.
String Mining
String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in nantural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and Function (Biology). This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string. A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented in the paper String Mining in Bioinformatics,[2] which include:
- Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.
- Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods includeBLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.
Itemset Mining
Some problems in sequence mining lend themselves discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson Up}, it is likely that {Motorolla up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".
Two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more-recent FP-Growth technique. A survey and taxonomy of the key algorithms for item set mining is presented in the paper Frequent pattern mining: current status and future directions.[3]
See also
- Association rule learning
- Data Mining
- GSP Algorithm
- Process mining
- Sequence analysis (Bioinformatics)
- Sequence clustering
- Sequence labeling
- string (computer science)
- Sequence alignment
- Time series
References
- ^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1824795.1824798 , please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with
|doi=10.1145/1824795.1824798
instead. - ^ String M. Abouelhoda, M. Ghanem. String Mining in Bioinformatics. In M. M. Gaber (Editor) Scientific Data Mining and Knowledge Discovery. Springer2009
- ^ J. Han, H. Cheng, D. Xin and X. Yan Frequent pattern mining: current status and future directions. In Data Mining and Knowledge Discovery Volume 15, Number 1, 55-86, DOI: 10.1007/s10618-006-0059-1