Jump to content

User:Salavat.nabi/sandbox

From Wikipedia, the free encyclopedia

Frequent item set mining

[edit]

Frequent item sets mining is a process of finding frequent item sets through the all possible item sets. For the enumerating all possible item sets (itemsets) can be used a lattice structure.

In figure shown lattice structure for items I={a,b,c,d,e}. If say in general, a data set which comprising k items could generate up to 2k-1 frequent itemsets, excluding the null set. In real applications value k could be really big, it is mean that the searching space of itemsets will rise exponentially[1].

The brute-force method for picking up frequent itemsets is a determine the support count for all candidate temsets in the lattice structure. It can be done by comparing each candidate to every transaction. The support count could be incremented only if candidate “participated” in transaction. This method can be time consuming, because it could contain O(NMω) where N is a quantity of transactions, M=2k-1 is the amount of candidate itemsets, and ω is the maximum transaction width.



The simplifying methods of frequent itemsets generation

[edit]

There are two ways of decreasing computational complexity of frequent itemset generation[2]:

  • By decreasing the quantity of candidate itemsets (M). The Apriori principle is effective way of excluding candidate itemsets without counting theirs support values.
  • By decreasing the quantity of comparison. Can be used another data structure, or compressed data structure.

The Apriori Principle

[edit]

If an itemset is frequent, then all of its subsets must also be frequent.[3]


Apriori principle have been applied in the Apriori algorithm.


Let {c,d,e} is a frequent itemset, then all transactions which contains this itemset must contain its subset {c,d},{c,e}, {d,e}, {c}, {d}, and {e}. Hence if {c,d,e} is frequent, then any subset have to be frequent. Contrariwise, if an itemset such as {a,b} is infrequent, all of them supersets have to be infrequent too.[ref] As shown in figure 4 the infrequent itemset {a,b} can be excluded from graph with entire all of the supersets. This cropping also called as support-based pruning. This pruning strategy based on theirs main property: support for itemset never outreach the support for its subset. The name of this property is anti-monotone property of the support measure.


Monotonicity Property

[edit]
Let I be a set of items, and J = 2I be the power set of I. A measure f is monotone (or upward closed) if

,

which means that if X is a subset of Y, then f{X) must not exceed f(Y). On the other hand, f is anti-monotone (or downward closed) if

,

which means that if X is a subset of У, then f(Y) must not exceed f(X). Any measure that possesses an anti-monotone property can be incorporated directly into the mining algorithm to effectively prune the exponential search space of candidate itemsets [4].

General-to-Specific

[edit]

In the Apriori algorithm have been used a general-to-specific search strategy. It is mean that pair of frequent (k-1)-itemsets are merged to obtain candidate k-itemsets. This tipe of search strategy is effective, because of the maximum length of a frequent itemset is often not too long.

FP-Tree Representation

[edit]

The FP-Growth Algorithm does not subscribe to the generate-and-test paradigm of Apriori. Instead, it represents the data set using a compact data structure called an FP-tree. This structure gives possibility to extracting frequent itemsets directly from this structure.

FP-Tree structure constructed by going through the all data set picking up one transaction in one time and mapping each transaction onto a path in this structure[5].

Sometimes transactions can have common items, that is why their paths can be overlapped. Hence, more overlapping paths make structure more compressed.

In figure shown data set with 10 transactions and five items: I={a,b,c,d,e,f}[6]. After reading all of the transactions from data set, can be used FP-Growth Algorithm.

See also

[edit]

References

[edit]
  1. ^ P.-N. Tan, M. Steinbach, V. Kumar Introduction to Data Mining (6.2 Frequent Itemset Generation), WP Co (2006) ISBN: 9780273769224
  2. ^ P.-N. Tan, M. Steinbach, V. Kumar Introduction to Data Mining (6.2), WP Co (2006) ISBN: 9780273769224
  3. ^ P.-N. Tan, M. Steinbach, V. Kumar Introduction to Data Mining (6.2.1 The Apriori Principle), WP Co (2006) ISBN: 9780273769224
  4. ^ P.-N. Tan, M. Steinbach, V. Kumar Introduction to Data Mining (Chapter 6, Definition 6.2), WP Co (2006) ISBN: 9780273769224
  5. ^ P.-N. Tan, M. Steinbach, V. Kumar Introduction to Data Mining (6.6 FP-tree), WP Co (2006) ISBN: 9780273769224
  6. ^ P.-N. Tan, M. Steinbach, V. Kumar Introduction to Data Mining (6.6 FP-Growth Algorithm), WP Co (2006) ISBN: 9780273769224