Draft:Infinite Feature Selection (Inf-FS)

Infinite Feature Selection

Infinite Feature Selection (Inf-FS) is an unsupervised algorithm for ranking features by their global importance, using a graph-based representation of pairwise feature affinities. It was first introduced by Giorgio Roffo and colleagues at the IEEE International Conference on Computer Vision in 2015.^[1]

While originally designed for feature filtering in classical machine learning pipelines, Inf-FS has since been recognized as an early mathematical formulation of the attention mechanism used in deep learning models such as Transformers. Recent work shows that the structure and computation in Inf-FS is mathematically equivalent to self-attention under specific parameterizations.^[2]^[3]

Method

Inf-FS models each feature as a node in a fully connected undirected graph. The edge weights between nodes represent pairwise affinity, typically computed based on correlation or mutual information. These relationships are encoded in an affinity matrix $A\in \mathbb {R} ^{N\times N}$ , where $N$ is the number of features.

The algorithm computes feature importance through an infinite sum over powers of $A$ , capturing all multi-hop interactions:^[1]

S=\sum _{k=1}^{\infty }\alpha ^{k}A^{k}=(I-\alpha A)^{-1}-I

Here, $\alpha$ is a regularization parameter such that $0<\alpha <1/\rho (A)$ , where $\rho (A)$ is the spectral radius of $A$ .

Feature scores are derived by summing the rows of $S$ :

s(i)=[S\cdot \mathbf {1} ]_{i}

This yields a global relevance score for each feature, reflecting its contextual influence across the graph.

Variants

A probabilistic extension called Infinite Latent Feature Selection (ILFS) was proposed in 2017, which infers a latent distribution over affinities using data-driven modeling. This approach allows the affinity matrix to be learned rather than manually defined.^[4]

Relationship to Self-Attention

Inf-FS has been shown to share the same underlying structure as self-attention mechanisms. Both compute contextual weights based on an affinity matrix $A$ between elements (features or tokens). The key difference lies in how $A$ is constructed:^[2]

In Inf-FS, $A$ can be handcrafted or probabilistically inferred.
In Transformers, $A=QK^{\top }/{\sqrt {d_{k}}}$ , where $Q$ and $K$ are learned projections of the input.^[5]

Importantly, when the infinite power series in Inf-FS is truncated to just one term (i.e., $A$ ), the resulting computation mirrors the dot-product attention used in Transformers. This establishes Inf-FS as a superset of self-attention, generalizing the mechanism to arbitrary affinity structures and multi-hop propagation.^[3]

Theoretical Foundations

A 2024 study by Teo and Nguyen demonstrated that self-attention can be derived as a projection operation from kernel principal component analysis (kernel PCA).^[3] In this framework, the attention output corresponds to projecting query vectors onto the eigenvectors of the Gram matrix of the key vectors, in an implicitly defined kernel space:

h_{i}=\sum _{j=1}^{N}{\text{softmax}}\left({\frac {q_{i}^{\top }k_{j}}{\sqrt {d_{k}}}}\right)v_{j}

This matches the original Inf-FS power series when kernel affinity matrices are used, further validating the shared mathematical foundation between the two mechanisms.

Historical Context

The use of affinity matrices in machine learning predates Inf-FS. Notable precursors include the bilateral filter (1998), PageRank (1998), and non-local means (2005), all of which rely on propagating influence over graphs.^[2]

Inf-FS was the first to formalize the idea of using infinite-path propagation over an affinity graph for feature relevance. The connection to attention mechanisms became more explicit after the advent of the Transformer architecture in 2017.^[5]

Applications

Inf-FS has been used in:^[6]

Feature selection for high-dimensional biomedical data
Visual tracking and object detection
Text and document classification
Dimensionality reduction prior to clustering

Its unsupervised nature makes it useful in low-label settings or where model interpretability is critical.

References

^ ^a ^b Roffo, Giorgio; Melzi, Simone; Cristani, Marco (2015). Infinite Feature Selection. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 4202–4210. doi:10.1109/ICCV.2015.478.
^ ^a ^b ^c Roffo, Giorgio (2025). "The Origin of Self-Attention: From Pairwise Affinity Matrices to Transformers". arXiv:2507.14560. A bot will complete this citation soon. Click here to jump the queue
^ ^a ^b ^c Teo, Rachel S.Y.; Nguyen, Tan M. (2024). Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis. Advances in Neural Information Processing Systems (NeurIPS).
^ Roffo, Giorgio; Melzi, Simone (2017). Infinite Latent Feature Selection: A Probabilistic Latent Graph-Based Ranking Approach. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 1398–1406. doi:10.1109/ICCV.2017.154.
^ ^a ^b Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS).
^ Roffo, Giorgio; Melzi, Simone; Cristani, Marco (2021). "Infinite Feature Selection". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (7): 2438–2452. doi:10.1109/TPAMI.2020.2968526.

[roffo2015-1] Roffo, Giorgio; Melzi, Simone; Cristani, Marco (2015). Infinite Feature Selection. 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 4202–4210. doi:10.1109/ICCV.2015.478.

[roffo2025-2] Roffo, Giorgio (2025). "The Origin of Self-Attention: From Pairwise Affinity Matrices to Transformers". arXiv:2507.14560. A bot will complete this citation soon. Click here to jump the queue

[teo2024-3] Teo, Rachel S.Y.; Nguyen, Tan M. (2024). Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis. Advances in Neural Information Processing Systems (NeurIPS).

[roffo2017-4] Roffo, Giorgio; Melzi, Simone (2017). Infinite Latent Feature Selection: A Probabilistic Latent Graph-Based Ranking Approach. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 1398–1406. doi:10.1109/ICCV.2017.154.

[vaswani2017-5] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS).

[roffo2020-6] Roffo, Giorgio; Melzi, Simone; Cristani, Marco (2021). "Infinite Feature Selection". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (7): 2438–2452. doi:10.1109/TPAMI.2020.2968526.

[1]

[2]

[3]

[4]

[5]

[6]