Jump to content

Talk:Attention (machine learning)

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Confusing line "X is the input matrix of word embeddings, size 4 x 300. x is the word vector for "that". "

[edit]

After "4x300" it immediately says "x is the word for 'that'." That's super-confusing, because one might think that the second x refers to the x between 4 and 300. There are three different uses of x in the sentence. Someone familiar with the field will be able to understand it, but wikipedia is meant to be clear as possible. ThinkerFeeler (talk) 00:20, 30 July 2023 (UTC)[reply]

typo in "asterix"?

[edit]

in the following extract: " The asterix within parenthesis "(*)" denotes the softmax" shouldn't the word be asterisk, not asterix ? :-) Jrob kiwi (talk) 16:35, 23 August 2023 (UTC)[reply]

Does RNN mean "recursive neural network" or "recurrent neural network"?

[edit]

In this article, is RNN supposed to mean "recursive neural network" or "recurrent neural network", or maybe sometimes one and sometimes the other? Once we figure this out, let's replace all occurrences with the correct three words, so that it is immediately clear even to novices. —Quantling (talk | contribs) 16:14, 24 October 2023 (UTC)[reply]

I'm pretty sure it is "recurrent". I am going to go ahead and edit. If I have it wrong, please accept my apologies ... and fix my edit. —Quantling (talk | contribs) 16:23, 24 October 2023 (UTC)[reply]

hard vs soft weights

[edit]

The intro mentions hard and soft weights, which I havent heard before in this context. can someone provide a citation showing it is actually used terminology? DMH43 (talk) 15:15, 26 December 2023 (UTC)[reply]

'word' should be replaced with something more generic

[edit]

The article frequently uses the word "word" when talking about attention. For example the opening paragraph states: "It calculates "soft" weights for each word, more precisely for its embedding, in the context window.". However, attention is a concept that is independent of input type - it can and has been applied to words, pixel values, quantities, etc. I believe it would be clearer to replace the use of "word" in reference to the inputs that attention is applied to, with something more generic such as "input element" or "token". 180.150.65.6 (talk) 14:31, 5 March 2024 (UTC)[reply]

Where the matrices coming from?

[edit]

The article does not explain where the Q K V matrices are coming from or how the corresponding networks are trained. 108.53.169.6 (talk) 02:38, 4 August 2024 (UTC)[reply]

Article dispute resolution

[edit]

@Ffid tham you have been repeatedly reverting all article edits to a very specific version of the article. However, at that point, the article is disorganized, and hard to read. Consider for example:

> The attention network was designed to identify high correlations patterns amongst words in a given sentence, assuming that it has learned word correlation patterns from the training data. This correlation is captured as neuronal weights learned during training with backpropagation.

This uses awkward phrasing like "neuronal weights learned". It also says "attention network", but attention mechanism is not a network. It is a module that can go into different kinds of neural networks.

> The diagram shows the Attention forward pass calculating correlations

This diagram is hard to understand, especially up there as the first image showing the mathematical operations all together. To have good style, the article should start simple and build the attention mechanism piece-by-piece. Specifically, the section on seq2seq was written to build the attention mechanism piece-by-piece.

After that section, then that picture can be displayed as a big summary (although I believe better pictures are available).

Furthermore, the "Encoder-decoder with attention" diagram is deeply confusing. I don't know what it shows, and I suspect neither would the readers. I have worked on the Transformer page a great deal, so I would know what encoder-decoder mechanism is, but this diagram has defeated me. There are better diagrams out there that I can put in, from seq2seq:

Seq2seq RNN encoder-decoder with attention mechanism, where the detailed construction of attention mechanism is exposed. See attention mechanism page for details.

Please justify your choice of that very specific version of the article, despite all these problems I have pointed out. See WP:DISPUTE for guidelines for dispute resolution

pony in a strange land (talk) 17:36, 25 October 2024 (UTC)[reply]

Wiki Education assignment: Linguistics in the Digital Age

[edit]

This article was the subject of a Wiki Education Foundation-supported course assignment, between 15 January 2025 and 9 May 2025. Further details are available on the course page. Student editor(s): Mirabbosm (article contribs).

— Assignment last updated by FblthpTheLost (talk) 00:10, 8 May 2025 (UTC)[reply]

Confusion with attention scores and attention mechanisms

[edit]

In the article additive, dot-product(scaled), multiplicative scoring seems to be confused with attention. Self-attention, multi-headed attention, cross attentions are actually attention mechanisms. Using these terms interchangable may lead to confusion. Nema9994 (talk) 15:02, 14 June 2025 (UTC)[reply]

Request to update historical overview of self-attention and affinity matrices

[edit]

I would like to address recent feedback regarding my proposed edits to the Wikipedia page on attention mechanisms. The motivation behind these contributions is to correct and expand the current historical timeline, which presently omits critical, peer-reviewed research that significantly influenced the development of self-attention and affinity-based methods.

In particular, the omitted work includes publications in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) and the International Conference on Computer Vision (ICCV). These are internationally recognized and highly selective venues in artificial intelligence and computer vision. The cited 2015 ICCV paper, for instance, introduces a novel affinity matrix formulation that directly anticipates the structure of self-attention popularized in 2017. This constitutes a clear case of academic precedence that should be represented in the timeline.

The intention is not to promote individual authors, but to ensure Wikipedia provides an accurate and comprehensive overview for readers—especially those unfamiliar with the technical background. The topic of infinite feature selection and its connection to early formulations of self-attention is underrepresented, despite its foundational role.

I am fully open to rephrasing the content to better align with Wikipedia’s neutrality and notability guidelines. However, rejecting inclusion of this material on the basis of undue weight without considering the quality and relevance of the cited sources (e.g., TPAMI and ICCV) risks perpetuating a misleading or incomplete record of the field’s development.

I kindly ask that we reconsider the edit and collaborate on a version that ensures historical accuracy while respecting Wikipedia’s standards. The goal is not to elevate any one contributor, but to present a truthful, verifiable account supported by top-tier independent sources.

Thank you. — KernelChronicles KernelChronicles (talk) 05:03, 26 July 2025 (UTC)[reply]