Talk:Attention (machine learning)

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics
???	This article has not yet received a rating on the project's importance scale.

Statistics

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
???	This article has not yet received a rating on the importance scale.

Systems: Cybernetics

	Systems science portal This article is within the scope of WikiProject Systems, which collaborates on articles related to systems and systems science.SystemsWikipedia:WikiProject SystemsTemplate:WikiProject SystemsSystems
???	This article has not yet received a rating on the project's importance scale.
	This article is within the field of Cybernetics.

Computing

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
???	This article has not yet received a rating on the project's importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Confusing line "X is the input matrix of word embeddings, size 4 x 300. x is the word vector for "that". "

After "4x300" it immediately says "x is the word for 'that'." That's super-confusing, because one might think that the second x refers to the x between 4 and 300. There are three different uses of x in the sentence. Someone familiar with the field will be able to understand it, but wikipedia is meant to be clear as possible. ThinkerFeeler (talk) 00:20, 30 July 2023 (UTC)[reply]

typo in "asterix"?

in the following extract: " The asterix within parenthesis "(*)" denotes the softmax" shouldn't the word be asterisk, not asterix ? :-) Jrob kiwi (talk) 16:35, 23 August 2023 (UTC)[reply]

Does RNN mean "recursive neural network" or "recurrent neural network"?

In this article, is RNN supposed to mean "recursive neural network" or "recurrent neural network", or maybe sometimes one and sometimes the other? Once we figure this out, let's replace all occurrences with the correct three words, so that it is immediately clear even to novices. — $Q$ uantling (talk | contribs) 16:14, 24 October 2023 (UTC)[reply]

I'm pretty sure it is "recurrent". I am going to go ahead and edit. If I have it wrong, please accept my apologies ... and fix my edit. —

Q

uantling (talk | contribs) 16:23, 24 October 2023 (UTC)[reply]

hard vs soft weights

The intro mentions hard and soft weights, which I havent heard before in this context. can someone provide a citation showing it is actually used terminology? DMH43 (talk) 15:15, 26 December 2023 (UTC)[reply]

'word' should be replaced with something more generic

The article frequently uses the word "word" when talking about attention. For example the opening paragraph states: "It calculates "soft" weights for each word, more precisely for its embedding, in the context window.". However, attention is a concept that is independent of input type - it can and has been applied to words, pixel values, quantities, etc. I believe it would be clearer to replace the use of "word" in reference to the inputs that attention is applied to, with something more generic such as "input element" or "token". 180.150.65.6 (talk) 14:31, 5 March 2024 (UTC)[reply]

Where the matrices coming from?

The article does not explain where the Q K V matrices are coming from or how the corresponding networks are trained. 108.53.169.6 (talk) 02:38, 4 August 2024 (UTC)[reply]

Article dispute resolution

@Ffid tham you have been repeatedly reverting all article edits to a very specific version of the article. However, at that point, the article is disorganized, and hard to read. Consider for example:

> The attention network was designed to identify high correlations patterns amongst words in a given sentence, assuming that it has learned word correlation patterns from the training data. This correlation is captured as neuronal weights learned during training with backpropagation.

This uses awkward phrasing like "neuronal weights learned". It also says "attention network", but attention mechanism is not a network. It is a module that can go into different kinds of neural networks.

> The diagram shows the Attention forward pass calculating correlations

This diagram is hard to understand, especially up there as the first image showing the mathematical operations all together. To have good style, the article should start simple and build the attention mechanism piece-by-piece. Specifically, the section on seq2seq was written to build the attention mechanism piece-by-piece.

After that section, then that picture can be displayed as a big summary (although I believe better pictures are available).

Furthermore, the "Encoder-decoder with attention" diagram is deeply confusing. I don't know what it shows, and I suspect neither would the readers. I have worked on the Transformer page a great deal, so I would know what encoder-decoder mechanism is, but this diagram has defeated me. There are better diagrams out there that I can put in, from seq2seq:

Please justify your choice of that very specific version of the article, despite all these problems I have pointed out. See WP:DISPUTE for guidelines for dispute resolution

pony in a strange land (talk) 17:36, 25 October 2024 (UTC)[reply]

Wiki Education assignment: Linguistics in the Digital Age

This article was the subject of a Wiki Education Foundation-supported course assignment, between 15 January 2025 and 9 May 2025. Further details are available on the course page. Student editor(s): Mirabbosm (article contribs).

— Assignment last updated by FblthpTheLost (talk) 00:10, 8 May 2025 (UTC)[reply]

Confusion with attention scores and attention mechanisms

In the article additive, dot-product(scaled), multiplicative scoring seems to be confused with attention. Self-attention, multi-headed attention, cross attentions are actually attention mechanisms. Using these terms interchangable may lead to confusion. Nema9994 (talk) 15:02, 14 June 2025 (UTC)[reply]

Request to update historical overview of self-attention and affinity matrices

I would like to address recent feedback regarding my proposed edits to the Wikipedia page on attention mechanisms. The motivation behind these contributions is to correct and expand the current historical timeline, which presently omits critical, peer-reviewed research that significantly influenced the development of self-attention and affinity-based methods.

In particular, the omitted work includes publications in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) and the International Conference on Computer Vision (ICCV). These are internationally recognized and highly selective venues in artificial intelligence and computer vision. The cited 2015 ICCV paper, for instance, introduces a novel affinity matrix formulation that directly anticipates the structure of self-attention popularized in 2017. This constitutes a clear case of academic precedence that should be represented in the timeline.

The intention is not to promote individual authors, but to ensure Wikipedia provides an accurate and comprehensive overview for readers—especially those unfamiliar with the technical background. The topic of infinite feature selection and its connection to early formulations of self-attention is underrepresented, despite its foundational role.

I am fully open to rephrasing the content to better align with Wikipedia’s neutrality and notability guidelines. However, rejecting inclusion of this material on the basis of undue weight without considering the quality and relevance of the cited sources (e.g., TPAMI and ICCV) risks perpetuating a misleading or incomplete record of the field’s development.

I kindly ask that we reconsider the edit and collaborate on a version that ensures historical accuracy while respecting Wikipedia’s standards. The goal is not to elevate any one contributor, but to present a truthful, verifiable account supported by top-tier independent sources.

Thank you. — KernelChronicles KernelChronicles (talk) 05:03, 26 July 2025 (UTC)[reply]