User:Mirabbosm/Attention (machine learning)
![]() | This is the sandbox page where you will draft your initial Wikipedia contribution.
If you're starting a new article, you can develop it here until it's ready to go live. If you're working on improvements to an existing article, copy only one section at a time of the article to this sandbox to work on, and be sure to use an edit summary linking to the article you copied from. Do not copy over the entire article. You can find additional instructions here. Remember to save your work regularly using the "Publish page" button. (It just means 'save'; it will still be in the sandbox.) You can add bold formatting to your additions to differentiate them from existing content. |
Attention in Machine Learning
[edit]Introduction
[edit]Attention mechanisms in machine learning allow models to focus on the most relevant parts of input data when performing tasks such as translation, summarization, or image analysis. First introduced in neural machine translation, attention has since become a core component in deep learning architectures, most notably in the Transformer model (Vaswani et al., 2017).
Origins and Development
[edit]The attention mechanism was introduced by Bahdanau, Cho, and Bengio in 2014 to solve limitations in sequence-to-sequence models used for translation. Their model, sometimes called additive attention, aligned each output word with its most relevant input positions instead of compressing the entire input into a single vector (Bahdanau et al., 2014). This allowed for better performance on long sequences and inspired further innovations in the field.
Luong et al. (2015) later refined this with global and local attention types, and introduced multiplicative attention, which simplified computation and improved training speed (Luong et al., 2015).
Self-Attention and Transformers
[edit]The major breakthrough came with self-attention, where each element in the input sequence attends to all others, enabling the model to capture global dependencies. This idea was central to the Transformer architecture, which replaced recurrence entirely with attention mechanisms. As a result, Transformers became the foundation for models like BERT, GPT, and T5 (Vaswani et al., 2017).
Applications
[edit]Attention is widely used in natural language processing, computer vision, and speech recognition. In NLP, it improves context understanding in tasks like question answering and summarization. In vision, visual attention helps models focus on relevant image regions, enhancing object detection and image captioning.
References
[edit]- Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473
- Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.04025
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems