User:Rbdota23/Information retrieval
![]() | This is the sandbox page where you will draft your initial Wikipedia contribution.
If you're starting a new article, you can develop it here until it's ready to go live. If you're working on improvements to an existing article, copy only one section at a time of the article to this sandbox to work on, and be sure to use an edit summary linking to the article you copied from. Do not copy over the entire article. You can find additional instructions here. Remember to save your work regularly using the "Publish page" button. (It just means 'save'; it will still be in the sandbox.) You can add bold formatting to your additions to differentiate them from existing content. |
Article Draft
[edit]Below is the article draft. I will add a section named SPLADE, and add to History and Timeline section of the Information Retrieval article. I understand that both History and Timeline are connected and similar, but it our responsibility to update it.
History
[edit]By the late 1990s, the rise of the World Wide Web fundamentally transformed information retrieval. While early search engines such as AltaVista (1995) and Yahoo! (1994) offered keyword-based retrieval, they were limited in scale and ranking sophistication. The breakthrough came in 1998 with the founding of Google, which introduced the PageRank algorithm[1], leveraging the web’s hyperlink structure to assess page importance and improve relevance ranking.
During the early 2000s, web search systems evolved rapidly with the integration of machine learning techniques. These systems began to incorporate user behavior data (e.g., click-through logs), query reformulation, and content-based signals to improve search accuracy and personalization. In 2009, Microsoft launched Bing, introducing features that would later incorporate semantic web technologies through the development of its Satori knowledge base. Academic analysis[2] have highlighted Bing’s semantic capabilities, including structured data use and entity recognition, as part of a broader industry shift toward improving search relevance and understanding user intent through natural language processing.
A major leap occurred in 2018, when Google deployed BERT (Bidirectional Encoder Representations from Transformers) to better understand the contextual meaning of queries and documents. This marked the integration of deep neural language models into production-scale retrieval systems.[3] BERT’s bidirectional training enabled a more nuanced comprehension of word relationships in context, improving the handling of natural language queries. Its success spurred adoption of transformer-based models across academic research and commercial search applications.[4]
Simultaneously, the research community began exploring neural ranking models that outperformed traditional lexical-based methods. Long-standing benchmarks such as the Text REtrieval Conference (TREC), initiated in 1992, and more recent evaluation frameworks Microsoft MARCO(MAchine Reading COmprehension) (2019)[5] became central to training and evaluating retrieval systems across multiple tasks and domains. MS MARCO has also been adopted in the TREC Deep Learning Tracks, where it serves as a core dataset for evaluating advances in neural ranking models within a standardized benchmarking environment[6].
As deep learning became integral to information retrieval systems, researchers began to categorize neural approaches into three broad classes: sparse, dense, and hybrid models. Sparse models, including traditional term-based methods and learned variants like SPLADE, rely on interpretable representations and inverted indexes to enable efficient exact term matching with added semantic signals[7]. Dense models, such as dual-encoder architectures like ColBERT, use continuous vector embeddings to support semantic similarity beyond keyword overlap[8]. Hybrid models aim to combine the advantages of both, balancing the lexical precision of sparse methods with the semantic depth of dense models. This classification reflects a broader shift in the 2020s toward retrieval techniques that scale effectively while optimizing for both relevance and computational efficiency[9].
As IR systems increasingly rely on deep learning, concerns around bias, fairness, and explainability have also come to the forefront. Research is now focused not just on relevance and efficiency, but on transparency, accountability, and user trust in retrieval algorithms.
Timeline
[edit]1998
- Google is founded by Larry Page and Sergey Brin. It introduces the PageRank algorithm, which evaluates the importance of web pages based on hyperlink structure.[1]
2001
- Wikipedia launches as a free, collaborative online encyclopedia. It quickly becomes a major resource for information retrieval, particularly for natural language processing and semantic search benchmarks.[10]
2009
- Microsoft launches Bing, introducing features such as related searches, semantic suggestions, and later incorporating deep learning techniques into its ranking algorithms.[2]
2013
- Google’s Hummingbird algorithm goes live, marking a shift from keyword matching toward understanding query intent and semantic context in search queries.[11]
2018
- Google AI researchers release BERT (Bidirectional Encoder Representations from Transformers), enabling deep bidirectional understanding of language and improving document ranking and query understanding in IR.[3]
2019
- Microsoft introduces MS MARCO (Microsoft MAchine Reading COmprehension), a large-scale dataset designed for training and evaluating machine reading and passage ranking models.[5]
2020
- ColBERT (Contextualized Late Interaction over BERT) is proposed and introduced at SIGIR 2020, enabling scalable neural retrieval by combining efficient indexing with contextual deep matching.[12][8]
2021
- SPLADE is introduced at SIGIR 2021. It’s a sparse neural retrieval model that balances lexical and semantic features using masked language modeling and sparsity regularization.[13]
2022
- The BEIR benchmark is released to evaluate zero-shot IR across 18 datasets covering diverse tasks. It standardizes comparisons between dense, sparse, and hybrid IR models.[14]
Splade
[edit]SPLADE is a neural retrieval model designed to learn sparse vector representations for queries and documents, effectively merging the strengths of traditional lexical matching methods and modern neural embeddings. Unlike dense retrieval models that rely on continuous vector spaces, SPLADE produces sparse representations, allowing for efficient integration with inverted index structures commonly used in information retrieval systems.[15]
SPLADE and its subsequent versions have brought advancement in the field of learned sparse retrieval. The original SPLADE model was presented at the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), a trusted peer-reviewed conference in the field.[16]
SPLADE v2 is an advancement of the original SPLADE model, designed to learn sparse vector representations for queries and documents, effectively merging the strengths of traditional lexical matching methods and modern neural embeddings. Unlike dense retrieval models that rely on continuous vector spaces, SPLADE v2 produces highly sparse representations, allowing for efficient integration with inverted index structures commonly used in information retrieval systems.[15]
The model operates by using the masked language modeling (MLM) head of BERT, applying explicit sparsity regularization to generate term-weighted representations. This approach enables SPLADE v2 to perform exact term matching while capturing semantic nuances, addressing challenges like the vocabulary mismatch problem encountered in traditional sparse methods.[15]
Key improvements in SPLADE v2 include modifications to the pooling mechanism, benchmarking models based solely on document expansion, and introducing models trained with knowledge distillation. These enhancements have led to significant gains in retrieval effectiveness, achieving more than a 9% improvement in NDCG@10 on the TREC DL 2019 dataset and leading to state-of-the-art results on the BEIR benchmark. [15]
One of the notable advantages of SPLADE is its ability to maintain efficiency comparable to traditional lexical search methods like BM25, while offering the enriched semantic understanding characteristic of neural models. This balance makes it a compelling choice for real-world applications where both speed and accuracy are critical.[16]
The official implementation and model weights for SPLADE are available under a Creative Commons NonCommercial license, with community-driven versions like SPLADE++ released under more permissive licenses.[16][17]
References
[edit]- ^ a b "The Anatomy of a Search Engine". infolab.stanford.edu. Retrieved 2025-04-09.
- ^ a b Uyar, Ahmet; Aliyu, Farouk Musa (2015-01-01). "Evaluating search features of Google Knowledge Graph and Bing Satori: Entity types, list searches and query interfaces". Online Information Review. 39 (2): 197–213. doi:10.1108/OIR-10-2014-0257. ISSN 1468-4527.
- ^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019-05-24), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv, doi:10.48550/arXiv.1810.04805, arXiv:1810.04805, retrieved 2025-04-09
- ^ Gardazi, Nadia Mushtaq; Daud, Ali; Malik, Muhammad Kamran; Bukhari, Amal; Alsahfi, Tariq; Alshemaimri, Bader (2025-03-15). "BERT applications in natural language processing: a review". Artificial Intelligence Review. 58 (6): 166. doi:10.1007/s10462-025-11162-5. ISSN 1573-7462.
- ^ a b Bajaj, Payal; Campos, Daniel; Craswell, Nick; Deng, Li; Gao, Jianfeng; Liu, Xiaodong; Majumder, Rangan; McNamara, Andrew; Mitra, Bhaskar (2018-10-31), MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, arXiv, doi:10.48550/arXiv.1611.09268, arXiv:1611.09268, retrieved 2025-04-24
- ^ Craswell, Nick; Mitra, Bhaskar; Yilmaz, Emine; Rahmani, Hossein A.; Campos, Daniel; Lin, Jimmy; Voorhees, Ellen M.; Soboroff, Ian (2024-02-28). "Overview of the TREC 2023 Deep Learning Track".
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Kim, Dohyun; Zhao, Lina; Chung, Eric; Park, Eun-Jae (2021-07-20), Pressure-robust staggered DG methods for the Navier-Stokes equations on general meshes, arXiv, doi:10.48550/arXiv.2107.09226, arXiv:2107.09226, retrieved 2025-04-24
- ^ a b Khattab, Omar; Zaharia, Matei (2020-07-25). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT". Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '20. New York, NY, USA: Association for Computing Machinery: 39–48. doi:10.1145/3397271.3401075. ISBN 978-1-4503-8016-4.
- ^ Lin, Jimmy; Nogueira, Rodrigo; Yates, Andrew (2021-08-19), Pretrained Transformers for Text Ranking: BERT and Beyond, arXiv, doi:10.48550/arXiv.2010.06467, arXiv:2010.06467, retrieved 2025-04-24
- ^ "History of Wikipedia", Wikipedia, 2025-02-21, retrieved 2025-04-09
- ^ Sullivan, Danny (2013-09-26). "FAQ: All About The New Google "Hummingbird" Algorithm". Search Engine Land. Retrieved 2025-04-09.
- ^ Khattab, Omar; Zaharia, Matei (2020-06-04), ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, arXiv, doi:10.48550/arXiv.2004.12832, arXiv:2004.12832, retrieved 2025-04-09
- ^ Jones, Rosie; Zamani, Hamed; Schedl, Markus; Chen, Ching-Wei; Reddy, Sravana; Clifton, Ann; Karlgren, Jussi; Hashemi, Helia; Pappu, Aasish; Nazari, Zahra; Yang, Longqi; Semerci, Oguz; Bouchard, Hugues; Carterette, Ben (2021-07-11). "Current Challenges and Future Directions in Podcast Information Access". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21. New York, NY, USA: Association for Computing Machinery: 1554–1565. doi:10.1145/3404835.3462805. ISBN 978-1-4503-8037-9.
- ^ Thakur, Nandan; Reimers, Nils; Rücklé, Andreas; Srivastava, Abhishek; Gurevych, Iryna (2021-10-21), BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, arXiv, doi:10.48550/arXiv.2104.08663, arXiv:2104.08663, retrieved 2025-04-09
- ^ a b c d Formal, Thibault; Lassance, Carlos; Piwowarski, Benjamin; Clinchant, Stéphane (2021-09-21). "SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval". arXiv.org. Retrieved 2025-03-12.
- ^ a b c "Learned sparse retrieval", Wikipedia, 2024-10-23, retrieved 2025-03-12
- ^ naver/splade, NAVER, 2025-03-10, retrieved 2025-03-12