User:Rbdota23/Information retrieval
![]() | This is the sandbox page where you will draft your initial Wikipedia contribution.
If you're starting a new article, you can develop it here until it's ready to go live. If you're working on improvements to an existing article, copy only one section at a time of the article to this sandbox to work on, and be sure to use an edit summary linking to the article you copied from. Do not copy over the entire article. You can find additional instructions here. Remember to save your work regularly using the "Publish page" button. (It just means 'save'; it will still be in the sandbox.) You can add bold formatting to your additions to differentiate them from existing content. |
Article Draft
[edit]Below is the article draft. I will add a section named SPLADE, and add to History and Timeline section of the Information Retrieval article. I understand that both History and Timeline are connected and similar, but it our responsibility to update it.
History
[edit]By the late 1990s, the rise of the World Wide Web fundamentally transformed information retrieval. While early search engines such as AltaVista (1995) and Yahoo! (1994) offered keyword-based retrieval, they were limited in scale and ranking refinement. The breakthrough came in 1998 with the founding of Google, which introduced the PageRank algorithm[1], using the web’s hyperlink structure to assess page importance and improve relevance ranking.
During the 2000s, web search systems evolved rapidly with the integration of machine learning techniques. These systems began to incorporate user behavior data (e.g., click-through logs), query reformulation, and content-based signals to improve search accuracy and personalization. In 2009, Microsoft launched Bing, introducing features that would later incorporate semantic web technologies through the development of its Satori knowledge base. Academic analysis[2] have highlighted Bing’s semantic capabilities, including structured data use and entity recognition, as part of a broader industry shift toward improving search relevance and understanding user intent through natural language processing.
A major leap occurred in 2018, when Google deployed BERT (Bidirectional Encoder Representations from Transformers) to better understand the contextual meaning of queries and documents. This marked one of the first times deep neural language models were used at scale in real-world retrieval systems.[3] BERT’s bidirectional training enabled a more refined comprehension of word relationships in context, improving the handling of natural language queries. Because of its success, transformer-based models gained traction in academic research and commercial search applications.[4]
Simultaneously, the research community began exploring neural ranking models that outperformed traditional lexical-based methods. Long-standing benchmarks such as the Text REtrieval Conference (TREC), initiated in 1992, and more recent evaluation frameworks Microsoft MARCO(MAchine Reading COmprehension) (2019)[5] became central to training and evaluating retrieval systems across multiple tasks and domains. MS MARCO has also been adopted in the TREC Deep Learning Tracks, where it serves as a core dataset for evaluating advances in neural ranking models within a standardized benchmarking environment[6].
As deep learning became integral to information retrieval systems, researchers began to categorize neural approaches into three broad classes: sparse, dense, and hybrid models. Sparse models, including traditional term-based methods and learned variants like SPLADE, rely on interpretable representations and inverted indexes to enable efficient exact term matching with added semantic signals[7]. Dense models, such as dual-encoder architectures like ColBERT, use continuous vector embeddings to support semantic similarity beyond keyword overlap[8]. Hybrid models aim to combine the advantages of both, balancing the lexical (token) precision of sparse methods with the semantic depth of dense models. This way of categorizing models balances scalability, relevance, and efficiency in retrieval systems[9].
As IR systems increasingly rely on deep learning, concerns around bias, fairness, and explainability have also come to the picture. Research is now focused not just on relevance and efficiency, but on transparency, accountability, and user trust in retrieval algorithms.
Timeline
[edit]1998: Google is founded by Larry Page and Sergey Brin. It introduces the PageRank algorithm, which evaluates the importance of web pages based on hyperlink structure.[1]
2001: Wikipedia launches as a free, collaborative online encyclopedia. It quickly becomes a major resource for information retrieval, particularly for natural language processing and semantic search benchmarks.[10]
2009: Microsoft launches Bing, introducing features such as related searches, semantic suggestions, and later incorporating deep learning techniques into its ranking algorithms.[2]
2013: Google’s Hummingbird algorithm goes live, marking a shift from keyword matching toward understanding query intent and semantic context in search queries.[11]
2018: Google AI researchers release BERT (Bidirectional Encoder Representations from Transformers), enabling deep bidirectional understanding of language and improving document ranking and query understanding in IR.[3]
2019: Microsoft introduces MS MARCO (Microsoft MAchine Reading COmprehension), a large-scale dataset designed for training and evaluating machine reading and passage ranking models.[5]
2020: The ColBERT (Contextualized Late Interaction over BERT) model, designed for efficient passage retrieval using contextualized embeddings, was introduced at SIGIR 2020..[12][8]
2021: SPLADE is introduced at SIGIR 2021. It’s a sparse neural retrieval model that balances lexical and semantic features using masked language modeling and sparsity regularization.[13]
2022: The BEIR benchmark is released to evaluate zero-shot IR across 18 datasets covering diverse tasks. It standardizes comparisons between dense, sparse, and hybrid IR models.[14]
Model Type
[edit]Third Dimension: Representational approach-based classification
[edit]In addition to the theoretical distinctions, modern information retrieval models are also categorized on how queries and documents are represented and compared, using a practical classification distinguishing between sparse, dense and hybrid models[7].
- Sparse models utilize interpretable, term-based representations and typically rely on inverted index structures. Classical methods such as TF-IDF and BM25 fall under this category, along with more recent learned sparse models that integrate neural architectures while retaining sparsity[14].
- Dense models represent queries and documents as continuous vectors using deep learning models, typically transformer-based encoders. These models enable semantic similarity matching beyond exact term overlap and are used in tasks involving semantic search and question answering[15].
- Hybrid models aim to combine the strengths of both approaches, integrating lexical (tokens) and semantic signals through score fusion, late interaction, or multi-stage ranking pipelines[16].
This classification has become increasingly common in both academic and the real world applications and is getting widely adopted and used in evaluation benchmarks for Information Retrieval models[9][14].
References
[edit]- ^ a b "The Anatomy of a Search Engine". infolab.stanford.edu. Retrieved 2025-04-09.
- ^ a b Uyar, Ahmet; Aliyu, Farouk Musa (2015-01-01). "Evaluating search features of Google Knowledge Graph and Bing Satori: Entity types, list searches and query interfaces". Online Information Review. 39 (2): 197–213. doi:10.1108/OIR-10-2014-0257. ISSN 1468-4527.
- ^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019-05-24), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv, doi:10.48550/arXiv.1810.04805, arXiv:1810.04805, retrieved 2025-04-09
- ^ Gardazi, Nadia Mushtaq; Daud, Ali; Malik, Muhammad Kamran; Bukhari, Amal; Alsahfi, Tariq; Alshemaimri, Bader (2025-03-15). "BERT applications in natural language processing: a review". Artificial Intelligence Review. 58 (6): 166. doi:10.1007/s10462-025-11162-5. ISSN 1573-7462.
- ^ a b Bajaj, Payal; Campos, Daniel; Craswell, Nick; Deng, Li; Gao, Jianfeng; Liu, Xiaodong; Majumder, Rangan; McNamara, Andrew; Mitra, Bhaskar (2018-10-31), MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, arXiv, doi:10.48550/arXiv.1611.09268, arXiv:1611.09268, retrieved 2025-04-24
- ^ Craswell, Nick; Mitra, Bhaskar; Yilmaz, Emine; Rahmani, Hossein A.; Campos, Daniel; Lin, Jimmy; Voorhees, Ellen M.; Soboroff, Ian (2024-02-28). "Overview of the TREC 2023 Deep Learning Track".
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ a b Kim, Dohyun; Zhao, Lina; Chung, Eric; Park, Eun-Jae (2021-07-20), Pressure-robust staggered DG methods for the Navier-Stokes equations on general meshes, arXiv, doi:10.48550/arXiv.2107.09226, arXiv:2107.09226, retrieved 2025-04-24
- ^ a b Khattab, Omar; Zaharia, Matei (2020-07-25). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT". Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '20. New York, NY, USA: Association for Computing Machinery: 39–48. doi:10.1145/3397271.3401075. ISBN 978-1-4503-8016-4.
- ^ a b Lin, Jimmy; Nogueira, Rodrigo; Yates, Andrew (2021-08-19), Pretrained Transformers for Text Ranking: BERT and Beyond, arXiv, doi:10.48550/arXiv.2010.06467, arXiv:2010.06467, retrieved 2025-04-24
- ^ "History of Wikipedia", Wikipedia, 2025-02-21, retrieved 2025-04-09
- ^ Sullivan, Danny (2013-09-26). "FAQ: All About The New Google "Hummingbird" Algorithm". Search Engine Land. Retrieved 2025-04-09.
- ^ Khattab, Omar; Zaharia, Matei (2020-06-04), ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, arXiv, doi:10.48550/arXiv.2004.12832, arXiv:2004.12832, retrieved 2025-04-09
- ^ Jones, Rosie; Zamani, Hamed; Schedl, Markus; Chen, Ching-Wei; Reddy, Sravana; Clifton, Ann; Karlgren, Jussi; Hashemi, Helia; Pappu, Aasish; Nazari, Zahra; Yang, Longqi; Semerci, Oguz; Bouchard, Hugues; Carterette, Ben (2021-07-11). "Current Challenges and Future Directions in Podcast Information Access". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21. New York, NY, USA: Association for Computing Machinery: 1554–1565. doi:10.1145/3404835.3462805. ISBN 978-1-4503-8037-9.
- ^ a b c Thakur, Nandan; Reimers, Nils; Rücklé, Andreas; Srivastava, Abhishek; Gurevych, Iryna (2021-10-21), BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, arXiv, doi:10.48550/arXiv.2104.08663, arXiv:2104.08663, retrieved 2025-04-09
- ^ Lau, Jey Han; Armendariz, Carlos; Lappin, Shalom; Purver, Matthew; Shu, Chang (2020). Johnson, Mark; Roark, Brian; Nenkova, Ani (eds.). "How Furiously Can Colorless Green Ideas Sleep? Sentence Acceptability in Context". Transactions of the Association for Computational Linguistics. 8: 296–310. doi:10.1162/tacl_a_00315.
- ^ Arabzadeh, Negar; Yan, Xinyi; Clarke, Charles L. A. (2021-09-22), Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection, arXiv, doi:10.48550/arXiv.2109.10739, arXiv:2109.10739, retrieved 2025-05-03