XML retrieval

XML-Retrieval, or XML Information Retrieval, is the content-based retrieval of documents structured with the eXtensible Markup Language (XML). As such it is used for computing relevance of XML-documents [7].

Most XML-Retrieval approaches do so based on techniques from the information retrieval (IR) area, e.g. by computing the similarity between a query consisting of keywords (= query terms) and the document. However, in XML-Retrieval the query can also contain structural hints. So called "content and structure" (CAS) queries enable users to specify what structure the requested content can or must have.
Taking advantage of the self-describing structure of XML-documents can improve the search for XML-documents significantly. This includes the use of CAS-queries, the weighting of different XML-elements differently and the focused retrieval of sub-documents.
Ranking in XML-Retrieval can incorporate both content relevance and structural similarity, which is the resemblance between the structure given in the query and the structure of the document. Also, the retrieval units resulting from an XML query may not always be entire documents, but can be any deeply nested XML elements, i.e. dynamic documents. The aim is to find the smallest retrieval unit that is highly relevant. Relevance can be defined according to the notion of specificity, which is the extent to which a retrieval unit focuses on the topic of request [4].

An Overview of existing approaches is given in [1] and [5]. The INitiative for the Evaluation of XML-Retrieval (INEX) was founded in 2002 and provides a platform for evaluating such algorithms [4]. Three different areas influence XML-Retrieval [2]:

• Traditional XML query languages:
Query languages such as the W3C standard XQuery [8] supply complex queries, but only look for exact matches. Therefore, they need to be extended to allow for vague search with relevance computing. Most XML-centred approaches imply a quite exact knowledge of the documents’ schemas [6].
• Databases:
Classic database systems have adopted the possibility to store semi-structured data [2] and resulted in the development of XML-databases. Often, they are very formal, concentrate more on searching than on ranking, and are used by experienced users able to formulate complex queries.
• Information Retrieval:
Classic Information Retrieval models such as the Vector Space Model provide relevance ranking, but do not include document structure; only flat queries are supported. Also, they apply a static document concept, so that retrieval units usually are entire documents [6]. They can be extended to consider structural information and dynamic document retrieval. Examples for approaches extending the Vector Space Model are [3] and [6]: they use document subtrees (index terms plus structure) as dimensions of the vector space.

References:

[1] Amer-Yahia, S.; Lalmas, Mounia: XML Search: Languages, INEX and Scoring. SIGMOD Rec. Vol. 35, No. 4, 2006.
[2] Fuhr, Norbert; Gövert, N.; Kazai, Gabriella; Lalmas, Mounia (eds.): INitiative for the Evaluation of XML Retrieval (INEX).
In: Proc. of the First INEX Workshop, Dagstuhl, Germany, 2002, ERCIM Workshop Proceedings, France, 2003. [3] Liu, S.; Zou, Q.; Chu, W.: Configurable Indexing and Ranking for XML Information Retrieval.
In: Proc. of the 27th Annual International ACM SIGIR Conference, ACM Press, 2004.
[4] Malik, Sadia; Trotman, Andrew; Lalmas, Mounia; Fuhr, Norbert: Overview of INEX 2006.
In: Proc. of the Fifth Workshop of the INitiative for the Evaluation of XML Retrieval, Germany, 2007.
[5] Pal, Sukomal: XML Retrieval – A Survey. 2007, Technical Report, CVPR, [1].
[6] Schlieder, Torsten; Meuss, H.: Querying and Ranking XML Documents. Journal of the American Society for Information Science and Technology, Vol. 53, No. 6, 2002.
[7] Winter, Judith; Drobnik, Oswald: An Architecture for XML Information Retrieval in a Peer-to-Peer Environment.
ACM PIKM2007 at ACM 16th Conference on Information and Knowledge Management (CIKM 2007), Lisbon, Portugal, 2007.
[8] World Wide Web Consortium: XQuery 1.0: An XML Query Language. W3C Recommendation, 23. Jan. 2007, http://www.w3.org/TR/xquery/.