Jump to content

Search engine indexing

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Jfroelich (talk | contribs) at 18:16, 24 October 2006 (Initial Draft). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

The search engine indexing process is a widely studied process in the field of computer science. The process involves how information is indexed so that it can be searched quickly. Similiar indexing processes are used by many of the popular internet search engines today, as well as various software applications.

This article provides a high-level introduction to the general algorithm involved in creating a search engine index and many of the data structures involved in the process. The article does not detail the entire process, starting only from the point in the process where a search engine has already collected the documents to be indexed. For more information about document collection refer to the article on search engine crawling. This article only refers to the subclass of search engines which deal with natural language data (textual data like web pages, reports, books, and documents).

The Corpus Data Structure

The algorithm covered in this article assumes we are referring to a class of full-text search engines (instead of agent-based or partial-text search engines). These search engines have gathered all the searcheable documents or information into a giant list and have somehow stored this on one or more computers. This list is referred to as the corpus.

The following figure is an overly simplified and ficticious example of a corpus. It is a list of documents. For each document, a unique identifying value is stored, along with a title for the document, the contents of the document.

Document IdTitleFull Contents
1The Lion, The Witch, and the WardrobeOnce upon a time....
2A Mid-Summer Night's DreamThere was a great
3War and Peace...
4A Tale of Two Cities...

Note that search engines differ greatly in corpus design and that there may be several other pieces of information stored about each document (commonly referred to as meta data). For example, we might store the date the document was added to the corpus, or when it was last updated or retrieved or checked, the web address where the document was obtained from if the document is a web page, the document's author, the type or format of the document (HTML / PDF / DOC), how many bytes does the document take up on the computer's hard drive (the document's length or disk size), and so forth. Some designs do not store the full document, some store the document in several pieces separately, and some do not store the document contents at all, only its meta information. For the purposes of this article, it is sufficient to understand that somwhere we have a list (or some other data structure) of documents, each identified in some manner, and we have (or have access to) the full or partial contents of these documents.


Corpus Traversal

The algorithm iterates through each document in the corpus. The order in which the algorithm accesses each document is variable among different search engine designs. For introductory purposes it is sufficient to consider a sequential access approach, where we start with the first document, then move to the next, and so forth until we have reached all the documents in the list. Each document in the list undergoes a series of processes that are outlined in the following sections.

In application, search engine designers pay attention to several issues in regards to scheduling which documents should be indexed, whether the full text of a single document is indexed all at one time, or in pieces over a period of time, or in pieces which are indexed simultaneously. When to traverse is often tied in with how the search engine was designed to collect and add or update documents in the list. In other cases it may be tied to how the document is actually stored on a computer's hard drive, or whether it is stored in its original format or some derived format, or whether it is stored in some type of compressed format or decompressed format.

Entire algorithms have been developed to choose the best way to traverse the corpus. Essentially, these are all derivative of the principles in corpus design (also called the cache), which is not covered in this article. For understanding the process in the context of this article, it is only necessary to acknowledge that a document is accessed and opened for the next step in the indexing process.

Tokenization

One of the fundamental principles to understand is that computers are not inherently aware of the structure of a natural language document. Humans are trained early on to read and separate out words from text and understand those words. To a computer, a document is only a big sequence of bytes. The bytes come together to store characters. The characters all come together to form a document. Computer's do not know that a space character between two sequences of characters means that there are two separate words in the document.

Instead, a computer program is developed by humans which trains the computer, or instructs the computer, how to identify what constitutes an individual or distinct word. This process is referred to as tokenization or [Lexical_analyzer|lexical analysis]], or lexing for short. The computer program or algorithm responsible for this step is commonly referred to as the parser or lexer. Lexing gets its name from either the word lexeme or lexical. Tokenization is a difficult and complicated task. Entire programs have been created which specialize in only tokenization, such as YACC or LEX or JFLEX. Many search engines incorporate these existing parsers. These tokenizer programs are widely used in the field of natural language processing for other purposes outside of search engine indexing as well.

Regarding the indexing algorithm, the current step to perform is for each document, to parses the document into tokens. At a basic level of understanding, the algorithm splits up the document into words which are called tokens. In reality this is a much more complex process. As a generalization, it is sufficient to come away with the understanding that each token is essentially a word (this is not always the case, but it is so for purposes of this reference). For example, if the full contents of a document consisted of the sentence "Hello World", there would typically be two tokens found, the token "Hello" and the token "World".

Search engines differ in design at this point in the algorithm regarding what constitutes a token. Some consider punctuation characters as tokens where as others ignore punctuation or do not treat punctuation as a distinct case. Some involve or differentiate between numbers, alphabetical words, alpha-numerical words (mixture of characters and numbers), and further classify between upper case words, lower case words, and mixed case words. There are all types of token properties that are considered. Every commercial search engine appears to store or use the properties differently. Some designs do not store any properties for a token, only its contents (the characters that make up the word). The matter is further complicated by the fact that there are documents in different languages, in that some documents are stored differently on a computer's hard drive given the document's character set. Issues with multi-lingual corpuses are sometimes dealt with at this point in the algorithm or deferred to a later step, or not dealt with at all.

For each token, various characteristics may be recorded such as whether the token's value is upper or lower case, its language, which natural language sentence or paragraph contain's the token, the part of speech of the token, the position of the token within its containing sentence, and the position of the token within the overall document. Stemming may be performed at this point, which essentially involves ignoring the affices or suffices of the token. Note that each search engine implementation works differently, and that the characteristics collected at this point are different.

One accepted definition of token position is the count of preceding characters prior to the occurrence of the word (citation needed). Position is sometimes referred to as the offset. As the first token in a document contains the first character in the document, so the token's position is 0. Where as the second word might occur 5 characters later, so its recorded position is 5. An alternative definition of token position is the count of preceding tokens (citation needed). The first token is at position 0, the second token at position 1, and so on.

Suppose a document consists solely of the sentence "It's a dog eat dog world". The token array (a data structure similiar to a list, or subtype of a list) might look like the following:

Token (Word)Position
It's1
a6
dog8
eat11
dog15
world20

Token Hashing

Storing the token 'dog' repeatedly is a redundancy. Storing a list such as this for all the documents in a large corpus is not technically feasible. Instead, many search engine designs incorporate a token hash, which assigns a unique, identifying value for each word. This word hash is a separately managed data structure commonly referred to as a lexicon. The following table displays what the lexicon looks like:

Word IdWord
1a
2dog
3eat
4it's
5world

After hashing the words, the algorithm stores a list of each word hash key (the key is the id or value identifying the word) and its position (and potentially other characteristics such as the part of speech). This listing consumes less computer memory (takes up less space) given the smaller key size.

Word IdPosition
41
16
28
311
215
520

Token Conflation

In the above figure, word id 2 appears twice. This is a redundancy, which means there is room for optimization. As such, the list is conflated or aggregated by the id. The position for each token occurrence, commonly referred to as a 'hit' (citation needed), is stored as a list for each key. The list might look something like the following:

Token IdPositions
11,20,500,etc
210,45,3445,etc

The Forward Index Data Structure

The algorithm continues iteration through the remaining documents sequentially, or according to a different logic. Assuming a simplistic sequential access order, it is generally accepted (citation needed) to immediately store the token per document in a data structure known as a forward index. The forward index is a very common index and is typically referred to as just 'the index'. The design of the forward index, and its use, varies greatly among different search engine implementations. The following table represents a simplified form of the forward index data structure.

Document IdHits
1(1 at 1,20,500) (2 at 10,45,3445)
2(1 at 3, 50, 60) (2 at 100, 120, 130,..)
3etc

This is a stopping point in the indexing algorithm for many search engines. However, the goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Querying the above table would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistc.

The Inverted Index Data Structure

The next step in the indexing process is to convert, or transform, the forward index to an inverted index data structure, which will optimize the speed of the query: find the documents where word X occurs. The inverted index data structure is a central component of a typical search engine indexing algorithm.

Instead of listing the words per document in the forward index, a new structure is developed which lists the documents per word. This is created by inverting the forward index structure. For this example, the inverted index might look like the following:

Word IdDocument Occurences
11,3,4,5
22,3,4
3etc

With the inverted index created, a search query can now be resolved by jumping to the word id (via random access) in the inverted index. Random access is generally regarded as being faster than sequential access.