Linear hashing

GROWTH METHODOLOGY

Linear hashing expands the address space for the hashing function in an iterative process. One full iteration of the expansion doubles the size of the address space.

The essence of linear hashing is that records are located as if the file was fully expanded. The hashing algorithm recognizes how much of the file currently exists and accounts for this in its selection of the appropriate bucket in which to store a given record.

If the necessary bucket exists, the record is written to that bucket. If the bucket has not yet been created, the system walks backward through the number of buckets that do exist until it finds a bucket to which the record can be added. When the bucket to which the record correctly hashes is created, the bucket in which it has been placed temporarily is visited, and the record is moved to its true home.

When a frame is full, additional records assigned to the frame are stored in the overflow file. The overflow file is also divided into the same size frames. When the initial overflow frame is full, a link is established to another free frame in the overflow file.

When the percentage of primary space occupied by data exceeds the threshold, the table automatically expands by one group. Once this group has been added, a portion of the rows in up to three of the old groups are redistributed to the new group. Likewise, when primary space utilization falls too far below the resize threshold (10 to 20 percent, depending on the number of groups in the table), the table automatically contracts.

Linear Hashing

Linear hashing grew out of the work of many individuals who were searching for a method to dynamically extend the fixed address space of hashing functions.

Linear hashing expands the address space for the hashing function in an iterative process. One full iteration of the expansion doubles the size of the address space.

The essence of linear hashing is that records are located as if the file was fully expanded. The hashing algorithm recognizes how much of the file currently exists and accounts for this in its selection of the appropriate bucket in which to store a given record.

If the necessary bucket exists, the record is written to that bucket. If the bucket has not yet been created, the system walks backward through the number of buckets that do exist until it finds a bucket to which the record can be added. When the bucket to which the record correctly hashes is created, the bucket in which it has been placed temporarily is visited, and the record is moved to its true home.

OpenInsight's Linear Hash

OpenInsight stores data in a primary table file (the primary address space) and an overflow file. The operating system file names are REVnnnnn.LK for the primary frame file and REVnnnnn.OV for the overflow file (where nnnnn represents a five-digit integer, which is the same for a given .LK and .OV pair).

The primary address space is divided into fixed length frames. The default frame size is 1024 bytes. In order for the table to be optimized, frame size is of secondary importance to the number of records per frame. It is recommended that frames not contain more than 20 records. Because OpenInsight must search a frame sequentially for a given record, more records per frame can lead to a degradation in performance. When fine tuning the database application, the frame size can be altered. For example, when the median record size exceeds 50 bytes, testing the data table with a larger frame size may improve performance. Estimating record size for variable length data records is more an art than a science, as is estimating the best performing frame size. It is also recommended that frame size not exceed 4 kilobytes as performance degrades in this case as well.

When a frame is full, additional records assigned to the frame are stored in the overflow file. The overflow file is also divided into the same size frames. When the initial overflow frame is full, a link is established to another free frame in the overflow file.

In OpenInsight, a group is defined as the primary frame (located in the primary address space) plus all the additional, linked overflow frames. Group is analogous to "buckets" in the hashing algorithm example discussed above. The overflow frames are allocated, whenever possible, so that frames of the same group are contiguous. Overflow frames may later be emptied when the table resizes or rows are deleted. Empty frames occurring at the end of the overflow file are truncated.

The first frame of the overflow file contains the first frame of the overflow frame free list, which helps to efficiently manage the overflow. This list is a sorted list of pointers to empty overflow frames. As frames are emptied, corresponding pointers are added to the free list. If the size of the free list exceeds one frame, the filing system creates links to other frames in the overflow file, where the list is continued.

As overflow frames are added or removed from a group, the forward and skip pointers in the effected groups are updated. The forward pointer points to the next overflow frame. The skip pointer points to the overflow frame where the next record begins. When record size is small, forward and skip pointers may point to the same frame. When record size exceeds frame size, the frame pointer and skip pointer point to different overflow frames. When the system searches a group for a record, the skip pointer points beyond the overflow frames containing the overflow of the current record, optimizing record searches in overflowing files.

Automatic Table Sizing The greater the number of records assigned to a group, the slower the access time when overflow exists. As mentioned above, this is due to the sequential access process required to locate a specific record in a group. For example, to locate a record the primary table needs to be accessed, then the overflow file needs to be accessed, looking sequentially from one frame to the next in the group, until the specified record is found.

To resolve this problem, linear hash tables automatically resize themselves when primary space gets too full, thereby reducing the number of overflow frames and also the average number of records per frame. This automatic resizing is controlled by the resize threshold.

When the percentage of primary space occupied by data exceeds the threshold, the table automatically expands by one group. Once this group has been added, a portion of the rows in up to three of the old groups are redistributed to the new group. Likewise, when primary space utilization falls too far below the resize threshold (10 to 20 percent, depending on the number of groups in the table), the table automatically contracts.

In essence, the threshold setting is an indicator of the table's ability to add rows with minimum overflow. Adjusting the threshold also requires some experimentation and should be examined in the context of other parameters as well, such as average number of records per frame. A low threshold on a steadily growing table allows faster access to data since there are fewer rows in each group. A higher threshold allows more primary space to be used before resizing, which is suitable for a more static table. A threshold of 80 percent for the filled capacity of the primary address space is a good indicator that expansion is called for. The threshold value is set from the Database Manager tool. This is discussed later in this chapter.