Array DBMS
Array Database System
An array database system is a database system that supports the array (also called raster) data structure, as understood in programming languages: a homogeneous collection of data items where each item has a coordinate associated, and coordinates sit at grid points in a rectangular, axis-parallel subset of the Euclidean space Zd for some d>0.
Such arrays convey a very regular structure with a well-defined neighbourhood relation between its cells, which requires suitable storage structures. Frequently, operations on cells are applied simultaneously to large contiguous portions of arrays, such as overlaying two equally-sized images; efficient query processing needs to provide support for such patterns. Finally, arrays tend to be very large, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today’s earth and space observation archives grow by Terabytes a day.
Query language support for arrays consists of array primitives which can be nested as in standard SQL, and preferably integrated into SQL. Once sufficiently declarative query languages are available, query optimization and parallelization can be applied which has been proven to accelerate array evaluation up to orders of magnitude. Storage management must give efficient access to large arrays and sub-arrays thereof. Due to the large sizes compression often is applied, but still some high-volume applications require storage hierarchies.
An important application domain today is Web services on geo raster data, such as 1-D environmental sensor time series, 2-D satellite images, 3-D x/y/t image time series (Fig. 1), 3-D x/y/z exploration data, and 4-D x/y/z/t climate simulation data. Other spatio-temporal applications can be found, e.g., in the Life Sciences. On-line analytical processing (OLAP) and Statistical Databases consider arrays as well, often combining the time dimension with abstract axes such as sales and products.
HISTORICAL BACKGROUND
The array paradigm is not directly supported by the relational data model. Tradit¬ion¬ally, therefore, two different approaches have been pursued to achieve array support, BLOBs and OLAP data cubes.
Imagery in Array Databases
If at all stored in databases, image pixel matrices go into BLOBs (Binary Large OBjects), linearized and encoded in some data exchange format, and resulting in a byte string whose semantics is unknown to the database system. The database system, therefore, cannot provide any array operation aside from reading and writing the complete BLOB. As a consequence, for many potential application domains, users believe that databases cannot support their raster data, and they still tend to develop their own file-based services in an ad-hoc manner.
The requirement for further array processing capabilities distinguishes array databases from multimedia databases and imaging systems. Array databases are different from multimedia data-bases which analyse images using some built-in algorithms to obtain feature vectors; sub¬sequent-ly, search is performed on the feature vectors, not the images. Array databases, conversely, do not attempt to interpret the raster data, but always operate on the cell values themselves. Likewise, array databases are different from image processing and understanding systems: imaging systems offer elaborate functionality, however constrained to not excessively large images in relation to main memory size; array databases, conversely, constrain themselves in functionality to remain safe in evaluation, but aim at having no object size limit.
Statistics Data in Array Databases
Statistical data usually are modeled as OLAP data cubes. The main paradigm there today is Relational OLAP (ROLAP) where each data item, together with its coordinates, is stored in a so-called fact table in a relational DBMS. This explains why ROLAP is suitable only for very sparse data, where listing only valid points together with their coordinates represents an efficient compression scheme. Relational query operators, extended with specific data cube operators, provide powerful analysis capabilities on such data cubes. See the cube entry for more details.
A Unified View
Although treated differently, both images (here understood to have any spatio-temporal dimensions, not just 2-D x/y axes) and data cubes share characteristics to a large extent: Both consist of a data structure mapping n-D coordinates with rectangular boundaries into a value space. Operations convey similarities too, as the following examples show: Subsetting (cutting out sub-cubes) is known in both worlds, and an OLAP roll-up from days to weeks is equivalent to scaling an image by a factor of 7 using linear interpolation.
Actually the main difference motivating different treatment lies in a data property: statistical data tend to be very sparse, typically only up to 5% of the data cube holds values of interest. Image data, on the other hand, tend to be 100% dense.
Some authors [7] argue that OLAP data cubes are not strictly arrays because, in the case of categorial axes (in OLAP called measures), the neighbourhood relation between points does not induce a strict ordering. Others [2] subsume data cubes under the array paradigm, for example because categorial measures ultimately are mapped to integer coordinates during physical modeling.
History and Current State
A first important step beyond BLOBs was accomplished with PICDMS [4] where a 2-D array query language, still procedural and without suitable storage support, has been introduced. A declarative n-D array query language with an algebraic foundation and a suitable architecture, which is implemented and in operational use, has been presented in [1][2]. Marathe and Salem present a 2-D database array language [8]. In [4] nested relational calculus is extended with multidimensional arrays, obtaining a model called NCRA and a query language derived from it, AQL. Both seem to be more theoretically motivated and aim in particular at complexity studies for array addressing. Mennis et al [9] introduce 3-D map algebra, a framework suitable for handling 2-D and 3-D geo raster data. While OLAP is a well established business today, arrays are considered by industry only recently. ESRI’s ArcSDE and Oracle’s GeoRaster cartridge offer tiled storage of 2-D maps with mainly spatial subsetting support; the rasdaman system offers n-D full query support.
Fundamentals
The collection paradigm in databases encompasses sets, bags, lists, and arrays. The array concept forms a separate, distinct information category. While it might be emulated, e.g., by nested lists, this will not lead to usable query concepts (e.g., for slicing a data cube along all its axes), nor can this be a basis for efficient implementations. Consequently, all database aspects need to be reconsidered for array support, including conceptual modeling (what are appropriate operations?), storage management (how to manage objects spanning many disk blocks, if not several media?), and query evaluation (what are efficient and optimizable processing strategies?).
Formally, an array a is a function a: X F where X is a finite axis-parallel hypercube in Euclidean space Zd for some d>0 and F is some algebra. Let X be the array’s domain, F its cell type, and the individual locations xX carrying some value fF the array’s cells.
Following good practice in databases, an array query language should be declarative and safe in evaluation. Declarative in this context means that there is no explicit iteration sequence over an array (or part of it) during evaluation – conceptually, all cells should be inspected simultaneous¬ly. This opens up avenues for efficient storage and evaluation patterns, and query optimization in general (see below). Avoiding explicit iterations also contributes to be safe in evaluation, i.e., every query is evaluated in a finite number of (finite-time) steps.
Actually it turns out that a wide range of practically relevant operations can be expressed using only a few primitives [2]. The MARRAY operator creates an array over some given domain and sets its cells by evaluating an also given expression at each cell location. An application of the MARRAY operator has the general form
marray index-range-specification
values cell-value-expression
where index-range-specification defines the resulting array’s domain and binds an iteration variable to it. In cell-value-expression the value of the cells are determined for each cell coordinate; this expression may or may not use the current value of the iteration variable, e.g., to read out cells from some given array. The first example, domain subsetting, makes use of this addressing by copying the values of cells within the cutout region from the original array into the new array.
Example: “A cutout of array a specified by the corner points (100,100) and (200,300).”
marray p in [100:200,100:300]
values a[p]
In the query language this particular application pattern of MARRAY can be abbreviated as
a[100:200,100:300]
Aside from copying cells as above, the resulting array can also have new values assigned which may or may not depend on other arrays.
Example: “Array a (which is known to be of size 1024x768), with intensity reduced by a factor of 2.” Auxiliary function dom(a) returns the domain of a, over which the query has to iterate:
marray p in dom(a)
values a[p] / 2
Such operations are abbreviated in the query language by applying the operation on hand directly to the array, in the example obtaining:
a / 2
All unary and binary operations on cells can thus be lifted to become array operations.
Example: “Array a’s values where they are above threshold t, and 0 otherwise.”
(a > t) * a
In this example Boolean and arithmetic operations are combined; similarly to many pro¬gramming languages, Boolean results from the comparison are interpreted as 0 and 1 for the subsequent multiplication.
The CONDENSE operator aggregates cell values into one scalar result, similar to SQL aggregates. Its application has the general form
condense condense-op
over index-range-specification
using cell-value-expression
where, like with MARRAY before, index-range-specification indicates the array index domain to be iterated over and binds a variable to it. Similarly, cell-value-expression represents an expression to be evaluated for each index range coordinate. The additional element, condense-op, specifies the aggregating operation used to combine the cell value expressions into one single value.
Example: “The sum of all values in a.”
condense +
over p in sdom(a)
using a[p]
This is abbreviated as add_cells(a); further predefined condensers include count_cells(), minimum, maximum, and quantifiers.
Histogram computation demonstrates nesting of these operations.
Example: “A histogram of 256 buckets over 8-bit greyscale array a.”
marray n in [0:255]
values count_cells( a = n )
The inner expression, a = n, is an application of the formerly introduced “lifted” operation. For each coordinate p of a, the comparison a[p] = n is performed, yielding a Boolean array.
The SORT operator allows to slice an array along one of its axes and reorder the slices according to some sorting criterion.
By embedding this into a set-oriented model with array-valued attributes one can write queries over sets of arrays. The next example assumes a table BrainActivationMaps where one of the attributes, named data, is an array; the data array might contain processed scans of human brains indicating activity like electrical flow, temperature, blood pressure, etc. Note that for the query it doesn’t matter whether the data array is, say, 2-D or 3-D. This way queries can be kept rather generic.
Example: “Tuple identifier and histogram of all those BrainActivationMaps tuples where the average data intensity exceeds threshold 127.”
select id,
marray n in [0:255] values count_cells( bam.data = n )
from BrainActivationMaps as bam where avg_cells( bam.data ) > 127 Such languages allow formulating statistical and imaging operations which can be expressed analytically without using loops. In [6] it has been proven that the expressive power of such array languages in principle is equivalent to relational query languages with ranking. Physical Modeling In almost all practically relevant scenarios array objects are by orders of magnitude larger than disk blocks. The foremost task for an array storage manager is to preserve spatial proximity on disk. For n-D data this obviously is impossible to achieve on a linear storage space, hence approx-im¬at¬ions need to be employed. One such technique is tiling, adopted from imaging. Tiling (also called chunking) partitions an n-D cube into a set of non-overlapping n-D sub-cubes (Fig. 2), each of which is stored in a BLOB in the traditional manner. The partitioning pattern can be chosen freely, which opens up a wide space for storage optimization. In the most simple case equally sized tiles are employed where the tile size is chosen suitably for the disk and bus specs; an advanced alternative is to determine an optimal tile distribution with individual tile proportions and sizes for a given query set.
A spatial index, such as the R-Tree, helps to quickly determine the tiles affected by a query, which usually is a range query. As opposed to spatial (i.e., vector) databases the situation is simple: the target objects, which have a regular box structure, partition a space of known extent. Hence, almost any spatial index will perform decently. Often compression of tiles is advantageous. Still, in face of very large array databases tertiary storage may be required, such as tape robots [11] [10]. Query Evaluation A tile-based storage structure suggests a tile-by-tile processing strategy. Indeed a large class of practically relevant queries can be evaluated by inspecting a tile stream, rather than loading the whole source object into main memory. Additionally, for some query types such as condensers tile streaming can be terminated prematurely. Interestingly it turns out that array query evaluation times typically are CPU bound, except for the very rare cases where only domain subsetting and no processing is required. The reason lies in the large number of array cells to be processed per query. Preliminary results show that query parallelization by assigning tiles to different CPUs yields promising performance gains [5]. Query Optimization Array queries lend themselves well towards query optimization. Currently only heuristic optimization is reasonably understood, but shows good results. Consider the query select avg_cells( a + b ) from a, b Fig. 3 shows an equivalence rule which can speed up evaluation: substituting the left-hand side array expression with the right-hand side yields select avg_cells( a ) + avg_cells( b ) from a, b This query involves two array traversals and two (negligible) scalar operations. Bottom line, three array traversals have been reduced to two this way.
Fig. 3 Algebraic equivalence rule “avg_cells(m1+m2) avg_cells(m1)+avg_cells(m2)” KEY APPLICATIONS In many cases where some phenomenon is sampled or simulated, the result is a rasterized data set. Given the high volumes and multi-faceted retrieval arising, there is a remarkably wide application domain for array databases. Briefly, it can be summarized as sensor, image, and statistics data in the widest sense. Exemplarily geo and life sciences will be inspected next|. Earth Sciences By far the most important and visible application domain for large-scale array information systems is geospatial raster data, with high public visibility through novel, user-friendly inter-action techniques like the Virtual Globes of NASA WorldWind (worldwind.arc.nasa.gov) and GoogleEarth (earth.google.com). The Open GeoSpatial Consortium (OGC, www.open¬geo¬spatial.org), in collaboration with ISO, OASIS, W3C and others, provides standards for open, interoperable geo Web service interfaces. OGC develops a family of modular geo service standards, of which the Web Coverage Service (WCS) suite is particularly relevant for raster data (there also called coverage data). It offers basic subsetting services, scaling, and re¬projection. This static request type is augmented with a coverage processing language in the Web Coverage Processing Service (WCPS) Implem¬ent¬at¬ion Specification [3]. A typical WCPS request computes the Normalized Difference Vegetation Index (NDVI), which computes a measure of the degree of vegetation for each pixel. For a multi-spectral satellite image s with near-infrared channel nir and red channel red, the NDVI is defined as: NDVI(s) = (s.nir – s.red) / (s.nir + s.red) The result is a real value between -1 and +1; the closer it is to +1, the higher is the likelihood for vegetation. For some server object LandsatScene (note that WCPS lists single objects and does not operate on sets like database languages) an 8-bit greyscale NDVI image is specified as follows: for s in ( LandsatScene ) return
encode( (char) (255 * (s.nir – s.red) / (s.nir + s.red) + 1), “TIFF“ )
Life Science Human Brain Imaging In human brain imaging the research goal is to understand the relations between brain structure and its function. In experiments, human subjects have to perform some mental task while activity parameters such as brain temperature, electrical activity, and oxygen consumption are measured by PET or fMRI CAT scans. The resulting 3-D x/y/z data cubes have a resolution of 1 mm and an overall volume in the Megabyte size. In a computationally expensive warping operation the brain images get normalized against some chosen standard brain so that organs always sit at known voxel coordinates (Fig. 4 left). In the end a brain image set as large as possible is desired to achieve high statistical significance. Traditionally, feature search is the only property through standard SQL on structured and semi-structured data; specialized tools can perform search on single images or small sets thereof. With array databases, thousands of experiments each with large image sets can be searched by brain feature. The brain organs can be registered as voxel masks (Fig. 4 right). The query types arising mainly perform standard statistics per brain. Example: “A parasagittal view of all scans containing critical Hippocampus activations, TIFF-coded.“ Positional parameter $1 denotes the slicing position, $2 the intensity threshold value, $3 the confidence, as chosen by a user through some browser interface. select tiff( ht[ $1, *:*, *:* ] ) from HeadTomograms as ht, Hippocampus as mask where count_cells( ht > $2 and mask ) / count_cells( mask ) > $3
Fig. 4 Human brain activation map (left) and brain organ masks (right)
Gene Expression Analysis
Gene expression is the activitiy of reading out genes for reproduction in a living body. The research goal in gene expression analysis is to understand how and when genes express to become manifest in the phenotype. One step towards this is to understand the spatio-temporal expression patterns in a well-known research object, the fruit fly Drosophila melanogaster.
The staining process delivers activity patterns (Fig. 5 left). It is common to combine three gene images into one by randomly assigning them to the red, green, and blue channel, resp. (Fig. 5 top right). Traditionally, diagrams like Fig. 5 bottom right are constructed.
Fig. 5 Gene activity maps for a fruitfly embryo (left), overlay into RGB (top right), slice aggregating activity over the 10% central strip of a snapshot (bottom right)
Array queries allow searching the resulting 4-D x/y/z/t activity cube and generate, among others, the views researchers are used to. A gene expression database might contain 4-D objects where the first dimension lists the fruitfly genes in some chosen order and the other three spatial dimensions allow addressing into the fruitfly body. Then, a query can slice these 4-D cubes and recombine these slices into the aforementioned RGB overlay.
Example: “Genes $1, $2, and $3 at time $4, as RGB images.”
select jpeg( {1c,0c,0c}*e[ $1, *:*, *:*, $4 ]
+ {0c,1c,0c}*e[ $2, *:*, *:*, $4 ] + {0c,0c,1c}*e[ $3, *:*, *:*, $4 ] )
from EmbryoImages as e where oid(e)=193537 FUTURE DIRECTIONS* Array databases are in their infancy, with many inviting research avenues. On a conceptual level, a unified algebraic array theory seems promising which is to unify (dense) image handling with (spar¬se) statistical and OLAP handling. Advanced query optimization deserves further investig-at¬ion, with topics like cost-based optimization and complex array addressing schemes (such as filter kernels). Physical tuning parameters are not yet fully understood in their effects and inter-play. Finally, further experience in as many applications as possible is desirable for a better under-standing of the relevant query patterns. The ultimate goal is to establish arrays as first-class database citizens.
References
[1] Baumann P (1994): On the Management of Multidimensional Discrete Data, VLDB Journal 4(3), pp. 401 - 444. Special Issue on Spatial Database Systems, 1994 [2] Baumann P (1999): A Database Array Algebra for Spatio-Temporal Data and Beyond, Proc. NGITS’99, LNCS 1649, pp.76-93, 1999 [4] Chock M, Cardenas A, Klinger A (1984): Database structure and manipulation capabilities of a picture database management system (PICDMS). IEEE ToPAMI, 6(4):484-492, 1984 [5] Hahn K, Reiner B (2002): Parallel Query Support for Multidimensional Data: Inter-object Parallelism. Proc. DEXA, 2002, Aix en Provence, France [6] Libkin L, Machlin R, Wong L (1996): A query language for multidimensional arrays: design, implementation and optimization techniques. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’ 96), pages 228-239 [7] Machlin R (2007): Index-Based Multidimensional Array Queries: Safety and Equivalence. Proc. ACM PODS, Beijing, China, June 2007 [8] Marathe A, Salem K (1997): A language for manipulating arrays. In Proc. VLDB’97, pages 46-55, August 1997 [9] Mennis J, Viger R, and Tomlin C.D (2005): Cubic Map Algebra Functions for Spatio-Temporal Analysis, Cartography and Geographic Information Science, Vol. 32, No. 1, 2005, pp. 17-32 [10] Reiner B, Hahn K (2002): Hierarchical Storage Support and Management for Large-Scale Multi¬dim¬ensional Array Database Management Systems. Proc. DEXA, 2002, Aix en Provence, France [11] Sarawagi S, Stonebraker M (1994): Efficient Or¬gan¬i¬za¬tion of Large Multidimensional Arrays.. Proc. ICDE'94, Houston, USA, 1994, pp. 328-336