Range query (computer science)

This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template.

In data structures a \emph{range querie} consists of preprocessing some input data to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of numbers not necessarily sorted and a query consists in computing some function on a specific range of the array. Here we discuss about some of these problems.\\

To be more precise, we may state the problem of range queries in the following way: a range query $q_f(A,i,j)$ on an array $A=[a_1,a_2,\ldots,a_n]$ of $n$ elements of some set $S$, denoted $\A{1}{n}$, takes two indices $1\leq i\leq j\leq n$, a function $f$ defined over arrays of elements of $S$ and outputs

$f(\A{i}{j})= f(a_i,\ldots,a_j)$. This should be done space and time efficient.\\

\emph{sum}: consider $f = sum$ and $\A{1}{n}$ and array of real numbers, the range query $sum(A,i,j)$ computes $sum(A[i,j]) = (a_i+\ldots + a_j)$, for any $1 \leq i \leq j \leq n$. These queries may be answer in constant time and using $O(n)$ extra space by calculating the sums of the first $i$ elements of $A$ and storing them into an auxiliar array $B$, such that $B[i]$ contains the sum of the first $i$ elements of $A$ for every $0\leq i\leq n$. Therefore any query might be answered by doing $sum(A[i,j]) = B[j] - B[i-1]$. Let us see an example:\\

\[ A=[45, 5.5, 89, 2134, 1, 78, 23, 89, 77] \] \[B[45, 50.5,139.5,2273.5,2274.5,2352.5,2375.5,2464,2541.5]\] \[sum(A[5,9]) = B[9]-B[4] = 2541.5-2273.5 =268 \]

We can extend this solution for every group operator $f$[link] where $f^{-1}$ is well defined and easily computable [], for instance we can apply this solution for $f\in\{+,x,-\}$. Finally notice this solution might be extended for arrays of dimension two with a similar preprocessing \cite[men he].\\

\emph{semigroup operator:} a more difficult case is when the function of interest $f$ is a semigroup operator[linkoperator]. In a semigroup the notion of $f^{-1}$ is not always defined. Therefore we can not use an analogous strategy. Yao, Alon and Schieber showed that there exists an efficient solution for these type of range queries. They proved that for any constant $c$, a preprocessing of time and space $\theta(c\cdot n)$ allows to answer range queries on lists where $f$ is a semigroup operator in $\theta(\alpha_c(n))$ time, where $\alpha_k$ is a certain functional inverse of the Ackerman function [cite Yao82, Yao85, towards optimal]. The proof is elaborated and it is out of the scope of this article however it may found here [cite, etc].\\

There are some cases of semigroup operators that admit slightly better solutions. In case $f\in \{max,min\}$ we want to find the minimum/maximum element in any range of $\A{1}{n}$. This problem is known as the \emph{range minimum/maximum query problem} and there are several data structures that allows $O(1)$ time queries and $O(n)$ preprocessing time and space [link]. Probably the simplest to understand solution is based on the equivalence between this problem and the Lowest Common Ancestor Problem [link and cite]. The cartesian tree $T_A$ of an array $\A{1}{n}$ has as root $a_i = min\{a_1,a_2,\ldots,a_n\}$ and it has as left and right subtrees the cartesian tree of $\A{1}{i-1}$ and the cartesian tree of $\A{i+1}{n}$ respectively. It is easy to see that the range minimum query $min(A,i,j)$ is the lowest common ancestor in $T_A$ of $a_i$ and $a_j$. Since the lowest common ancestor problem is solvable in constant time and linear space thus so is range minimum query. The solution when f = max is analogous. Cartesian trees can be constructed in linear time. [cite and link].\\

\emph{range Mode:} The \emph{mode} of an array is the element the element that appears the most in the array. For instance the mode of $A=[4,5,6,7,4,]$ is $4$. In case of ties any of the most frequent elements might be picked as mode. A range mode query consists on preprocessing $\A{1}{n}$ such that we can find the mode in any range of $\A{1}{n}$. Several data structures have been devised for this problem, we summarize some of the results in table [cite].\\

\emph{range Median queries:} This particular case is of special interest since finding the median of an array $A[1\ldots n]$ has several applications, for further reference see cite[Sariel Har-Perel, S. Muthukrishman Range Medians]. The median problem was proven to be solvable in $O(n)$ time [link] by et al in . However its generalization through range median queries is recent. A range median query $median(A,i,j)$ where $A,i$ and $j$ have the usual meanings returns the median element of $\A{i}{j}$. In other words, $median(A,i,j)$ returns the element of $\A{i}{j}$ of rank $(j-i)/2$. Note that range median queries can not be solved by following any of the previous methods discussed above including Yao's approach for semigroup operators [cite].\\

There have been studied two variants of this problem, the \emph{offline} version, where all the $k$ queries of interest are given in a batch and we are interest in reduce the total cost and a version where all the preprocessing is done \emph{up front} and we are interested in optimize the cost of any subsequent single query. Concerning the first variant of the problem recently was proven that can be solved in time $O(n\log k + k \log n)$ and space $O(n\log k)$. Algorithm 1 shows how to find the element of rank $r$ in $\A{i}{j}$ an unsorted array of distinct elements, to find the range medians we set $r=(j-i)/2$.\\

rangeMedian(A,i,j,r)\\

if A.length() == 1 return A[1]\\

if A.low is undefined then\\

 m = median(A)\\
 A.low  = [e in A | e <= m]\\
 A.high = [e in A | e > m ]\\

calculate t the number of elements of A[i,j] that belong to A.low\\

if r <= t return rangeMedian(A.low, i,j,r)\\ else return rangeMedian(A.high, i,j, r-t)\\

Algorithm 1 partitions $A$, using $A$'s median, into two arrays $A.low$ and $A.high$, where the former contains the elements of $A$ that are less than or equal to the median $m$ and the latter the rest of the elements of $A$. If we know that the number of elements of $\A{i}{j}$ that ended up in $A.low$ are $t$ and this number is bigger than $r$ then we should keep looking for the element of rank $r$ in $A.low$ else we should look for the element of rank $(r-t)$ in A.high. To find $t$, it is enough to find the maximum index $m<=i-1$ such that $a_m$ is in A.low and the maximum index $l<=j$ such that $a_l$ in in A.high. Then $t=l-m$. The total cost for any query, without considering the partitioning part, is $\log n$ since at most $\log n$ recursion calls are done and only a constant number of operations are performed in each of them (to get the value of $t$ fractional cascading should be used and recall we are not considering the partitioning part). If a linear algorithm to find the medians is used, the total cost of preprocessing for $k$ range median queries is $n\log k$. Clearly this algorithm can be easily modified to solve the up front version of the problem.

\section{Related Problems} All the problems described above have been studied for higher dimensions as well as their dynamic versions. On the other hand, range queries might be extended to toher data structures like trees [citeMorin]. A similar family of problems are orthogonal range queries also known as counting queries.