In-place matrix transposition

In-place matrix transposition, also called in-situ matrix transposition, refers to the problem of transposing an $N\times M$ matrix in-place in computer memory: with $O(1)$ additional storage, or at most with additional storage much less than $NM$ . Typically, the matrix is assumed to be stored in row-major order or column-major order (i.e., contiguous rows or columns, respectively, arranged consecutively).

The problem is most difficult when $N\neq M$ , i.e. for a non-square (rectangular) matrix, where it involves a complicated permutation of the data elements, with many cycles of length greater than 2. In contrast, for a square matrix ( $N=M$ ), all of the cycles of are length 1 or 2, and the transposition can achieved by a simple loop to swap the upper triangle of the matrix with the lower triangle. Further complications arise if one wishes to maximize memory locality, however, to improve cache line utilization, since transposes naturally require non-consecutive memory accesses.

The problem of non-square in-place transposition has been studied since at least the late 1950s, and several algorithms are known, including several which attempt to optimize locality for cache optimization.

References

P. F. Windley, "Transposing matrices in a digital computer," Computer Journal 2, p. 47-48 (1959), 47-48.
G. Pall, and E. Seiden, "A problem in Abelian Groups, with application to the transposition of a matrix on an electronic computer," Math. Comp. 14, p. 189-192 (1960).
J. Boothroyd, "Algorithm 302: Transpose vector stored array," ACM Transactions on Mathematical Software 10 (5), p. 292-293 (1967).
Susan Laflin and M. A. Brebner, "Algorithm 380: in-situ transposition of a rectangular matrix," ACM Transactions on Mathematical Software 13 (5), p. 324-326 (1970). Source code.
Norman Brenner, "Algorithm 467: matrix transposition in place," ACM Transactions on Mathematical Software 16 (11), p. 692-694 (1973). Source code.
Esko G. Cate and David W. Twigg, "Algorithm 513: Analysis of In-Situ Transposition," ACM Transactions on Mathematical Software 3 (1), p. 104-110 (1977). Source code.
Murray Dow, "Transposing a matrix on a vector computer," Parallel Computing 21 (12), p. 1997-2005 (1995).
Donald E. Knuth, The Art of Computer Programming Volume 1: Fundamental Algorithms, third edition, section 1.3.3 exercise 12 (Addison-Wesley: New York, 1997).
M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran, "Cache-oblivious algorithms," in Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS 99), p. 285-297 (1999). Extended abstract at IEEE, at Citeseer.
M. Frigo and S. G. Johnson, "The Design and Implementation of FFTW3," Proceedings of the IEEE 93 (2), 216–231 (2005). Source code of the FFTW library, which includes optimized serial and parallel square and non-square transposes, in addition to FFTs.