Permute instruction
This article needs additional citations for verification. (June 2021) |
Permute (and Shuffle) instructions, part of Bit manipulation as well as Vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.
There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows:
for i = 0 to length-1
dest[i] = src[indices[i]]
where the scatter variant is:
for i = 0 to length-1
dest[indices[[i]] = src[i]
Note that unlike in Memory-based Gather-scatter all three of dest, src, and indices are registers (or parts of registers in the case of bit-level permute), not memory locations.
The scatter variant can be seen to "scatter" the source elements across (into) to the destination, where the "gather" variant is gathering data from the indexed source elements.
Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical permutation because duplicates can occur in the output. Permute instructions, misleadingly, actually create combinations.
When the instruction actually is a permutation, it is usually called "shuffle". The AVX512 group of instructions with "shuffle" capability involve simply swapping of different parts of the data at different bitwidths and in different areas.
A special case of permute is also used in GPU "swizzling" (again, confusingly, actually a combination) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate SIMD lane.
Occurrences of permute instructions
Permute instructions occur in both scalar processors as well as Vector processing engines as well as GPUs. In Vector instruction sets they are typically named "Register Gather/Scatter" operations such as in RISC-V Vectors[1], and take Vectors as input for both source elements and source array, and output another Vector.
In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked, SIMD style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result.
Some ISAs, particularly for cryptographic applications, even have bit-level permute operations, such as in RISC-V bitmanip[2]; in the Power ISA it is known as bpermd
and has been included for several decades, and is still in the Power ISA v.3.0 B spec.[3]
Also in some non-Vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include VSHUFF32x4
from AVX512 and
bdep
(bit deposit) from RISC-V bitmanip.
Permute operations in different forms are surprisingly common, occurring in AltiVec, Power ISA, PowerPC G4, AVX-512, SVE2[4] and in Vector processors as well as GPUs
See also
- Kepler_(microarchitecture)#Shuffle_Instructions
- GeForce_700_series#New_shuffle_Instructions
- Intel AVX-512 ISA Manual]
References
- ^ [1]
- ^ [2]
- ^ "Power ISA Version 3.0 B". Power.org. 2017-03-27. Retrieved 2019-08-11.
- ^ ARM HPC, SVE2 Extension summary, p32
This article has not been added to any content categories. Please help out by adding categories to it so that it can be listed with similar articles, in addition to a stub category. (June 2021) |