Cache control instruction

In computing, cache control instructions are hints embedded in the instruction stream of a microprocessor intended to improve the performance of caches, by benefiting from foreknowledge of the memory access pattern supplied by the programmer or compiler.^[1] The may reduce cache pollution, reduce bandwidth requirement, bypass latencies, and provide better control over the working set. Most cache control instructions do not affect the semantics of a program.

Examples

Several such instructions with variations supported by several instruction set architectures such as PowerPC, x86, and MIPS

Prefetch

Also known as Data cache block touch. Some variations bypass higher levels of the cache hierarchy, which is useful in a 'streaming' context where data is traversed once, rather than held in the working set. The prefetch should occur sufficiently far ahead in time to mitigate the latency of memory access, for example in a loop traversing memory linearly,

Data Cache Block Allocate Zero

This hint is used to prepare cache line prior to overwriting the contents completely. In this example, the CPU needn't load anything from main memory.

Data Cache Block Invalidate

This hint is used to discard cache lines, without committing their contents to main memory. Care is needed since incorrect results are possible, unlike other cache hints the semantics of the program are significantly modified. This is used in conjunction with 'allocate zero' for managing temporary data. This saves un-necassery main memory bandwidth and cache pollution.

Data Cache Block Flush

This hint immediately evicts a cache line, freeing up space for future allocations. The hint is used when it is known data is no longer part of the working set.

Other hints

Some processors support a variation of load/store instructions that also imply cache hints. An example is load last in the PowerPC ISA which suggests that data will only be used once, i.e. the cache line in question may be pushed to the head of the eviction queue.

Alternatives

Scratchpad memory

Some processors support scratchpad memory into which temporaries may be put, and DMA to transfer data to & from main memory when needed. These allow similar levels of control over memory traffic and locality, but have the disadvantage that they require significantly different software to use. A traditional microprocessor can run legacy code, which may be accelerated by cache control instructions, whilst a scratchpad based machine requires dedicated coding from the ground up to even function. Cache control instructions are specific to a certain cache line size, which in practice may vary between generations of processors in the same architectural family.

Caches may help coalescing reads and writes from less predictable access patterns (e.g. during texture mapping), whilst scratchpad DMA requires reworking algorithms for more predictable 'linear' traversals.

As such these are generally harder to use with traditional programming models, although dataflow models (such as Tensorflow) might be more suitable.

Vector fetch

Vector processors (for example modern GPUs and xeon phi) use massive parallelism to achieve high throughput whilst working around memory latency. Many read operations are issued in parallel, for subsequent invocations of a compute kernel; calculations may be put on hold awaiting future data, whilst the execution units are devoted to working on data from past requests data that has already turned up. This is easier for programmers to leverage in conjunction with the appropriate programming models (compute kernels), but harder to apply to general purpose code.

The disadvantage is that many copies of temporary states may be held in the local memory of a processing element, awaiting data in flight.

Automatic prefetching

In recent times cache control instructions have become less popular as increasingly advanced application processor designs from intel and ARM devote more transistors to accelerating code written in traditional languages, e.g. performing automatic prefetching, with hardware to detect linear access patterns on the fly. However the techniques may remain valid for throughput oriented processors, which have a different throughput vs latency tradeoff, and may prefer to devote more silicon to execution units.

References

^ "Power PC manual, see 1.10.3 Cache Control Instructions" (PDF).

[1] "Power PC manual, see 1.10.3 Cache Control Instructions" (PDF).

[1]