Domain-specific architecture

Tensor Processing Unit
	Tensor Processing Unit 3.0
Designer	Google
Introduced	May 2016
Type	Neural network; Machine learning

Cite error: There are <ref> tags on this page without content in them (see the help page).This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template.

A Domain-specific architecture is a programmable computer architecture specifically tailored to operate very efficiently within the confines of a given application domain. The term is often used in contrast to general-purpose architectures, such as CPUs, that are designed to operate on any computer program.

History

In conjunction with the semiconductor boom started in the 1960s, computer architects were tasked with finding new ways to exploit the increasingly large number of transistors available. Moore's Law and Dennard Scaling enabled architects to focus on improving the performance of general-purpose microprocessors on general-purpose programs ^[1]^[2].

These efforts yielded several technological innovations such as multi-level caches, out-of-order execution, deep instruction pipelines, multithreading and multiprocessing. The impact of these innovations was measured on generalist benchmarks such as SPEC, and architects were not concerned with the internal structure or specific characteristics of these programs^[3].

The end of Dennard scaling pushed computer architects to switch from a single, very fast processor to several processor cores. Performance improvement could no longer be achieved by simply increasing the operating frequency of a single core.^[4]

The end of Moore's Law shifted the focus away from general purpose architectures, towards more specialized hardware. Although general-purpose CPU will likely have a place in any computer system, heterogeneous systems composed of general-purpose and domain-specific components are the most recent trend for achieving high-performance^[5].

While hardware accelerators and ASIC have been used in very specialized application domains since the inception of the semiconductor industry, they generally implement a specific function with very limited flexibility. In contrast, the shift towards domain-specific architectures wants to achieve a better balance of flexibility and specialization.

A notable early example of a domain-specific programmable architecture are GPU. These specialized hardware were developed specifically to operate within the domain of image processing and computer graphics. These programmable processing units found widespread adoption both in gaming consoles and personal computers. With the improvement of the hardware/software stack for both NVIDIA and AMD GPUs, these architectures are being used more and more for the acceleration of embarassingly parallel tasks, even outside of the domain of image processing^[6].

Since the reinassance of machine-learning-based artificial intelligence in the 2010s, several domain-specific architectures have been developed to accelerate inference for different forms of artificial neural networks. Some example are Google's TPU, NVIDIA's NVDLA^[7] and ARM's MLP^[8].

Guidelines for DSA Design

John Hennessy and David Patterson outlined five principles for DSA design that lead to a better area efficiency and energy savings. The objective in these types of architecture is often also to reduce the Non-Recurring Engineering (NRE) costs, so that the investment in a specialized solution can be more easily amortized^[3].

Minimize distance over which data is moved

A remarkable amount of energy is used in general-purpose memory hierarchies moving data attempting to minimize the latency required to access data. In the case of Domain-Specific Architectures, it is expected that understanding of the application domains by hardware and compiler designers allows for simpler and specializied memory hierarchies, where the data movement is largely handled in software, with tailor-made memories for specific functions within the domain.

Invest saved resources into arithmetic units or bigger memories

Since a remarkable amount of hardware resources can be saved by dropping general-purpose architectural optimizations such as out-of-order execution, prefetching, address coalescing and hardware speculation, the resources saved should be re-invested to maximally exploit the availalbe parallelism, for example by adding more arithmetic units, or solve any memory bandwidth issues by adding bigger memories.

Use the easiest form of parallelism that matches the domain

Since the target application domains almost always present an inherent form of parallelism, it is important to decide how to take advantage of this parallelism and expose it to the software. If, for example, a SIMD architecture can work in the domain, it would be easier for the programmer to use than a MIMD architecture.

Reduce data size and type to the simplest needed for the domain

Whenever possible, using narrower and simpler data types yields several advantages. For example, it reduces the cost of moving data for memory-bound applications, and it can also reduce the amount of resources required to implement the respective arithmetic units.

Use a domain-specific programming language to port code to the DSA

One of the challenges for DSAs is ease of use, and more specifically being able to effectively program the architecture and run applications on it. Whenever possible, it is advised to use exisiting Domain-Specific Languages (DSL) such as Halide^[9] and TensorFlow^[10] to more easily program a DSA. Re-use of existing compiler toolchains and software frameworks makes using a new DSA significantly more accessible.

DSA for deep neural networks

One of the application domains where Domain-Specific Architectures have found the most amount of success is that of artificial intelligence. In particular, several architectures have been developed for the acceleration of Deep Neural Networks (DNN). ^[11] In the following sections, we report some notable examples.

TPU

Google's TPU was developed in 2015 to accelerate DNN inference, since the company projected that the use of voice search woud require to double the computational resources allocated at the time for neural network inference^[12].

The TPU was designed to be a co-processor communicating via a PCIe bus, to be easily incorporated in existing servers. It is primairily a matrix-multiplication engine following a CISC (Complex Instruction Set Computer) ISA. The multiplication engine uses systolic execution to save energy, reducing the amount of writes to SRAM^[13].

The TPU was fabricated with a 28-nm process, and clocked at 700MHz. The portion of the application that runs on the TPU is implemented in TensorFlow.

The TPU computes primairly reduced precision integers, which further contributes to energy savings and increased performance.

Microsoft Catapult

References

^ Moore, G.E. (1998-01). "Cramming More Components Onto Integrated Circuits". Proceedings of the IEEE. 86 (1): 82–85. doi:10.1109/jproc.1998.658762. ISSN 0018-9219. {{cite journal}}: Check date values in: |date= (help)
^ Dennard, R.H.; Gaensslen, F.H.; Yu, Hwa-Nien; Rideout, V.L.; Bassous, E.; LeBlanc, A.R. (1974-10). "Design of ion-implanted MOSFET's with very small physical dimensions". IEEE Journal of Solid-State Circuits. 9 (5): 256–268. doi:10.1109/jssc.1974.1050511. ISSN 0018-9200. {{cite journal}}: Check date values in: |date= (help)
^ ^a ^b Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth edition ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 540. ISBN 978-0-12-811905-1. {{cite book}}: |edition= has extra text (help)
^ Schauer, Bryan. "Multicore Processors – A Necessity" (PDF). Archived from the original (PDF) on 2011-11-25. Retrieved 2023-07-06.
^ Gajendra, Sharma; Prashant, Poudel (2022-11-24). "Current trends in heterogeneous systems: A review". Trends in Computer Science and Information Technology. 7 (3): 086–090. doi:10.17352/tcsit.000055. ISSN 2641-3086.
^ "NVIDIA Accelerated Applications". NVIDIA. Retrieved 2023-07-06.
^ "NVDLA - Microarchitectures - Nvidia - WikiChip". en.wikichip.org. Retrieved 2023-07-06.
^ "Machine Learning Processor (MLP) - Microarchitectures - ARM - WikiChip". en.wikichip.org. Retrieved 2023-07-06.
^ Ragan-Kelley, Jonathan. "Halide". halide-lang.org. Retrieved 2023-07-06.
^ "TensorFlow". TensorFlow. Retrieved 2023-07-06.
^ Ghayoumi, Mehdi (2021-10-12), "Deep Neural Networks (DNNs) Fundamentals and Architectures", Deep Learning in Practice, Boca Raton: Chapman and Hall/CRC, pp. 77–107, retrieved 2023-07-06
^ Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth edition ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 557. ISBN 978-0-12-811905-1. {{cite book}}: |edition= has extra text (help)
^ Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth edition ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 560. ISBN 978-0-12-811905-1. {{cite book}}: |edition= has extra text (help)

External links

[1] Moore, G.E. (1998-01). "Cramming More Components Onto Integrated Circuits". Proceedings of the IEEE. 86 (1): 82–85. doi:10.1109/jproc.1998.658762. ISSN 0018-9219. {{cite journal}}: Check date values in: |date= (help)

[2] Dennard, R.H.; Gaensslen, F.H.; Yu, Hwa-Nien; Rideout, V.L.; Bassous, E.; LeBlanc, A.R. (1974-10). "Design of ion-implanted MOSFET's with very small physical dimensions". IEEE Journal of Solid-State Circuits. 9 (5): 256–268. doi:10.1109/jssc.1974.1050511. ISSN 0018-9200. {{cite journal}}: Check date values in: |date= (help)

[:0-3] Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth edition ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 540. ISBN 978-0-12-811905-1. {{cite book}}: |edition= has extra text (help)

[4] Schauer, Bryan. "Multicore Processors – A Necessity" (PDF). Archived from the original (PDF) on 2011-11-25. Retrieved 2023-07-06.

[5] Gajendra, Sharma; Prashant, Poudel (2022-11-24). "Current trends in heterogeneous systems: A review". Trends in Computer Science and Information Technology. 7 (3): 086–090. doi:10.17352/tcsit.000055. ISSN 2641-3086.

[6] "NVIDIA Accelerated Applications". NVIDIA. Retrieved 2023-07-06.

[7] "NVDLA - Microarchitectures - Nvidia - WikiChip". en.wikichip.org. Retrieved 2023-07-06.

[8] "Machine Learning Processor (MLP) - Microarchitectures - ARM - WikiChip". en.wikichip.org. Retrieved 2023-07-06.

[9] Ragan-Kelley, Jonathan. "Halide". halide-lang.org. Retrieved 2023-07-06.

[10] "TensorFlow". TensorFlow. Retrieved 2023-07-06.

[11] Ghayoumi, Mehdi (2021-10-12), "Deep Neural Networks (DNNs) Fundamentals and Architectures", Deep Learning in Practice, Boca Raton: Chapman and Hall/CRC, pp. 77–107, retrieved 2023-07-06

[12] Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth edition ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 557. ISBN 978-0-12-811905-1. {{cite book}}: |edition= has extra text (help)

[13] Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth edition ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 560. ISBN 978-0-12-811905-1. {{cite book}}: |edition= has extra text (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]