Jump to content

Supercomputer architecture

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by VanishedUserABC (talk | contribs) at 13:37, 5 February 2012 (Start the page). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)
The air cooled Blue Gene/P supercomputer at Argonne National Lab with 250,000 processors.[1]

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. The early architectures pioneered by Seymour Cray relied on compact innovative designs and parallelism to achieve superior computational peak performance.[2] However, in time the demand for increased computational power ushered in the age of massively parallel systems.

While the supercomputers of the 1970s used only a few processors, in the 1990s, machines with thousands of processors began to appear and by the end of the 20th century, massively parallel supercomputers with tens of thousands of "off-the-shelf" processors were the norm. Supercomputers of the 2st century can use over 100,000 processors connected by fast networks.[3][4]

Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers.[5][6][5][7] The large amount of heat generated by a system may also have other effects, e.g. reducing the lifetime of other system components.[8] There have been diverse approaches to heat management, from pumping Fluorinert through the system, to a hybrid liquid-air cooling system or air cooling with normal air conditioning temperatures.[9][10]

Context and overview

Since the late 1960s the growth in the power and proliferation of supercomputers has been dramatic, and the underlying architectural directions of these systems have taken significant turns.[3][4] While the early supercomputers relied on a small number of closely connected processors that accesses shared memory, the supercomputers of the 2st century use over 100,000 processors connected by fast networks.[3][4]

Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers.[5] Seymour Cray's "get the heat out" motto was central to his design philosophy and has continued to be a key issue in supercomputer architectures, e.g. in large scale experiments such as Blue Waters.[6][5][7] The large amount of heat generated by a system may also have other effects, e.g. reducing the lifetime of other system components.[8]

An IBM HS22 blade

There have been diverse approaches to heat management, e.g. the Cray 2 pumped Fluorinert through the system, while System X used a hybrid liquid-air cooling system and the Blue Gene/P is air cooled with normal air conditioning temperatures.[9][11][12] The Aquasar supercomputer architecture uses the heat it generates to warm a university campus.[13][14]

The heat density generated by a supercomputer, has a direct dependence on the processor type used in the system, with more powerful processors typically generating more heat, given similar underlying semiconductor technologies.[8] While early supercomputers used a few fast, closely packed processors that took advantage of local parallelism (e.g. pipelining and vector processing), in time the number of processors grew, and computing nodes could be placed further away, e.g. in a computer cluster or could be geographically dispersed in grid computing.[3][15] As the number of processors in a supercomputer grows, "component failure" begins to become a serious issue. If a supercomputer uses thousands of nodes, each of which may fail once per year on the average, then the system will experience several node failures each day.[10]

Early systems with a few processors

The cylindrical shape of the early Cray computers centralized access, keeping distances short and uniform.[4]

The CDC 6600 class machines were very early attempts at supercomputing and gained their advantage over the existing systems by relegating work to peripheral computing elements, freeing the CPU (Central Processing Unit) to process actual data. With the Minnesota FORTRAN compiler the 6600 could sustain 500 kilo-FLOPS on standard mathematical operations.[16]

Other early supercomputers such as the Cray 1 and Cray 2 that appeared afterwards used a small number of fast processors that worked in harmony and were uniformly connected to the largest amount of shared memory that could be managed at the time.[4]

These early architectures introduced parallel processing at the processor level, with innovations such as vector processing, in which the processor can perform several operations during one clock cycle, rather than having to wait for successive cycles.

In time, as the number of processors increased, different architectural issues emerged. Two issues that need to be addressed as the number of processors used in an architecture increases are the distribution of memory and processing. In the distributed memory approach, each processor is physically packaged close with some local memory. The memory associated with other processors is then "further away" based on bandwidth and latency parameters in Non-Uniform Memory Access.

In the 1960s pipelining was viewed as an innovation, and by the 1970s the use of vector processors had been well established.[3] By 1990 parallel vector processing had gained ground.[3] By the 1980s, many supercomputers used parallel vector processors.[3]

The relatively small number of processors in early systems, allowed them to easily use shared memory architecture where processors could access a common pool of memory.[17] In the early days, a common approach was the use of uniform memory access (UMA) in which access time to a memory location was similar between processors. The use of non-uniform memory access (NUMA) allowed a processor to access its own local memory faster than other memory locations, while cache-only memory architectures (COMA) allowed for the local memory of each processor to be used as cache, thus requiring coordination as memory values changed.[17]

As the number of processors increases, efficient interprocessor communication and synchronization on a supercomputer becomes a challenge.[18] A number of approaches may be used to achieve this goal. For instance, in the early 1980s, in the [Cray X-MP]] system, shared registers were used. In this approach, all processors had access to shared registers that did not move data back and forth but were only used for interprocessor communication and synchronization.[18] However, inherent challenges in managing a large amount of shared memory among many processors resulted in a move to more distributed architectures.[18]

Massive, centralized parallelism

During the 1980s, as the demand for computing power increased, the general trend to a much number of processors began, ushering in the age of massively parallel systems, with distributed memory and distributed file systems, given that shared memory architectures could not scale to a large number of processors.[3][19] Hybrid approaches such as distributed shared memory also appeared after the early systems.[20]

The computer clustering approach connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast, private local area network.[21] The activities of the computing nodes are orchesterated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a single system image concept.[21]

Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer to peer or grid computing which also use many nodes, but with a far more distributed nature.[21] By the 21st century, the TOP500 organization's semiannual list of the 500 fastest supercomputers often includes many clusters, e.g. the world's fastest in 2011, the K computer with a distributed memory, cluster architecture.[22][23]

Massive distributed parallelism

Example architecture of a geographically disperse distributively owned distributed computing system connecting many personal computers over a network

Grid computing uses a large number of computers in distributed, diverse administrative domains. It is an opportunistic approach which uses resources whenever they are available.[24] An example is BOINC a volunteer-based, opportunistic grid system.[25] Some BOINC applications have reached multi-petaflop levels by using close to half a million computers connected on the internet, whenever volunteer resources become available.[26] However, these types of results often do not appear in the TOP500 ratings because they do not run the general purpose Linpack benchmark.

Although grid computing has had success in parallel task execution, demanding supercomputer applications such as weather simulations or computational fluid dynamics have remained out of reach, partly due to the barriers in reliable sub-assignment of a large number of tasks as well as the reliable availability of resources at a given time.[25][27][28]

In quasi-opportunistic supercomputing a large number of geographically disperse computers are orchestrated with built-in safeguards.[29] The quasi-opportunistic approach goes beyond volunteer computing on a highly distributed systems such as BOINC, or general grid computing on a system such as Globus by allowing the middleware to provide almost seamless access to many computing clusters so that existing programs in languages such as Fortran or C can be distributed among multiple computing resources.[29]

Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing.[30] The quasi-opportunistic approach enables the execution of demanding applications within computer grids by establishing grid-wise resource allocation agreements; and fault tolerant message passing to abstractly shield against the failures of the underlying resources, thus maintaining some opportunism, while allowing a higher level of control.[29][29] [24][31]

Person walking between the racks of a Cray XE6

The air cooled IBM Blue Gene supercomputer architecture trades processor speed for low power consumption so that a larger number of processors can be used at room temperature, by using normal air-conditioning.[32][33]

The K computer is water cooled, homogeneous processor, distributed memory system with a cluster architecture.[34][35][36] It uses over 80,000 SPARC-based processors, each with eight cores, for a total of over 700,000 cores — almost twice as many as any other system. It uses over 800 cabinets, each with 96 computing nodes (each with 16 GB of memory), and 6 I/O nodes. It is more powerful than the next five systems on the TOP500 list combined, but at 824.56 MFLOPS/W has the lowest power to performance ratio of any current major supercomputer system.[37][38]

Unlike the K computer, the Tianhe-1A system uses a hybrid architecture and integrates CPUs and GPUs.[5] It uses over 14,000 Xeon general purpose processors and over 7,000 Nvidia Tesla graphic-based processors on about 3,500 blades.[39] It has 112 computer cabinets, 262 terabytes of distributed memory and 2 petabytes of disk storage is implemented via Lustre clustered files.[40][41][42][43] Tianhe-1 uses a proprietary high-speed communication network to connect the processors.[5] The proprietary interconnect network was based on the Infiniband QDR, enhanced with Chinese made FeiTeng-1000 CPUs.[5] In the case of the interconnect the system is twice faster than the Infiniband, but not as fast as some interconnects on other supercomputers.[44]

The limits of specific approaches continue to be tested, as boundaries are reached through large scale experiments, e.g. in 2011 IBM ended its participation in the Blue Waters petaflops project at the University of Illinois.[45][46] The Blue Waters architecture was based on the IBM POWER7 processor and intended to have 200,000 cores with a petabyte of "globally addressable memory" and 10 petabytes of disk space.[7] The goal of a sustained petaflop led to design choices that optimized single core performance, and hence a lower number of cores. The lower number of cores was then expected to help performance on programs that did not scale well to a large number of processors.[7] The large globally addressable memory architecture aimed to solve memory address problems in an efficient manner, for the same type of programs.[7] Blue Waters had been expected to run at sustained speeds of at least one petaflop, and relied on the specific water cooling approach to manage heat. In the first four years of operation, the National Science Foundation spent about $200 million on the project. IBM released the Power 775 computing node derived from that project's technology soon thereafter, but effectively abandoned the Blue Waters approach.[46][45]

Architectural experiments are continuing in a number of directions, e.g. the Cyclops64 system uses a "supercomputer on a chip" approach, in a direction away from the use of massive distributed processors.[47][48] Each 64-bit Cyclops64 chip contains 80 processors, and the entire system uses a globally addressable memory architecture.[49] The processors are connected with non-internally blocking crossbar switch and communicate with each other via global interleaved memory. There is no data cache in the architecture, but half of each SRAM bank can be used as a scratchpad memory.[49] Although this type of architecture allows unstructured parallelism in a dynamically non-contiguous memory system, it also produces challenges in the efficient mapping of parallel algorithms to a many-core system.[48]

See also

references

  1. ^ IBM Blue gene announcement
  2. ^ Hardware software co-design of a multimedia SOC platform by Sao-Jie Chen, Guang-Huei Lin, Pao-Ann Hsiung, Yu-Hen Hu 2009 ISBN pages 70-72
  3. ^ a b c d e f g h Supercomputers: directions in technology and applications by Allan R. Hoffman et. al National Academies, 1990 ISBN 0309040884 pages 35-47
  4. ^ a b c d e Readings in computer architecture by Mark Donald Hill, Norman Paul Jouppi, Gurindar Sohi 1999 ISBN 1558605398 pages 40-49
  5. ^ a b c d e f g The TianHe-1A Supercomputer: Its Hardware and Software by Xue-Jun Yang, Xiang-Ke Liao, et al in the Journal of Computer Science and Technology, Volume 26, Number 3, pages 344-351 [1]
  6. ^ a b The Supermen: Story of Seymour Cray and the Technical Wizards Behind the Supercomputer by Charles J. Murray 1997 ISBN 0471048852 pages 133-135
  7. ^ a b c d e Parallel Computational Fluid Dyynamics; Recent Advances and Future Directions edited by Rupak Biswas 2010 ISBN 160595022X page 401
  8. ^ a b c Supercomputing Research Advances by Yongge Huáng 2008 ISBN 1604561866 pages 313-314
  9. ^ a b Parallel computing for real-time signal processing and control by M. O. Tokhi, Mohammad Alamgir Hossain 2003 ISBN 9781852335991 pages 201-202
  10. ^ a b Computational science -- ICCS 2005: 5th international conference edited by Vaidy S. Sunderam 2005 ISBN 3540260439 pages 60-67
  11. ^ Varadarajan, S. System X: building the Virginia Tech supercomputer in Proceedings. 13th International Conference on Computer Communications and Networks, 2004. ICCCN 2004. ISBN 0-7803-8814-3
  12. ^ IBM uncloaks 20 petaflops BlueGene/Q super The Register November 22, 2010
  13. ^ HPC Wire July 2, 2010
  14. ^ CNet May 10, 2010
  15. ^ Encyclopedia of Computer Science and Technology by Harry Henderson 2008 ISBN 0816063826 page 217
  16. ^ Frisch, Michael (Dec 1972). "Remarks on Algorithms". Communications of the ACM 15 (12): 1074.
  17. ^ a b Advanced computer architecture and parallel processing by Hesham El-Rewini et al. 2005 ISBN 978-0-471-46740-3 pages 77-80
  18. ^ a b c High Performance Computing: Technology, Methods and Applications, Volume 10 by J.J. Dongarra, L. Grandinetti, J. Kowalik and G.R. Joubert 1995 ISBN 0444821635 pages 123-125
  19. ^ Applications on advanced architecture computers by Greg Astfalk 1996 ISBN 0898713684 pages 61-64
  20. ^ Distributed Shared Memory: Concepts and Systems by Jelica Protic, Milo Tomasevic and Veljko Milutinović 1997 ISBN 0818677376 pages ix-x
  21. ^ a b c Network-Based Information Systems: First International Conference, NBIS 2007 ISBN 3540745726 page 375
  22. ^ TOP500 list To view all clusters on the TOP500 select "cluster" as architecture from the sublist menu.
  23. ^ M. Yokokawa et al The K Computer, in "International Symposium on Low Power Electronics and Design" (ISLPED) 1-3 Aug. 2011, pages 371-372
  24. ^ a b Grid computing: experiment management, tool integration, and scientific workflows by Radu Prodan, Thomas Fahringer 2007 ISBN 3540692614 pages 1-4
  25. ^ a b Parallel and Distributed Computational Intelligence by Francisco Fernández de Vega 2010 ISBN 3642106749 pages 65-68
  26. ^ BOIN statistics, 2011
  27. ^ Cite error: The named reference Gao was invoked but never defined (see the help page).
  28. ^ Cite error: The named reference mario was invoked but never defined (see the help page).
  29. ^ a b c d Quasi-opportunistic supercomputing in grids by Valentin Kravtsov, David Carmeli , Werner Dubitzky , Ariel Orda , Assaf Schuster , Benny Yoshpa, in IEEE International Symposium on High Performance Distributed Computing, 2007, pages 233-244 [2]
  30. ^ Computational Science - Iccs 2008: 8th International Conference edited by Marian Bubak 2008 ISBN 9783540693833 pages 112-113
  31. ^ Computational Science - Iccs 2009: 9th International Conference edited by Gabrielle Allen, Jarek Nabrzyski 2009 ISBN 3642019692 pages 387-388
  32. ^ Euro-Par 2005 parallel processing: 11th International Euro-Par Conference edited by José Cardoso Cunha, Pedro D. Medeiros 2005 ISBN 9783540287001 pages 560-567
  33. ^ IBM uncloaks 20 petaflops BlueGene/Q super The Register November 22, 2010
  34. ^ M Yokokawa et al The K computer: Japanese next-generation supercomputer development project[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5993668 IEEEXplore 1-3 Aug. 2011 pp 371-372 ]
  35. ^ TOP500 list
  36. ^ M. Yokokawa et al The K Computer, in "International Symposium on Low Power Electronics and Design" (ISLPED) 1-3 Aug. 2011, pages 371-372
  37. ^ Takumi Maruyama (2009). SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core Processor for PETA Scale computing (PDF). Proceedings of Hot Chips 21. IEEE Computer Society.
  38. ^ "RIKEN Advanced Institute for Computational Science" (PDF), RIKEN, retrieved 20 June 2011
  39. ^ http://www.msnbc.msn.com/id/39519135/ns/business-bloomberg_businessweek/
  40. ^ "China ..." 28 October 2010.
  41. ^ "Top100 ..." 28 October 2010.
  42. ^ Tianhe-1A
  43. ^ The TianHe-1A Supercomputer: Its Hardware and Software by Xue-Jun Yang, Xiang-Ke Liao, et al in the Journal of Computer Science and Technology, Volume 26, Number 3, pages 344-351
  44. ^ U.S. says China building 'entirely indigenous' supercomputer, by Patrick Thibodeau Computerworld, November 4, 2010 [3]
  45. ^ a b The Register: IBM yanks chain on 'Blue Waters' super
  46. ^ a b The Statesman IBM's Unix computer business is booming
  47. ^ Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64 K Barner, GR Gao, Z Hu, Lecture Notes in Computer Science, 2005, Volume 3779, Network and Parallel Computing, Pages 132-143
  48. ^ a b Analysis and performance results of computing betweenness centrality on IBM Cyclops64 by Guangming Tan, Vugranam C. Sreedhar and Guang R. Gao The Journal of Supercomputing Volume 56, Number 1, 1-24 September 2011
  49. ^ a b Network and Parallel Computing: IFIP International Conference, NPC 2005, November 30 - December 3, 2005, by Hai Jin, Daniel Reed and Wenbin Jiang ISBN 354029810X pages 132-133