Supercomputer operating system

The Jaguar XT5 supercomputer at Oak Ridge National Labs

Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as major sea changes have taken place in supercomputer architecture.^[1]

While in a traditional multi-user computer system, job scheduling is in effect a scheduling problem for processing and peripheral resources, in a a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources.^[2]

Modern supercomputers may run different operating systems on different nodes, e.g. using a small and efficient lightweight kernel such as CNK or CNL on compute nodes, but a larger and more full-fledged system such as a Linux-derivative on server and I/O nodes.^[3]^[4]

Although most modern supercomputer use the Linux operating system, each manufacturer has made its own specific changes to the Linux-derivative they use, and no industry standard exists, partly due to the fact that the differences in hardware architectures require changes to optimize the operating system to each architecture.^[1]^[5]

Context and overview

In the early days of supercomputing, the basic architectural concepts were evolving rapidly, and system software had to follow hardware innovations that usually took rapid turns.^[1] In the early systems, operating systems were custom tailored to each supercomputer to gain speed, yet in the rush to develop them, serious software quality challenges surfaced in many cases the cost and complexity of system software development became as much as issue as hardware.^[1] In the 1980s the cost for software development at Cray came to equal what the spent on hardware and that trend was partly responsible for a move away from the in-house operating systems to the adaptation of generic software.^[6] The first wave in operating system changes came in the mid 1980s as vendor specific operating systems were abandoned in favor of Unix, and despite early skepticism this transition proved successful.^[1]^[6]

By the early 1990s major changes were taking place in supercomputing system software.^[1] By this time, the use of Unix in itself had started to change the way system software was viewed. The use of a high level language (C) to implement the operating system, and the reliance on standardized interfaces was in contrast to the assembly language oriented approaches of the past.^[1] As hardware vendors adapted Unix to their systems, new and useful features were added to Unix, e.g. fast file systems and tunable process schedulers.^[1] However, all the companies that adpted Unix made their own specfic changes to it, rather than collaborating on an industry standard to create "Unix for supercomputers". This was partly due to the fact that the differences in their architectures required these changes to optimize UNIX to that architecture.^[1]

Thus as general purpose operating systems became stable, supercomputers began to borrow and adapt the critical system code from them and relied on the rich set of secondary functionality that came with them, not having to reinvent the wheel.^[1] However, at the same time the size of the code for general purpose operating systems was growing rapidly, and by the time UNIX-based code had reached 500,000 lines of code its maintenance and use was a challenge.^[1] This resulted in the move to use microkernels which used a minimal set of the operating system functions. Systems such as MACH at Carnegie Mellon University and Chorus at INRIA were examples of early microkernels.^[1]

While in a traditional multi-user computer system, job scheduling is in effect a scheduling problem for processing and peripheral resources, in a a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources.^[2] The need to tune task scheduling and tune the operating system in different configurations of a supercomputer is essential. A typical parallel job scheduler has a master scheduler which instructs a number of slave schedulers to launch, monitor and control parallel jobs, and periodically receives reports from them about the status of job progress.^[2]

The separation of the operating system into separate components was also necessary as upercomputers developed different types of nodes, e.g. compute nodes vs I/O nodes. And on the same supercomputer, different Linux-based OS may be running, e.g. Cray uses compute Linux on some nodes, another Linux on another and the "entire OS" is really made of a combination of multiple operating systems.

Early systems

The CDC 6600, generally considered the first supercomputer in the world, ran the Chippewa Operating System, which was then deployed on various other CDC 6000 series computers. ^[8] The Chippewa was a rather simple job control oriented system derived from the earlier CDC 3000, but it influenced the later KRONOS and SCOPE systems.^[9]^[8]

The first Cray 1 was delivered to the Los Alamos Lab without an operating system, or any other software.^[7] Los Alamos developed not only the application software for it, but also the operating system.^[7] The main timesharing system for the Cray 1, the Cray Time Sharing System (CTSS), was then developed at the Livermore Labs as a direct descendant of the Livermore Time Sharing System (LTTS) for the CDC 6600 operating system from twenty years earlier.^[7]

The rising software costs in developing a supercomputing soon became dominant, as evidenced by the fact that in the 1980s the cost for software development at Cray came to equal what they spent on hardware.^[6] That trend was partly responsible for a move away from the in-house, Cray Operating System to UNICOS system based on Unix.^[6] In 1985, the Cray 2 was the first system to ship with the UNICOS operating system.^[10]

Around the same time, the EOS operating system was developed by ETA Systems for use in their ETA10 supercomputers in the 1980s.^[11] Written in Cybil, a Pascal-like language from Control Data Corporation, EOS highlighted the stability problems in developing stable operating systems for supercomputers and eventually a Unix-like system was offered on the same machine.^[12]^[11] The lessons learned from the development of ETA system software included the high level of risk associated with the development of a new supercomputer operating system, and the advantages of using Unix with its large existing base of system software libraries.^[11]

By the middle of 1990s, despite the existing investment in older operating systems, the general trend was towards the use of Unix-based systems, which also facilitated the use of interactive user interfaces for scientific computing across multiple platforms.^[13] That trend continued to build momentum and by 2005, the United States National Research Council's review of supercomputing could directly state: "virtually all supercomputers today use some variant of UNIX".^[14] These variants of UNIX include AIX from IBM, the open source Linux system, and other adaptations such as UNICOS from Cray^[14] Linux is estimated to command the highest share of the supercomputing pie, but these are only estimates since some sites do not reveal the exact operating system they use.^[15]

Modern approaches

The IBM Blue Gene runs different operating systems on different nodes. It uses the CNK operating system on the compute nodes, but uses a modified Linux-based kernel called INK (for I/O Node Kernel) on the I/O nodes.^[3]^[16] CNK is a lightweight kernel that runs on each node and supports a single application running for a single user on that node. For the sake of efficient operation, the design of CNK was kept simple and minimal, with physical memory being statically mapped and the CNK neither needing nor providing scheduling or context switching.^[3] CNK does not even implement file I/O on the compute node, but delegates that to dedicated I/O nodes.^[16] However, given that on the Blue Gene multiple compute nodes share a single I/O node, the I/O node operating system does require multi-tasking, hence the selection of the Linux-based operating system.

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Encyclopedia of Parallel Computing by David Padua 2011 ISBN 0387097651 pages 426-429
^ ^a ^b ^c Cite error: The named reference Yariv was invoked but never defined (see the help page).
^ ^a ^b ^c Euro-Par 2004 Parallel Processing: 10th International Euro-Par Conference 2004, by Marco Danelutto, Marco Vanneschi and Domenico Laforenza ISBN 3540229248 pages 835
^ An Evaluation of the Oak Ridge National Laboratory Cray XT3 by Sadaf R. Alam etal International Journal of High Performance Computing Applications February 2008 vol. 22 no. 1 52-80
^ "Top500 OS chart". Top500.org. Retrieved 2010-10-31.
^ ^a ^b ^c ^d Knowing machines: essays on technical change by Donald MacKenzie 1998 ISBN 0262631881 page 149-151
^ ^a ^b ^c ^d Targeting the computer: government support and international competition by Kenneth Flamm 1987 ISBN 0815728514 pages 81-83
^ ^a ^b The computer revolution in Canada by John N. Vardalas 2001 ISBN 0262220644 page 258
^ Design of a computer: the Control Data 6600 by James E. Thornton, Scott, Foresman Press 1970 page 163
^ Lester T. Davis, The balance of power, a brief history of Cray Research hardware architectures in "High performance computing: technology, methods, and applications" by J. J. Dongarra 1995 ISBN 0444821635 page 126 [1]
^ ^a ^b ^c Lloyd M. Thorndyke, The Demise of the ETA Systems in "Frontiers of Supercomputing II by Karyn R. Ames, Alan Brenner 1994 ISBN 0520084012 pages 489-497
^ Past, present, parallel: a survey of available parallel computer systems by Arthur Trew 1991 ISBN 3540196641 page 326
^ Frontiers of Supercomputing II by Karyn R. Ames, Alan Brenner 1994 ISBN 0520084012 page 356
^ ^a ^b Getting up to speed: the future of supercomputing by Susan L. Graham, Marc Snir, Cynthia A. Patterson, National Research Council 2005 ISBN 0309095026 page 136
^ Forbes magazine, 03.15.05: Linux Rules Supercomputers
^ ^a ^b Euro-Par 2006 Parallel Processing: 12th International Euro-Par Conference, 2006, by Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner ISBN 3540377832 page

[Padua426-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Encyclopedia of Parallel Computing by David Padua 2011 ISBN 0387097651 pages 426-429

[Yariv-2] Cite error: The named reference Yariv was invoked but never defined (see the help page).

[EuroPar2004-3] Euro-Par 2004 Parallel Processing: 10th International Euro-Par Conference 2004, by Marco Danelutto, Marco Vanneschi and Domenico Laforenza ISBN 3540229248 pages 835

[Alam-4] An Evaluation of the Oak Ridge National Laboratory Cray XT3 by Sadaf R. Alam etal International Journal of High Performance Computing Applications February 2008 vol. 22 no. 1 52-80

[5] "Top500 OS chart". Top500.org. Retrieved 2010-10-31.

[MacKenzie-6] Knowing machines: essays on technical change by Donald MacKenzie 1998 ISBN 0262631881 page 149-151

[Flamm-7] Targeting the computer: government support and international competition by Kenneth Flamm 1987 ISBN 0815728514 pages 81-83

[Vardalas-8] The computer revolution in Canada by John N. Vardalas 2001 ISBN 0262220644 page 258

[9] Design of a computer: the Control Data 6600 by James E. Thornton, Scott, Foresman Press 1970 page 163

[Power-10] Lester T. Davis, The balance of power, a brief history of Cray Research hardware architectures in "High performance computing: technology, methods, and applications" by J. J. Dongarra 1995 ISBN 0444821635 page 126 [1]

[Thorndyke-11] Lloyd M. Thorndyke, The Demise of the ETA Systems in "Frontiers of Supercomputing II by Karyn R. Ames, Alan Brenner 1994 ISBN 0520084012 pages 489-497

[12] Past, present, parallel: a survey of available parallel computer systems by Arthur Trew 1991 ISBN 3540196641 page 326

[13] Frontiers of Supercomputing II by Karyn R. Ames, Alan Brenner 1994 ISBN 0520084012 page 356

[National136-14] Getting up to speed: the future of supercomputing by Susan L. Graham, Marc Snir, Cynthia A. Patterson, National Research Council 2005 ISBN 0309095026 page 136

[15] Forbes magazine, 03.15.05: Linux Rules Supercomputers

[EuroPar2006-16] Euro-Par 2006 Parallel Processing: 12th International Euro-Par Conference, 2006, by Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner ISBN 3540377832 page

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Context and overview

Early systems

Modern approaches

See also

References