User:Beedub60/Panasas ActiveScale File System

The Panasas ActiveScale File system is a parallel, distributed, fault-tolerant, object-based file system. The short name for this file system is PanFS. PanFS is a commercial (proprietary) storage system from Panasas that is designed for HPC environments such as large scale simulations, seismic data processing, computational fluid dynamics, monte carlo simulations, computer aided design environments, etc. It is a general purpose file system that implements the POSIX interface.

A detailed description of the system can be found in the proceedings of the 2008 USENIX File Systems and Storage Technology (FAST) conference under the title [Performance of the Panasas Parallel File System]

Introduction

The PanFS file system is based on Object storage device (OSD) that store chunks of files and their attributes. Files are striped across multiple objects on different OSD, and PanFS clients write redundant data into objects using RAID algorithms so that the system can tolerate the loss of one or more OSD. Research systems such as Zebra and Petal also striped file data across data containers, and other commercial systems such as IBRIX and OneFS also stripe file data across containers. While OneFS stripes with redundant data, the redundant data is written internally whereas the PanFS clients generate their own redundant data, which eliminates a network hop in the data path.

The Panasas ActiveScale File system is deployed as a combined hardware/software solution based on a blade-based hardware platform. StorageBlades implement the OSD interface and have two high capacity SATA disk drives, processor, memory, and dual 1GE network interfaces. Up to 11 of these blades fit into a 4 rackU chassis that includes two power supplies and a battery that provide an integrated UPS and non-volatile memory environment for efficient write caching. DirectorBlades run additional services that implement the distributed PanFS file system and cluster management services that make a large collection of StorageBlades and DirectorBlades (100's to 1000's and Petabytes of storage) appear as a single storage system.

Storage Model

PanFS is a file system with POSIX semantics. The administrator has to manage two basic concepts: the BladeSet and the Volume. The BladeSet is a physical hardware resource composed of one or more blade chassis. It represents a performance domain, and the system will automatically balance capacity and load across elements of a BladeSet. The Volume is a logical file hierarchy (i.e., a directory tree) that is resident on a particular BladeSet. One or more Volumes are co-resident on a BladeSet and compete for capacity and performance within that BladeSet.

BladeSets can be grow by adding hardware, and the capacity of existing OSD in the BladeSet are balanced by migrating data in the background. Old hardware can be removed from a BladeSet by draining data from OSD onto new, replacement hardware.

Volumes have an optional directory quota that can be modified dynamically to manage space consumption by applications. The system also implements per-user and per-group quotas.

Volumes appear in the file system name space as directories below the mount point. By default, PanFS provides a global name space under /panfs/realm where the contents of /panfs are all realms known to a file system client, and the directories under /panfs/realm are the volumes within that realm. It is also possible to mount a realm or volume at a non-standard location in the client's local namespace.

Snapshots are implemented on a per-volume basis. A snapshot requires a coordinated action among the metadata managers, file system clients, and storage devices for that volume. Snapshot files appear under a .snapshot subdirectory. The management console provides a snapshot scheduler, and the number of snapshots can be limited or snapshots can be deleted automatically if the system runs out of capacity.

System Components

The main software components of the PanFS implementation include:

The OSDfs object storage system. This is a log-structured file system that manages block devices (e.g., the SATA drives in a StorageBlade) and provides the Object storage device interface that has a more flexible byte-oriented data container and extensible attributes. OSDfs does efficient block allocation and data prefetching. It implements a non-overwrite (i.e., log-structured) system and supports snapshots.

The Pan Manager file system metadata manager. This is divided into the File Manager (FM), Storage Manager (SM), and Lock Manager (LM) layers. The FM implements the POSIX file system semantics and a cache consistency protocol with PanFS clients. The SM implements data striping and recovery from OSD failures. Each file has its own RAID equation and is chunked and written to different OSD based on a map determined by the SM. A file's map can change over time as it grows or adapts to failures. The LM implements byte range mandatory or advisory locks (i.e., flock) and share locks on files. These are application-level locks unrelated to internal locking and consistency protocols among the FM, SM and DirectFLOW clients.

The FM/SM/LM services use high-speed, replicated transaction log to provide fault tolerance. A file system transaction (e.g., file create, file delete, file rename) is journaled and the journal entries are reflected to a backup via a low-latency network protocol. If the FM/SM/LM service fails, a backup resumes operation based on the contents of the transaction log.

The SM implements a parallel RAID rebuild based on declustered data placement. Each file has its own RAID protection which means each file can be reconstructed independently. File data is organized in multiple parity groups that provide a two-level data striping scheme. Parity groups are spread out (i.e., declustered) over available StorageBlades. When an OSD (i.e., StorageBlade) fails, all the files that had parity groups stored on that OSD are impacted by the failure. The SM can efficiently determine the set of impacted files, and then uses a master/worker model to farm out reconstruction work. Each worker is given a batch of files to rebuild, which is done by reading the surviving components of the file's parity group and regenerating the lost component(s) based on parity information in the remaining components. The advantage of this scheme is that larger systems that have more hardware resources can reconstruct lost data more quickly.

The DirectFLOW client. This is the PanFS client that commicates with the FM/SM using an RPC protocol, and with the StorageBlades using the iSCSI/OSD protocol. The client is allowed to cache file data, directory information, and file attributes. These caches are kept consistent via a callback protocol with the FM. The PanFS client is responsible for writing data and parity to the OSD so that the system can recover from failures. It uses a start-write/end-write protocol that informs the FM/SM about changes to a file's content. The DirectFLOW client is implemented as a kernel module within Linux that is dynamically loaded and implements the Virtual File System (VFS) kernel interface.

The Gateway is an instance of the PanFS client that runs on the DirectorBlade and is exported via the NFS and CIFS protocols. The collection of DirectorBlades in the system provides a scalable NFS or CIFS cluster for those environments that cannot or wish not to run the proprietary DirectFLOW client in their compute nodes. The file system supports concurrent access to the same data via DirectFLOW, NFS, and CIFS.

The Realm Manager (RM) system manager. The module maintains a model of all the hardware devices in the system (StorageBlades and DirectorBlades) and the various software services running on them. When components fail, the RM triggers recovery actions. The RM is implemented as a 3-way or 5-way replicated service. Each service maintains a database that reflects a model of the distributed system. The RM uses the PTP (i.e., Paxos) quorum-based voting protocol to make decisions and to update its data model in lock step across the replicated set of RM services.

The Performance Manager (PM). This module monitors the capacity and utilization of StorageBlades and can migrate data among StorageBlades to level capacity. It also supports the complete drain of an OSD to facilitate replacement of failing hardware, or retirement of old hardware.

The Managment Console (GUI). This is closely associated with the RM component and provides a command line or HTML-based GUI used to manage the system.

The Hardware Agent. This is a local agent that monitors the hardware and provides alerts about failed components, over temperature conditions, and power loss (remember, there is an integrated UPS in the blade chassis).

The Update Agent. This agent manages software upgrades. All blades operate the same revision of software, and addition or replacement of hardware requires automatic update of matching software.

The Panasas Process Monitor. This is the process "nanny" for all services running on blades. It implements a state machine that determines what set of applications should be running based on the state of the blade. Blade states include Booting, Unconfigured, Clock Sync, Offline, Online, Failed, Rebooting, Shutting Down. If processes fail they are automatically restarted by the process monitor. The RM can remotely query the state of the blade and control the blade by commanding it to change state.

Configuration Agent. This is a blade-local agent that reacts to changes in global configuration made by the the Management Console or the Realm Manager. It maintains local configuration files that affect local services running on each blade.

Scale

The smallest system is one blade chassis, which has 11 blades total. The system needs at least 1 DirectorBlade, although 3 DirectorBlades are necessary for a fully fault tolerant system because of the quorum-based protocol used by the Realm Manager. The blades have identical form-factor, so a single chassis could have 1 to 3 DirectorBlades and 8 to 10 StorageBlades.

Blade chassis are grouped into a performance and fault domain called a BladeSet. The system balances capacity across the OSD in a BladeSet. A BladeSet can be grown to add new hardware, and older hardware can be drained and removed from a BladeSet. Typical installations have bladesets of 5, 10, or 20 chassis, (i.e., 50 to 200 StorageBlades) although any sized BladeSet is allowed. The largest bladeset is over 100 chassis and 2 Petabytes and deployed at Los Alamos National Labs to support the RoadRunner super computer.

Introduction

Storage Model

System Components

Scale

See also