User:Beedub60/Panasas ActiveScale File System

The Panasas ActiveScale File system is a parallel, distributed, fault-tolerant, object-based file system. The short name for this file system is PanFS. PanFS is a commercial (proprietary) storage system from Panasas that is designed for HPC environments such as large scale simulations, seismic data processing, computational fluid dynamics, monte carlo simulations, computer aided design environments, etc. It is a general purpose file system that implements the POSIX interface.

A detailed description of the system can be found in the proceedings of the 2008 USENIX File Systems and Storage Technology (FAST) conference under the title [Performance of the Panasas Parallel File System]

Introduction

The PanFS file system is based on Object storage device (OSD) that store chunks of files and their attributes. Files are striped across multiple objects on different OSD, and PanFS clients write redundant data into objects using RAID algorithms so that the system can tolerate the loss of one or more OSD. Research systems such as Zebra and Petal also striped file data across data containers, and other commercial systems such as IBRIX and OneFS also stripe file data across containers. While OneFS stripes with redundant data, the redundant data is written internally whereas the PanFS clients generate their own redundant data, which eliminates a network hop in the data path.

The Panasas ActiveScale File system is deployed as a combined hardware/software solution based on a blade-based hardware platform. StorageBlades implement the OSD interface and have two high capacity SATA disk drives, processor, memory, and dual 1GE network interfaces. Up to 11 of these blades fit into a 4 rackU chassis that includes two power supplies and a battery that provide an integrated UPS and non-volatile memory environment for efficient write caching. DirectorBlades run additional services that implement the distributed PanFS file system and cluster management services that make a large collection of StorageBlades and DirectorBlades (100's to 1000's and Petabytes of storage) appear as a single storage system.

Components

The main software components of the PanFS implementation include:

The OSDfs object storage system. This is a log-structured file system that manages block devices (e.g., the SATA drives in a StorageBlade) and provides the Object storage device interface that has a more flexible byte-oriented data container and extensible attributes. OSDfs does efficient block allocation and data prefetching. It implements a non-overwrite (i.e., log-structured) system and supports snapshots.

The Pan Manager file system metadata manager. This is divided into the File Manager (FM) and Storage Manager (SM) layers. The FM implements the POSIX file system semantics and a cache consistency protocol with PanFS clients. The SM implements data striping and recovery from OSD failures. Each file has its own RAID equation and is chunked and written to different OSD based on a map determined by the SM. A file's map can change over time as it grows or adapts to failures.

The FM/SM services use high-speed, replicated transaction log to provide fault tolerance. A file system transaction (e.g., file create, file delete, file rename) is journaled and the journal entries are reflected to a backup via a low-latency network protocol. If the FM/SM service fails, a backup resumes operation based on the contents of the transaction log.

The SM implements a parallel RAID rebuild based on declustered data placement. Each file has its own RAID protection which means each file can be reconstructed independently. When an OSD fails, all the files that had components stored on that OSD are impacted by the failure. The SM can efficiently determine the set of impacted files, and then uses a master/worker model to farm out reconstruction work. Each worker is given a batch of files to rebuild, which is done by reading the surviving components of the file and regenerating the lost component(s) based on parity information in the remaining components.

The DirectFLOW client. This is the PanFS client that commicates with the FM/SM using an RPC protocol, and with the StorageBlades using the iSCSI/OSD protocol. The client is allowed to cache file data, directory information, and file attributes. These caches are kept consistent via a callback protocol with the FM. The PanFS client is responsible for writing data and parity to the OSD so that the system can recover from failures. It uses a start-write/end-write protocol that informs the FM/SM about changes to a file's content. The DirectFLOW client is implemented as a kernel module within Linux that is dynamically loaded and implements the Virtual File System (VFS) kernel interface.

The Realm Manager (RM) system manager. The module maintains a model of all the hardware devices in the system (StorageBlades and DirectorBlades) and the various software services running on them. When components fail, the RM triggers recovery actions. The RM is implemented as a 3-way or 5-way replicated service. Each service maintains a database that reflects a model of the distributed system. The RM uses the PTP (i.e., Paxos) quorum-based voting protocol to make decisions and to update its data model in lock step across the replicated set of RM services.

The Performance Manager (PM). This module monitors the capacity and utilization of StorageBlades and can migrate data among StorageBlades to level capacity. It also supports the complete drain of an OSD to facilitate replacement of failing hardware, or retirement of old hardware.

The Managment Console (GUI). This is closely associated with the RM component and provides a command line or HTML-based GUI used to manage the system.

The Hardware Agent. This is a local agent that monitors the hardware and provides alerts about failed components, over temperature conditions, and power loss (remember, there is an integrated UPS in the blade chassis).

The Update Agent. This agent manages software upgrades. All blades operate the same revision of software, and addition or replacement of hardware requires automatic update of matching software.

Scale

The smallest system is one blade chassis, which has 11 blades total. The system needs at least 1 DirectorBlade, although 3 DirectorBlades are necessary for a fully fault tolerant system because of the quorum-based protocol used by the Realm Manager. The blades have identical form-factor, so a single chassis could have 1 to 3 DirectorBlades and 8 to 10 StorageBlades.

Blade chassis are grouped into a performance and fault domain called a BladeSet. The system balances capacity across the OSD in a BladeSet. A BladeSet can be grown to add new hardware, and older hardware can be drained and removed from a BladeSet. Typical installations have bladesets of 5, 10, or 20 chassis, (i.e., 50 to 200 StorageBlades) although any sized BladeSet is allowed. The largest bladeset is over 100 chassis and 2 Petabytes and deployed at Los Alamos National Labs to support the RoadRunner super computer.