Jump to content

Oracle ZFS

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by TechsysPete (talk | contribs) at 00:45, 11 August 2007 (Solaris implementation issues). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Not to be confused with Z-FS, a SAN filesystem by Zetera Corporation, or zFS, the file system for IBM z/OS. For other uses, see ZFS (disambiguation).
ZFS
Developer(s)Sun Microsystems
Full nameZFS
IntroducedNovember 2005 with OpenSolaris
Structures
Directory contentsExtensible Hash table
Limits
Max volume size16 EiB
Max file size16 EiB
Max no. of files248
Max filename length255 bytes
Features
ForksYes (called Extended Attributes)
AttributesPOSIX
File system
permissions
POSIX
Transparent
compression
Yes
Transparent
encryption
No
Other
Supported
operating systems
Sun Solaris, Apple Mac OS X 10.5, FreeBSD, Linux via FUSE

In computing, ZFS is a file system originally created by Sun Microsystems for the Solaris Operating System. The features of ZFS include high storage capacity, integration of the concepts of filesystem and volume management, a novel on-disk structure, lightweight instances, and easy storage pool management. ZFS is implemented as open source software, licensed under the Common Development and Distribution License (CDDL).

History

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick. It was announced on September 14 2004.[1] Source code for ZFS was integrated into the main trunk of Solaris development on October 31 2005[2] and released as part of build 27 of OpenSolaris on November 16 2005. Sun announced that ZFS was included in the 6/06 update to Solaris 10 in June 2006, one year after the opening of the OpenSolaris community.[3]

The name originally stood for "Zettabyte File System", but is now a pseudo-initialism.[4]

Storage pools

Unlike a traditional file system, which resides on a single device and thus requires a volume manager to use more than one device, ZFS is built on top of virtual storage pools called zpools. A pool is constructed from virtual devices (vdevs), each of which is either a raw device, a mirror (RAID 1) of one or more devices, or a RAID-Z group of two or more devices. The storage capacity of all vdevs is available to all of the file system instances in the zpool.

A quota can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.

Capacity

ZFS is a 128-bit file system, so it can store 18 billion billion (18.4 × 1018) times more data than current 64-bit systems. The limitations of ZFS are designed to be so large that they will not be encountered in practice for some time. Some theoretical limits in ZFS are:

  • 248 — Number of snapshots in any file system (2 × 1014)
  • 248 — Number of files in any individual file system (2 × 1014)
  • 16 EiB (264 bytes) — Maximum size of a file system
  • 16 EiB — Maximum size of a single file
  • 16 EiB — Maximum size of any attribute
  • 256 ZiB (278 bytes) — Maximum size of any zpool
  • 256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
  • 256 — Number of files in a directory (actually constrained to 248 for the number of files in a ZFS file system)
  • 264 — Number of devices in any zpool
  • 264 — Number of zpools in a system
  • 264 — Number of file systems in a zpool

Although a statement quoted from this page asserted that "If 1,000 files were created every second, it would take about 9,000 years to reach the limit of the number of files", this is far short of the reality: If a billion computers each filled a billion individual file systems per second, the time required to reach the limit of the overall system would be almost 1,000 times the estimated age of the universe.

Project leader Bonwick said, "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans."[1] Later he clarified:

Although we'd all like Moore's Law to continue forever, quantum mechanics imposes some fundamental limits on the computation rate and information capacity of any physical device. In particular, it has been shown that 1 kilogram of matter confined to 1 liter of space can perform at most 1051 operations per second on at most 1031 bits of information [see Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)]. A fully populated 128-bit storage pool would contain 2128 blocks = 2137 bytes = 2140 bits; therefore the minimum mass required to hold the bits would be (2140 bits) / (1031 bits/kg) = 136 billion kg.

To operate at the 1031 bits/kg limit, however, the entire mass of the computer must be in the form of pure energy. By E=mc², the rest energy of 136 billion kg is 1.2x1028 J. The mass of the oceans is about 1.4x1021 kg. It takes about 4,000 J to raise the temperature of 1 kg of water by 1 degree Celsius, and thus about 400,000 J to heat 1 kg of water from freezing to boiling. The latent heat of vaporization adds another 2 million J/kg. Thus the energy required to boil the oceans is about 2.4x106 J/kg * 1.4x1021 kg = 3.4x1027 J. Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.[5]

Copy-on-write transactional model

ZFS uses a copy-on-write, transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required.

Snapshots and clones

An advantage of copy-on-write is that when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot version of the file system to be maintained. ZFS snapshots are created very quickly, since all the data composing the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist.

Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus all disks in a pool are used, which balances the write load across them.

Variable block sizes

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. Automatic tuning to match workload characteristics is contemplated[citation needed].

If compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).

Lightweight filesystem creation

In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or resize a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.

Additional capabilities

  • Explicit I/O priority with deadline scheduling.
  • Claimed globally optimal I/O sorting and aggregation.
  • Multiple independent prefetch streams with automatic length and stride detection.
  • Parallel, constant-time directory operations.
  • End-to-end checksumming, allowing data corruption detection and recovery (if you have redundancy in the pool).
  • Intelligent scrubbing and resilvering.[6]
  • Load and space usage sharing between disks in the pool.[7]
  • Ditto blocks: Metadata is replicated inside the pool, two or three times (according to metadata importance).[8] If the pool has several devices, ZFS tries to replicate over different devices. So a pool without redundancy can lose data if you find bad sectors, but metadata should be fairly safe even in this scenario.
  • ZFS design (copy-on-write + uberblocks) is safe when using disks with write cache enabled, if they support the cache flush commands issued by ZFS. This feature provides safety and a performance boost compared with some other filesystems.
  • When entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it doesn't know if other slices are managed by non-write-cache safe filesystems, like UFS.

Cache Management

ZFS also uses the ARC, a new method for cache management, instead of the traditional Solaris virtual memory page cache.

Limitations

ZFS lacks transparent encryption, although there is an OpenSolaris project underway.[9]

ZFS does not support per-user or per-group quotas. Instead, it is possible to create user-owned filesystems, each with its own size limit. Intrinsically, there is no practical quota solution for the file systems shared among several users (such as team projects, for example), where the data cannot be separated per user, although it could be implemented on top of the ZFS stack.

Capacity expansion is normally achieved by adding groups of disks as vdev (stripe, RAID-Z, RAID-Z2, or mirrored). Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself - the heal time will depend on amount of store information, not the disk size. If a snapshot is taken during this process, it will cause the heal to be restarted.

It is currently not possible to reduce the number of vdevs in a pool nor otherwise reduce pool capacity. However, it is currently being worked on by the ZFS team.[1]

It is not possible to add a disk to a RAID-Z or RAID-Z2 vdev. This feature appears very difficult to implement.

Reconfiguring storage requires copying data offline, destroying the pool, and recreating the pool with the new policy.

ZFS is not a native cluster, distributed, or parallel file system and cannot provide concurrent access from multiple hosts as ZFS is a local file system. (However, this does not exclude using it as a back-end store for software that provides such services.)

Solaris implementation issues

Current ZFS implementation (Solaris 10 11/06) has some issues admins should know before deploying it. These issues are not inherent to ZFS, and might be solved in future releases:

  • While both ZFS root and ZFS boot support are readily available, ZFS root filesystem support is currently set off on Solaris 10 default installations, since the standard installer still does not fully support ZFS roots. To create bootable ZFS systems one needs to use scripts or do it manually. The ZFS Boot project recently successfully added boot support to the OpenSolaris project, and is available in recent builds of Solaris Nevada.[10][11] ZFS boot is currently (20070208) planned for a Solaris 10 update in late 2007.
  • If a Solaris Zone is put on ZFS, the system cannot be upgraded — the OS will need to be reinstalled. This issue is planned to be addressed in a Solaris 10 update in 2007 [citation needed].
  • A file "fsync" will commit to disk all pending modifications on the filesystem. That is, an "fsync" on a file will flush out all deferred (cached) operations to the filesystem (not the pool) in which the file is located. This can make some fsync() slow when running alongside a workload which writes a lot of data to filesystem cache.[12]. The issue is currently fixed in Solaris Nevada.
  • New vdevs can be added to a storage pool, but they cannot be removed. A vdev can be exchanged for using a bigger new one, but it cannot be removed, in the process reducing the total pool storage size even if the pool has enough unused space. The ability to shrink a zpool is a work in progress, currently targeted for a Solaris 10 update in late 2007.
  • ZFS encourages creation of many filesystems inside the pool (for example, for quota control), but importing a pool with thousands of filesystems is a slow operation (can take minutes).
  • ZFS filesystem compression/decompression is single-threaded; only one CPU per zpool is used. The issue is now fixed in Solaris Nevada via 6460622 and will be in s10u4.
  • ZFS uses a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being worked on: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient.[13]
  • ZFS copy-on-write operation can degrade on-disk file layout (file fragmentation) when files are modified, decreasing performance.
  • ZFS blocksize is configurable per filesystem, currently 128KB by default. Reads or writes which are smaller than the block size suffer a performance penalty. If your workload reads/writes data in fixed sizes (blocks), for example a database, you should (manually) configure ZFS blocksize equal to the application blocksize, for better performance and to conserve cache memory and disk bandwidth.
  • ZFS only offlines a faulty hard disk if it can't be opened. Read/write errors or slow/timed-out operations do not currently cause a disk to be marked as faulty.
  • When listing ZFS space usage, the "used" column only shows non-shared usage. So if some of your data is shared (for example, between snapshots), you don't know how much is there. You don't know, for example, which snapshot deletion would give you more free space.
  • There is work in progress to provide automatic and periodic disk scrubbing, in order to provide corruption detection and early disk-rotting detection. Currently the data scrubbing must be done manually with "zpool scrub" command.
  • Current ZFS compression/decompression code is very fast, but the compression ratio is not comparable to gzip or similar algorithms. There is a project to add new compression modules to ZFS.[14][15][16]
  • When taking or destroying a snapshot while the zpool is scrubbing/resilvering, the process will be restarted from the beginning.[17]
  • Not all symbolic links are protected by ditto blocks.[18][19]
  • Swapping over ZVOL pseudo-devices can hang the system.[20][21]

Platforms

ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC and x86-based systems. Since the code for ZFS is open source, a port to other operating systems and platforms can be produced without Sun's involvement.

Nexenta OS, a complete GNU-based open source operating system built on top of the OpenSolaris kernel and runtime, includes a ZFS implementation, added in version alpha1.

Apple Inc. is porting ZFS to their Mac OS X operating system, according to a post by a Sun employee on the opensolaris.org zfs-discuss mailing list, and previewed screenshots of the next version of Apple's Mac OS X.[22] As of Mac OS X 10.5 (Developer Seed 9A321), support for ZFS has been included, but lacks the ability to act as a root partition, noted above. Also, attempts to format local drives using ZFS are unsuccessful; this is a known bug.[23] On June 6 2007, Sun's CEO Jonathan I. Schwartz announced that Apple would make ZFS "the" filesystem in OS 10.5 Leopard[24]. Marc Hamilton, VP for Solaris Marketing later wrote to clarify that, in his opinion, Apple is planning to use ZFS in future versions of Mac OS X, but not necessarily as the default filesystem for Mac OS X 10.5 Leopard. [25] Apple has announced that ZFS will be supported in read-only mode from the command line [26], but has also released a write capable implementation available on Apple Developer Connection site to anyone that has an account (Including free accounts) (this implementation was later removed from the developer site).

Porting ZFS to Linux is complicated by the fact that the GNU General Public License, which governs the Linux kernel, prohibits linking with code under certain licenses, such as CDDL, the license ZFS is released under.[27] One solution to this problem is to port ZFS to Linux's FUSE system so the filesystem runs in userspace instead. A project to do this was sponsored by Google's Summer of Code program in 2006, and is in Beta stage as of May 2007[28]. However, running a file system outside the kernel on traditional Unix-like systems can have a significant performance impact. Sun Microsystems has stated that a Linux port is being investigated.[29]

There are no plans to port ZFS to HP-UX or AIX.[29]

Pawel Jakub Dawidek has ported and committed ZFS to FreeBSD for inclusion in FreeBSD 7.0, due to be released in 2007.[30]

Adaptive Endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness doesn't match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data itself; as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

References

  1. ^ a b "ZFS: the last word in file systems". Sun Microsystems. September 14 2004. Retrieved 2006-04-30. {{cite web}}: Check date values in: |date= (help)
  2. ^ Jeff Bonwick (October 31, 2005). "ZFS: The Last Word in Filesystems". Jeff Bonwick's Blog. Retrieved 2006-04-30.
  3. ^ "Sun Celebrates Successful One-Year Anniversary of OpenSolaris". Sun Microsystems. June 20 2006. {{cite web}}: Check date values in: |date= (help)
  4. ^ Jeff Bonwick (2006-05-04). "You say zeta, I say zetta". Jeff Bonwick's Blog. Retrieved 2006-09-08.
  5. ^ Jeff Bonwick (September 25 2004). "128-bit storage: are you high?". Sun Microsystems. Retrieved 2006-07-12. {{cite web}}: Check date values in: |date= (help)
  6. ^ "Smokin' Mirrors". Jeff Bonwick's Weblog. 2006-05-02. Retrieved 2007-02-23.
  7. ^ "ZFS Block Allocation". Jeff Bonwick's Weblog. 2006-11-04. Retrieved 2007-02-23.
  8. ^ "Ditto Blocks - The Amazing Tape Repellent". Flippin' off bits Weblog. 2006-05-12. Retrieved 2007-03-01.
  9. ^ "OpenSolaris Project: ZFS on disk encryption support". OpenSolaris Project. Retrieved 2006-12-13.
  10. ^ "Latest ZFS add-ons". milek's blog. 2007-03-28. Retrieved 2007-03-29.
  11. ^ "ZFS Bootable datasets - happily rumbling". Tim Foster's blog. 2007-03-29. Retrieved 2007-04-01.
  12. ^ "The Dynamics of ZFS". Roch Bourbonnais' Weblog. 2006-06-21. Retrieved 2007-02-19.
  13. ^ "Implementing fbarrier() on ZFS". zfs-discuss. 2007-02-13. Retrieved 2007-02-13.
  14. ^ "gzip for ZFS update". Adam Leventhal's Weblog. 2007-01-31. Retrieved 2007-03-09.
  15. ^ "gzip compression support". zfs-discuss. 2007-03-23. Retrieved 2007-04-01.
  16. ^ "Gzip compression for ZFS". zfs-discuss. 2007-03-29. Retrieved 2007-04-01.
  17. ^ "scrub/resilver has to start over when a snapshot is taken". OpenSolaris Bug Tracker. 2005-10-30. Retrieved 2007-03-14.
  18. ^ "symlinks and ditto blocks". OpenSolaris Forums. 2007-03-28. Retrieved 2007-07-13.
  19. ^ "zpl symlinks should have their own object type". OpenSolaris Bug Tracker. 2007-01-23. Retrieved 2007-07-13.
  20. ^ "Plans for swapping to part of a pool". OpenSolaris Forums. 2007-07-12. Retrieved 2007-07-14.
  21. ^ "system hang while zvol swap space shorted". OpenSolaris Bug Tracker. 2007-02-26. Retrieved 2007-07-14.
  22. ^ "Porting ZFS to OSX". zfs-discuss. April 27 2006. Retrieved 2006-04-30. {{cite web}}: Check date values in: |date= (help)
  23. ^ "Mac OS X 10.5 9A326 Seeded". InsanelyMac Forums. December 14 2006. Retrieved 2006-12-14. {{cite web}}: Check date values in: |date= (help)
  24. ^ "Sun announce ZFS is "the file system" in Mac OS X v10.5". Sun. June 6 2007. Retrieved 2007-06-06. {{cite web}}: Check date values in: |date= (help)
  25. ^ "Marc Hamilton's weblog: Apple is planning to use the ZFS file system from OpenSolaris in future versions of their OS". Marc Hamilton's weblog. June 7 2007. Retrieved 2007-06-07. {{cite web}}: Check date values in: |date= (help)
  26. ^ "Apple: Leopard offers limited ZFS read-only". MacNN. June 12 2007. Retrieved 2007-06-23. {{cite web}}: Check date values in: |date= (help)
  27. ^ Jeremy Andrews (April 19 2007). "Linux: ZFS, Licenses and Patents". Retrieved 2007-04-21. {{cite web}}: Check date values in: |date= (help)
  28. ^ Ricardo Correia (May 26 2006). "ZFS on FUSE/Linux". Retrieved 2006-07-15. {{cite web}}: Check date values in: |date= (help)
  29. ^ a b "Fast Track to Solaris 10 Adoption: ZFS Technology". Solaris 10 Technical Knowledge Base. Sun Microsystems. Retrieved 2006-04-24.
  30. ^ Dawidek, Pawel (April 6 2007). "ZFS committed to the FreeBSD base". Retrieved 2007-04-06. {{cite web}}: Check date values in: |date= (help)

See also