Shard (database architecture)

A database shard is a horizontal partition in a database or search engine. Each individual partition is referred to as a shard or database shard.

Database architecture

Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (which is what Normalization and Vertical Partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.

There are numerous advantages to this partitioning approach. The total number of rows in each table is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.^[1]

Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.

Where distributed computing is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful.

Shards compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g. the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required both instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated into them en masse.

This is also why sharding is related to a shared nothing architecture - once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

This makes replication across multiple servers easy (simple horizontal partitioning can't). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.

There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.

Support for shards

dbShards

CodeFutures dbShards is a product dedicated to database shards.^[2]

Hibernate ORM

Hibernate Shards provides support for shards.^[3]^[4]

MongoDB

MongoDB supports sharding from version 1.6

Plugin for Grails

Grails supports sharding using the Grails Sharding Plugin. ^[5]

Redis

Redis is a datastore with support for client-side sharding.

Ruby ActiveRecord

Octopus works as a database sharding and replication extension for the ActiveRecord ORM.

Solr Search Server

Solr enterprise search server provides sharding capabilities.^[6]

SQLAlchemy ORM

SQLAlchemy is an object-relational mapper for the Python programming language that provides sharding capabilities.^[7]

SQL Azure (pending)

Microsoft have announced that SQL Azure will get sharding using "federations". This is scheduled for release in late 2011.^[8]

References

^ Rahul Roy (July 28, 2008). "Shard - A Database Design". {{cite journal}}: Cite journal requires |journal= (help) (http://technoroy.blogspot.com/2008/07/shard-database-design.html)
^ "dbShards product overview".
^ "Hibernate Shards". 2/ 8/2007. {{cite web}}: Check date values in: |date= (help)
^ "Hibernate Shards".
^ "Grails Sharding Plugin".
^ "Distributed Search".
^ "Basic example of using the SQLAlchemy Sharding API".
^ "Federations in SQL Azure: Database Solutions with Unlimited Scalability".

[Rahul_Roy,_Shard-1] Rahul Roy (July 28, 2008). "Shard - A Database Design". {{cite journal}}: Cite journal requires |journal= (help) (http://technoroy.blogspot.com/2008/07/shard-database-design.html)

[dbShards-2] "dbShards product overview".

[Hibernate_Shards-3] "Hibernate Shards". 2/ 8/2007. {{cite web}}: Check date values in: |date= (help)

[Hibernate_Shards_documentation-4] "Hibernate Shards".

[Grails-5] "Grails Sharding Plugin".

[SorlShard-6] "Distributed Search".

[SQLAlchemy-7] "Basic example of using the SQLAlchemy Sharding API".

[SQL_Azure_Federations-8] "Federations in SQL Azure: Database Solutions with Unlimited Scalability".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Database architecture

Shards compared to horizontal partitioning

Support for shards

See also

References