Shard (database architecture)

A shard is a method of horizontal partitioning in a database or search engine.

Shard database architecture

Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard. ^[1]

Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.

Where distributed computing is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful. This is of particular relevance to distributed database platforms, such as Mnesia.

Shards compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify which table a particular row will be found in, without needing to search the index, i.e. the classis example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in just the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost if querying the database required both instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, sharding thus splits large partitionable tables across the servers, whilst smaller tables are replicated into them en masse.

This is also why sharding is related to a shared nothing architecture - once shared, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

This makes replication across multiple servers easy (simple horizontal partitioning can't). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.

Obviously there is also a need for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options inbetween.

Support for shards

In Hibernate

Hibernate Shards is an extension to Hibernate Core that provides support for shards.^[2] Hibernate Shards makes sharding available to developers that are already using Hibernate without major refactoring.^[3]

In SQLAlchemy

SQLAlchemy SQLAlchemy is an object-relational mapper for the Python programming language that provides sharding capabilities.^[4]

dbShards

CodeFutures Software dbShards is a product dedicated to database shards.^[5]

In Solr

The enterprise search server Solr provides sharding capabilities.^[6]

In DB2

The IBM DB2 database has several methods of sharding (though the terminology is different). Database Partitioning automatically distributes data across multiple servers, Table Partitioning breaks data from a table into multiple partitions on the same server, Union-All Views is a primitive version of Table Partitioning that uses actual tables for each partition, Multi-Dimensional Clustering (MDC) keeps all data in a single table but achieves many of the benefits of Table Partitioning via block indexing and storage.

References

^ Rahul Roy (July 28, 2008). "Shard - A Database Design". {{cite journal}}: Cite journal requires |journal= (help) (http://technoroy.blogspot.com/2008/07/shard-database-design.html)
^ "Hibernate Shards". 2/ 8/2007. {{cite web}}: Check date values in: |date= (help)
^ "Hibernate Shards".
^ "Basic example of using the SQLAlchemy Sharding API".
^ "dbShards product overview".
^ "Distributed Search".

This database-related article is a stub. You can help Wikipedia by expanding it.

[Rahul_Roy,_Shard-1] Rahul Roy (July 28, 2008). "Shard - A Database Design". {{cite journal}}: Cite journal requires |journal= (help) (http://technoroy.blogspot.com/2008/07/shard-database-design.html)

[Hibernate_Shards-2] "Hibernate Shards". 2/ 8/2007. {{cite web}}: Check date values in: |date= (help)

[Hibernate_Shards_documentation-3] "Hibernate Shards".

[SQLAlchemy-4] "Basic example of using the SQLAlchemy Sharding API".

[dbShards-5] "dbShards product overview".

[SorlShard-6] "Distributed Search".

[1]

[2]

[3]

[4]

[5]

[6]