Distributed database

A distributed database is a database in which data is stored across different physical locations.^[1] It may be stored in multiple computers located in the same physical location (e.g. a data centre); or maybe dispersed over a network of interconnected computers. Unlike parallel systems, in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely coupled sites that share no physical components.

System administrators can distribute collections of data (e.g. in a database) across multiple physical locations. A distributed database can reside on organised network servers or decentralised independent computers on the Internet, on corporate intranets or extranets, or on other organisation networks. Because distributed databases store data across multiple computers, distributed databases may improve performance at end-user worksites by allowing transactions to be processed on many machines, instead of being limited to one.^[2]

Two processes ensure that the distributed databases remain up-to-date and current: replication and duplication.

Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be complex and time-consuming, depending on the size and number of the distributed databases. This process can also require much time and computer resources.
Duplication, on the other hand, has less complexity. It identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, users may change only the master database. This ensures that local data will not be overwritten.

Both replication and duplication can keep the data current in all distributive locations.^[2]

Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous, and asynchronous distributed database technologies. The implementation of these technologies can and do depend on the needs of the business and the sensitivity/confidentiality of the data stored in the database and the price the business is willing to spend on ensuring data security, consistency and integrity.

When discussing access to distributed databases, Microsoft favors the term distributed query, which it defines in protocol-specific manner as "[a]ny SELECT, INSERT, UPDATE, or DELETE statement that references tables and rowsets from one or more external OLE DB data sources".^[3] Oracle provides a more language-centric view in which distributed queries and distributed transactions form part of distributed SQL.^[4]

Advantages

Management of distributed data with different levels of transparency like network transparency, fragmentation transparency, replication transparency, etc.
Increase reliability and availability
Easier expansion
Reflects organizational structure — database fragments potentially stored within the departments they relate to
Local autonomy or site autonomy — a department can control the data about them (as they are the ones familiar with it)
Protection of valuable data — if there were ever a catastrophic event such as a fire, all of the data would not be in one place, but distributed in multiple locations
Improved performance — data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing the load on the databases to be balanced among servers. (A high load on one module of the database won't affect other modules of the database in a distributed database)
Economics — it may cost less to create a network of smaller computers with the power of a single large computer
Modularity — systems can be modified, added and removed from the distributed database without affecting other modules (systems)
Reliable transactions - due to a replication of the database
Hardware, operating system, network, fragmentation, DBMS, replication and location independence
Continuous operation, even if some nodes go offline (depending on design)
Distributed query processing can improve performance
Single-site failure does not affect the performance of the system.
For those systems that support full distributed transactions, operations enjoy the ACID properties:
- A-atomicity, the transaction takes place as a whole or not at all
- C-consistency maps one consistent DB state to another
- I-isolation, each transaction sees a consistent DB
- D-durability, the results of a transaction must survive system failures

The Merge Replication Method is popularly used to consolidate the data between databases.^[5]

Disadvantages

Complexity — DBAs may have to do extra work to ensure that the distributed nature of the system is transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one. Extra database design work must also be done to account for the disconnected nature of the database — for example, joins become prohibitively expensive when performed across multiple systems.
Economics — increased complexity and a more extensive infrastructure means extra labor costs
Security — remote database fragments must be secured, and they are not centralized so the remote sites must be secured as well. The infrastructure must also be secured (for example, by encrypting the network links between remote sites).
Difficult to maintain integrity — but in a distributed database, enforcing integrity over a network may require too much of the network's resources to be feasible
Inexperience — distributed databases are difficult to work with, and in such a young field there is not much readily available experience in "proper" practice
Lack of standards — there are no tools or methodologies yet to help users convert a centralized DBMS into a distributed DBMS^{[citation needed]}
Database design more complex — In addition to traditional database design challenges, the design of a distributed database has to consider fragmentation of data, allocation of fragments to specific sites and data replication
Additional software is required
Operating system should support distributed environment
Concurrency control poses a major issue. It can be solved by locking and timestamping.
Distributed access to data
Analysis of distributed data

References

^ "Definition: distributed database". www.its.bldrdoc.gov.
^ ^a ^b O'Brien, J. & Marakas, G.M.(2008) Management Information Systems (pp. 185-189). New York, NY: McGraw-Hill Irwin
^ "TechNet Glossary". Microsoft. Retrieved 2013-07-16. distributed query[:] Any SELECT, INSERT, UPDATE, or DELETE statement that references tables and rowsets from one or more external OLE DB data sources.
^ Ashdown, Lance; Kyte, Tom (September 2011). "Oracle Database Concepts, 11g Release 2 (11.2)". Oracle Corporation. Archived from the original on 2013-07-15. Retrieved 2013-07-17. Distributed SQL synchronously accesses and updates data distributed among multiple databases. [...] Distributed SQL includes distributed queries and distributed transactions.
^ Security, Networx. "Distributed Database". www.networxsecurity.org. Retrieved 2018-02-06.

M. T. Özsu and P. Valduriez, Principles of Distributed Databases (3rd edition) (2011), Springer, ISBN 978-1-4419-8833-1
Elmasri and Navathe, Fundamentals of database systems (3rd edition), Addison-Wesley Longman, ISBN 0-201-54263-3
Oracle Database Administrator's Guide 10g (Release 1), http://docs.oracle.com/cd/B14117_01/server.101/b10739/ds_concepts.htm

[1] "Definition: distributed database". www.its.bldrdoc.gov.

[O'Brien-2] O'Brien, J. & Marakas, G.M.(2008) Management Information Systems (pp. 185-189). New York, NY: McGraw-Hill Irwin

[3] "TechNet Glossary". Microsoft. Retrieved 2013-07-16. distributed query[:] Any SELECT, INSERT, UPDATE, or DELETE statement that references tables and rowsets from one or more external OLE DB data sources.

[4] Ashdown, Lance; Kyte, Tom (September 2011). "Oracle Database Concepts, 11g Release 2 (11.2)". Oracle Corporation. Archived from the original on 2013-07-15. Retrieved 2013-07-17. Distributed SQL synchronously accesses and updates data distributed among multiple databases. [...] Distributed SQL includes distributed queries and distributed transactions.

[5] Security, Networx. "Distributed Database". www.networxsecurity.org. Retrieved 2018-02-06.

[1]

[2]

[3]

[4]

[5]

v t e Database management systems
Types	Object-oriented comparison Relational list comparison Key–value Column-oriented list Document-oriented Wide-column store Graph NoSQL NewSQL In-memory list Multi-model comparison Cloud Blockchain-based database
Concepts	Database ACID Armstrong's axioms Codd's 12 rules CAP theorem CRUD Null Candidate key Foreign key PACELC design principle Superkey Surrogate key Unique key
Objects	Relation table column row View Transaction Transaction log Trigger Index Stored procedure Cursor Partition
Components	Concurrency control Data dictionary JDBC XQJ ODBC Query language Query optimizer Query rewriting system Query plan
Functions	Administration Query optimization Replication Sharding
Related topics	Database models Database normalization Database storage Distributed database Federated database system Referential integrity Relational algebra Relational calculus Relational model Object–relational database Transaction processing
Category Outline

Advantages

Disadvantages

See also

References