Jump to content

Apache Cassandra

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Thomasvincent (talk | contribs) at 07:29, 29 May 2025 (Cleanup and added citations). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Apache Cassandra
Original author(s)Avinash Lakshman, Prashant Malik / Facebook
Developer(s)Apache Software Foundation
Initial releaseJuly 2008; 16 years ago (2008-07)
Stable release
5.0.4 / April 10, 2025; 56 days ago (2025-04-10)[1]
Repository
Written inJava
Operating systemCross-platform
Available inEnglish
TypeNoSQL database, wide column store
LicenseApache License 2.0
WebsiteOfficial website Edit this at Wikidata

Apache Cassandra is a free and open-source, distributed NoSQL database management system designed to handle large volumes of data across multiple commodity servers while providing high availability with no single point of failure.[2] As a wide column store, Cassandra efficiently handles data models with numerous sparse columns and is particularly suited for systems with high write throughput requirements due to its LSM tree storage architecture.[2] The database is used by over 30,000 organizations worldwide.[3]

The system combines Amazon's Dynamo distributed storage and replication techniques with Google's Bigtable data storage engine model.[4][2] Cassandra prioritizes availability and scalability over consistency, making it an AP (Availability and Partition tolerance) system in the CAP theorem framework.

History

Origins at Facebook

Avinash Lakshman, a co-author of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook in 2007–2008 to power the inbox search feature.[5][6] Facebook needed a database that could handle massive scale across multiple data centers with high write throughput and no single point of failure. The database was named after Cassandra, the mythological Trojan prophetess whose prophecies were never believed, reflecting the challenges of consistency in distributed systems.[7]

Open source development

Facebook released Cassandra as open-source software on Google Code in July 2008.[8] In March 2009, it became an Apache Incubator project,[9] and on February 17, 2010, it graduated to a top-level Apache project.[10]

Architecture

Distributed design

Cassandra uses a peer-to-peer distributed system where all nodes are identical, eliminating single points of failure. Key architectural features include:

  • Masterless replication: Every node can accept read and write requests, regardless of where data resides
  • Linear scalability: Performance increases proportionally with added nodes[11]
  • Configurable replication: Data automatically replicates across multiple nodes for fault tolerance
  • Multi-datacenter support: Built-in support for clusters spanning multiple data centers[12]

Performance benchmarks have shown that Cassandra 4.0 achieved up to 33% better throughput compared to version 3.11, with significantly improved latencies.[13]

Consistency model

Cassandra offers tunable consistency, allowing developers to choose consistency levels per operation:[14][15]

  • Write consistency: From "ANY" (highest availability) to "ALL" (highest consistency)
  • Read consistency: Similar range with additional options like "LOCAL_QUORUM"
  • Eventual consistency: Default model using timestamps and tombstones

Cluster communication

The system employs a gossip protocol for cluster management:

  • Nodes exchange state information about themselves and other nodes
  • Uses Phi Accrual Failure Detector for fault detection[16]
  • Implements "hinted handoff" for temporary node failures
  • Seed nodes serve as bootstrap points for cluster formation

Data model

Cassandra implements a wide column store model that combines elements of key-value and tabular databases:

Core concepts

  • Keyspace: Top-level namespace (analogous to a database in RDBMS)
  • Table: Container for rows (formerly "column family")
  • Partition key: Determines data distribution across nodes
  • Clustering key: Orders data within a partition
  • Column: Basic data unit with name, value, type, and timestamp

Schema flexibility

Unlike traditional relational databases, Cassandra provides:

  • Dynamic columns per row
  • Runtime schema modifications without downtime[17]
  • Support for complex data types including collections and user-defined types (UDTs)
Data Model Comparison
Feature Cassandra Traditional RDBMS
Primary structure Keyspace → Table → Row Database → Table → Row
Schema flexibility Dynamic columns per row Fixed schema
Relationships Denormalized data model Normalized with JOINs
Query patterns Must follow data model Ad hoc queries supported

Storage engine

LSM tree architecture

Cassandra uses a Log-structured merge-tree (LSM tree) optimized for write-heavy workloads:[2][18]

1. Write path:

  * Writes go to commit log (durability) and memtable (performance)
  * Memtables flush to immutable SSTables when full
  * No in-place updates; all operations append new data

2. Read path:

  * Checks memtable first for latest data
  * Uses bloom filters to efficiently search SSTables
  * Merges data from multiple sources using timestamps

3. Compaction:

  * Periodically merges SSTables to reclaim space
  * Removes obsolete data and tombstones
  * Multiple strategies available (Size-Tiered, Leveled, Time-Window, Unified in 5.0)[3]

Storage components

  • Commit log: Write-ahead log for crash recovery
  • Memtable: In-memory write buffer
  • SSTable: Sorted String Table - immutable on-disk files
  • Bloom filter: Probabilistic data structure for efficient lookups
  • Index files: Primary key indexes and secondary indexes

Query language

Cassandra Query Language (CQL)

CQL provides an SQL-like interface while respecting Cassandra's distributed nature:

```sql -- Create a keyspace with replication CREATE KEYSPACE my_app WITH REPLICATION = {

 'class': 'NetworkTopologyStrategy',
 'datacenter1': 3,
 'datacenter2': 2

};

-- Create a table CREATE TABLE users (

 user_id UUID PRIMARY KEY,
 username TEXT,
 email TEXT,
 created_at TIMESTAMP

);

-- Insert data INSERT INTO users (user_id, username, email, created_at) VALUES (uuid(), 'john_doe', 'john@example.com', toTimestamp(now())); ```

Query limitations

Due to its distributed architecture, Cassandra does not support:

  • Multi-table JOINs
  • Ad hoc aggregations (though limited support exists)
  • Arbitrary WHERE clauses (must include partition key)
  • Foreign key constraints
  • ACID transactions across partitions (limited support in newer versions)

Major releases

Recent versions

  • Cassandra 4.0 (July 2021): Production-ready focus, 5x faster streaming, audit logging[19][20]
  • Cassandra 4.1 (June 2022): Pluggable memtable implementations, guardrails framework
  • Cassandra 5.0 (September 2024): Major performance and feature release[3]

Cassandra 5.0 features

The latest major release introduced significant enhancements:[21]

  • Storage Attached Indexes (SAI): More flexible secondary indexing with better performance[3]
  • Vector search: Native support for AI/ML workloads with vector data type[22]
  • Unified Compaction Strategy (UCS): Adaptive compaction that optimizes automatically[3]
  • JDK 17 support: Up to 20% performance improvement from better memory management[3][22]
  • Trie-based storage: New memtable and SSTable formats for improved efficiency[3]
  • ACID transactions: Limited support for multi-partition transactions[22]

Version support

As of 2025:[23]

  • Latest stable: 5.0.4 (April 2025)
  • Supported versions: 4.0.x, 4.1.x, 5.0.x
  • End-of-life: All 3.x versions with the release of 5.0

Production deployment

Notable users

Large-scale Cassandra deployments include:[24]

  • Apple: 160,000+ instances, 100+ PB of data[24][5]
  • Netflix: 10,000+ instances, 6 PB of data, 1 trillion requests/day[24][25]
  • Uber: Mission-critical systems for real-time analytics[26]
  • Discord: Message storage and delivery[24]

Performance characteristics

  • Write-optimized but reads are also efficient with proper data modeling[2]
  • Linear scalability for both reads and writes[11]
  • Typical latencies: sub-millisecond for cached data, single-digit milliseconds for disk reads[27]
  • Handles thousands to millions of operations per second depending on cluster size[24]

Operational considerations

  • Hardware: Optimized for SSDs, benefits from high memory for caching[15]
  • Monitoring: JMX-based with tools like nodetool[28][29]
  • Maintenance: Regular repairs and compactions required[2]
  • Backup: Snapshot-based with support for incremental backups

Ecosystem

Client drivers

Official drivers available for:[30]

  • Java (native and JDBC)
  • Python
  • Node.js
  • C/C++[31]
  • C#/.NET
  • Go
  • DataStax Enterprise: Commercial distribution with additional features[26]
  • ScyllaDB: C++ reimplementation claiming higher performance[13]
  • Amazon Keyspaces: Managed Cassandra-compatible service
  • Azure Cosmos DB: Offers Cassandra API compatibility[15]

See also

References

  1. ^ "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.
  2. ^ a b c d e f Carpenter, Jeff; Hewitt, Eben (2022). Cassandra: The Definitive Guide (3rd ed.). O'Reilly Media. ISBN 978-1-4920-9710-5.
  3. ^ a b c d e f g "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.
  4. ^ "Apache Cassandra Documentation Overview". Retrieved January 21, 2021.
  5. ^ a b "Apache Cassandra: Four Interesting Facts". DataStax. January 31, 2025. Retrieved May 29, 2025.
  6. ^ "What is Apache Cassandra?". ScyllaDB. February 19, 2025. Retrieved May 29, 2025.
  7. ^ "The meaning behind the name of Apache Cassandra". Archived from the original on November 1, 2016. Retrieved July 19, 2016.
  8. ^ Hamilton, James (July 12, 2008). "Facebook Releases Cassandra as Open Source". Retrieved June 4, 2009.
  9. ^ "Is this the new hotness now?". March 2, 2009. Archived from the original on April 25, 2010. Retrieved March 29, 2010.
  10. ^ "Cassandra is an Apache top level project". February 18, 2010. Archived from the original on March 28, 2010. Retrieved March 29, 2010.
  11. ^ a b "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.
  12. ^ Casares, Joaquin (November 5, 2012). "Multi-datacenter Replication in Cassandra". DataStax. Retrieved July 25, 2013.
  13. ^ a b "Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance". ScyllaDB. August 29, 2023. Retrieved May 29, 2025.
  14. ^ DataStax (January 15, 2013). "About data consistency". Archived from the original on July 26, 2013. Retrieved July 25, 2013.
  15. ^ a b c "Best practices for optimal performance in Azure Managed Instance for Apache Cassandra". Microsoft Learn. August 14, 2024. Retrieved May 29, 2025.
  16. ^ Hayashibara, Naohiro; Défago, Xavier; Yared, Rami; Katayama, Takuya (2004). "The Φ Accrual Failure Detector". IEEE Symposium on Reliable Distributed Systems. pp. 66–78. doi:10.1109/RELDIS.2004.1353004.
  17. ^ Ellis, Jonathan (March 2, 2012). "The Schema Management Renaissance in Cassandra 1.1". DataStax. Retrieved July 25, 2013.
  18. ^ "Performance Analysis: Apache Cassandra 4.0.0 Release". benchANT. Retrieved May 29, 2025.
  19. ^ "The Apache Cassandra Project Releases Apache® Cassandra™ v4.0". Apache Software Foundation. July 27, 2021. Retrieved May 29, 2025.
  20. ^ "Apache Cassandra 4.0 Comes in Ready for Production". The New Stack. July 26, 2021. Retrieved May 29, 2025.
  21. ^ "Apache Cassandra 5.0 Brings Major Updates". BigDataWire. September 9, 2024. Retrieved May 29, 2025.
  22. ^ a b c "Apache Cassandra 2024 Wrapped: A Year of Innovation and Growth". DataStax. March 7, 2025. Retrieved May 29, 2025.
  23. ^ "Apache Cassandra - endoflife.date". Retrieved May 29, 2025.
  24. ^ a b c d e "Apache Cassandra Case Studies". Apache Cassandra. Retrieved May 29, 2025.
  25. ^ "How Netflix Stores 140 Million Hours of Viewing Data Per Day". ByteByteGo. March 18, 2025. Retrieved May 29, 2025.
  26. ^ a b "The Best Apache Cassandra Use Cases". DataStax. March 17, 2025. Retrieved May 29, 2025.
  27. ^ "Apache Cassandra Performance Benchmarking". DataStax. February 1, 2025. Retrieved May 29, 2025.
  28. ^ "How to monitor Cassandra performance metrics". Datadog. December 3, 2015. Retrieved January 5, 2016.
  29. ^ "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.
  30. ^ "Client drivers". Apache Cassandra Documentation. Retrieved May 29, 2025.
  31. ^ "DataStax C/C++ Driver for Apache Cassandra". DataStax. Retrieved December 15, 2014.

Further reading