Apache Cassandra
![]() | |
Original author(s) | Avinash Lakshman, Prashant Malik / Facebook |
---|---|
Developer(s) | Apache Software Foundation |
Initial release | July 2008 |
Stable release | 5.0.4
/ April 10, 2025[1] |
Repository | |
Written in | Java |
Operating system | Cross-platform |
Available in | English |
Type | NoSQL database, wide column store |
License | Apache License 2.0 |
Website | Official website ![]() |
Apache Cassandra is a free and open-source, distributed NoSQL database management system designed to handle large volumes of data across multiple commodity servers while providing high availability with no single point of failure.[2] As a wide column store, Cassandra efficiently handles data models with numerous sparse columns and is particularly suited for systems with high write throughput requirements due to its LSM tree storage architecture.[2] The database is used by over 30,000 organizations worldwide.[3]
The system combines Amazon's Dynamo distributed storage and replication techniques with Google's Bigtable data storage engine model.[4][2] Cassandra prioritizes availability and scalability over consistency, making it an AP (Availability and Partition tolerance) system in the CAP theorem framework.
History
Origins at Facebook
Avinash Lakshman, a co-author of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook in 2007–2008 to power the inbox search feature.[5][6] Facebook needed a database that could handle massive scale across multiple data centers with high write throughput and no single point of failure. The database was named after Cassandra, the mythological Trojan prophetess whose prophecies were never believed, reflecting the challenges of consistency in distributed systems.[7]
Open source development
Facebook released Cassandra as open-source software on Google Code in July 2008.[8] In March 2009, it became an Apache Incubator project,[9] and on February 17, 2010, it graduated to a top-level Apache project.[10]
Architecture
Distributed design
Cassandra uses a peer-to-peer distributed system where all nodes are identical, eliminating single points of failure. Key architectural features include:
- Masterless replication: Every node can accept read and write requests, regardless of where data resides
- Linear scalability: Performance increases proportionally with added nodes[11]
- Configurable replication: Data automatically replicates across multiple nodes for fault tolerance
- Multi-datacenter support: Built-in support for clusters spanning multiple data centers[12]
Performance benchmarks have shown that Cassandra 4.0 achieved up to 33% better throughput compared to version 3.11, with significantly improved latencies.[13]
Consistency model
Cassandra offers tunable consistency, allowing developers to choose consistency levels per operation:[14][15]
- Write consistency: From "ANY" (highest availability) to "ALL" (highest consistency)
- Read consistency: Similar range with additional options like "LOCAL_QUORUM"
- Eventual consistency: Default model using timestamps and tombstones
Cluster communication
The system employs a gossip protocol for cluster management:
- Nodes exchange state information about themselves and other nodes
- Uses Phi Accrual Failure Detector for fault detection[16]
- Implements "hinted handoff" for temporary node failures
- Seed nodes serve as bootstrap points for cluster formation
Data model
Cassandra implements a wide column store model that combines elements of key-value and tabular databases:
Core concepts
- Keyspace: Top-level namespace (analogous to a database in RDBMS)
- Table: Container for rows (formerly "column family")
- Partition key: Determines data distribution across nodes
- Clustering key: Orders data within a partition
- Column: Basic data unit with name, value, type, and timestamp
Schema flexibility
Unlike traditional relational databases, Cassandra provides:
- Dynamic columns per row
- Runtime schema modifications without downtime[17]
- Support for complex data types including collections and user-defined types (UDTs)
Feature | Cassandra | Traditional RDBMS |
---|---|---|
Primary structure | Keyspace → Table → Row | Database → Table → Row |
Schema flexibility | Dynamic columns per row | Fixed schema |
Relationships | Denormalized data model | Normalized with JOINs |
Query patterns | Must follow data model | Ad hoc queries supported |
Storage engine
LSM tree architecture
Cassandra uses a Log-structured merge-tree (LSM tree) optimized for write-heavy workloads:[2][18]
1. Write path:
* Writes go to commit log (durability) and memtable (performance) * Memtables flush to immutable SSTables when full * No in-place updates; all operations append new data
2. Read path:
* Checks memtable first for latest data * Uses bloom filters to efficiently search SSTables * Merges data from multiple sources using timestamps
3. Compaction:
* Periodically merges SSTables to reclaim space * Removes obsolete data and tombstones * Multiple strategies available (Size-Tiered, Leveled, Time-Window, Unified in 5.0)[3]
Storage components
- Commit log: Write-ahead log for crash recovery
- Memtable: In-memory write buffer
- SSTable: Sorted String Table - immutable on-disk files
- Bloom filter: Probabilistic data structure for efficient lookups
- Index files: Primary key indexes and secondary indexes
Query language
Cassandra Query Language (CQL)
CQL provides an SQL-like interface while respecting Cassandra's distributed nature:
```sql -- Create a keyspace with replication CREATE KEYSPACE my_app WITH REPLICATION = {
'class': 'NetworkTopologyStrategy', 'datacenter1': 3, 'datacenter2': 2
};
-- Create a table CREATE TABLE users (
user_id UUID PRIMARY KEY, username TEXT, email TEXT, created_at TIMESTAMP
);
-- Insert data INSERT INTO users (user_id, username, email, created_at) VALUES (uuid(), 'john_doe', 'john@example.com', toTimestamp(now())); ```
Query limitations
Due to its distributed architecture, Cassandra does not support:
- Multi-table JOINs
- Ad hoc aggregations (though limited support exists)
- Arbitrary WHERE clauses (must include partition key)
- Foreign key constraints
- ACID transactions across partitions (limited support in newer versions)
Major releases
Recent versions
- Cassandra 4.0 (July 2021): Production-ready focus, 5x faster streaming, audit logging[19][20]
- Cassandra 4.1 (June 2022): Pluggable memtable implementations, guardrails framework
- Cassandra 5.0 (September 2024): Major performance and feature release[3]
Cassandra 5.0 features
The latest major release introduced significant enhancements:[21]
- Storage Attached Indexes (SAI): More flexible secondary indexing with better performance[3]
- Vector search: Native support for AI/ML workloads with vector data type[22]
- Unified Compaction Strategy (UCS): Adaptive compaction that optimizes automatically[3]
- JDK 17 support: Up to 20% performance improvement from better memory management[3][22]
- Trie-based storage: New memtable and SSTable formats for improved efficiency[3]
- ACID transactions: Limited support for multi-partition transactions[22]
Version support
As of 2025:[23]
- Latest stable: 5.0.4 (April 2025)
- Supported versions: 4.0.x, 4.1.x, 5.0.x
- End-of-life: All 3.x versions with the release of 5.0
Production deployment
Notable users
Large-scale Cassandra deployments include:[24]
- Apple: 160,000+ instances, 100+ PB of data[24][5]
- Netflix: 10,000+ instances, 6 PB of data, 1 trillion requests/day[24][25]
- Uber: Mission-critical systems for real-time analytics[26]
- Discord: Message storage and delivery[24]
Performance characteristics
- Write-optimized but reads are also efficient with proper data modeling[2]
- Linear scalability for both reads and writes[11]
- Typical latencies: sub-millisecond for cached data, single-digit milliseconds for disk reads[27]
- Handles thousands to millions of operations per second depending on cluster size[24]
Operational considerations
- Hardware: Optimized for SSDs, benefits from high memory for caching[15]
- Monitoring: JMX-based with tools like nodetool[28][29]
- Maintenance: Regular repairs and compactions required[2]
- Backup: Snapshot-based with support for incremental backups
Ecosystem
Client drivers
Official drivers available for:[30]
- Java (native and JDBC)
- Python
- Node.js
- C/C++[31]
- C#/.NET
- Go
Related projects
- DataStax Enterprise: Commercial distribution with additional features[26]
- ScyllaDB: C++ reimplementation claiming higher performance[13]
- Amazon Keyspaces: Managed Cassandra-compatible service
- Azure Cosmos DB: Offers Cassandra API compatibility[15]
See also
- Bigtable – Google's original wide column store
- Dynamo (storage system) – Amazon's distributed key-value store
- CAP theorem
- NoSQL
- Distributed database
- Comparison of database management systems
References
- ^ "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.
- ^ a b c d e f Carpenter, Jeff; Hewitt, Eben (2022). Cassandra: The Definitive Guide (3rd ed.). O'Reilly Media. ISBN 978-1-4920-9710-5.
- ^ a b c d e f g "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.
- ^ "Apache Cassandra Documentation Overview". Retrieved January 21, 2021.
- ^ a b "Apache Cassandra: Four Interesting Facts". DataStax. January 31, 2025. Retrieved May 29, 2025.
- ^ "What is Apache Cassandra?". ScyllaDB. February 19, 2025. Retrieved May 29, 2025.
- ^ "The meaning behind the name of Apache Cassandra". Archived from the original on November 1, 2016. Retrieved July 19, 2016.
- ^ Hamilton, James (July 12, 2008). "Facebook Releases Cassandra as Open Source". Retrieved June 4, 2009.
- ^ "Is this the new hotness now?". March 2, 2009. Archived from the original on April 25, 2010. Retrieved March 29, 2010.
- ^ "Cassandra is an Apache top level project". February 18, 2010. Archived from the original on March 28, 2010. Retrieved March 29, 2010.
- ^ a b "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.
- ^ Casares, Joaquin (November 5, 2012). "Multi-datacenter Replication in Cassandra". DataStax. Retrieved July 25, 2013.
- ^ a b "Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance". ScyllaDB. August 29, 2023. Retrieved May 29, 2025.
- ^ DataStax (January 15, 2013). "About data consistency". Archived from the original on July 26, 2013. Retrieved July 25, 2013.
- ^ a b c "Best practices for optimal performance in Azure Managed Instance for Apache Cassandra". Microsoft Learn. August 14, 2024. Retrieved May 29, 2025.
- ^ Hayashibara, Naohiro; Défago, Xavier; Yared, Rami; Katayama, Takuya (2004). "The Φ Accrual Failure Detector". IEEE Symposium on Reliable Distributed Systems. pp. 66–78. doi:10.1109/RELDIS.2004.1353004.
- ^ Ellis, Jonathan (March 2, 2012). "The Schema Management Renaissance in Cassandra 1.1". DataStax. Retrieved July 25, 2013.
- ^ "Performance Analysis: Apache Cassandra 4.0.0 Release". benchANT. Retrieved May 29, 2025.
- ^ "The Apache Cassandra Project Releases Apache® Cassandra™ v4.0". Apache Software Foundation. July 27, 2021. Retrieved May 29, 2025.
- ^ "Apache Cassandra 4.0 Comes in Ready for Production". The New Stack. July 26, 2021. Retrieved May 29, 2025.
- ^ "Apache Cassandra 5.0 Brings Major Updates". BigDataWire. September 9, 2024. Retrieved May 29, 2025.
- ^ a b c "Apache Cassandra 2024 Wrapped: A Year of Innovation and Growth". DataStax. March 7, 2025. Retrieved May 29, 2025.
- ^ "Apache Cassandra - endoflife.date". Retrieved May 29, 2025.
- ^ a b c d e "Apache Cassandra Case Studies". Apache Cassandra. Retrieved May 29, 2025.
- ^ "How Netflix Stores 140 Million Hours of Viewing Data Per Day". ByteByteGo. March 18, 2025. Retrieved May 29, 2025.
- ^ a b "The Best Apache Cassandra Use Cases". DataStax. March 17, 2025. Retrieved May 29, 2025.
- ^ "Apache Cassandra Performance Benchmarking". DataStax. February 1, 2025. Retrieved May 29, 2025.
- ^ "How to monitor Cassandra performance metrics". Datadog. December 3, 2015. Retrieved January 5, 2016.
- ^ "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.
- ^ "Client drivers". Apache Cassandra Documentation. Retrieved May 29, 2025.
- ^ "DataStax C/C++ Driver for Apache Cassandra". DataStax. Retrieved December 15, 2014.
Further reading
- Carpenter, Jeff; Hewitt, Eben (2022). Cassandra: The Definitive Guide (3rd ed.). O'Reilly Media. ISBN 978-1-4920-9710-5.
- Kan, C. Y. (2023). Cassandra Data Modeling and Analysis. Packt Publishing. ISBN 978-1-78961-091-5.
{{cite book}}
: Check|isbn=
value: checksum (help)