Apache Cassandra

Apache Cassandra
Original author(s)	Avinash Lakshman, Prashant Malik / Facebook
Developer(s)	Apache Software Foundation
Initial release	July 2008; 16 years ago
Stable release	5.0.4 / April 10, 2025; 56 days ago
Repository	gitbox.apache.org/repos/asf/cassandra.git ;
Written in	Java
Operating system	Cross-platform
Available in	English
Type	NoSQL database, wide column store
License	Apache License 2.0
Website	Official website

Apache Cassandra is a free and open-source, distributed NoSQL database management system designed to handle large volumes of data across multiple commodity servers while providing high availability with no single point of failure.^[2] As a wide column store, Cassandra efficiently handles data models with numerous sparse columns and is particularly suited for systems with high write throughput requirements due to its LSM tree storage architecture.^[2] The database is used by over 30,000 organizations worldwide.^[3]

The system combines Amazon's Dynamo distributed storage and replication techniques with Google's Bigtable data storage engine model.^[4]^[2] Cassandra prioritizes availability and scalability over consistency, making it an AP (Availability and Partition tolerance) system in the CAP theorem framework.

History

Origins at Facebook

Avinash Lakshman, a co-author of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook in 2007–2008 to power the inbox search feature.^[5]^[6] Facebook needed a database that could handle massive scale across multiple data centers with high write throughput and no single point of failure. The database was named after Cassandra, the mythological Trojan prophetess whose prophecies were never believed, reflecting the challenges of consistency in distributed systems.^[7]

Open source development

Facebook released Cassandra as open-source software on Google Code in July 2008.^[8] In March 2009, it became an Apache Incubator project,^[9] and on February 17, 2010, it graduated to a top-level Apache project.^[10]

Architecture

Distributed design

Cassandra uses a peer-to-peer distributed system where all nodes are identical, eliminating single points of failure. Key architectural features include:

Masterless replication: Every node can accept read and write requests, regardless of where data resides
Linear scalability: Performance increases proportionally with added nodes^[11]
Configurable replication: Data automatically replicates across multiple nodes for fault tolerance
Multi-datacenter support: Built-in support for clusters spanning multiple data centers^[12]

Performance benchmarks have shown that Cassandra 4.0 achieved up to 33% better throughput compared to version 3.11, with significantly improved latencies.^[13]

Consistency model

Cassandra offers tunable consistency, allowing developers to choose consistency levels per operation:^[14]^[15]

Write consistency: From "ANY" (highest availability) to "ALL" (highest consistency)
Read consistency: Similar range with additional options like "LOCAL_QUORUM"
Eventual consistency: Default model using timestamps and tombstones

Cluster communication

The system employs a gossip protocol for cluster management:

Nodes exchange state information about themselves and other nodes
Uses Phi Accrual Failure Detector for fault detection^[16]
Implements "hinted handoff" for temporary node failures
Seed nodes serve as bootstrap points for cluster formation

Data model

Cassandra implements a wide column store model that combines elements of key-value and tabular databases:

Core concepts

Keyspace: Top-level namespace (analogous to a database in RDBMS)
Table: Container for rows (formerly "column family")
Partition key: Determines data distribution across nodes
Clustering key: Orders data within a partition
Column: Basic data unit with name, value, type, and timestamp

Schema flexibility

Unlike traditional relational databases, Cassandra provides:

Dynamic columns per row
Runtime schema modifications without downtime^[17]
Support for complex data types including collections and user-defined types (UDTs)

Data Model Comparison
Feature	Cassandra	Traditional RDBMS
Primary structure	Keyspace → Table → Row	Database → Table → Row
Schema flexibility	Dynamic columns per row	Fixed schema
Relationships	Denormalized data model	Normalized with JOINs
Query patterns	Must follow data model	Ad hoc queries supported

Storage engine

LSM tree architecture

Cassandra uses a Log-structured merge-tree (LSM tree) optimized for write-heavy workloads:^[2]^[18]

1. Write path:

  * Writes go to commit log (durability) and memtable (performance)
  * Memtables flush to immutable SSTables when full
  * No in-place updates; all operations append new data

2. Read path:

  * Checks memtable first for latest data
  * Uses bloom filters to efficiently search SSTables
  * Merges data from multiple sources using timestamps

3. Compaction:

  * Periodically merges SSTables to reclaim space
  * Removes obsolete data and tombstones
  * Multiple strategies available (Size-Tiered, Leveled, Time-Window, Unified in 5.0)^[3]

Storage components

Commit log: Write-ahead log for crash recovery
Memtable: In-memory write buffer
SSTable: Sorted String Table - immutable on-disk files
Bloom filter: Probabilistic data structure for efficient lookups
Index files: Primary key indexes and secondary indexes

Query language

Cassandra Query Language (CQL)

CQL provides an SQL-like interface while respecting Cassandra's distributed nature:

```sql -- Create a keyspace with replication CREATE KEYSPACE my_app WITH REPLICATION = {

 'class': 'NetworkTopologyStrategy',
 'datacenter1': 3,
 'datacenter2': 2

};

-- Create a table CREATE TABLE users (

 user_id UUID PRIMARY KEY,
 username TEXT,
 email TEXT,
 created_at TIMESTAMP

);

-- Insert data INSERT INTO users (user_id, username, email, created_at) VALUES (uuid(), 'john_doe', 'john@example.com', toTimestamp(now())); ```

Query limitations

Due to its distributed architecture, Cassandra does not support:

Multi-table JOINs
Ad hoc aggregations (though limited support exists)
Arbitrary WHERE clauses (must include partition key)
Foreign key constraints
ACID transactions across partitions (limited support in newer versions)

Major releases

Recent versions

Cassandra 4.0 (July 2021): Production-ready focus, 5x faster streaming, audit logging^[19]^[20]
Cassandra 4.1 (June 2022): Pluggable memtable implementations, guardrails framework
Cassandra 5.0 (September 2024): Major performance and feature release^[3]

Cassandra 5.0 features

The latest major release introduced significant enhancements:^[21]

Storage Attached Indexes (SAI): More flexible secondary indexing with better performance^[3]
Vector search: Native support for AI/ML workloads with vector data type^[22]
Unified Compaction Strategy (UCS): Adaptive compaction that optimizes automatically^[3]
JDK 17 support: Up to 20% performance improvement from better memory management^[3]^[22]
Trie-based storage: New memtable and SSTable formats for improved efficiency^[3]
ACID transactions: Limited support for multi-partition transactions^[22]

Version support

As of 2025:^[23]

Latest stable: 5.0.4 (April 2025)
Supported versions: 4.0.x, 4.1.x, 5.0.x
End-of-life: All 3.x versions with the release of 5.0

Production deployment

Notable users

Large-scale Cassandra deployments include:^[24]

Apple: 160,000+ instances, 100+ PB of data^[24]^[5]
Netflix: 10,000+ instances, 6 PB of data, 1 trillion requests/day^[24]^[25]
Uber: Mission-critical systems for real-time analytics^[26]
Discord: Message storage and delivery^[24]

Performance characteristics

Write-optimized but reads are also efficient with proper data modeling^[2]
Linear scalability for both reads and writes^[11]
Typical latencies: sub-millisecond for cached data, single-digit milliseconds for disk reads^[27]
Handles thousands to millions of operations per second depending on cluster size^[24]

Operational considerations

Hardware: Optimized for SSDs, benefits from high memory for caching^[15]
Monitoring: JMX-based with tools like nodetool^[28]^[29]
Maintenance: Regular repairs and compactions required^[2]
Backup: Snapshot-based with support for incremental backups

Ecosystem

Client drivers

Official drivers available for:^[30]

Java (native and JDBC)
Python
Node.js
C/C++^[31]
C#/.NET
Go

Related projects

DataStax Enterprise: Commercial distribution with additional features^[26]
ScyllaDB: C++ reimplementation claiming higher performance^[13]
Amazon Keyspaces: Managed Cassandra-compatible service
Azure Cosmos DB: Offers Cassandra API compatibility^[15]

References

^ "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.
^ ^a ^b ^c ^d ^e ^f Carpenter, Jeff; Hewitt, Eben (2022). Cassandra: The Definitive Guide (3rd ed.). O'Reilly Media. ISBN 978-1-4920-9710-5.
^ ^a ^b ^c ^d ^e ^f ^g "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.
^ "Apache Cassandra Documentation Overview". Retrieved January 21, 2021.
^ ^a ^b "Apache Cassandra: Four Interesting Facts". DataStax. January 31, 2025. Retrieved May 29, 2025.
^ "What is Apache Cassandra?". ScyllaDB. February 19, 2025. Retrieved May 29, 2025.
^ "The meaning behind the name of Apache Cassandra". Archived from the original on November 1, 2016. Retrieved July 19, 2016.
^ Hamilton, James (July 12, 2008). "Facebook Releases Cassandra as Open Source". Retrieved June 4, 2009.
^ "Is this the new hotness now?". March 2, 2009. Archived from the original on April 25, 2010. Retrieved March 29, 2010.
^ "Cassandra is an Apache top level project". February 18, 2010. Archived from the original on March 28, 2010. Retrieved March 29, 2010.
^ ^a ^b "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.
^ Casares, Joaquin (November 5, 2012). "Multi-datacenter Replication in Cassandra". DataStax. Retrieved July 25, 2013.
^ ^a ^b "Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance". ScyllaDB. August 29, 2023. Retrieved May 29, 2025.
^ DataStax (January 15, 2013). "About data consistency". Archived from the original on July 26, 2013. Retrieved July 25, 2013.
^ ^a ^b ^c "Best practices for optimal performance in Azure Managed Instance for Apache Cassandra". Microsoft Learn. August 14, 2024. Retrieved May 29, 2025.
^ Hayashibara, Naohiro; Défago, Xavier; Yared, Rami; Katayama, Takuya (2004). "The Φ Accrual Failure Detector". IEEE Symposium on Reliable Distributed Systems. pp. 66–78. doi:10.1109/RELDIS.2004.1353004.
^ Ellis, Jonathan (March 2, 2012). "The Schema Management Renaissance in Cassandra 1.1". DataStax. Retrieved July 25, 2013.
^ "Performance Analysis: Apache Cassandra 4.0.0 Release". benchANT. Retrieved May 29, 2025.
^ "The Apache Cassandra Project Releases Apache® Cassandra™ v4.0". Apache Software Foundation. July 27, 2021. Retrieved May 29, 2025.
^ "Apache Cassandra 4.0 Comes in Ready for Production". The New Stack. July 26, 2021. Retrieved May 29, 2025.
^ "Apache Cassandra 5.0 Brings Major Updates". BigDataWire. September 9, 2024. Retrieved May 29, 2025.
^ ^a ^b ^c "Apache Cassandra 2024 Wrapped: A Year of Innovation and Growth". DataStax. March 7, 2025. Retrieved May 29, 2025.
^ "Apache Cassandra - endoflife.date". Retrieved May 29, 2025.
^ ^a ^b ^c ^d ^e "Apache Cassandra Case Studies". Apache Cassandra. Retrieved May 29, 2025.
^ "How Netflix Stores 140 Million Hours of Viewing Data Per Day". ByteByteGo. March 18, 2025. Retrieved May 29, 2025.
^ ^a ^b "The Best Apache Cassandra Use Cases". DataStax. March 17, 2025. Retrieved May 29, 2025.
^ "Apache Cassandra Performance Benchmarking". DataStax. February 1, 2025. Retrieved May 29, 2025.
^ "How to monitor Cassandra performance metrics". Datadog. December 3, 2015. Retrieved January 5, 2016.
^ "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.
^ "Client drivers". Apache Cassandra Documentation. Retrieved May 29, 2025.
^ "DataStax C/C++ Driver for Apache Cassandra". DataStax. Retrieved December 15, 2014.

External links

[1] "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.

[carpenter2022-2] ^ ^a ^b ^c ^d ^e ^f Carpenter, Jeff; Hewitt, Eben (2022). Cassandra: The Definitive Guide (3rd ed.). O'Reilly Media. ISBN 978-1-4920-9710-5.

[cassandra5announcement-3] ^ ^a ^b ^c ^d ^e ^f ^g "Apache Cassandra 5.0 Announcement". Apache Cassandra. Retrieved May 29, 2025.

[4] "Apache Cassandra Documentation Overview". Retrieved January 21, 2021.

[datastax-facts-5] "Apache Cassandra: Four Interesting Facts". DataStax. January 31, 2025. Retrieved May 29, 2025.

[scylladb-what-is-6] "What is Apache Cassandra?". ScyllaDB. February 19, 2025. Retrieved May 29, 2025.

[7] "The meaning behind the name of Apache Cassandra". Archived from the original on November 1, 2016. Retrieved July 19, 2016.

[JH2008-8] Hamilton, James (July 12, 2008). "Facebook Releases Cassandra as Open Source". Retrieved June 4, 2009.

[9] "Is this the new hotness now?". March 2, 2009. Archived from the original on April 25, 2010. Retrieved March 29, 2010.

[GRAD-10] "Cassandra is an Apache top level project". February 18, 2010. Archived from the original on March 28, 2010. Retrieved March 29, 2010.

[netflix-benchmark-11] "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.

[12] Casares, Joaquin (November 5, 2012). "Multi-datacenter Replication in Cassandra". DataStax. Retrieved July 25, 2013.

[scylladb-benchmark-13] "Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance". ScyllaDB. August 29, 2023. Retrieved May 29, 2025.

[tunable_consistency-14] DataStax (January 15, 2013). "About data consistency". Archived from the original on July 26, 2013. Retrieved July 25, 2013.

[microsoft-best-practices-15] "Best practices for optimal performance in Azure Managed Instance for Apache Cassandra". Microsoft Learn. August 14, 2024. Retrieved May 29, 2025.

[16] Hayashibara, Naohiro; Défago, Xavier; Yared, Rami; Katayama, Takuya (2004). "The Φ Accrual Failure Detector". IEEE Symposium on Reliable Distributed Systems. pp. 66–78. doi:10.1109/RELDIS.2004.1353004.

[17] Ellis, Jonathan (March 2, 2012). "The Schema Management Renaissance in Cassandra 1.1". DataStax. Retrieved July 25, 2013.

[benchant-18] "Performance Analysis: Apache Cassandra 4.0.0 Release". benchANT. Retrieved May 29, 2025.

[apache-announcement-19] "The Apache Cassandra Project Releases Apache® Cassandra™ v4.0". Apache Software Foundation. July 27, 2021. Retrieved May 29, 2025.

[thenewstack-4-20] "Apache Cassandra 4.0 Comes in Ready for Production". The New Stack. July 26, 2021. Retrieved May 29, 2025.

[21] "Apache Cassandra 5.0 Brings Major Updates". BigDataWire. September 9, 2024. Retrieved May 29, 2025.

[datastax2024wrapped-22] "Apache Cassandra 2024 Wrapped: A Year of Innovation and Growth". DataStax. March 7, 2025. Retrieved May 29, 2025.

[23] "Apache Cassandra - endoflife.date". Retrieved May 29, 2025.

[case-studies-24] "Apache Cassandra Case Studies". Apache Cassandra. Retrieved May 29, 2025.

[netflix-bytebytego-25] "How Netflix Stores 140 Million Hours of Viewing Data Per Day". ByteByteGo. March 18, 2025. Retrieved May 29, 2025.

[datastax-use-cases-26] "The Best Apache Cassandra Use Cases". DataStax. March 17, 2025. Retrieved May 29, 2025.

[datastax-benchmarks-27] "Apache Cassandra Performance Benchmarking". DataStax. February 1, 2025. Retrieved May 29, 2025.

[28] "How to monitor Cassandra performance metrics". Datadog. December 3, 2015. Retrieved January 5, 2016.

[netdata-monitoring-29] "Cassandra Monitoring: Key Metrics & Best Practices". Netdata. Retrieved May 29, 2025.

[30] "Client drivers". Apache Cassandra Documentation. Retrieved May 29, 2025.

[31] "DataStax C/C++ Driver for Apache Cassandra". DataStax. Retrieved December 15, 2014.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category

v t e Database management systems
Types	Object-oriented comparison Relational list comparison Key–value Column-oriented list Document-oriented Wide-column store Graph NoSQL NewSQL In-memory list Multi-model comparison Cloud Blockchain-based database
Concepts	Database ACID Armstrong's axioms Codd's 12 rules CAP theorem CRUD Null Candidate key Foreign key PACELC design principle Superkey Surrogate key Unique key
Objects	Relation table column row View Transaction Transaction log Trigger Index Stored procedure Cursor Partition
Components	Concurrency control Data dictionary JDBC XQJ ODBC Query language Query optimizer Query rewriting system Query plan
Functions	Administration Query optimization Replication Sharding
Related topics	Database models Database normalization Database storage Distributed database Federated database system Referential integrity Relational algebra Relational calculus Relational model Object–relational database Transaction processing
Category Outline