Lambda architecture
Lambda Architecture
Lambda architecture refers to a data-processing architecture aimed at processing massive quantities of data while allowing ad-hoc queries and lowering the latency of those queries. Lambda architecture attempts to solve the problem of balancing comprehensiveness (including all data), accuracy, and latency when querying big-data collections.
Overview
Lambda architecture describes a system consisting of three layers:[1]
- Batch – Precomputes results using a distributed processing system, typically Hadoop. This layer stores a master copy of the entire data set and acts as the system of record.
- Serving – Responds to ad-hoc queries by gathering data from the batch layer, or, if unavailable, the Speed layer.
- Speed – Processes data streams without regard to fix-ups or completeness.
Relies on a combination of computation techniques such as partial recomputation (p. 287) and estimation (hyperloglog), as well as optimizations in resource usage (p. 293) and data transformations.
Typical Components
In practice, each of the three layers can be built from any of a number of suitable components. For the serving layer, some implementations have used Cassandra to store data from the speed layer, and Elephant DB to do the same for the batch layer.[2]
Examples of Lambda Architecture in Use
Metamarkets, which provides analytics for players in the programmatic advertising space, employs a version of the lambda architecture that uses Druid (open-source data store) for storing and serving both the streamed and batch-processed data.[3] For running analytics on its advertising data warehouse, Yahoo has taken a similar approach, also using Apache Storm, Hadoop, and Druid.[4]
The Netflix Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not to provide the same type of views.[5] Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.
Criticism
Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths, while attempting to abstract the code bases into a single framework puts many of the specialized tools in each side's ecosystems out of reach.[6]
In a technical discussion over the merits of employing a pure streaming approach, it was noted that using a flexible streaming framework such as Apache Samza could provide some of the same benefits as batch processing without the latency.[7]
References
- ^ Marz, Nathan, and Warren, James. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013, p. 13.
- ^ Bijnens, Nathan. "A real-time architecture using Hadoop and Storm". 11 December 2013, slide 24
- ^ Yang, Fangjin, and Merlino, Gian. "Real-time Analytics with Open Source Technologies". 30 July 2014, slide 42
- ^ Rao, Supreeth, and Gupta, Sunil. "Interactive Analytics in Human Time". 17 June 2014, slides 9 and 16
- ^ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013
- ^ Krebs, Jay. "Questioning the Lambda Architecure". radar.oreilly.com. Oreilly. Retrieved 15 August 2014.
- ^ Hacker News retrieved 20 August 2014