Lambda Architecture explained
Mo Data stashed this in Big Data Technologies
Big data architecture paradigms are commonly separated into two (supposedly) diametrical models, the more traditional batch and the (near) real-time processing. The most popular technologies representing the two are Hadoop with MapReduce and Storm. However, a hybrid solution, the Lambda Architecture, challenges the idea that these approaches have to exclude each other. The Lambda Architecture combines a slow and fast lane of data processing to achieve the best of both worlds. Fast results and deep, large scale processing.
Usually one or the other architecture has been implemented due to a business requirement. Commonly, business users or customers eventually arrive at the point where they either would like to get a more historic view or more real time insight either of which can not be provided by the deployed architecture. At this point a hybrid solution becomes the only realistic solution. One which brings some surprising benefits with it.
Lambda Architecture explainedThe Lambda Architecture centrally receives data and does as little as possible processing before copying and splitting the data stream to the real time and batch layer. The batch layer collects the data in a data sink like HDFS or S3 in its raw form. Hadoop jobs regularly process the data and write the result to a data store.
Since this process is fully batched the data store can have some significant simplification. It should support random reads, i.e. needs some kind of index, however, it can do away with random writing, locking, and consistency issues. This simplifies the store significantly. An example of such a system is ElephantDB.
The problem with batch processing is the time it takes. For example, the above process may take hours or days. In the meantime data has been arriving and subsequent processes or services continue to work with hours or days old information. The real time layer solves this by taking its copy of the data and processing it in seconds or minutes and stores it in a fast random read and write store. This store is more complex since it has to be constantly updated.
The complexity of the real time layer and it’s store is manageable since it only has to store and serve a sliding window of data, which needs to be roughly as long as the batch process takes. Both layers’ results are merged and real time information is replaced in favour of batch layer data. In many cases this enables for the real time process to work with good approximations since its results are replaced by highly precise data within a short period.
Lambda Architecture benefitsThe addition of another layer to an architecture has major advantages. Firstly, data can (historically) be processed with high precision and involved algorithms without losing short-term information, alerts, and insights provided by the real time layer. Secondly, the addition of a layer is offset by dramatically reducing the random write storage requirements. The batch write storage provides also the option to switch data at predefined times and version data.
Lastly and importantly, the addition of the data sink of raw data offers the option to recover from human mistakes, i.e. deploying bugs which write erroneous aggregated data from which other architectures can not recover. Another option is to retrospectively enhance data extraction or learning algorithms and apply them on the whole of the historic dataset. This is extremely helpful in agile and startup environments where MVPs push what can be done down the track.
See also this http://lambda-architecture.net/
What is the Lambda Architecture?Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.
The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.
Here’s how it looks like, from a high-level perspective:
- All data entering the system is dispatched to both the batch layer and the speed layer for processing.
- The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
- The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
- The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
- Any incoming query can be answered by merging results from batch views and real-time views.
The Lambda architecture: principles for architecting realtime Big Data systems I’ve started reading “Big Data - Principles and best practices of scalable realtime data systems" by Nathan Marz and James Warren. Throughout 2012 Manning have been releasing chapters as part of their early access program, and at the time of writing six chapters have been made available for download. Despite the slow drip feed of information, the wait is worthwhile because IMHO the authors have hit the mark with their "Lambda architecture".
The book describes the Lambda architecture as a clear set of principles for architecting Big Data systems. I like the concepts of building immutability and recomputation into a system, and it is the first architecture to really define how batch and stream processing can work together to solve a myriad of use cases. With the general emphasis moving more towards realtime, I see this book being a must read for all Big Data developers and architects alike.
The premise behind the Lambda architecture is you should be able to run ad-hoc queries against all of your data to get results, but doing so is unreasonably expensive in terms of resource. Technically it is now feasible to run ad-hoc queries against your Big Data (Cloudera Impala), but querying a petabyte dataset everytime you want to compute the number of pageviews for a URL may not always be the most efficient approach. So the idea is to precompute the results as a set of views, and you query the views. I tend to call these Question Focused Datasets (e.g. pageviews QFD).
The Lambda architecture The Lambda architecture is split into three layers, the batch layer, the serving layer, and the speed layer.
Batch layer (Apache Hadoop) The batch layer is responsible for two things. The first is to store the immutable, constantly growing master dataset (HDFS), and the second is to compute arbitrary views from this dataset (MapReduce). Computing the views is a continuous operation, so when new data arrives it will be aggregated into the views when they are recomputed during the next MapReduce iteration. The views should be computed from the entire dataset and therefore the batch layer is not expected to update the views frequently. Depending on the size of your dataset and cluster, each iteration could take hours. Serving layer (Cloudera Impala) The output from the batch layer is a set of flat files containing the precomputed views. The serving layer is responsible for indexing and exposing the views so that they can be queried. As the batch views are static, the serving layer only needs to provide batch updates and random reads, and for this I would use Cloudera Impala.
To expose the views using Impala all the serving layer would have to do is create a table in the Hive Metastore that points to the files in the HDFS. Users would then be able to use Impala to query the views immediately. Hadoop and Impala are perfect tools for the batch and serving layers. Hadoop can store and process petabytes of data, and Impala can query this data quickly and interactively. Although, the batch and serving layers alone do not satisfy any realtime requirement because MapReduce (by design) is high latency and it could take a few hours for new data to be represented in the views and propagated to the serving layer. This is why we need the speed layer.
Just a note about the use of the term ‘realtime’. When I say realtime, I actually mean near-realtime (NRT) and the time delay between the occurrence of an event and the availability of any processed data from that event. In the Lambda architecture, realtime is the ability to process the delta of data that has been captured after the start of the batch layers current iteration and now.
Speed layer (Storm, Apache HBase) In essence the speed layer is the same as the batch layer in that it computes views from the data it receives. The speed layer is needed to compensate for the high latency of the batch layer and it does this by computing realtime views in Storm.
The realtime views contain only the delta results to supplement the batch views. Whilst the batch layer is designed to continuously recompute the batch views from scratch, the speed layer uses an incremental model whereby the realtime views are incremented as and when new data is received. Whats clever about the speed layer is the realtime views are intended to be transient and as soon as the data propagates through the batch and serving layers the corresponding results in the realtime views can be discarded. This is referred to as “complexity isolation” in the book, meaning that the most complex part of the architecture is pushed into the layer whose results are only temporary.
The final piece of the puzzle is exposing the realtime views so that they can be queried and merged with the batch views to get the complete results. As the realtime views are incremental, the speed layer requires both random reads and writes, and for this I would use Apache HBase. HBase provides the ability for Storm to continuously increment the realtime views and at the same time can be queried by Impala for merging with the batch views. Impala’s ability to query both the batch views stored in the HDFS and the realtime views stored in HBase make it the perfect tool for the job.
The book describes some great architectural principles that can be applied to any Big Data architecture, specifically immutability and recomputation. Hadoop gives you a platform for storing all of your data and you don’t need a complex system for finding and updating individual records at scale, you can simply append new immutable records to your master dataset. An immutable record is a version of a record at a point in time. Newer versions of a record can be added, but a particular version can never be overridden, meaning that you can always revert to a previous state.
In the Lambda architecture this means that if you accidentally added some bad records, they can simply be deleted and the views recomputed to fix the problem. If the data was mutable then its much more difficult - and sometimes impossible - to revert to a previous state. The book describes this as having “human fault-tolerance”.
The book also emphasises that the batch views should be recomputed from scratch from the entire master dataset. You may think that this is a bad idea (as I did) and surely its more performant to implement incremental MapReduce algorithms to increase the frequency of the batch views. Although by doing this you will trade performance for human fault-tolerance, because when using an incremental algorithm it is much more difficult to fix any problems in the views. For example, lets say that you accidentally deployed an algorithm that incremented a counter by 2 instead of 1. If you were computing incremental results it would be difficult to go back and recompute each increment, whereas recomputing the results from scratch is simple and all you would have to do is fix the algorithm and the views would be fixed during the next batch iteration.
When talking about the Lambda architecture one question always comes up: can you achieve the same results using the Hadoop ecosystem alone? I think the reasons for implementing this architecture lie in your perception and requirements for realtime, and whether you think human fault-tolerance is an important factor in your system. It is feasible to implement a low latency system in Hadoop. For example, you could use Apache Flume to create an ingest pipeline into HDFS or HBase and use Impala to query the data. The latest version of Flume (1.2.0) also introduces the concept of Interceptors which can be used to modify the streaming data. Although Flume by design is not a streaming analytics platform like Storm and therefore I think it would be difficult to compute your realtime views in Flume (but not impossible). Storm on the other hand is a purpose built, scalable stream processing system that typically works at much lower latency.
What I’ve learnt the most from the Big Data book (or at least the first 6 chapters of it) is the architectural principles. The importance of immutability and human fault-tolerance, and the benefits of precomputation and recomputation. If you’re considering implementing the Lambda architecture in its entirety, ask yourself one question: how realtime do I need to be? If your answer is in the tens of seconds then the complete Lambda architecture maybe overkill, but if your answer is milliseconds then I think the Lambda architecture is your answer. storm-hbase connector
As a precursor to this post I’ve been working on a HBase connector for Storm. The connector provides a number of Storm Bolt and Trident state implementations for creating realtime views in a Lambda architecture. Please check it out on my GitHub page: https://github.com/jrkinley/storm-hbase. Links http://www.cloudera.com/content/cloudera/en/products/cdh.html https://github.com/nathanmarz/storm http://www.manning.com/marz/ http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it