Lambda Architecture explained

Mo Data stashed this in Big Data Technologies

http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/

Big data architecture paradigms are commonly separated into two (supposedly) diametrical models, the more traditional batch and the (near) real-time processing. The most popular technologies representing the two are Hadoop with MapReduce and Storm. However, a hybrid solution, the Lambda Architecture, challenges the idea that these approaches have to exclude each other. The Lambda Architecture combines a slow and fast lane of data processing to achieve the best of both worlds. Fast results and deep, large scale processing.

Usually one or the other architecture has been implemented due to a business requirement. Commonly, business users or customers eventually arrive at the point where they either would like to get a more historic view or more real time insight either of which can not be provided by the deployed architecture. At this point a hybrid solution becomes the only realistic solution. One which brings some surprising benefits with it.

Lambda Architecture explainedThe Lambda Architecture centrally receives data and does as little as possible processing before copying and splitting the data stream to the real time and batch layer. The batch layer collects the data in a data sink like HDFS or S3 in its raw form. Hadoop jobs regularly process the data and write the result to a data store.

Since this process is fully batched the data store can have some significant simplification. It should support random reads, i.e. needs some kind of index, however, it can do away with random writing, locking, and consistency issues. This simplifies the store significantly. An example of such a system is ElephantDB.

The problem with batch processing is the time it takes. For example, the above process may take hours or days. In the meantime data has been arriving and subsequent processes or services continue to work with hours or days old information. The real time layer solves this by taking its copy of the data and processing it in seconds or minutes and stores it in a fast random read and write store. This store is more complex since it has to be constantly updated.

The complexity of the real time layer and it’s store is manageable since it only has to store and serve a sliding window of data, which needs to be roughly as long as the batch process takes. Both layers’ results are merged and real time information is replaced in favour of batch layer data. In many cases this enables for the real time process to work with good approximations since its results are replaced by highly precise data within a short period.

Lambda Architecture benefitsThe addition of another layer to an architecture has major advantages. Firstly, data can (historically) be processed with high precision and involved algorithms without losing short-term information, alerts, and insights provided by the real time layer. Secondly, the addition of a layer is offset by dramatically reducing the random write storage requirements. The batch write storage provides also the option to switch data at predefined times and version data.

Lastly and importantly, the addition of the data sink of raw data offers the option to recover from human mistakes, i.e. deploying bugs which write erroneous aggregated data from which other architectures can not recover. Another option is to retrospectively enhance data extraction or learning algorithms and apply them on the whole of the historic dataset. This is extremely helpful in agile and startup environments where MVPs push what can be done down the track.

See also this http://lambda-architecture.net/

What is the Lambda Architecture?Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

Here’s how it looks like, from a high-level perspective:

LA overview

All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.

And this

http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for

The Lambda architecture: principles for architecting realtime Big Data systems I’ve started reading “Big Data - Principles and best practices of scalable realtime data systems" by Nathan Marz and James Warren. Throughout 2012 Manning have been releasing chapters as part of their early access program, and at the time of writing six chapters have been made available for download. Despite the slow drip feed of information, the wait is worthwhile because IMHO the authors have hit the mark with their "Lambda architecture".

The book describes the Lambda architecture as a clear set of principles for architecting Big Data systems. I like the concepts of building immutability and recomputation into a system, and it is the first architecture to really define how batch and stream processing can work together to solve a myriad of use cases. With the general emphasis moving more towards realtime, I see this book being a must read for all Big Data developers and architects alike.

The premise behind the Lambda architecture is you should be able to run ad-hoc queries against all of your data to get results, but doing so is unreasonably expensive in terms of resource. Technically it is now feasible to run ad-hoc queries against your Big Data (Cloudera Impala), but querying a petabyte dataset everytime you want to compute the number of pageviews for a URL may not always be the most efficient approach. So the idea is to precompute the results as a set of views, and you query the views. I tend to call these Question Focused Datasets (e.g. pageviews QFD).

The Lambda architecture The Lambda architecture is split into three layers, the batch layer, the serving layer, and the speed layer.

Batch layer (Apache Hadoop) The batch layer is responsible for two things. The first is to store the immutable, constantly growing master dataset (HDFS), and the second is to compute arbitrary views from this dataset (MapReduce). Computing the views is a continuous operation, so when new data arrives it will be aggregated into the views when they are recomputed during the next MapReduce iteration. The views should be computed from the entire dataset and therefore the batch layer is not expected to update the views frequently. Depending on the size of your dataset and cluster, each iteration could take hours. Serving layer (Cloudera Impala) The output from the batch layer is a set of flat files containing the precomputed views. The serving layer is responsible for indexing and exposing the views so that they can be queried. As the batch views are static, the serving layer only needs to provide batch updates and random reads, and for this I would use Cloudera Impala.

To expose the views using Impala all the serving layer would have to do is create a table in the Hive Metastore that points to the files in the HDFS. Users would then be able to use Impala to query the views immediately. Hadoop and Impala are perfect tools for the batch and serving layers. Hadoop can store and process petabytes of data, and Impala can query this data quickly and interactively. Although, the batch and serving layers alone do not satisfy any realtime requirement because MapReduce (by design) is high latency and it could take a few hours for new data to be represented in the views and propagated to the serving layer. This is why we need the speed layer.

Just a note about the use of the term ‘realtime’. When I say realtime, I actually mean near-realtime (NRT) and the time delay between the occurrence of an event and the availability of any processed data from that event. In the Lambda architecture, realtime is the ability to process the delta of data that has been captured after the start of the batch layers current iteration and now.

Speed layer (Storm, Apache HBase) In essence the speed layer is the same as the batch layer in that it computes views from the data it receives. The speed layer is needed to compensate for the high latency of the batch layer and it does this by computing realtime views in Storm.

The realtime views contain only the delta results to supplement the batch views. Whilst the batch layer is designed to continuously recompute the batch views from scratch, the speed layer uses an incremental model whereby the realtime views are incremented as and when new data is received. Whats clever about the speed layer is the realtime views are intended to be transient and as soon as the data propagates through the batch and serving layers the corresponding results in the realtime views can be discarded. This is referred to as “complexity isolation” in the book, meaning that the most complex part of the architecture is pushed into the layer whose results are only temporary.

The final piece of the puzzle is exposing the realtime views so that they can be queried and merged with the batch views to get the complete results. As the realtime views are incremental, the speed layer requires both random reads and writes, and for this I would use Apache HBase. HBase provides the ability for Storm to continuously increment the realtime views and at the same time can be queried by Impala for merging with the batch views. Impala’s ability to query both the batch views stored in the HDFS and the realtime views stored in HBase make it the perfect tool for the job.

Thoughts

The book describes some great architectural principles that can be applied to any Big Data architecture, specifically immutability and recomputation. Hadoop gives you a platform for storing all of your data and you don’t need a complex system for finding and updating individual records at scale, you can simply append new immutable records to your master dataset. An immutable record is a version of a record at a point in time. Newer versions of a record can be added, but a particular version can never be overridden, meaning that you can always revert to a previous state.

In the Lambda architecture this means that if you accidentally added some bad records, they can simply be deleted and the views recomputed to fix the problem. If the data was mutable then its much more difficult - and sometimes impossible - to revert to a previous state. The book describes this as having “human fault-tolerance”.

The book also emphasises that the batch views should be recomputed from scratch from the entire master dataset. You may think that this is a bad idea (as I did) and surely its more performant to implement incremental MapReduce algorithms to increase the frequency of the batch views. Although by doing this you will trade performance for human fault-tolerance, because when using an incremental algorithm it is much more difficult to fix any problems in the views. For example, lets say that you accidentally deployed an algorithm that incremented a counter by 2 instead of 1. If you were computing incremental results it would be difficult to go back and recompute each increment, whereas recomputing the results from scratch is simple and all you would have to do is fix the algorithm and the views would be fixed during the next batch iteration.

When talking about the Lambda architecture one question always comes up: can you achieve the same results using the Hadoop ecosystem alone? I think the reasons for implementing this architecture lie in your perception and requirements for realtime, and whether you think human fault-tolerance is an important factor in your system. It is feasible to implement a low latency system in Hadoop. For example, you could use Apache Flume to create an ingest pipeline into HDFS or HBase and use Impala to query the data. The latest version of Flume (1.2.0) also introduces the concept of Interceptors which can be used to modify the streaming data. Although Flume by design is not a streaming analytics platform like Storm and therefore I think it would be difficult to compute your realtime views in Flume (but not impossible). Storm on the other hand is a purpose built, scalable stream processing system that typically works at much lower latency.

What I’ve learnt the most from the Big Data book (or at least the first 6 chapters of it) is the architectural principles. The importance of immutability and human fault-tolerance, and the benefits of precomputation and recomputation. If you’re considering implementing the Lambda architecture in its entirety, ask yourself one question: how realtime do I need to be? If your answer is in the tens of seconds then the complete Lambda architecture maybe overkill, but if your answer is milliseconds then I think the Lambda architecture is your answer. storm-hbase connector

As a precursor to this post I’ve been working on a HBase connector for Storm. The connector provides a number of Storm Bolt and Trident state implementations for creating realtime views in a Lambda architecture. Please check it out on my GitHub page: https://github.com/jrkinley/storm-hbase. Links http://www.cloudera.com/content/cloudera/en/products/cdh.html https://github.com/nathanmarz/storm http://www.manning.com/marz/ http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it

<a rel="nofollow" target="_blank" href="http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/">http://www.semantikoz.com/blog/lambda-architecture-velocity-volume-big-data-hadoop-storm/</a>

Big data architecture paradigms are commonly separated into two 
(supposedly) diametrical models, the more traditional batch and the 
(near) real-time processing. The most popular technologies representing 
the two are Hadoop with MapReduce and Storm. However, a hybrid solution,
 the Lambda Architecture, challenges the idea that these approaches have
 to exclude each other. The Lambda Architecture combines a slow and fast
 lane of data processing to achieve the best of both worlds. Fast 
results and deep, large scale processing.

Usually
 one or the other architecture has been implemented due to a business 
requirement. Commonly, business users or customers eventually arrive at 
the point where they either would like to get a more historic view or 
more real time insight either of which can not be provided by the 
deployed architecture. At this point a hybrid solution becomes the only 
realistic solution. One which brings some surprising benefits with it.

Lambda Architecture explainedThe
 Lambda Architecture centrally receives data and does as little as 
possible processing before copying and splitting the data stream to the 
real time and batch layer. The batch layer collects the data in a data 
sink like HDFS or S3 in its raw form. Hadoop jobs regularly process the 
data and write the result to a data store.

Since this process is fully batched the data store can have some 
significant simplification. It should support random reads, i.e. needs 
some kind of index, however, it can do away with random writing, 
locking, and consistency issues. This simplifies the store 
significantly. An example of such a system is ElephantDB.

The 
problem with batch processing is the time it takes. For example, the 
above process may take hours or days. In the meantime data has been 
arriving and subsequent processes or services continue to work with 
hours or days old information. The real time layer solves this by taking
 its copy of the data and processing it in seconds or minutes and stores
 it in a fast random read and write store. This store is more complex 
since it has to be constantly updated.

The complexity of the real 
time layer and it’s store is manageable since it only has to store and 
serve a sliding window of data, which needs to be roughly as long as the
 batch process takes. Both layers’ results are merged and real time 
information is replaced in favour of batch layer data. In many cases 
this enables for the real time process to work with good approximations 
since its results are replaced by highly precise data within a short 
period.

Lambda Architecture benefitsThe addition of 
another layer to an architecture has major advantages. Firstly, data can
 (historically) be processed with high precision and involved algorithms
 without losing short-term information, alerts, and insights provided by
 the real time layer. Secondly, the addition of a layer is offset by 
dramatically reducing the random write storage requirements. The batch 
write storage provides also the option to switch data at predefined 
times and version data.

Lastly and importantly, the addition of 
the data sink of raw data offers the option to recover from human 
mistakes, i.e. deploying bugs which write erroneous aggregated data from
 which other architectures can not recover. Another option is to 
retrospectively enhance data extraction or learning algorithms and apply
 them on the whole of the historic dataset. This is extremely helpful in
 agile and startup environments where MVPs push what can be done down 
the track.

See also this <a rel="nofollow" target="_blank" href="http://lambda-architecture.net/">http://lambda-architecture.net/</a>

What is the Lambda Architecture?<a rel="nofollow" target="_blank" href="https://twitter.com/nathanmarz">Nathan Marz</a> came up with the term
Lambda Architecture (LA) for a generic, scalable and fault-tolerant data
processing architecture, based on his experience working on distributed data
processing systems at Backtype and Twitter.

The LA aims to satisfy the needs for a robust system that is fault-tolerant,
both against hardware failures and human mistakes, being able to serve a wide 
range of workloads and use cases, and in which low-latency reads and updates are 
required. The resulting system should be linearly scalable, and it should scale out 
rather than up.

Here’s how it looks like, from a high-level perspective:

<ol><li>All data entering the system is dispatched to both the batch layer and the speed layer for processing.</li>
 <li>The batch layer has two functions: (i) managing 
the master dataset (an immutable, append-only set of raw data), and (ii)
 to pre-compute the batch views.</li>
 <li>The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. </li>
 <li>The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.</li>
 <li>Any incoming query can be answered by merging results from batch views and real-time views.</li></ol>

And this

<a rel="nofollow" target="_blank" href="http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for">http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for</a>

The Lambda architecture: principles for architecting realtime Big Data systems
I’ve started reading “Big Data - Principles and best practices of scalable realtime data systems"
 by Nathan Marz and James Warren. Throughout 2012 Manning have been 
releasing chapters as part of their early access program, and at the 
time of writing six chapters have been made available for download. 
Despite the slow drip feed of information, the wait is worthwhile 
because IMHO the authors have hit the mark with their "Lambda 
architecture".

The book describes the Lambda architecture as a clear set 
of principles for architecting Big Data systems. I like the concepts of 
building immutability and recomputation into a system, and it is the 
first architecture to really define how batch and stream processing can 
work together to solve a myriad of use cases. With the general emphasis 
moving more towards realtime, I see this book being a must read for all 
Big Data developers and architects alike.

The premise behind the Lambda architecture is you should 
be able to run ad-hoc queries against all of your data to get results, 
but doing so is unreasonably expensive in terms of resource. Technically
 it is now feasible to run ad-hoc queries against your Big Data 
(Cloudera Impala), but querying a petabyte dataset everytime you want to
 compute the number of pageviews for a URL may not always be the most 
efficient approach. So the idea is to precompute the results as a set of
 views, and you query the views. I tend to call these Question Focused 
Datasets (e.g. pageviews QFD).

The Lambda architecture
The Lambda architecture is split into three layers, the batch layer, the serving layer, and the speed layer.

Batch layer (Apache Hadoop)
The batch layer is responsible for two things. The first 
is to store the immutable, constantly growing master dataset (HDFS), and
 the second is to compute arbitrary views from this dataset (MapReduce).
 Computing the views is a continuous operation, so when new data arrives
 it will be aggregated into the views when they are recomputed during 
the next MapReduce iteration.
The views should be computed from the entire dataset and 
therefore the batch layer is not expected to update the views 
frequently. Depending on the size of your dataset and cluster, each 
iteration could take hours.
Serving layer (Cloudera Impala)
The output from the batch layer is a set of flat files 
containing the precomputed views. The serving layer is responsible for 
indexing and exposing the views so that they can be queried.
As the batch views are static, the serving layer only 
needs to provide batch updates and random reads, and for this I would 
use Cloudera Impala.

To expose the views using Impala all the serving 
layer would have to do is create a table in the Hive Metastore that 
points to the files in the HDFS. Users would then be able to use Impala 
to query the views immediately.
Hadoop and Impala are perfect tools for the batch and 
serving layers. Hadoop can store and process petabytes of data, and 
Impala can query this data quickly and interactively. Although, the 
batch and serving layers alone do not satisfy any realtime requirement 
because MapReduce (by design) is high latency and it could take a few 
hours for new data to be represented in the views and propagated to the 
serving layer. This is why we need the speed layer.

Just a note about the use of the term ‘realtime’. When
 I say realtime, I actually mean near-realtime (NRT) and the time delay 
between the occurrence of an event and the availability of any processed
 data from that event. In the Lambda architecture, realtime is the 
ability to process the delta of data that has been captured after the 
start of the batch layers current iteration and now.

Speed layer (Storm, Apache HBase)
In essence the speed layer is the same as the batch layer 
in that it computes views from the data it receives. The speed layer is 
needed to compensate for the high latency of the batch layer and it does
 this by computing realtime views in Storm.

The realtime views contain 
only the delta results to supplement the batch views.
Whilst the batch layer is designed to continuously 
recompute the batch views from scratch, the speed layer uses an 
incremental model whereby the realtime views are incremented as and when
 new data is received. Whats clever about the speed layer is the 
realtime views are intended to be transient and as soon as the data 
propagates through the batch and serving layers the corresponding 
results in the realtime views can be discarded. This is referred to as 
“complexity isolation” in the book, meaning that the most complex part 
of the architecture is pushed into the layer whose results are only 
temporary.

The final piece of the puzzle is exposing the realtime 
views so that they can be queried and merged with the batch views to get
 the complete results. As the realtime views are incremental, the speed 
layer requires both random reads and writes, and for this I would use 
Apache HBase. HBase provides the ability for Storm to continuously 
increment the realtime views and at the same time can be queried by 
Impala for merging with the batch views. Impala’s ability to query both 
the batch views stored in the HDFS and the realtime views stored in 
HBase make it the perfect tool for the job.

Thoughts

The book describes some great architectural principles 
that can be applied to any Big Data architecture, 
specifically immutability and recomputation. Hadoop gives you a platform
 for storing all of your data and you don’t need a complex system for 
finding and updating individual records at scale, you can simply append 
new immutable records to your master dataset. An immutable record is a 
version of a record at a point in time. Newer versions of a record can 
be added, but a particular version can never be overridden, meaning that
 you can always revert to a previous state.

In the Lambda architecture 
this means that if you accidentally added some bad records, they can 
simply be deleted and the views recomputed to fix the problem. If the 
data was mutable then its much more difficult - and sometimes impossible
 - to revert to a previous state. The book describes this as having 
“human fault-tolerance”.

The book also emphasises that the batch views should be 
recomputed from scratch from the entire master dataset. You may think 
that this is a bad idea (as I did) and surely its more performant to 
implement incremental MapReduce algorithms to increase the frequency of 
the batch views. Although by doing this you will trade performance for 
human fault-tolerance, because when using an incremental algorithm it is
 much more difficult to fix any problems in the views. For example, lets
 say that you accidentally deployed an algorithm that incremented a 
counter by 2 instead of 1. If you were computing incremental results it 
would be difficult to go back and recompute each increment, whereas 
recomputing the results from scratch is simple and all you would have to
 do is fix the algorithm and the views would be fixed during the next 
batch iteration.

When talking about the Lambda architecture one question 
always comes up: can you achieve the same results using the Hadoop 
ecosystem alone?
I think the reasons for implementing this architecture lie
 in your perception and requirements for realtime, and whether you think
 human fault-tolerance is an important factor in your system. It is 
feasible to implement a low latency system in Hadoop. For example, you 
could use Apache Flume to create an ingest pipeline into HDFS or HBase 
and use Impala to query the data. The latest version of Flume (1.2.0) 
also introduces the concept of Interceptors which can be used to modify 
the streaming data. Although Flume by design is not a streaming 
analytics platform like Storm and therefore I think it would be 
difficult to compute your realtime views in Flume (but not impossible). 
Storm on the other hand is a purpose built, scalable stream processing 
system that typically works at much lower latency.

What I’ve learnt the most from the Big Data book (or at 
least the first 6 chapters of it) is the architectural principles. The 
importance of immutability and human fault-tolerance, and the benefits 
of precomputation and recomputation. If you’re considering implementing 
the Lambda architecture in its entirety, ask yourself one question: how 
realtime do I need to be? If your answer is in the tens of seconds then 
the complete Lambda architecture maybe overkill, but if your answer is 
milliseconds then I think the Lambda architecture is your answer.
storm-hbase connector

As a precursor to this post I’ve been working on a HBase 
connector for Storm. The connector provides a number of Storm Bolt and 
Trident state implementations for creating realtime views in a Lambda 
architecture. Please check it out on my GitHub page: <a rel="nofollow" target="_blank" href="https://github.com/jrkinley/storm-hbase.">https://github.com/jrkinley/storm-hbase.</a>
Links
<a rel="nofollow" target="_blank" href="http://www.cloudera.com/content/cloudera/en/products/cdh.html">http://www.cloudera.com/content/cloudera/en/products/cdh.html</a>
<a rel="nofollow" target="_blank" href="https://github.com/nathanmarz/storm">https://github.com/nathanmarz/storm</a>
<a rel="nofollow" target="_blank" href="http://www.manning.com/marz/">http://www.manning.com/marz/</a>
<a rel="nofollow" target="_blank" href="http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it">http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it</a>

Mo Data
1:09 AM Mar 23 2015

Stashed in:

To save this post, select a stash from drop-down menu or type in a new one:

Really good to ask upfront how realtime the system needs to be.

Adam Rifkin
1:22 AM Mar 23 2015

Lambda Architecture explained

Mo Data stashed this in Big Data Technologies

You May Also Like: