map reduce explained with sandwiches

Home Profile Create Page

map reduce explained with sandwiches

Joyce Park stashed this in Code

map reduce explained - thx Dave Chapell http://t.co/05FUyJBL3w

10:38 PM Oct 10 2014

Source: Tweet by @tgrall

Joyce Park
3:40 PM Dec 31 2014

Stashed in: Software!, Awesome, For Milo, Sandwich!, Troutgirl Pix

To save this post, select a stash from drop-down menu or type in a new one:

Wow computer science would be so much easier if we could explain everything with SANDWICHES!

Joyce Park
3:40 PM Dec 31 2014

For some reason they chose to illustrate a sandwich without meat or cheese.

Or was that done of purpose to show that mapreduce is like an inferior sandwich?

Adam Rifkin
11:53 PM Dec 31 2014

Still easier to understand than the Apache mapreduce tutorial:

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Adam Rifkin
10:57 AM Jan 01 2015

Damn, all my sandwiches are hung up on this one tomato shard.

Joseph Barrera
10:04 PM Dec 31 2014

In a crunch you can make sandwiches without the tomato shard.

Adam Rifkin
11:50 PM Dec 31 2014

This is a funny picture but I don't think it really explains map/reduce. Here is my explanation:

Suppose four people had a big pile of coins with pennies, nickels, dimes and quarters and they wanted to figure out how much of each type they had. The map/reduce way to do this would be to:

Map: split the pile into four equal piles and assign one to each person. Each person sorts their pile into pennies, nickels, dimes and quarters
Shuffle: each person is designated one type of coin and all of those coins are passed to them, e.g. everyone passes their pennies to Alice, nickels to Bob, etc.
Reduce: each person counts their pile

The key ideas are:

The map step partitions the input arbitrarily to parcel out the work evenly. There is no specialization at this point as all nodes are working on a random subset of the data. In the example, everyone gets 1/4 of the original coin pile.
The shuffle step assigns work to the reduce nodes grouped by a key in the data. This introduces specialization and potentially uneven work load if the key distribution is uneven. In the example, there may be a lot more quarters than pennies.
The reduce step nodes work through their assigned keys and have all the information associated with the key from the original dataset. In the example, Alice has every penny from the original pile and can therefore compute the total count.

During steps 1 and 3, nodes work entirely independently. Map/reduce restricts inter-node communication to step 2 because it is a major bottleneck in parallel computing.

One additional concept is the combiner. In the example, instead of simply passing the coins, each person could count of their coin piles and pass the coin counts instead of the actual coins. This saves bandwidth but only works if the reduce operation is associative and commutative (such as addition).

This is a funny picture but I don't think it really explains map/reduce. Here is my explanation:

Suppose four people had a big pile of coins with pennies, nickels, dimes and quarters and they wanted to figure out how much of each type they had. The map/reduce way to do this would be to:

<ol><li>Map: split the pile into four equal piles and assign one to each person. Each person sorts their pile into pennies, nickels, dimes and quarters</li><li>Shuffle: each person is designated one type of coin and all of those coins are passed to them, e.g. everyone passes their pennies to Alice, nickels to Bob, etc.</li><li>Reduce: each person counts their pile</li></ol>

The key ideas are:

<ol><li>The map step partitions the input arbitrarily to parcel out the work evenly. There is no specialization at this point as all nodes are working on a random subset of the data. In the example, everyone gets 1/4 of the original coin pile.</li><li>The shuffle step assigns work to the reduce nodes grouped by a key in the data. This introduces specialization and potentially uneven work load if the key distribution is uneven. In the example, there may be a lot more quarters than pennies.</li><li>The reduce step nodes work through their assigned keys and have all the information associated with the key from the original dataset. In the example, Alice has every penny from the original pile and can therefore compute the total count.</li></ol>

During steps 1 and 3, nodes work entirely independently. Map/reduce restricts inter-node communication to step 2 because it is a major bottleneck in parallel computing.

One additional concept is the combiner. In the example, instead of simply passing the coins, each person could count of their coin piles and pass the coin counts instead of the actual coins. This saves bandwidth but only works if the reduce operation is associative and commutative (such as addition).

Tom Annau
7:04 PM Jan 02 2015

Great explanation, Tom. Thank you for adding this!

Adam Rifkin
11:15 PM Jan 02 2015

map reduce explained with sandwiches

Joyce Park stashed this in Code

You May Also Like: