map reduce explained with sandwiches
Joyce Park stashed this in Code
Stashed in: Software!, Awesome, For Milo, Sandwich!, Troutgirl Pix
Wow computer science would be so much easier if we could explain everything with SANDWICHES!
For some reason they chose to illustrate a sandwich without meat or cheese.
Or was that done of purpose to show that mapreduce is like an inferior sandwich?
Still easier to understand than the Apache mapreduce tutorial:
Damn, all my sandwiches are hung up on this one tomato shard.
In a crunch you can make sandwiches without the tomato shard.
This is a funny picture but I don't think it really explains map/reduce. Here is my explanation:
Suppose four people had a big pile of coins with pennies, nickels, dimes and quarters and they wanted to figure out how much of each type they had. The map/reduce way to do this would be to:
- Map: split the pile into four equal piles and assign one to each person. Each person sorts their pile into pennies, nickels, dimes and quarters
- Shuffle: each person is designated one type of coin and all of those coins are passed to them, e.g. everyone passes their pennies to Alice, nickels to Bob, etc.
- Reduce: each person counts their pile
The key ideas are:
- The map step partitions the input arbitrarily to parcel out the work evenly. There is no specialization at this point as all nodes are working on a random subset of the data. In the example, everyone gets 1/4 of the original coin pile.
- The shuffle step assigns work to the reduce nodes grouped by a key in the data. This introduces specialization and potentially uneven work load if the key distribution is uneven. In the example, there may be a lot more quarters than pennies.
- The reduce step nodes work through their assigned keys and have all the information associated with the key from the original dataset. In the example, Alice has every penny from the original pile and can therefore compute the total count.
During steps 1 and 3, nodes work entirely independently. Map/reduce restricts inter-node communication to step 2 because it is a major bottleneck in parallel computing.
One additional concept is the combiner. In the example, instead of simply passing the coins, each person could count of their coin piles and pass the coin counts instead of the actual coins. This saves bandwidth but only works if the reduce operation is associative and commutative (such as addition).
Great explanation, Tom. Thank you for adding this!
3:40 PM Dec 31 2014