MapReduce and Hadoop
Birkbeck, University of London
Source: Wikipedia (IBM Roadrunner)Divide and ConquerParallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?Common Theme?
• Parallelization problems arise from:
– Communication between workers (e.g., to
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanismSource: Ricardo Guimarães HerrmannManaging Multiple Workers
• Difficult because
– We don’t know the order in which workers run
– We don’t know when workers interrupt each
– We don’t know the order in which workers access
shared dataManaging Multiple Workers
• Thus, we need:
– semaphores (lock, unlock)
– conditional variables (wait, notify, broadcast)
• Still, lots of problems:
– deadlock, livelock, race conditions...
– dining philosophers, sleeping barbers, cigarette
• Moral of the story: be carefulCurrent Tools
• Programming models
– Shared Memory (pthreads)
– Message Passing (MPI)
Shared Memory Message Passing
P P P P P P P P P P
1 2 3 4 5 1 2 3 4 5
• Design Patterns
– Producer-Consumer Flows
– Shared Work Queues
producer consumerWhere the rubber meets the road
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason
– At the scale of datacenters (even across
– In the presence of failures
– In terms of multiple interacting services
• Not to mention debugging…Where the rubber meets the road
• The reality:
– Lots of one-off solutions, custom code
– Write you own dedicated library, then program
– Burden on the programmer to explicitly manage
everythingFallacies of Distributed Computing
• The network is reliable.
• Latency is zero.
• Bandwidth is infinite.
• The network is secure.
• Topology doesn't change.
• There is one administrator.
• Transport cost is zero.
• The network is homogeneous.What’s the point?
• It’s all about the right level of abstraction
– The von Neumann architecture has served us well,
but is no longer appropriate for the multi-core or
The datacenter is the computerWhat’s the point?
• Hide system-level details from the developers
– No more race conditions, lock contention, etc.
• Separating the what from how
– Developer specifies the computation that needs to
– Execution framework (“runtime”) handles actual
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Move processing to the data
– Cluster have limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability
– From mythical man-month to tradable machine-hourWarm-Up
• The task:
– We have a huge text document
– Count the number of times each distinct word
appears in the file
• Sample application:
– Analyse web server logs to find popular URLsWarm-Up
• In UNIX, it can be easily done:
words(doc.txt) sort uniq -c
– Here words is a script that takes a file and outputs
the words in it, one per a line.
– The file doc.txt may be too large for memory,
but all word, count pairs fit in memory
– The great thing is that it is naturally parallelizable
– This captures the essence of MapReduceTypical Big-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
Key idea: provide a functional abstraction for
these two operations
(Dean and Ghemawat, OSDI 2004)