Done, your profile is created.Finish your profile by filling in the following fields
Forgot Password Earn Money,Free Notes
Password sent to your Email Id, Please Check your Mail
Updating Cart........ Please Wait........
Cloud Computing and Hadoop
Cloud Computing and Hadoop
MapReduce and Hadoop
Birkbeck, University of London
Source: Wikipedia (IBM Roadrunner)Divide and ConquerParallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?Common Theme?
• Parallelization problems arise from:
– Communication between workers (e.g., to
– Access to shared resources (e.g., data)
• Thus, we need a synchronization mechanismSource: Ricardo Guimarães HerrmannManaging Multiple Workers
• Difficult because
– We don’t know the order in which workers run
– We don’t know when workers interrupt each
– We don’t know the order in which workers access
shared dataManaging Multiple Workers
• Thus, we need:
– semaphores (lock, unlock)
– conditional variables (wait, notify, broadcast)
• Still, lots of problems:
– deadlock, livelock, race conditions...
– dining philosophers, sleeping barbers, cigarette
• Moral of the story: be carefulCurrent Tools
• Programming models
– Shared Memory (pthreads)
– Message Passing (MPI)
Shared Memory Message Passing
P P P P P P P P P P
1 2 3 4 5 1 2 3 4 5
• Design Patterns
– Producer-Consumer Flows
– Shared Work Queues
producer consumerWhere the rubber meets the road
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason
– At the scale of datacenters (even across
– In the presence of failures
– In terms of multiple interacting services
• Not to mention debugging…Where the rubber meets the road
• The reality:
– Lots of one-off solutions, custom code
– Write you own dedicated library, then program
– Burden on the programmer to explicitly manage
everythingFallacies of Distributed Computing
• The network is reliable.
• Latency is zero.
• Bandwidth is infinite.
• The network is secure.
• Topology doesn't change.
• There is one administrator.
• Transport cost is zero.
• The network is homogeneous.What’s the point?
• It’s all about the right level of abstraction
– The von Neumann architecture has served us well,
but is no longer appropriate for the multi-core or
The datacenter is the computerWhat’s the point?
• Hide system-level details from the developers
– No more race conditions, lock contention, etc.
• Separating the what from how
– Developer specifies the computation that needs to
– Execution framework (“runtime”) handles actual
• Scale “out”, not “up”
– Limits of SMP and large shared-memory machines
• Move processing to the data
– Cluster have limited bandwidth
• Process data sequentially, avoid random access
– Seeks are expensive, disk throughput is reasonable
• Seamless scalability
– From mythical man-month to tradable machine-hourWarm-Up
• The task:
– We have a huge text document
– Count the number of times each distinct word
appears in the file
• Sample application:
– Analyse web server logs to find popular URLsWarm-Up
• In UNIX, it can be easily done:
words(doc.txt) sort uniq -c
– Here words is a script that takes a file and outputs
the words in it, one per a line.
– The file doc.txt may be too large for memory,
but all word, count pairs fit in memory
– The great thing is that it is naturally parallelizable
– This captures the essence of MapReduceTypical Big-Data Problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
Key idea: provide a functional abstraction for
these two operations
(Dean and Ghemawat, OSDI 2004)