Question? Leave a message!

Cloud Computing and Hadoop

Cloud Computing and Hadoop
Dr.GordenMorse Profile Pic
Published Date:22-07-2017
Website URL
Cloud Computing MapReduce and Hadoop Dell Zhang Birkbeck, University of London 2016/17MapReduce Basics Source: Wikipedia (IBM Roadrunner)Divide and ConquerParallelization Challenges • How do we assign work units to workers? • What if we have more work units than workers? • What if workers need to share partial results? • How do we aggregate partial results? • How do we know all the workers have finished? • What if workers die?Common Theme? • Parallelization problems arise from: – Communication between workers (e.g., to exchange state) – Access to shared resources (e.g., data) • Thus, we need a synchronization mechanismSource: Ricardo Guimarães HerrmannManaging Multiple Workers • Difficult because – We don’t know the order in which workers run – We don’t know when workers interrupt each other – We don’t know the order in which workers access shared dataManaging Multiple Workers • Thus, we need: – semaphores (lock, unlock) – conditional variables (wait, notify, broadcast) – barriers • Still, lots of problems: – deadlock, livelock, race conditions... – dining philosophers, sleeping barbers, cigarette smokers... • Moral of the story: be carefulCurrent Tools • Programming models – Shared Memory (pthreads) – Message Passing (MPI) Shared Memory Message Passing P P P P P P P P P P 1 2 3 4 5 1 2 3 4 5 MemoryCurrent Tools • Design Patterns – Master-Slaves – Producer-Consumer Flows – Shared Work Queues producer consumer master work queue slaves producer consumerWhere the rubber meets the road • Concurrency is difficult to reason about • Concurrency is even more difficult to reason about – At the scale of datacenters (even across datacenters) – In the presence of failures – In terms of multiple interacting services • Not to mention debugging…Where the rubber meets the road • The reality: – Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage everythingFallacies of Distributed Computing • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesn't change. • There is one administrator. • Transport cost is zero. • The network is homogeneous.What’s the point? • It’s all about the right level of abstraction – The von Neumann architecture has served us well, but is no longer appropriate for the multi-core or cluster environment. The datacenter is the computerWhat’s the point? • Hide system-level details from the developers – No more race conditions, lock contention, etc. • Separating the what from how – Developer specifies the computation that needs to be performed – Execution framework (“runtime”) handles actual execution“Big Ideas” • Scale “out”, not “up” – Limits of SMP and large shared-memory machines • Move processing to the data – Cluster have limited bandwidth • Process data sequentially, avoid random access – Seeks are expensive, disk throughput is reasonable • Seamless scalability – From mythical man-month to tradable machine-hourWarm-Up • The task: – We have a huge text document – Count the number of times each distinct word appears in the file • Sample application: – Analyse web server logs to find popular URLsWarm-Up • In UNIX, it can be easily done: words(doc.txt) sort uniq -c – Here words is a script that takes a file and outputs the words in it, one per a line. – The file doc.txt may be too large for memory, but all word, count pairs fit in memory – The great thing is that it is naturally parallelizable – This captures the essence of MapReduceTypical Big-Data Problem • Iterate over a large number of records Map • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results Reduce • Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)