Question? Leave a message!




Big Data Massive Parallel Processing (Spark)

Big Data Massive Parallel Processing (Spark)
Ghislain Fourny Big Data 7. Massive Parallel Processing (Spark) 1kirtchanut / 123RF Stock Photo YARN 2Last week: MapReduce Input data Map Map Map Map Map Map Map Map Intermediate data (shuffled) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 3Hadoop infrastructure (version 1) Namenode + JobTracker /dir/file Datanode Datanode Datanode Datanode Datanode Datanode + + + + + + 4 TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTrackerIssue 1: scalability M M M M M M M M M M M M 4,000 nodes 40,000 tasks 5Issue 2: bottleneck JobTracker Bottleneck TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker 6 6Issue 3: Jack of all trades Scheduling Monitoring 7 7Issue 4: Utilization (task slots) Static Fixedsize (Decide on M/R at configuration time) 8 8YARN Yet Another Resource Negotiator 9YARN Scheduling Monitoring Application management Application Master Application Master Application Master Application Master Resource Manager Application Master 10Scales more M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M 10,000 nodes 100,000 tasks 11YARN architecture ResourceManager Container Container Container NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager 12Remember... It does ring a bell, doesn't it 13Masterslave architecture Master Slave Slave Slave Slave Slave Slave 14HDFS server architecture Namenode /dir/file1 /dir/file2 /file3 Datanode Datanode Datanode Datanode Datanode Datanode 15YARN ResourceManager Container Container Container NodeManager NodeManager NodeManager NodeManager NodeManager 16ResourceManager ResourceManager Scheduler + Applications Manager 17YARN: Client posts a job Client ResourceManager ApplicationClient Protocol Job Container Container Container NodeManager NodeManager NodeManager NodeManager NodeManager 18YARN: RM allocates an Application Master Client ResourceManager (Scheduler) ApplicationMaster Protocol Job Schedules Application Master Container Container NodeManager NodeManager NodeManager NodeManager NodeManager 19Scheduling strategies FIFO scheduler Capacity scheduler Fair scheduler 20Fine grained resource requests CPU RAM 21YARN: Application Master asks for Containers Client ResourceManager (Applications Manager) ApplicationMaster Protocol Job Application Master Container Container NodeManager NodeManager NodeManager NodeManager NodeManager 22Application Master communicates with containers ContainerManagement Protocol Execute Monitor Container Application Master Container Container Container 23Forward compatibility with DAGs of tasks 24Nikki Zalewski / 123RF Stock Photo PageRank 25The web 26Each page divides its impact evenly across links 1/3 1/3 1/3 27Weighted graph of the web 1 6 3 5 2 7 4 28Stochastic approach 1 6 3 5 2 7 4 29Stochastic approach 1 6 3 5 2 7 4 30Stochastic approach 1 6 3 5 2 7 4 31Stochastic approach 1 6 3 5 2 7 4 32Stochastic approach 1 6 3 5 2 7 4 33Probability of being on each page as a random surfer 6 1 3 5 2 7 4 34… abstracted as a vector π‘₯ " π‘₯ π‘₯ π‘₯ π‘₯ π‘₯ ' π‘₯ ( 35Stochastic approach: 1 step 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 π‘₯ " 1 1 1 π‘₯ 0 0 0 0 2 3 2 π‘₯ 0 1 0 0 0 0 0 π‘₯ 1 π‘₯ 0 0 1 0 0 0 3 π‘₯ ' 1 π‘₯ 0 0 0 0 0 0 ( 2 1 1 1 0 0 0 0 3 2 2 36Stochastic approach: n steps 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 π‘₯ " 1 1 1 π‘₯ 0 0 0 0 2 3 2 π‘₯ 0 1 0 0 0 0 0 π‘₯ 1 π‘₯ 0 0 1 0 0 0 3 π‘₯ ' 1 π‘₯ 0 0 0 0 0 0 ( 2 1 1 1 0 0 0 0 3 2 2 37Convergence 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 π‘₯ π‘₯ " " 1 1 1 π‘₯ π‘₯ 0 0 0 0 2 3 2 π‘₯ π‘₯ 0 1 0 0 0 0 0 π‘₯ π‘₯ = 1 π‘₯ π‘₯ 0 0 1 0 0 0 3 π‘₯ π‘₯ ' ' 1 π‘₯ π‘₯ 0 0 0 0 0 0 ( ( 2 1 1 1 0 0 0 0 3 2 2 38Convergence MX=X That’s a diagonalization problem 39Damping 1 1 1βˆ’π‘ 𝑀+𝑝 ⨂ 𝑋=1.𝑋 … … 1 1 With probability p the surfer picks a page chosen uniformly at random 40Damping 1 … 1 𝑋 = 0.85𝑀+0.15 𝑋 … … … 7" 1 … 1 41Matrix algebra: sounds like mapping and shuffling Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 42Nikki Zalewski / 123RF Stock Photo Introduction to Spark 43MapReduce highlevel data Map Shuffle Map 44MapReduce highlevel data This is a very specific topology 45... and it's underusing YARN 46... because YARN supports any DAG 47YARN we can build something more general 48MapReduce (even more) highlevel: twostep... Map Reduce 49... to any DAGs 50FullDAG query processing enters Spark 51Spark's firstclass citizen Resilient RDD Distributed Dataset 52Spark's firstclass citizen it's just a RDD Big collection 53Spark's firstclass citizen RDD ... and it is par ti tion ed 54RDD lifecycle Local On the HDFS S3 Filesystem fly Creation RDD 55RDD lifecycle RDD Transformation RDD 56RDD lifecycle RDD Action Local On HDFS S3 Filesystem screen 57Lineage graph RDD RDD RDD RDD RDD RDD RDD RDD 58Lazy Evaluation RDD RDD RDD RDD RDD RDD RDD Each action triggers an RDD evaluation 59Lazy Evaluation RDD RDD RDD RDD RDD RDD RDD RDD 60Spark: Execution Application or Shell 61Spark: Hello, World val rdd1 = sc.parallelize( List("Hello, World", "Hello, there") ) val rdd2 = rdd1.flatMap( value = value.split(" ") ) Key Value rdd2.countByValue() Hello, 2 there 1 World 1 62Overview of transformations 63Transformations 64Transformations: filter function: 65Transformations: map function: 66Transformations: flatMap function: 67Transformations: distinct 68Transformations: sample fraction + seed 69Transformations on two RDDs 70Transformations: union 71Transformations: intersection 72Transformations: substract 73Transformations: cartesian product 74Overview of actions 75Actions: collect 76Actions: count 22 77Actions: count by value 12 4 6 78Actions: take, top, take ordered, take sample 4 79Actions: reduce, fold, aggregate, for each + 80Pair RDDs 81Transformations: reduce by key + 82Transformations: group by key + 83Transformations: map values function: 84Transformations: keys 85Transformations: values 86Transformations: join 87Transformations: substract by key 88Actions: count by key 6 1 4 89Actions: lookup 90Partitioning 91Parallel execution 92Parallel execution Executor 1 Executor 2 Executor 3 Executor 4 93Parallel execution RDD 1 Logical layer Task RDD 2 Physical layer 94Sequence of (parallelizable) tasks Logical layer Task Task Task 95Physical layer (3 tasks) Physical layer 96Optimization 97Optimization Stage 98Sequence of (parallelizable) tasks Logical layer Task Task Stage Task 99General DAG with stages Join Simple shuffle 100General DAG with stages (physical containers) Join Simple shuffle 101Performance tuning 102Inefficiency 103Inefficiency 104Inefficiency 105Inefficiency 106Inefficiency 107Inefficiency 108Inefficiency 109Persisting RDDs 110Persisting RDDs 111Persisting RDDs 112Persisting RDDs 113Persisting RDDs 114General DAG with stages Join Simple shuffle 115Prepartitioning Prepartitioned Optimized Join Simple shuffle 116
Website URL
Comment