Parallel algorithm for linear algebra

parallel linear algebra software for multicore architectures and parallel linear algebra package design overview parallel planes linear algebra parallel projection linear algebra
Dr.NaveenBansal Profile Pic
Published Date:25-10-2017
Your Website URL(Optional)
Chapter 4 Parallel Sorting 4.1 Quick Review of Sequential Sorting Let X =x ,..., x be a set of n real values stored in an array at locations 1 n X0,..., Xn − 1. Note that the index i is shifted by one: Xi= x , starting i+1 from0to n − 1.Weasktosortinincreasingorder,thatistoreportthesortedsequence (x ,..., x ) with x ≤ ··· ≤ x . Sorting amounts to find a permutation σ on (1) (n) (1) (n) theindices (1,..., n)suchthat x ≤ ··· ≤ x (ofteninorderstatistics,wewrite σ(1) σ(n) forshortσ(i) = (i)).Sincethereexist n(factorial n)suchdistinctpermutations,we can also unsort (shuffle) in n different ways a sorted sequence, say (1,..., n). We assume when sorting on a memory-distributed parallel architecture with P processesthatalldataarealreadyallocatedtotheprocessesinto P arraysdenotedby X ,... X .Attheendofaparallelsortingprocedure,allelementsof X aresorted 0 P−1 i in increasing order, and less or equal to all elements of X for 0 ≤ i ≤ P − 2. i+1 Let us concisely review a few common sequential sorting algorithms. 4.1.1 Main Sequential Sorting Algorithms BubbleSort.This bubble sorting procedure is incremental and use a propagation mechanism: It consists in letting the largest element of the input array to move up until it finds its place, and repeat this procedure for the second largest until the it reaches the smallest element. This sorting procedure bears its name from the analogy of an air bubble underwater that goes up to the surface. Figure4.1 illustrates an example of bubble sorting. This algorithm is quite naive but very easy to program. Its worst-case complexity for sorting n elements is quadratic, 2 in O(n ). QuickSort. QuickSortisarandomizedrecursivealgorithmthatchoosesanelement, ˜ called the pivot, and cost O(n log n) in amortized time. We first apply a random permutation on the input array in linear time before calling the QuickSort sorting © Springer International Publishing Switzerland 2016 99 F. Nielsen, Introduction to HPC with MPI for Data Science, Undergraduate Topics in Computer Science, DOI 10.1007/978-3-319-21903-5_4100 4 Parallel Sorting Bubble 4321 34 21 21 34 3421 2134 1 2 34 3421 21 34 34 21 Stage 1: Stage 2: Stage 3: The largest element The second largest The third largest move up element move up element move up Fig. 4.1 Example of BubbleSort sorting that requires quadratic time to sort: we compare pairs of consecutiveelements.Afterthefirststageiscompleted,thelargestelementreachedthelastposition of the array. Then we iteratively proceed for the second largest to move up, and so on procedure. Then Quicksort chooses the first element X0 as the pivot element, and partitions X into three sub-arrays: an array X of elements strictly less than the pivot value, an array X of elements strictly great than the pivot, and an array X of elements equal to the pivot value (always of size 1 when all elements are = distinct). QuickSort finally calls recursively itself on smaller-size arrays X and X , and return the sorted list by concatenating the sorted sub-arrays: QuickSort(X) = (QuickSort(X ), X ,QuickSort(X )). = Beware that if you do not choose the pivot randomly, then you need first to apply ˜ a random permutation to guarantee an amortized time of O(n log n). Otherwise, Quicksort may take quadratic time if you sort a sorted array A more careful ˜ analysisprovesthat Quicksortrequires O(n + n log p)timewhere p isthenumber of distinct elements. Thus when all elements are identical, we have the sorted sequence in linear time, as expected. MergeSort.The MergeSortsortingalgorithmproceedsrecursivelyasfollows:first, wesplitdataintotwolistsandperformthissplittingrecursionuntilgettingelemen- tary lists of single elements (henceforth sorted by definition, this is the terminal case of recursion). Then we merge those sorted lists two by two until we get the overall sorted list of all elements. Figure4.2 illustrates the MergeSort.Themain primitive consists in merging two sorted lists into a sorted list: this can be done easily in linear time, and therefore MergeSort costs overall O(n log n) time.4.1 Quick Review of Sequential Sorting 101 4 2 785 1 36 P 0 P P 4278 5 1 3 6 0 4 P P P P 6 4 2 7 8 51 36 0 2 4 42 7 8 5 1 36 P P P P P P P P 0 1 2 3 4 5 6 7 27 4 8 135 6 P P P P 6 0 2 4 27 4 8 135 6 P P 0 4 1327 4 568 P 0 Fig. 4.2 Illustrating the parallel merge sort algorithm (fine grained parallelism) RadixSort.The RadixSort sorting algorithm relies on the binary representation of  ( j) b−1 j numbers into b bits: x = x 2 . First, we sort elements into two groups i j=0 i according the value of their bit value (1 or 0), starting from the Least Significant Bit (LSB) to the Most Significant Bit (MSB). The time complexity of RadixSort is O(bn). Note that if we use integers represented using b bits, we can have at b most n = 2 distinct numbers. That is, we need to have b ≥ log n to guarantee 2 that all elements can be distinct, and in that case, the complexity of radix sort is O(n log n), matching the time complexity of the MergeSort algorithm. 4.1.2 Complexity of Sorting: A Lower Bound To get a lower bound of the complexity of sorting n distinct elements by compar- isons , let us observe that a single comparison splits the permutation space into twoparts.Thus tofindtherightpermutationthatsortsallelements,westartfromthe identity permutation and by making comparison operations we split the permutation space until we get a singleton permutation set: the permutation solution for sorting. That is, we ask for the depth of a decision tree on the permutation set. For a binary treewith n nodes, thedepthofthetreeisatleast log n(with ·denoting thefloor 2 function), and since we have n potential permutations, we can deduce that the mini- maldepthofadecisiontreeis log n.Using Stirling formulaforapproximatingthe 2 √ n n factorial n: n∼ 2πn( ) , we deduce that log n= O(n log n). This proves that 2 e sorting sequentially requires Ω(n log n) elementary comparison operations. Let us emphasize that a lower bound holds only on the considered computation model. We usually assume the real-RAM model where elementary arithmetic operations can be computed on the reals in constant time and without any numerical precision issues. On other computation models, we can sort deterministically in O(n loglog n) time usinglinearmemoryspace1byatechniquecalledintegersorting.Thereexistsalso adaptive algorithmsthatsortprovablyquickeralreadypartiallysortedsequences2. Divide recursively the lists Merge recursively sorted sub-lists102 4 Parallel Sorting 4.2 Parallel Sorting by Merging Lists Figure4.2 shows how to parallelize the merge sort algorithm, by a fine-grained parallelism. We use P = n processes to divide data and merge recursively the sorted sub-lists. Let us study the sequential time of this algorithm with this complexity analysis:   log n  n i t = O 2 = O(n log n). seq i 2 i=1 To contrast with, the complexity of a parallel implementation of merge sort is:   log n  n t = O 2 = O(n), par i 2 i=0  n+1 n 1−q k since q = . k=0 1−q t seq This method is thus inefficient since the obtained speed-up is = O(log n). t par Ideally, we aim at an optimal linear speed-up factor of O(P) = O(n). As depicted in Fig.4.2, when we merge sub-lists, some processes happen to have no workload. 4.3 Parallel Sorting Using Ranks An important question is whether we can sort in parallel in O(log n) time? We shall see that this is possible indeed with a trivial parallel algorithm based on computing the ranks of elements. However this RankSort algorithm does not yield an optimal speed-up. For each data element Xi, let us compute its rank in the array defined by: Ri=X j∈ X X j Xi. That is, the rank Ri of Xi is the number of array elements strictly less than itself. The smallest element has rank 0 and the largest element has rank n − 1. Then we put the elements into a new auxiliary array, Y, that will be sorted as follows: YRi = Xi∀i. Here, we have assumed that all elements were distinct to avoid degenerate rank definition (and get a total order on the elements). Computing the rank of an element can also be parallelized easily on P = n nodes as follows: For a given element Xi, we evaluate the predicate X j Xi,∀ j ∈ 1,..., n, and we aggregate all predicate evaluations by counting 0 when the pred- icate is false, and 1 when the predicate is true. That is, we have:4.3 Parallel Sorting Using Ranks 103 n−1  Ri= 1 X jXi   j=0 Boolean predicate value converted to 0 (false) or 1 (true) The sequential RankSort is described by the following code using a double loop: for (i =0; i n; i++) // for each element rang =0; for (j =0; j n; j++) // we count the number of elements smaller than itself if (ai aj) rang++; // then we copy the element at its right position into new array b brang= ai; 2 Sequential rank sorting costs quadratic time, t = O(n ), since it costs linear seq time to compute the rank of a single element. But its parallel implementation using P = n processors is linear, t = O(P) = O(n). par 2 Now, let us consider parallel RankSort with P = n processes. This can be done in practice for small values of n by using the Graphical Processing Units (GPUs) that are made of thousands of graphics cores. To compute the rank of an element, we now use n processes for evaluating the predicate and performs a reduce operation (a prefix sum,MPI_Reduce/MPI_Sum 2 2 in MPI). Therefore we use P = n for computing all ranks. Process P evaluates i, j boolean predicate 1 , and we compute the rank of Xi by aggregating all X jXi predicate values of processes P . Here, the P denotes the group of processes P i,∗ i,∗ i, j pour 1 ≤ j ≤ P (Fig.4.3). ai a0 ai a1 ai a2 ai a3 comparison 0/1 0/1 0/1 0/1 + + 0/1/2 0/1/2 + 0/1/2/3/4 Fig. 4.3 Computing the rank in RankSort by aggregating the result of the boolean predicates 1 , counting one each time it is true: This is a collaborative reduce operation X jXi reduce104 4 Parallel Sorting Overall, the parallel time of rank sort with a quadratic number of processes is the time required by a collaborative reduce operation. This reduce operation depends on thetopologyoftheinterconnectionnetwork.Itcanbedoneinlogarithmictimeforan hypercube topology of dimension log n, but takes linear time for the ring topology. 2 Thus, by using P = n processors on the hypercube topology, we get: t = O(log n). par Whenwechoosethecompleteinterconnectiongraphforthetopology(theclique), then the reduce operation can be calculated in constant time (assuming that we can receive at the same time the P − 1 data from its neighbors), and the RankSort 2 algorithm with P = n processes requires constant time, O(1). 4.4 Parallel Quicksort Werecallthatforapivotvalue x,wepartitiondataintotwoarrays: X and X .Here, ≤ forthesakeofconciseness,wemergedthe X arraywiththe X array.Thenwerecur- = sively sort the sub-arrays X ← QuickSort(X ) and X ← QuickSort(X ), ≤x ≤x x x and finally we get the sorted array by concatenation as: QuickSort(X) = (QuickSort(X ),QuickSort(X )). ≤x x Whenwechooserandomly x ∈ X,wegetarandomizedalgorithmwithexpected- ˜ time complexity O(n log n). Otherwise, when the pivot is chosen deterministically, we may compute the median element (an order statistics operation in linear time) to balance fairly the two sub-arrays, and we obtain a deterministic algorithm, in O(n log n) time. The QuickSort algorithm in C++ using the Standard Template Library (STL) is given below: WWW source code:SequentialQuickSort.cpp // filename: SequentialQuickSort.cpp include vector.h include iostream.h include multiset.h include algo.h // pivot template class T void quickSort(vectorT&v, unsigned int low, unsigned int high) if (low = high) return; // select median element for the pivot unsigned int pivotIndex =(low + high)/ 2;4.4 Parallel Quicksort 105 // partition pivotIndex = pivot (v, low, high, pivotIndex); // sort recursively if (low pivotIndex) quickSort(v, low, pivotIndex ); if (pivotIndex high) quickSort(v, pivotIndex + 1, high); template class T void quickSort(vectorT& v) unsigned int numberElements = v.size (); if (numberElements 1) quickSort(v,0, numberElements -1); template class T unsigned int pivot (vectorT& v, unsigned int start, unsigned int stop, unsigned int position) //swap pivot with initial position swap (vstart, vposition); // partition values unsigned int low = start +1; unsigned int high = stop; while (low high) if (vlow vstart) low++; else if (vhigh vstart) swap (vlow, vhigh); // swap again pivot with initial element swap (vstart, vlow); return low; void main() vectorint v(100); for (int i =0; i 100; i++) vi= rand(); quickSort(v); vectorint::iterator itr = v.begin(); while (itr = v.end ()) cout itr " "; itr++; cout "\n"; The median element for n = 2m + 1elements isthe middle element of the sorted n−1 array, located at position m = . For even number of elements, we choose the 2 n median element to be at position  . Algorithm1 recalls the classic recursive 2 linear-time algorithm to compute the median (or any other ranked element by a divide and conquer technique with pruning). These selection algorithms are called order statistics.106 4 Parallel Sorting Data: S a set of n =S number, k ∈ N Result: Return the k-th element of S if n ≤ 5 then // Terminal case of recursion Sort S and return the k-th element of S; else n Partition S in groups; 5 // The last group has 5 (complete) or n mod 5 elements (incomplete) n Compute recursively the group medians M =m ,..., m ; 1 5 // Calculate the pivot x as the median n +1 n 5 x ← SELECT(M, , ); 5 2 Partition S into two sub-sets L =y ∈ S : y ≤ x and R =y ∈ S : y x; if k ≤L then return SELECT(L,L, k); else return SELECT(R, n−L, k −L); end end n Algorithm 1: Computing the k-th element (median when k = )usinga 2 recursive procedure SELECT (deterministic) in linear time. Now, let us parallelize Quicksort:Let P computers (each taking charge of a process).Weseektosortdataalreadydistributedonthelocalmemoryofthemachines n P ,..., P into P sub-sets X ,..., X of size . Without loss of generality, 0 P−1 0 P−1 P we assume that n can be divided by P: n mod P = 0. We shall write X ≤ X if and only if ∀x ∈ X ,∀x ∈ X , x ≤ x . Initially, i j i i j j i j all groups X ,..., X are unordered. The key idea of this first parallelization of 0 P−1 Quicksort is to partition data on processes by exchanging messages so that at the end of the partition, we have X ≤ ··· ≤ X . A straightforward implementation 0 P−1 consists in randomly choosing the pivot x, and to broadcast this pivot value to all the other processes. Then each process P partitions its array into two sub-arrays p p p X and X using the pivot. Furthermore, each process of the upper process group ≤ p  p ≥ P/2 sends its lower list X to a corresponding process p = p − P/2 ≤ P/2, ≤  p and receives an upper list X , and vice-versa. Processes then split into two groups, and we recursively apply the parallel quicksort algorithm. Figure4.4 illustrates this algorithm denoted by Quicksort //, a short cut for Parallel QuickSort. Notice that the sequential Quicksort algorithm with log P recursive levels yields a tree of calling functions that we can visualize on the function stack. It partitions ˜ data: X ≤ X ≤ ··· ≤ X in expected time O(n log P) (randomized algorithm) 0 1 P−1 such that the group data X are not yet sorted but that we have X ≤ X for all i ≤ j. i i j4.4 Parallel Quicksort 107 P 0 x P 1 P 2 P 3 Selecting pivot and broadcasting ≤ ≤ P P 0 x 0 x P ≤ x P ≤ x 1 1 P P 2 2 x x ≤ ≤ P P 3 3 ≤ x ≤ x sub-array exchanges Local group partitions with respect to the pivot lower/upper process groups P x P 0 0 x ≤ P P 1 1 ≤ x P ≤ x’ P 2 2 ≤ x’ P ≤ P 3 3 ≤ x’ Process partition, pivot selection, broadcast Local partitions P P 0 0 x ≤ P P 1 1 ≤ x P P 2 2 ≤ x’ P P 3 3 ≤ x’ Exchanging sub-arrays Final stage: recurse on process group Local sorting on each process sub-array Fig. 4.4 Illustrating parallel QuickSort: selecting the pivot, broadcasting the pivot, partitioning locallydatawiththepivot,exchangingsub-arraysbetweencorrespondingprocesses,andperforming recursion. Notice that depending on the pivot choices the sub-arrays can be quite unbalanced108 4 Parallel Sorting It then remains to sort locally data inside each process using a dedicated sequential sorting algorithm, like the sequential QuickSort or the sequential mergesort. Let us summarize our first parallelization of the Quicksort algorithm as follows: • theupperprocesses(withrankgreaterthan P/2)havedatavaluesabovethepivot, and the lower processes (with rank less than P/2) have data values less than the pivot, • after log P recursive calls, each process has a list of values disjoint from all the others, and • the largest element of process P is less or equal than the smallest element of P i i+1 for all i, and • intheterminalcaseoftherecursion,eachprocesssortsitsowngroupofdatausing a sequential sorting algorithm like QuickSort. One of the major drawback of Quicksort // is that processes have potentially very different workload (size of their sub-lists). Indeed, the size of the process sub-lists depends of the chosen pivots that are broadcasted when splitting the process groups. This load-unbalancing phenomenon is illustrated in Fig.4.4 where it is graphically indicated that the size of sub-lists can be quite different from one stage to another stage.Weshallnowstudytwoalgorithmsthatconsidertheload-balancingissueatthe heart of their parallelization: The HyperQuickSort algorithm and the Parallel Sorting by Regular Sampling (or PSRS for short) algorithm. 4.5 HyperQuickSort In the HyperQuickSort algorithm, the P processes first start by calling a sequential n n n sorting procedure on their local data elements, in O( log ) time. Then the P P P process that has in charge to choose the pivot, choose the median from its sorted list n (hence,atindex ).This“pivotprocess”broadcaststhepivottoallotherprocessesof 2P its group. Processes then partition their data into two sub-lists X and X according ≤ tothepivotvalue.ThenweproceedsimilarlyasforQuickSort//:Processesexchange upper and lower sub-lists with partner process, and on each process, we merge the two sorted sub-lists into a sorted list (in linear time). Finally, we recursively call HyperQuickSort on the processes of its group. Figure4.5 illustrates this recursive HyperQuickSort algorithm. Let us analyze the amortized mean complexity of HyperQuickSort by making the following hypotheses: Lists are assumed to be more or less balanced, and com- munication times are dominated by transmission time (that is, latency times are n n ˜ ignored). The initial sequential Quicksort call costs O( log ), the comparisons for P P n ˜ the log P merging stages cost O( log P), and the cost of communications for the P n ˜ log P sub-list exchange is O( log P). Thus the overall parallel global time is in P n ˜ ˜ O( log(P + n)). Therefore, we obtain an optimal speed-up factor in O(P) (under P our mild no latency and well-balanced sub-list assumptions).4.5 HyperQuickSort 109 P2 11,50,53,95,36,67,86,44 P3 35,16,81,1,44,23,15,5 P2 11,36,44,50,53,67,86,95 P3 1,5,15,16,23,35,44,81 P0 97,48,16,8,66,96,17,49 P1 58,76,54,39,82,47,65,51 P0 8,16,17,48,49,66,96,97 P1 39,47,51,54,58,65,76,82 (1) (2) P 11,36,44,50,53,67,86,95 P 1,5,15,16,23,35,44,81 P 11,36,44,50,53,67,86,95 P 1,5,15,16,23,35,44,81 2 3 2 3 P 8,16,17,48,49,66,96,97 P 39,47,51,54,58,65,76,82 P 8,16,17,48,49,66,96,97 P 39,47,51,54,58,65,76,82 0 1 0 1 (3) (4) P2 49,66,96,97,50,53,67,86,95 P3 51,54,58,65,76,82,81 P2 49,50,53,66,67,86,95,96,97 P3 51,54,58,65,76,81,81 P 8,16,17,48,11,36,44 P 1,5,15,16,23,35,44,39,47 P 8,11,16,17,36,44,48 P 1,5,15,16,23,35,39,44,47 0 1 0 1 (5) (6) P 49,50,53,66,67,86,95,96,97 P 51,54,58,65,76,81,81 2 3 P 49,50,53,66,67,86,95,96,97 P 51,54,58,65,76,81,81 2 3 P 8,11,16,17,36,44,48 P 1,5,15,16,23,35,39,44,47 0 1 P 8,11,16,17,36,44,48 P 1,5,15,16,23,35,39,44,47 0 1 (7) (8) P2 76,81,82,86,95,96,97 P3 51,54,58,65,49,50,53,66,67 P2 76,81,82,86,95,96,97 P3 49,50,51,53,54,58,65,66,67 P 8,11,16,17,1,5,15,16 P 36,44,48,23,35,39,44,47 P 1,5,8,11,15,16,16,17 P 23,35,36,39,44,44,47,48 0 1 0 1 (9) (10) Fig. 4.5 Illustration of the HyperQuickSort algorithm: (1) Initialization, (2) chosen pivot 48, (3) data partitioning with respect to 48, (4) sub-list exchange between pairs of processes, (5) lists have been exchanged, (6) merging lists, (7) recursive calls → Pivot 6717, (8) partition and exchange, (9) lists exchanged, (10) merging sorted sub-lists In practice, the process lists that we have assumed more or less balanced are not well balanced in real-world applications Thus we shall present our last alternative choice for parallel sorting. This last algorithm chooses better pivots that make parti- tions further well balanced in practice. This method goes by the name Parallel Sort Regular Sampling (PSRS). 4.6 Parallel Sort Regular Sampling (PSRS) The Parallel Sort Regular Sampling algorithm (or PSRS method for short) proceeds in four stages. Here, we do not assume anymore that the number of processes is a power of 2, and P can be an arbitrary natural number. Let us describe the algorithm PSRS as follows: 1. Each process P sorts its own local data using a sequential algorithm (say, i QuickSort), and choose P elements sampled at the following regular positions:110 4 Parallel Sorting n 2n (P − 1)n 0, , ,..., 2 2 2 P P P We thus obtain a regular sampling of the sorted data. 2. A process gathers and sorts all these regular samples, and then selects P − 1 pivots among these P × P samples. This process broadcasts these P −1pivots, and all processes partition their local data into P pieces. 3. Each process P keeps its ith partition, and sends its jth partition to process i P , ∀ j = i. This is a total exchange (or all-to-all) collaborative communication i primitive. 4. Each process merges its P partitioned arrays into a final sorted list. Figure4.6 schematically illustrates the work flow of the PSRS algorithm on a given toy data-set. input array to sort 91 1763 12 1578 14 1345 11 16 2010 3 6 9 12 15 17 4 7 11 13 14 16 015 2 810 P P P 0 1 2 39 15 4 1114 0 28 step 1 : local sorting and regular sampling 0231 489 11145 P −1 pivots 411 step 2 gather and selection of P − 1 pivots 411 411 411 3 6 9 12 15 17 4 7 11 13 14 16 015 2 810 step 3 : partitionming and all-to-all exchange P0 P1 P2 P P 0 P 0 0 3 69 12 15 17 P 1 P P 1 1 4 711 13 14 16 P 2 P 0 12 58 10 P ∅ empty array 2 2 step 4 :merging locally P sub-arrays to a sorted array 411 0 1 2 3456789 1011121314151617 sorted array Fig. 4.6 Example of execution of the Parallel Sort Regular Sampling algorithm, PSRS4.6 Parallel Sort Regular Sampling (PSRS) 111 We analyze the complexity of this sorting algorithm as follows: Each process n merges about elements, and this is the time experimentally observed empirically P in practice We assume that the interconnection network of the processes allows to have P simultaneous communications. Let us summarize the costs of these different stages of PSRS as follows: • Local computation cost: n n ˜ – QuickSort in time O( log ), P P 2 – Sorting regular samples : O(P log P), n – Merging sub-lists : O( log P). P • Communication cost: – Gathering samples, broadcasting pivots, n – Total exchange: O( ). P 4.7 Sorting on Grids: ShearSort Here, we demonstrate a simple parallel sorting algorithm well suited to the grid topology: the ShearSort parallel algorithm. At the final stage, the sorted sequence can either be ordered line by line on the grid, or ordered following a snake pattern √ √ as depicted in Fig.4.7.Let P = P P = n be the number of grid processors. To sort on the process grid, we alternatively sort on rows and sort on columns until we get the sorted sequence after log n stages. For the snake pattern, we only need to alternate the sorting directions (meaning in increasing order or in decreasing order) on lines. Let us analyze the complexity of this ShearSort sorting on a 2D grid of dimen- √ √ √ √ sion P = n × n = n. By sorting in parallel n numbers in O( n) time, the smallest number snake-like pattern largest number Fig. 4.7 Here, after completion of the ShearSort parallel sorting, the sorted elements are stored in the snake pattern112 4 Parallel Sorting √ √ parallel time we get is t = O((log n) × n) = O( n log n). The cost of the par T seq sequential algorithm is t = O(n log n). Thus we obtain a speed-up in = seq T par √ √ O( n) = O( P), that is not optimal Figure4.8 illustrates the various sorting stages of the ShearSort algorithm for a given input sequence. Fig. 4.8 The various stages 414 8 2 of the ShearSort:log n steps are required to produce a 10 3 13 16 final totally ordered sequence 7 5 15 1 6 9 12 11 grid initialization 241 84 1 4 3 7 16 10 13 3 8 2 5 6 1 5 715 12 11 9 14 12 11 9 6 16 13 10 15 row sort (alternate) column sort 1 3 4 7 1 2 3 4 5 2 8 6 5 8 6 7 9 11 12 14 9 11 12 10 16 15 13 10 16 15 13 14 row sort (alternate) column sort 1 2 3 4 1 2 3 4 8 5 7 6 8 7 5 6 9 10 11 12 9 10 11 12 16 15 14 13 16 15 14 13 row sort (alternate) result4.8 Sorting Using Comparison Network: Odd–Even Sorting 113 4.8 Sorting Using Comparison Network: Odd–Even Sorting Let us now consider sorting by comparing pairs of elements. We introduce a sorting network, called odd–even transposition (or odd–even sorting). The main principle relies on the bubble sort idea. Here, sorting requires two stages per cycle, namely: • Odd stage: we compare and exchange (swap) the odd pairs of elements: (X0, X1),(X2, X3),... . • Even stage: we compare and exchange (swap) the even pairs of elements: (X1, X2),(X3, X4),... . Tosortanarrayof n elements,thisalgorithmrequires n cyclesofodd–evenstages. Figure4.9 depicts a running example for this comparison network sorting algorithm. In the C/C++ programming language, this algorithm can be concisely imple- mented as follows: WWW source code:OddEvenSort.cpp // filename: OddEvenSort.cpp void OddEvenSort(int a, int n) int phase, i; for (phase =0; phase n; phase++) if (phase % 2 == 0) // even stage for (i =1; i n; i += 2) if (ai-1 ai) swap(&ai, &ai-1); else // odd stage for (i =1; i n-1; i += 2) if (ai ai+1) swap(&ai, &ai+1); Wecangeneralizethisodd–evensortingalgorithmbyconsideringpairsof groups of elements instead of pairs of singleton elements. We sort the n/P local elements inside each group of the process (say, using your favorite sequential algorithm like the sequential QuickSort), and then we send and receive the corresponding elements fromthepairsofadjacentprocesses.Whentherankoftheprocessislessthantherank of the process of its matching pair, we keep half of the smaller values, otherwise we keephalfofthelargervalues.Werepeat P timesthisgroupodd–evencycle.Thuswe114 4 Parallel Sorting input 18 15 22 10 23 11 even stage 15 18 10 22 11 23 odd stage 15 10 18 11 22 23 even stage 10 15 11 18 22 23 odd stage 10 11 15 18 22 23 even stage 10 11 15 18 22 23 Fig. 4.9 Sorting by odd–even transposition: it requires n odd–even cycles to produce a sorted sequence can tune up the granularity of the parallelism by considering P ranging from P = n (fine grained parallelism) to P = 2 (coarsest-grained parallelism). Notice that this algorithm can be fairly easily implemented. Figure4.10 illustrates the different steps of this algorithm. Let us now analyze the complexity of this group odd/even sorting algorithm: the n n n initial sequential sorting requires O( log ) for sorting groups of size . Then we P P P n repeat P cycles by sorting the smallest values from the largest values in time O( ) P (bymerginglistsandbykeepingtherighthalfforeachprocess),andwecommunicate4.8 Sorting Using Comparison Network: Odd–Even Sorting 115 P P P P 0 1 2 3 Initial configuration 15,11,9,16 3,14,8,74,6,12,10 5,2,13,1 After sorting group data stored in local memory 9,11,15,16 3,7,8,14 4,6,10,12 1,2,5,13 Stage 1 (odd) 3,7,8,911,14,15,16 1,2,4,56,10,12,13 Stage 2 (even) 3,7,8,91,2,4,511,14,15,16 6,10,12,13 Stage 3 (odd) 1,2,3,45,7,8,96,10,11,12 13,14,15,16 Stage 4 (even) 1,2,3,45,6,7,89,10,11,12 13,14,15,16 Sorting completed Fig. 4.10 Generalizing the odd–even pair sorting algorithm to odd–even group sorting. The gran- ularity of the parallelism depends on the group size of data stored in the local memory of processes n to each process O( ) elements. Neglecting the latency times of communications, p n n wethusobtainanoverallcomplexityintime O( log + n).Thisalgorithmisvery P P interesting on a communication network that uses the bidirectional ring topology. 4.9 Merging Sorted Lists Using a Comparison Network From a comparison-swap block, we can build a tailored circuit that implements algorithmssortingnumbers.Figure4.11illustratesthebasicelementofthesecircuits: the comparison-swap box. We can sort two sorted sub-lists by using a comparison networkimplementedinhardwareasshowsinFig.4.12.Thuswecanbuildaphysical comparison network recursively as it is depicted in Fig.4.12.116 4 Parallel Sorting compare a max(a,b) and min(a,b) b swap Fig. 4.11 The comparison-swap elementary hardware box that takes as input two numbers and returns in output the minimum and the maximum of the input elements. Fig. 4.12 Comparison sorted 2458 1367 network for merging two sub-lists sorted sub-lists (top) built odd-even generically recursively comparisons (bottom) from basic comparison-swap blocks 1256 3478 sorted sequence123 45 67 8 c 2n b n c 2n−1 b n−1 c 2n−2 Sorting with MergeSort even b 4 b 3 b 2 b 1 a n a n−1 Sorting with c 7 MergeSort c 6 even c 5 c a 4 4 c a 3 3 c a 2 2 a c 1 1 4.10 The Bitonic Merge Sort Finally, to conclude this chapter on parallel sorting, let us describe the bitonic merge 1 sort algorithm that was first proposed by Ken Batcher. A sequence is said bitonic if it is a unimodal sequence (that is, with a unique extremum point, let it be minimum or maximum) by considering the cyclic sequence. Now, we can search efficiently 1 The Bitonic Merge Sort 117 for an element in a bitonic sequence in logarithmic time by using a bisection search algorithm. To obtain a bitonic partition, we proceed as follows: • We associate to each element of the first half of the list, an element of the second n half of the list: x ↔ x . i i+ 2 • We compare these pairs of elements and sort them, so they are ordered following the (min,max) ordering. • Thus each element of the first half is guaranteed to be smaller than all elements of the second half by construction. n • The two half lists are bitonic sequences of length . 2 • Observe that this comparison sequence does not depend semantically of the data. This property is important and is different with respect to the MergeSort algorithm whose behavior depends on the semantic of the data. Therefore we get a binary splitting, and obtain as output two bitonic sequences B and B such that the elements of B are all smaller than the elements of B .The 1 2 1 2 terminal case of the recursion is when we have a sequence with a single element: in that case, it is trivially a sorted bitonic list Figure4.13 shows the workflow of this algorithm. An example of sorting with bitonic sort is reported in Fig.4.14. Let us analyze the complexity of the BitonicMergeSort: (1) Each bitonic partition n costs comparisons,(2)Wehavelog n recursionlevelsofbitonicsequencesplitting, 2 and (3) log n levels of merging of bitonic sequences. Thus the overall number of bitonic sequence 35897421 comparison and swap 35 421 789 bitonic sequence bitonic sequence By taking splitted in two halves the minima and maxima bitonic sequence we obtain two bitonic sequences Fig. 4.13 Top Splitting a bitonic sequence into two bitonic sequences using comparison-swap n blocksbetweenelements x and x . Bottom Anintuitivevisualproofthatbytakingtheminimum i i+ 2 and the maximum on these two sub-sequences, we indeed obtain two bitonic sequences118 4 Parallel Sorting 35897421 comparison and swap 35 421 789 4 1 3 7 589 2 unit bitonic 8 2 1 4 5 7 9 3 sequence (sorted ) sorted sequence Fig. 4.14 Recursive calls for splitting bitonic sequences into bitonic sub-sequences 2 comparison-swap elementary operations is in O(n log n). Figure4.15 shows the bitonic sorting algorithm implemented using a comparison network. 4.11 Notes and References The celebrated parallel sorting algorithms are covered in the following textbook 3. 2 Sorting in O(log n(loglog n) ) on the hypercube has been investigated in 4. Even if sorting has a well-known lower bound of Ω(n log n) on the real-RAM model of computation, we observe that in practice that it can be more or less easy to sort alreadypartiallysortedsequences:thus,weratherseekfor adaptive algorithms2for sortingthattakeintoaccountotherinputparametersinordertobemorecompetitivein practice,andintheworst-casetoyieldtheunadaptivecomplexityof O(n log n)time. Thefundamentalprimitiveinsortingisthecomparisonoperationthatgivenapair of elements produce a sorted pair as output: comparison (a, b) − −−−−−→ (min(a, b),max(a, b)) One can choose the granularity of parallelism by considering A and B as group of elements, and not as single elements anymore. Then the operation A B means    to sort A ∪ B and to return the pair (A , B ) with A the first half of sorted elements,  and B the second half of sorted elements. Thus we can build sorting network for parallel algorithms that control the granularity of the parallelism by adjusting the size of the groups for the basic sorting comparison-swap primitive.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.