Question? Leave a message!




Kokkos, Manycore Device Performance Portability for C++ HPC Applications

Kokkos, Manycore Device Performance Portability for C++ HPC Applications 24
Kokkos, Manycore Device Photos placed in horizontal position with even amount Performance Portability of white space between photos and header for C++ HPC Applications Photos placed in horizontal position H. Carter Edwards and Christian Trott with even amount of white Sandia National Laboratories space between photos and header GPU TECHNOLOGY CONFERENCE 2015 MARCH 1620, 2015 SAN JOSE, CALIFORNIA SAND20151885C (Unlimited Release) Sandia National Laboratories is a multiprogram laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DEAC0494AL85000. SAND NO. 2011XXXXP What is “Kokkos”  κόκκος (Greek)  Translation: “granule” or “grain” or “speck”  Like grains of salt or sand on a beach  Programming Model Abstractions  Identify / encapsulate grains of data and parallelizable operations  Aggregate these grains with data structure and parallel patterns  Map aggregated grains onto memory and cores / threads  An Implementation of the Kokkos Programming Model  Sandia National Laboratories’ open source C++ library 1 Outline  Core Abstractions and Capabilities  Performance portability challenge: memory access patterns  Layered C++ libraries  Spaces, policies, and patterns  Polymorphic multidimensional array  Easy parallel patterns with C++11 lambda  Managing memory access patterns  Atomic operations  Wrap up  Portable Hierarchical Parallelism  Initial Scalable Graph Algorithms  Conclusion 2 Performance Portability Challenge: Best (decent) performance requires computations to implement architecturespecific memory access patterns  CPUs (and Xeon Phi)  Coredata affinity: consistent NUMA access (first touch)  Array alignment for cachelines and vector units  Hyperthreads’ cooperative use of L1 cache  GPUs  Threaddata affinity: coalesced access with cacheline alignment  Temporal locality and special hardware (texture cache)  Array of Structures (AoS) vs. Structure of Arrays (SoA) dilemma  i.e., architecture specific data structure layout and access  This has been the wrong concern The right concern: Abstractions for Performance Portability 3 Kokkos’ Performance Portability Answer Integrated mapping of thread parallel computations and multidimensional array data onto manycore architecture 1. Map user’s parallel computations to threads  Parallel pattern: foreach, reduce, scan, taskdag, ...  Parallel loop/task body: C++11 lambda or C++98 functor 2. Map user’s datum to memory  Multidimensional array of datum, with a twist  Layout : multiindex (i,j,k,...) ↔ memory location  Kokkos chooses layout for architecturespecific memory access pattern  Polymorphic multidimensional array 3. Access user datum through special hardware (bonus)  GPU texture cache to speed up readonly random access patterns  Atomic operations for thread safety 4 Layered Collection of C++ Libraries  Standard C++, Not a language extension  Not a language extension: OpenMP, OpenACC, OpenCL, CUDA  In spirit of Intel’s TBB, NVIDIA’s Thrust CUSP, MS C++AMP, ...  Uses C++ template metaprogramming  Previously relied upon C++1998 standard  Now require C++2011 for lambda functionality Supported by Cuda 7.0; full functionality in Cuda 7.5  Participating in ISO/C++ standard committee for core capabilities Application Library Domain Layer(s) Trilinos Sparse Linear Algebra Kokkos Containers Algorithms Kokkos Core Backends: Cuda, OpenMP, pthreads, specialized libraries ... 5 Abstractions: Spaces, Policies, and Patterns  Memory Space : where data resides  Differentiated by performance; e.g., size, latency, bandwidth  Execution Space : where functions execute  Encapsulates hardware resources; e.g., cores, GPU, vector units, ...  Denote accessible memory spaces  Execution Policy : how (and where) a user function is executed  E.g., data parallel range : concurrently call function(i) for i = 0..N)  User’s function is a C++ functor or C++11 lambda  Pattern: parallelfor, parallelreduce, parallelscan, taskdag, ...  Compose: pattern + execution policy + user function; e.g., parallelpattern( PolicySpace, Function);  Execute Function in Space according to pattern and Policy  Extensible spaces, policies, and patterns 6 Examples of Execution and Memory Spaces Compute Node Attached Accelerator GPU primary Multicore primary DDR GDDR Socket shared deepcopy Attached Accelerator Compute Node GPU primary GPU::capacity primary Multicore GDDR DDR (via pinned) shared perform Socket GPU::perform (via UVM) 7 Polymorphic Multidimensional Array View  View double38 , Space a(“a”,N,M);  Allocate array data in memory Space with dimensions NM38 C++17 improvement to allow Viewdouble 38,Space  a(i,j,k,l) : User’s access to array datum  “Space” accessibility enforced; e.g., GPU code cannot access CPU memory  Optional array bounds checking of indices for debugging  View Semantics: Viewdouble38,Space b = a ;  A shallow copy: ‘a’ and ‘b’ are pointers to the same allocated array data  Analogous to C++11 std::sharedptr  deepcopy( destinationview , sourceview );  Copy data from ‘sourceview’ to ‘destinationview’  Kokkos policy: never hide an expensive deep copy operation 8 Polymorphic Multidimensional Array Layout  Layout mapping : a(i,j,k,l) → memory location  Layout is polymorphic, defined at compile time  Kokkos chooses default array layout appropriate for “Space”  E.g., rowmajor, columnmajor, Morton ordering, dimension padding, ...  User can specify Layout : View ArrayType, Layout, Space  Override Kokkos’ default choice for layout  Why For compatibility with legacy code, algorithmic performance tuning, ...  Example Tiling Layout  Viewdouble,Tile8,8,Space m(“matrix”,N,N);  Tiling layout transparent to user code : m(i,j) unchanged  Layoutaware algorithm extracts tile subview 9 Multidimensional Array Subview Attributes  Array subview of array view (new)  Y = subview( X , ...rangesandindicesargumentlist... );  View of same data, with the appropriate layout and index map  Each index argument eliminates a dimension  Each range begin,end) argument contracts a dimension  Access intent Attributes View ArrayType, Layout, Space, Attributes  How user intends to access datum  Example, View with const and random access intension  View double , Cuda a(“mymatrix”, N, N );  View const double , Cuda, RandomAccess b = a ; Kokkos implements b(i,j) with GPU texture cache 10 Multidimensional Array functionality being considered by ISO/C++ standard committee  TBD: add layout polymorphism – a critical capability  To be discussed at May 2015 ISO/C++ meeting  TBD: add explicit (compiletime) dimensions  Minor change to core language to allow: T 38  Concern: performance loss when restricted to implicit (runtime) dimensions  TBD: use simple / intuitive array access API: x(i,j,k,l)  Currently considering : x i , j , k , l  Concern: performance loss due to intermediate initializer list  TBD: add shared pointer (std::sharedptr) semantics  Currently merely a wrapper on usermanaged memory  Concern: coordinating management of view and memory lifetime 11 Easy Parallel Patterns with C++11 and Defaults parallelpattern( PolicySpace , UserFunction )  Easy example BLAS1 AXPY with views parallelfor( N , KOKKOSLAMBDA( int i ) y(i) = a x(i) + y(i); );  Default execution space chosen for Kokkos installation  Execution policy “N” = RangePolicyDefaultSpace(0,N)  define KOKKOSLAMBDA = / nonCuda /  define KOKKOSLAMBDA =device / Cuda 7.5 candidate feature /  Tell NVIDIA Cuda development team you like and want this in Cuda 7.5  More verbose without lambda and defaults: struct axpyfunctor Viewdouble,Space x , y ; double a ; KOKKOSINLINEFUNCTION void operator()( int i ) const y(i) = a x(i) + y(i); // ... constructor ... ; parallelfor( RangePolicySpace(0,N) , axpyfunctor(a,x,y) ); 12 Kokkos Manages Challenging Part of Patterns’ Implementation  Example: DOT product reduction parallelreduce( N , KOKKOSLAMBDA( int i , double value ) value += x(i) y(i); , result );  Challenges: temporary memory and interthread reduction operations  Cuda shared memory for interwarp reductions  Cuda global memory for interblock reductions  Intrawarp, interwarp, and interblock reduction operations  User may define reduction type and operations struct myreductionfunctor typedef ... valuetype ; KOKKOSINLINEFUNCTION void join( valuetype, const valuetype) const ; KOKKOSINLINEFUNCTION void init( valuetype ) const ; ;  ‘valuetype’ can be runtimesized onedimensional array  ‘join’ and ‘init’ plugged into interthread reduction algorithm 13 Managing Memory Access Pattern: Compose Parallel Execution ○ Array Layout  Map Parallel Execution  Maps calls to function(iw) onto threads  GPU: iw = threadIdx + blockDim blockIds  CPU: iw ∈begin,end) ; contiguous partitions among threads Th  Choose Multidimensional Array Layout  Leading dimension is parallel work dimension  Leading multiindex is ‘iw’ : a( iw , j, k, l )  Choose appropriate array layout for space’s architecture  E.g., AoS for CPU and SoA for GPU  Finetune Array Layout  E.g., padding dimensions for cache line alignment 14 Performance Impact of Access Pattern  Molecular dynamics computational kernel in miniMD 7 13 ςς  Simple Lennard Jones force model: F = 6ε− 2 ∑ i ( ) ( ) r r j , r r ij ij ij cut 2  Atom neighbor list to avoid N computations posi = pos(i); for( jj = 0; jj numneighbors(i); jj++) j = neighbors(i,jj); rij = posi – pos(j); //random read 3 floats if (rij rcut) fi += 6e((s/rij)7 – 2(s/rij)13) f(i) = fi;  Test Problem 200  864k atoms, 77 neighbors correct layout  2D neighbor array 150 (with texture)  Different layouts CPU vs GPU 100 correct layout  Random read ‘pos’ through (without texture) GPU texture cache 50 wrong layout  Large performance loss (with texture) 0 with wrong array layout Xeon Xeon Phi K20x 15 GFlop/s Atomic operations atomicexchange, atomiccompareexchangestrong, atomicfetchadd, atomicfetchor, atomicfetchand  Threadscalability of nontrivial algorithms and data structures  Essential for lockfree implementations  Concurrent summations to shared variables  E.g., finite element computations summing to shared nodes  Updating shared dynamic data structure  E.g., append to a shared array or insert into a shared map  Portably map to compiler/hardware specific capabilities  GNU and CUDA extensions when available  Current: any 32bit or 64bit type, may use CASloop implementation  ISO/C++ 2011 and 2014 atomics not adequate for HPC  Proposed necessary improvements for C++17 16 ThreadScalable Fill of Sparse Linear System −𝟏  MiniFENL: Newton iteration of FEM: 𝒙 =𝒙−𝑱 (𝒙 )𝒓 (𝒙� +𝒏𝟏𝒏𝒏𝒏  Fill sparse matrix via ScatterAtomicAdd or GatherSum  ScatterAtomicAdd + Simpler + Less memory – Slower HW atomic  GatherSum + Bitwise reproducibility 0.35  Performance win Phi60 GatherSum 0.3  Scatteratomicadd Phi60 ScatterAtomic 0.25 Phi240 GatherSum  equal Xeon PHI 0.2 Phi240 ScatterAtomic 0.15  40 faster Kepler GPU K40X GatherSum 0.1  Pattern chosen K40X ScatterAtomic 0.05 0  Feedback to HW vendors: 1E+03 1E+04 1E+05 1E+06 1E+07 performant atomics Number of finite element nodes 17 Matrix Fill: microsec/node Core Abstractions and Capabilities (wrap up)  Abstractions  Identify / encapsulate grains of data and parallelizable operations  Aggregate these grains with data structure and parallel patterns  Map aggregated grains onto memory and cores / threads  Grains and Patterns  Parallelizable operation: C++11 lambda or C++98 functor  Parallel pattern: foreach, reduce, scan, taskdag, ...  Multidimensional array of datum  Atomic operations  Extensible Mappings  Polymorphic multidimensional array : space, layout, access intentions  Execution policy : where and how to execute  Next Step : Finer Grain Parallelism with Hierarchical Patterns  κόκκος : “like grains of sand on a beach” – how fine can we go 18 Outline  Core Abstractions and Capabilities  Portable Hierarchical Parallelism  Twolevel threadteam execution policy and nested parallel patterns  Threadteam shared memory  Threelevel execution policy  Application to molecular dynamics kernels  Application to tensor mathematics kernels  Initial Scalable Graph Algorithms (very new)  Conclusion 19 Thread Team Execution Policy  Expose and map more parallelism  Vocabulary  OpenMP: League of Teams of Threads  Cuda: Grid of Blocks of Threads  Thread Team Functionality  Threads within a team execute concurrently parallelfor  Teams do not execute concurrently parallelreduce Nested parallel patterns: foreach, reduce, scan  Teamshared scratch memory  Thread Team Portability : map onto hardware  Cuda : team == thread block, possibly a subblock group of warps  Xeon Phi : team == hyperthreads sharing L1 cache  CPU : team == thread 20 Thread Team Example: Sparse MatrixVector Multiplication  Traditional serial compressed row storage (CRS) algorithm: for ( int i = 0 ; i nrow ; ++i ) for ( int j = irow(i) ; j irow(i+1) ; ++j ) y(i) += A(j) x( jcol(j) );  Thread team algorithm, using C++11 lambda typedef TeamPolicySpace policy ; parallelfor( policy( nrow / leagues / ), KOKKOSLAMBDA( policy::membertype const member ) double result = 0 ; const int i = member.leaguerank(); parallelreduce( TeamThreadRange(member,irow(i),irow(i+1)), ( int j , double val ) val += A(j) x(jcol(j));, result ); if ( member.teamrank() == 0 ) y(i) = result ; ); 21 Thread Team Shared Scratch Memory  Challenges  Multiple arrays residing in shared scratch memory  Arrays may have runtime dimensions  Arrays’ dimensions possibly dependent upon team size  Approach: reuse Kokkos abstractions  Shared scratch Memory Space of the Execution Space  Manage array with a View defined on this space  Thread team executing in the execution space is given an instance of the associated shared scratch memory space  Capability available via user defined functor  Typically need richer information than C++11 lambda can provide  ... example ... 22 Team Shared Scratch Memory Example struct myfunctor typedef TeamPolicyExecutionSpace Policy ; typedef ExecutionSpace::scratchmemoryspace Scratch ; typedef Viewdouble,Scratch,MemoryUnmanaged SharedView ; SharedView x , y ; int nx , ny ; KOKKOSINLINEFUNCTION void operator()( Policy::membertype const member ) const Scratch shmemspace = member.teamshmem(); x( shmemspace, member.teamsize(), nx ); y( shmemspace, member.teamsize(), ny ); // ... team fill of arrays ... member.teambarrier(); // ... team computations on arrays ... member.teambarrier(); // Query shared memory size before parallel dispatch: sizet teamshmemsize( int teamsize ) const return SharedView::shmemsize( teamsize , nx ) + SharedView::shmemsize( teamsize , ny ); ; 23 rd Thread Team Execution Policy, 3 Level  Add third level of Vector parallelism  Map algorithm’s thread teams onto hardware resources  Cuda : “thread” == warp, “vector lane” == thread of warp  Xeon Phi : “thread” == hyperthread, “vector lane” == SSE or AVX lane  Vector parallelism functionality  Vector lanes execute lockstep concurrently  Consistent parallel patterns at vector level: foreach, reduce, scan  New “single” pattern denoting only one vector lane performs operation  Portably covering all levels used in sophisticated Cuda kernels  C++11 lambda necessary for usability  Vector parallel lambdas nested within team parallel lambdas  Fortunately Cuda 6.5 supports C++ lambda within device kernels 24 Application to Molecular Dynamics Kernel Atom Neighbor List Construction atom ids stored in a Cartesian grid (XYZ) localitybin data structure atoms sorted by locality NonTeam algorithm has good cache efficiency using teams and shared memory to improve cache efficiency on GPU a team works on a set of neighboring bins, 1 bin per thread in the team NonTeam Algorithm Team Algorithm parForTeam basebins in bins parFor i in natoms copytoshared(basebins,sharedbasebins) n = 0 for binrow in YZpartof(basebins) binidx = binof(i); copytoshared(binrow,sharedbinrow) for bin in stencil(binidx) parForTeam i in binatomids(sharedbasebins) for j in binatomids(bin) parForVector i in binatomids(sharedbasebins) if( distance(i,j) cut ) for j in binatomids(sharedbinrow) neighbor(i,n++) = j; if( distance(i,j) cut ) neighbor(i,n++) = j;  Previously a Cudaspecialized implementation  Now a portable implementation 25 Performance of a Complete Simulation Step  Timing data for isolated kernel not available  Comparing compute nodes of roughly equivalent power 1/2 of K80 (i.e. one of the two GPUs on the board) 2 Sockets of 8 Core Sandy Bridge with 2 wide SMT 2 Sockets of 10 Core Power 8 chips with 8 wide SMT  CPUs using TeamSize 1  GPUs using TeamSize 2x32 70 60 50 40 NonTeam 30 Team 20 10 0 K80 SandyBridge Power8 26 Time per step Application to Tensor Math Library Kernels  Performed through Harvey Mudd College clinic program  Advisor/Professor: Jeff Amelang  Undergraduate team: Brett Collins, Alex Gruver, Ellen Hui, Tyler Marklyn  Project: reengineer serial kernels to use Kokkos  Initially using “flat” range policy  Progressing to thread team policy for appropriate kernels  Several candidate kernels for team parallelism, results for: ∑  Multimatrix multiply : ∀𝑐 ,𝑑 ,𝑒 :𝑉𝑐 ,𝑑 ,𝑒 =𝐴𝑐 ,𝑝 ,𝑑∗𝐵𝑐 ,𝑝 ,𝑒 𝑝  Thread team  Outer (league level) parallelfor over dimension ‘c’  Inner (team level) parallelreduce over summation dimensions p  Inner (team level) parallelfor over tensor dimensions d, e 27 Application to Tensor Math Library Kernels  Performance of “multimatrix multiply” tensor contraction ∑ ∀𝑐 ,𝑑 ,𝑒 :𝑉𝑐 ,𝑑 ,𝑒 =𝐴𝑐 ,𝑝 ,𝑑∗𝐵𝑐 ,𝑝 ,𝑒 𝑝  d = e = 6, symmetric tensor  p = 27 point numerical integration of a hexahedral cell  c = cells Teamsynchronization overhead with nested parallelism More parallelism available to map 28 Outline  Core Abstractions and Capabilities  Portable Hierarchical Parallel Execution Policies  Initial Scalable Graph Algorithms  Construction of sparse matrix graph from finite element mesh  Breadth first search of highly variable degree graph  Conclusion 29 ThreadScalable Construction of Sparse Matrix Graph from Finite Element Mesh  Given Finite Element Mesh Connectivity  element → nodes  Viewint8,Space elementnode ;  Generate node→node graph  Compressed sparse row data structure 𝒏𝒏 ,𝒄𝒄𝒄𝒏𝒄𝒏𝒋 : ∀ 𝒋∈𝒊𝒓𝒊𝒏𝒏𝒏 … 𝒊 𝒊 𝒓𝒏𝒏𝒏 +𝟏 , ∀ 𝒏𝒏  node = node index, irow = offset array, column(j) = connected node index  Challenges  Determine unique nodenode entries given redundant entries  element → nodes have shared faces and edges  Unknown number of nodenode entries 2  Upper bound N is too large to allocate 30 𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏ThreadScalable Construction of Sparse Matrix Graph from Finite Element Mesh 1. Parallelfor : fill Kokkos lockfree unordered map with nodenode pairs  element → nodes : foreach element, foreach pair of nodes  Successful insert → atomic increment node’s column counts 2. Parallelscan : sparse matrix rows’ column counts generates row offsets  Last entry is total count of unique nodenode pairs 3. Allocate sparse matrix columnindex array 4. Parallelfor : query unordered map to fill sparse matrix columnindex array  foreach entry in unordered map of nodenode pairs 5. Parallelfor : sort rows’ columnindex subarray 2 1.5 1 Phi60 Phi240 0.5 K40X 0 1E+03 1E+04 1E+05 1E+06 1E+07 Number of finite element nodes 31 Microsec/node Breadth First Search of Graph with Highly Varied Degree Vertices  Porting portions of MTGL to GPU via Kokkos  MTGL: Sandia’s multithreaded graph library  Internal laboratory directed research development (LDRD) project  Sandia collaborators: Jonathan Berry and Greg Mackey MTGL: Multithreaded Graph Library Kokkos Containers Algorithms Kokkos Core Backends: Cuda, OpenMP, pthreads, Qthreads, ...  Evaluate suitability of Kokkos and GPU for graph algorithms  MTGL previously threaded for CPU via Qthreads  Ease and performance of layering MTGL on Kokkos  Performance of MTGL algorithms on GPU 32 Breadth First Search of Graph with Vertices of Highly Varying Degree  Iterative frontieradvancing algorithm (conceptually simple)  Given a frontier set of vertices  Foreach edge associated with each vertex in the frontier if edge’s other vertex has not been visited, add to next frontier  Challenges for threadscalability  Maximizing parallelism in “foreach edge of each frontier vertex”  Removing load imbalance in “foreach edge of each frontier vertex”  Set of edges will not fit in GPU memory (set of vertices will fit)  Concurrent growth of global frontier set  Strategy for threadscalability  Manhattan loop collapse of “foreach edge of each frontier vertex”  ThreadTeam coordinated growth of global frontier set technique used in Cray and LLVM compilers 33 Breadth First Search Algorithm  Graph implemented via compressed sparse row (CSR) scheme 𝒗 ,𝒆𝒏𝒏𝒏𝒋 : ∀ 𝒋∈𝒊𝒓𝒊𝒏𝒗 …𝒊𝒓𝒊𝒏𝒗 +𝟏 , ∀ 𝒗  v = vertex index, irow = offset array, edge(j) = subarray of paired vertices  Given search result array of vertices : search()  0..a) = vertex indices accumulated from previous search iteration  a..b) = vertex indices of current search frontier 1. Generate frontier vertex degree offset array ‘fscan’  Frontier subarray of vertex indices is search( a..b) )  parallelscan of vertex degrees ( irowv+1 – irowv ) to generate fscan 2. Evaluate search frontier’s edges, edges = fscan(b) – fscan(a)  parallelfor via TeamPolicy, each team searches range of edges  Each thread evaluates vertices of collection of edges  Atomic update to determine if first visit, append threadlocal buffer  Intrateam parallelscan of local buffers to count team’s search result  Append team’s search to global search array, only one atomic update 3. Repeat for updated frontier 34 Breadth First Search Algorithm  Maximizing parallelism  Manhattan loop collapse facilitates parallelizing over edges, not vertices  Removes load imbalance concerns for highly variable degree vertices  Minimizing synchronization  Thread local buffer for accumulating search result  Intrateam parallel scan of thread local buffer sizes for team result size  Team’s single atomic update of global search array  Place arrays in appropriate memory spaces via Kokkos::View  Vertex arrays in GPU memory: irow(), search(), fscan()  Edge array in HostPinned memory: edge()  Performance evaluation of portable implementation  Scalability for graphs with highly variable degree vertices  CPU vs. GPU  Edge array in GPU vs. HostPinned 35 Breadth First Search Performance Testing  Sequence of generated test graphs  Doubling vertices and edges  Edges eventually cannot fit in GPU memory  Similar vertex degree histograms for all generated graphs  Start algorithm’s iteration on vertex of largest degree 36 Breadth First Search Performance Testing  Good scalability on Kepler  Teams stream through edge array with coalesced access pattern  Almost 2x performance drop reading edge array from Host Pinned memory  Enables processing of large graphs where edges cannot fit in GPU memory 37 Summary : Concepts and Abstractions  κόκκος : “like grains of sand on a beach”  Identify / encapsulate grains of data and parallelizable operations  Aggregate these grains with data structure and parallel patterns  Map aggregated grains onto memory and cores / threads  Mapping  User functions, execution spaces, parallel patterns, execution polices  Polymorphic multidimensional array, memory spaces, layout, access intent  Atomic operations  Hierarchical Parallel Patterns  Maximizing opportunity (grains) for parallelism 38 Conclusion  Kokkos enables performance portability  parallelpattern( ExecutionPolicyExecutionSpace , UserFunction )  Polymorphic multidimensional arrays solves the arrayofstructs versus structofarrays dilemma  Atomic operations Engaging with ISO/C++ Standard to advocate for these capabilities  Pure library approach using C++ template metaprogramming  Significantly simplified when UserFunction is a C++11 lambda  Cuda 7.5 candidate feature for device lambda : =device  Tell NVIDIA you like and want this  Thread team execution policy for hierarchical parallelism  Portable abstraction for Cuda grids, blocks, warps, and shared memory  Early RD for application to graph algorithms 39