Question? Leave a message!




THE END OF THE GPU ROADMAP

THE END OF THE GPU ROADMAP 27
Tim Sweeney CEO, Founder Epic Games timepicgames.com THE END OF THE GPU ROADMAPBackground: Epic GamesBackground: Epic Games  Independent game developer  Located in Raleigh, North Carolina, USA  Founded in 1991  Over 30 games released  Gears of War  Unreal series  Unreal Engine 3 is used by 100’s of gamesHistory: Unreal EngineUnreal Engine 1 19961999  First modern game engine  Objectoriented  Realtime, visual toolset  Scripting language  Last major software renderer  Software texture mapping  Colored lighting, shadowing  Volumetric lighting fog  Pixelaccurate culling  25 games shippedUnreal Engine 2 20002005  PlayStation 2, Xbox, PC  DirectX 7 graphics  Singlethreaded  40 games shippedUnreal Engine 3 20062012  PlayStation 3, Xbox 360, PC  DirectX 9 graphics  Pixel shaders  Advanced lighting shadowing  Multithreading (6 threads)  Advanced physics  More visual tools  Game Scripting  Materials  Animation  Cinematics…  150 games in developmentUnreal Engine 3 Games Army of Two (Electronic Arts) Mass Effect (BioWare) Undertow (Chair Entertainment) BioShock (2K Games)Game Development: 2009Gears of War 2: Project Overview  Project Resources  15 programmers  45 artists  2year schedule  12M development budget  Software Dependencies  1 middleware game engine  20 middleware libraries  Platform librariesGears of War 2: Software Dependencies Gears of War 2 Gameplay Code 250,000 lines C++, script code Unreal Engine 3 Middleware Game Engine 2,000,000 lines C++ code ZLib Speed FaceFX Bink DirectX OpenAL Data … Movie Tree Face Compr Graphics Audio Rendering Animation Codec essionHardware: HistoryComputing History 1985 Intel 80386: Scalar, inorder CPU 1989 Intel 80486: Caches 1993 Pentium: Superscalar execution 1995 Pentium Pro: Outoforder execution 1999 Pentium 3: Vector floatingpoint 2003 AMD Opteron: Multicore 2006 PlayStation 3, Xbox 360: “Manycore” …and we’re back to inorder executionGraphics History 1984 3D workstation (SGI) 1997 GPU (3dfx) 2002 DirectX9, Pixel shaders (ATI) 2006 GPU with full programming language (NVIDIA GeForce 8) 2009 x86 CPU/GPU Hybrid (Intel Larrabee)Hardware: 20122020Hardware: 20122020 Processor Processor Processor Processor Processor In Order In Order In Order In Order In Order 4 Threads 4 Threads 4 Threads 4 Threads 4 Threads I D I I D I D I D D L2 Cache Processor Processor Processor Processor Processor In Order In Order In Order In Order In Order 4 Threads 4 Threads 4 Threads 4 Threads 4 Threads I D I D I D I D I D Intel Larrabee NVIDIA GeForce 8  x86 CPUGPU Hybrid  General Purpose GPU  C/C++ Compiler  CUDA “C” Compiler  DirectX/OpenGL  DirectX/OpenGL  Manycore, vector architecture  Manycore, vector architecture  Teraflopclass performance  Teraflopclass performanceHardware: 20122020 CONCLUSION CPU, GPU architectures are getting closerTHE GPU TODAYThe GPU Today  Large frame buffer  Complicated pipeline  It’s fixedfunction  But we can specify shader programs that execute in certain pipeline stagesShader Program Limitations  No randomaccess memory writes  Can write to current pixel in frame buffer  Can’t create data structures  Can’t traverse data structures  Can hack it using texture accesses  Hard to share data between main program and shaders programs  Weird programming language  HLSL rather than C/C++ Result: “The Shader ALU Plateau”Antialiasing Limitations  MSAA Oversampling  Every 1 bit of output precision costs up to 2X memory performance  Ideally want 1020 bits  Discrete sampling (in general)  Texture filtering only implies antialiasing when shader equation is linear  Most shader equations are nonlinear Aliasing is the 1 visual artifact in Gears of WarTexture Sampling Limitations  Inherent artifacts of bilinear/trilinear  Poor approximation of Integrate(color,area) in the presence of:  Small triangles  Texture seams  Alpha translucency  Masking  Fixedfunction = poor scalability  Megatexture, etcFrame Buffer Model Limitation  Frame buffer: 1 (or n) layers of 4vectors, where n = small constant  Ineffective for  General translucency  Complex shadowing models  Memory bandwidth requirement = FPS Pixel Count Layers Depth pow(2,n) where n = quality of MSAASummary of Limitations  “The Shader ALU Plateau”  Antialiasing limitations  Texture Sampling limitations  Frame Buffer limitationsThe MetaProblem:  The fixedfunction pipeline is too fixed to solve its problems  Result:  All games look similar  Derive little benefit from Moore’s Law  Crysis on highend NVIDIA SLI solution only looks at most marginally better than top Xbox 360 games This is a market BEGGING to be disrupted :)SO...Return to 100 “Software” Rendering  Bypass the OpenGL/DirectX API  Implement a 100 software renderer  Bypass all fixedfunction pipeline hardware  Generate image directly  Build traverse complex data structures  Unlimited possibilities Could implement this…  On Intel CPU using C/C++  On NVIDIA GPU using CUDA (no DirectX)Software Rendering in Unreal 1 (1998) Ran 100 on CPU No GPU required Features  Realtime colored lighting  Volumetric Fog  Tiled Rendering  Occlusion DetectionSoftware Rendering in 1998 vs 2012 60 MHz Pentium could execute: 16 operations per pixel at 320x200, 30 Hz In 2012, a 4 Teraflop processor would execute: 16000 operations per pixel at 1920x1080, 60 Hz Assumption: Using 50 of computing power for graphics, 50 for gameplayFuture Graphics: Raytracing  For each pixel  Cast a ray off into scene  Determine which objects were hit  Continue for reflections, refraction, etc  Consider  Less efficient than pure rendering  Can use for reflections in traditional renderFuture Graphics: The REYES Rendering Model  “Dice” all objects in scene down into subpixel sized triangles  Rendering with  Flat Shading ()  Analytic antialiasing  Perpixel occlusion (ABuffer/BSP)  Benefits  Displacement maps for free  Analytic Antialiasing  Advanced filtering (Gaussian)  Eliminates texture samplingFuture Graphics: The REYES Rendering Model Today’s Pipeline Potential 2012 Pipeline  Build 4M poly “highres” character Build 4M poly “highres” character  Generate normal maps from  Render it ingame geometry in highres  Advanced LOD scheme assures  Rendering 20K poly “lowres” proper subpixel sized triangles character ingameFuture Graphics: Volumetric Rendering  Direct Voxel Rendering  Raycasting  Efficient for trees, foliage  Tesselated Volume Rendering  Marching Cubes  Marching Tetrahedrons  Point Clouds  SignalSpace Volume Rendering  Fourier Projection Slice Theorem  Great for clouds, translucent volumetric dataFuture Graphics: Software Tiled Rendering  Split the frame buffer up into bins  Example: 1 bin = 8x8 pixels  Process one bin at a time  Transform, rasterize all objects in the bin  Consider  Cache efficiency  Deep frame buffers, antialiasingHybrid Graphics Algorithms  Analytic Antialiasing – Analytic solution, better than 1024x MSAA  Sortindependent translucency – Sorted linkedlist per pixel of fragments requiring perpixel memory allocation, pointerfollowing, conditional branching (ABuffer).  Advanced shadowing techniques – Physically accurate perpixel penumbra volumes – Extension of wellknown stencil buffering algorithm – Requires storing, traversing, and updating a very simple BSP tree per pixel with memory allocation and pointed following.  Scenes with very large numbers of objects – Fixedfunction GPU + API has 10X100X state change disadvantageGraphics: 20122020 Potential Industry Goals Achieve moviequality:  Antialiasing  Direct Lighting  Shadowing  Particle Effects  Reflections Significantly improve:  Character animation  Object counts  Indirect lightingSOFTWARE IMPLICATIONSSoftware Implications Software must scale to… • 10’s – 100’s of threads • Vector instruction setsSoftware Implications Programming Models • Shared State Concurrency • Message Passing • Pure Functional Programming • Software Transactional MemoryMultithreading in Unreal Engine 3: “Task Parallelism”  Gameplay thread  AI, scripting  Thousands of interacting objects  Rendering thread  Scene traversal, occlusion  Direct3D command submission  Pool of helper threads for other work  Physics Solver  Animation Updates Good for 4 threads. No good for 100 threads“Shared State Concurrency” The standard C++/Java threading model  Many threads are running  There is 512MB of data  Any thread can modify any data at any time  All synchronization is explicit, manual  See: LOCK, MUTEX, SEMAPHORE  No compiletime verification of correctness properties:  Deadlockfree  Racefree  InvariantsMultithreaded Gameplay Simulation: Manual Synchronization Idea:  Update objects in multiple threads  Each object contains a lock  “Just lock an object before using it” Problems:  “Deadlocks”  “Data Races”  Debugging is difficult/expensiveMultithreaded Gameplay Simulation: “Message Passing” Idea:  Update objects in multiple threads  Each object can only modify itself  Communicate with other objects by sending messages Problems:  Requires writing 1000’s of message protocols  Still need synchronizationPure Functional Programming “Pure Functional” programming style: • Define algorithms that don’t write to shared memory or perform I/O operations (their only effect is to return a result) Examples: • Collision Detection • Physics Solver • Pixel ShadingPure Functional Programming “Inside a function with no side effects, subcomputations can be run in any order, or concurrently, without affecting the function’s result” With this property: • A programmer can explicitly multithread the code, safely. • Future compilers will be able to automatically multithread the code, safely. See: “Implementing Lazy Functional Languages on Stock Hardware”; Simon Peyton Jones; Journal of Functional Programming 2005 Multithreaded Gameplay Simulation: Software Transactional Memory Idea:  Update objects in multiple threads  Each thread runs inside a transaction block and has an atomic view of its “local” changes to memory  C++ runtime detects conflicts between transactions  Nonconflicting transactions are applied to “global” memory  Conflicting transactions are “rolled back” and rerun Implemented 100 in software; no custom hardware required. Problems:  “Object update” code must be free of sideeffects  Requires C++ runtime support  Cost around 30 performance See: “Composable Memory Transactions”; Tim Harris, Simon Marlow, Simon Peyton Jones, and Maurice Herlihy. ACM Conference on Principles and Practice of Parallel Programming 2005 Vectorization Supporting “Vector Instruction Sets” efficiently NVIDIA GeForce 8: • 8 to 15 cores • 16wide vectorsVectorization C++, Java compilers generate “scalar” code GPU Shader compilers generate “vector” code  Arbitrary vector size (4, 16, 64, …)  Nwide vectors yield Nwide speedupVectorization: “The Old Way”  “Old Vectors” (SIMD): Intel SSE, Motorola Altivec  4wide vectors  4wide arithmetic operations  Vector loads Load vector register from vector stored in memory  Vector swizzle maskFuture Programming Models: Vectorization  “Old Vectors” Intel SSE, Motorola Altivec x x x x 0 1 2 3 vec4 x,y,z; ... + + + + z = x+y; y y y y 0 1 2 3 = = = = z z z z 0 1 2 3Vectorization: “New Vectors” (ATI, NVIDIA GeForce 8, Intel Larrabee)  16wide vectors  16wide arithmetic  Vector loads/stores  Load 16wide vector register from scalars from 16 independent memory addresses, where the addresses are stored in a vector  Analogy: Registerindexed constant access in DirectX  Conditional vector masks“New SIMD” is better than “Old SIMD”  “Old Vectors” were only useful when dealing with vectorlike data types:  “XYZW” vectors from graphics  4x4 matrices  “New Vectors” are far more powerful: Any loop whose body has a staticallyknown call graph free of sequential dependencies can be “vectorized”, or compiled into an equivalent 16wide vector program. And it runs up to 16X faster“New Vectors” are universal int n; cmplx coords; int color = new intn (Mandelbrot set generator) for(int i=0; in; i++) int j=0; cmplx c=cmplx(0,0) while(mag(c) 2) c=cc + coordsi; j++; colori = j; This code…  is free of sequential dependencies  has a statically known call graph Therefore, we can mechanically transform it into an equivalent data parallel code fragment.“New Vectors” Translation for(int i=0; in; i+=N) for(int i=0; in; i++) ivector=i,i+1,..i+N1 … imask=in,i+1N,i+2N,..i+N1N … Standard dataparallel loop setup Note: Any code outside this loop (which invokes the loop) is necessarily scalar“New Vectors” Translation int n; cmplx coords; Note: Any code outside this loop int color = new intn (which invokes the loop) for(int i=0; in; i++) int j=0; is necessarily scalar cmplx c=cmplx(0,0) while(mag(c) 2) c=cc + coordsi; int n; j++; cmplx coords; int color = new intn colori = j; for(int i=0; in; i+=N) intN ivector=i,i+1,..i+N1 Loop Index Vector boolN imask=in,i+1N,i+2N,..i+N1N complxN cvector=cmplx(0,0),.. Loop Mask Vector while(1) boolN whilevector= Vectorized Loop Variable imask0 mag(cvector0)2, .. Vectorized Conditional: if(allfalse(whilevector)) Propagates loop mask break; to local condition cvector=cvectorcvector + coordsi..i+N1 : imask colorsi..i+N1 : imask = cvector; Maskpredicated Maskpredicated vector read vector writeVectorization Tricks  Vectorization of loops  Subexpressions independent of loop variable are scalar and can be lifted out of loop  Subexpressions dependent on loop variable are vectorized  Each loop iteraction computes an “active mask” enabling operation on some subset of the N components  Vectorization of function calls  For every scalar function, generate an Nwide vector version of the function taking an Nwide “active mask”  Vectorization of conditionals  Evaluate Nwide conditional and combine it with the current active mask  Execute “true” branch if any masked conditions true  Execute “false” branch if any masked conditions false  Will often execute both branchesVectorization Paradigms  Handcoded vector operations  Current approach to SSE/Altivec  Loop vectorization  See: Vectorizing compilers  Run a big function with a big bundle of data  CUDA/OpenCL  Nested Data Parallelism  See NESTL  Very general set of “vectorization” transforms for many categories of nested computationsLayers: Multithreading Vectors Physics, collision detection, scene traversal, path finding .. Game World State Graphics shader programs Vector (Data Parallel) Subset Purely functional core Software Transactional Memory Sequential Execution Hardware I/O Potential Performance Gains : 20122020 Up to...  64X for multithreading  1024X for multithreading + vectors 1024X 64X 64X 1X My estimate of feasibility based on Moore’s LawMultithreading Vectorization: Who Choses  Hardware companies impose a limited model on developers  Sony Cell, NVIDIA CUDA, Apple OpenCL  Hardware provides general feature; languages runtimes make it nice; users choose  Tradeoffs  Performance  Productivity  FamiliarityHARDWARE IMPLICATIONSThe Graphics Hardware of the Future All else is just computingFuture Hardware: A unified architecture for computing and graphics Hardware Model  Three performance dimensions  Clock rate  Cores  Vector width  Executes two kinds of code:  Scalar code (like x86, PowerPC)  Vector code (like GPU shaders or SSE/Altivec)  Some fixedfunction hardware  Texture sampling  RasterizationVector Instruction Issues  A future computing device needs…  Full vector ISA  Masking scatter/gather memory access  64bit integer ops memory addressing  Full scalar ISA  Dynamic controlflow is essential  Efficient support for scalarvector transitions  Initiating a vector computation  Reducing the results  Repacking vectors  Must support billions of transitions per secondMemory System Issues Effective bandwidth demands will be huge Typically read 1 byte of memory per FLOP 4 TFLOP of computing power demands 4 TBPS of effective memory bandwidth Yes, reallyMemory System Issues Threads (GPU) Caches (CPU)  Hide memory latency Expose memory latency  Lose data locality Exploit data locality to minimize main memory bandwidthMemory System Issues  Cache coherency is vital  It should be the defaultRevisiting REYES  “Dice” all objects in scene down into sub pixelsized triangles  Tilebased setup  Rendering with  Flat Shading  No texture sampling  Analytic antialiasing  Perpixel occlusion Requires no artificial (ABuffer/BSP) software threading or pipelining.LESSONS LEARNEDLessons learned: Productivity is vital Hardware will become 20X faster, but:  Game budgets will increase less than 2X. Therefore...  Developers must be willing to sacrifice performance in order to gain productivity.  Highlevel programming beats lowlevel programming.  Easier hardware beats faster hardware  We need great tools: compilers, engines, middleware libraries...Lessons learned: Today’s hardware is too hard  If it costs X (time, money, pain) to develop an efficient singlethreaded algorithm, then…  Multithreaded version costs 2X  PlayStation 3 Cell version costs 5X  Current “GPGPU” version is costs: 10X or more  Over 2X is uneconomical for most software companies  This is an argument against:  Hardware that requires difficult programming techniques  Nonunified memory architectures  Limited “GPGPU” programming modelsLessons learned: Plan Ahead Previous Generation:  Leadtime for engine development was 3 years  Unreal Engine 3:  2003: development started  2006: first game shipped Next Generation:  Leadtime for engine development is 5 years  Start in 2009, ship in 2014 So, let’s get startedCONCLUSIONEND
Website URL
Comment